Speech Technology and Research Laboratory
  Current Research Activities
  Past Research Activities
  Technologies for License
  In the News
  Career Opportunities
  Contact Us
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

  SRI Logo

Information Extraction from Speech

Speech and Language Technology for Information Access and Processing


Jerry Hobbs (PI), Andreas Stolcke (co-PI)
Doug Appelt, Dilek Hakkani-Tür, Luc Julia, David Israel, David Martin
Madelaine Plauché, Ze'ev Rivlin, Elizabeth Shriberg, Kemal Sönmez, Gökhan Tür

Project Summary

The goal of this DARPA-funded project is to enhance and integrate current speech and natural language processing technology to enable information extraction from audio sources. This effort is a collaboration of SRI's Speech Technology and Research Laboratory and the Natural Language Program in the Artificial Intelligence Center (AIC). In collaboration with AIC's Perception Program we are also working on the integration of vision-based technologies (such as optical character recognition) with speech and natural language. That effort is also known as SIEVE (Speech and Information Extraction for Video Exploitation).

In the Press

Research Efforts

We are presently focusing on a number of fundamental research problems which have to be solved to enable the overall goal of automatically extracting content from audio/video sources. Most of these research efforts are being undertaken in collaboration with other ongoing projects in the STAR Lab and the AI Center. Most of these technologies are being developed in the DARPA Broadcast News domain, with the eventual goal of a sophisticated ``News-on-Demand'' system.

  • Speaker Segmentation and Tracking: This effort develops algorithms for segmenting speech by speaker and for identifying recurring speakers. This will enable retrieval of segments based on speaker identity, as well as enhance other aspects of the system, e.g., by adapting speech recognition models for individual speakers. This work is a collaboration with the Speaker Recognition and the Consistency Modeling projects.
  • Linguistic Segmentation: Here we aim to detect linguistic events that structure the input in important ways for further natural language processing. In particular, we want to detect sentence boundaries, so that coherent linguistic units can be archived, retrieved, and handed to an information extraction agent. Other events of importance are speech disfluencies (such as self-repairs) which could pose problems for natural language processing unless detected. This is a collaboration with the Disfluencies and Hidden Event Modeling projects.
  • Topic Segmentation: Above the sentence level, we need to segment the audio stream into stories, or topical units, a step that enables further content-based processing. Topic segmentation is achieved by modeling the changing distribution of topic-specific words in the speech data, as well by the way speakers pause and modulate their pitch at topic boundaries. By combining both types of cues we achieve considerably better results than with text-based methods alone.
  • Information Extraction and Name Finding: We are adapting SRI's TextPro System, based on technology similar to that of SRI's earlier FASTUS System to enable information extraction from recognized speech, dealing in particular with the presence of word errors and the absence of cues such as capitalization and sentence punctuation (cf. previous topic). A preliminary stage for information extraction is the identification of names (of people, places, and organizations) in the input.
  • Knowledge Source Integration: Because most information sources associated with audio and video are unreliable, our goal is to enhance recognition technologies by combining them with each other. For example, we can use the results of optical character recognition on video to prime the speech recognizer, or vice-versa. Another example of model adaptation across modalities involves using names extracted from textual sources to update the vocabulary of a speech recognizer.


A key focus of the SIEVE project is the integration of many, multi-modal information sources (both auditory and visual) for information extraction purposes. In order to simultaneously browse and visualize a multitude of parallel, time-aligned information sources we developed MAESTRO, a graphical interface for Multimedia Annotation and Enhancement via a Synergy of automatic Technologies and Reviewing Operators. The name is motivated by the appearance of its user interface, which resembles a conductor's score in displaying the results of the various "voices" such as speech recognition, name extraction, scene detection, speaker tracking, topic tracking, etc. A block diagram identifies the various components that provide input to the MAESTRO interface.

Funding Information

This project is funded through DARPA's Information Technology Office, Allen Sears, Program Manager, under contract N66001-97-C-8544.

Publications and Presentations

L.P. Heck & A. Sankar (1997), Acoustic Clustering and Adaptation for Robust Speech Recognition, Proc. EUROSPEECH, vol. 4, pp. 1867-1870, Rhodes, Greece. (PDF)

A. Stolcke, E. Shriberg, R. Bates, M. Ostendorf, D. Hakkani, M. Plauche, G. Tur, & Y. Lu (1998), Automatic Detection of Sentence Boundaries and Disfluencies based on Recognized Words. Proc. Intl. Conf. on Spoken Language Processing, vol. 5, pp. 2247-2250, Sydney, Australia. (PDF)

Multimodal Technology Integration for News-on-Demand. Presentation for DARPA News-on-Demand Compare-and-Contrast Meeting, September, 1998. (PowerPoint source)

Combining Words and Prosody for Information Extraction from Speech. Presentation at the DARPA Broadcast News Workshop, Herndon, VA, March 1999.

A. Stolcke, E. Shriberg, D. Hakkani-Tur, G. Tur, Z. Rivlin, & K. Sonmez (1999), Combining Words and Speech Prosody for Automatic Topic Segmentation. Proc. DARPA Broadcast News Workshop, pp. 61-64, Herndon, VA. (HTML, PDF)

D. Appelt & D. Martin (1999), Named Entity Recognition in Speech: Approach and Results Using the TextPro System. Proc. DARPA Broadcast News Workshop, pp. 51-54, Herndon, VA. (PDF, HTML)

D. Hakkani-Tur, G. Tur, A. Stolcke, & E. Shriberg (1999), Combining Words and Prosody for Information Extraction from Speech. Proc. EUROSPEECH, vol. 5, pp. 1991-1994, Budapest. (PDF)

A. Stolcke, E. Shriberg, D. Hakkani-Tur, & G. Tur (1999), Modeling the Prosody of Hidden Events for Improved Word Recognition. Proc. EUROSPEECH, vol. 1, pp. 307-310, Budapest. (PDF)

Z. Rivlin, D. Appelt, R. Bolles, A. Cheyer, D. Hakkani-Tur, D. Israel, L. Julia, D. Martin, G. Myers, K. Nitz, B. Sabata, A. Sankar, E. Shriberg, K. Sonmez, A. Stolcke, & G. Tur (2000), MAESTRO: Conductor of Multimedia Analysis Technologies, Communications of the ACM 43(2), 57-63, Special Issue on News on Demand, February 2000. (DOI)

E. Shriberg, A. Stolcke, D. Hakkani-Tur, & G. Tur (2000), Prosody-Based Automatic Segmentation of Speech into Sentences and Topics, Speech Communication 32(1-2), 127-154 (Special Issue on Accessing Information in Spoken Audio). (PDF )

G. Tur, D. Hakkani-Tur, A. Stolcke, & E. Shriberg (2001), Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation, Computational Linguistics, 27(1), 31-57. (PDF)

E. Shriberg & A. Stolcke (2001), Prosody Modeling for Automatic Speech Understanding: An Overview of Recent Research at SRI. In M. Bacchiani, J. Hirschberg, D. Litman, & M. Ostendorf (eds.), Proc. ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding, pp. 13-16, Red Bank, NJ. (PDF)

E. Shriberg, A. Stolcke, & D. Baron (2001), Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting Punctuation, Disfluencies, and Overlapping Speech. In M. Bacchiani, J. Hirschberg, D. Litman, & M. Ostendorf (eds.), Proc. ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding, pp. 139-146, Red Bank, NJ. (PDF)

A. Stolcke & E. Shriberg (2001), Markovian Combination of Language and Prosodic Models for better Speech Understanding and Recognition . Invited talk at the IEEE Workshop on Speech Recognition and Understanding, Madonna di Campiglio, Italy, December 2001. (PDF)

D. Baron, E. Shriberg, and A. Stolcke (2002), Automatic Punctuation and Disfluency Detection in Multi-Party Meetings Using Prosodic and Lexical Cues. Proc. Intl. Conf. on Spoken Language Processing, Denver, vol. 2, pp. 949-952. (PDF)


About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2011 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Aug 23, 2022