|
|
Modeling and Automatic Labeling of Hidden Events in Speech
Elizabeth Shriberg AuthorsElizabeth Shriberg and Andreas Stolcke WWW Pagehttp://www.speech.sri.com/projects/hidden-events.html Program AreaSpeech and Natural Language Understanding Keywordsspontaneous speech recognition, speech understanding, disfluency detection, prosody modeling, statistical language modeling, sentence and topic segmentation, dialog modeling Project SummaryIn order to use natural language processing (NLP) successfully with speech recognition, including with spontaneous speech, automatic speech recognizers must be able to provide more than just a stream of words. They must provide information about language structure, at both local and more global levels, including:
The primary goal of this project is to augment standard speech recognition models to enable recognizers to output sequences annotated for such events. Inspection reveals that these phenomena, regardless of their span, can all be represented as non-overt events occurring at word boundaries. For example, sentence and topic boundaries can be delimited by events occurring at inter-word junctures. Similarly, disfluencies and discourse elements can be demarcated by events surrounding the words involved. Because such phenomena 1) are not overt in the word sequence; and 2) can be delimited using inter-word event representations, we refer to them collectively as "Hidden Word-level Events" or simply "events". ApproachThe approach develops more comprehensive speech models that allow the automatic recognition and classification of events to occur in tandem with standard word recognition. Event recognition is based on a combination of prosodic and language models, extending the standard components found in current systems. New models capture the specific prosodic characteristics of events, such as intonation and duration patterns. Information from prosodic features is combined with statistical language models that describe the distribution of events in relation to words and other lexical units. The integrated modeling of events and addition of prosodic modeling distinguishes this work past efforts which have been based on post-processing of word-level information. ResultsWe have successfully applied the combined prosody and language modeling approach to automatically detect the five types of events listed above, and to two standard speech corpora representing different styles. Results for spontaneous, conversational speech (from the Switchboard corpus) show that adding the prosodic information significantly improves detection performance over word-based modeling for disfluency detection, sentence boundary detection, and dialog act classification. Results for more formal speech (from the Broadcast News corpus, in work jointly funded by the DARPA TRVS program) show that the prosody model _alone_ performs at a level that is comparable or better to that of the word-based model for sentence and topic segmentation tasks. A further gain is achieved by combining the prosody and word-based models, both when using true words and using words from a speech recognizer. A second goal of our research is to improve word recognition itself, since the combined word/event models should be superior to those used by standard recognizers. Recent results show that just by adding prosodic modeling of sentence boundaries and disfluencies, we significantly reduce the word error on Switchboard. In addition to its use in the development of speech recognition algorithms, automatic event labeling has great potential for practical benefit for speech research in general. The techniques will enable automatic labeling of large corpora of spontaneous speech, reducing the need for human annotators and benefiting other areas of speech and language research. Project ReferencesStolcke, A., Shriberg, E., Hakkani-Tur, D., Tur, G., Rivlin, Z. and Sonmez, K. (1999). Combining words and prosody for information extraction from speech. To appear in Proc. DARPA Speech Workshop. A. Stolcke, E. Shriberg, R. Bates, M. Ostendorf, D. Hakkani, M. Plauche, G. Tur, & Y. Lu (1998), Automatic Detection of Sentence Boundaries and Disfluencies based on Recognized Words. Proc. Intl. Conf. on Spoken Language Processing, vol. 5, pp. 2247-2250, Sydney, Australia. Plauche, M. and Shriberg, E. (1999). Hierarchical Clustering of Acoustic Features in Repetitions. To appear in Proc. International Congress of Phonetic Sciences, San Francisco. E. Shriberg, R. Bates, A. Stolcke, P. Taylor, D. Jurafsky, K. Ries, N. Coccaro, R. Martin, M. Meteer, & C. Van Ess-Dykema (1998), Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? Language and Speech, 41(3-4): 439-487. E. Shriberg & A. Stolcke (1998), How far do speakers back up in repairs? A quantitative model. Proc. Intl. Conf. on Spoken Language Processing, vol. 5, pp. 2183-2186, Sydney, Australia.
Area BackgroundCurrent speech recognition technology is focused on transcribing spoken input into a sequence of words. Natural language processing (NLP) on the other hand, is concerned with the parsing, understanding and indexing of transcribed utterances and larger linguistic units. In a fully automatic spoken language system, the output of the recognizer typically serves as input to the NLP component. At present however, a significant gap remains between these two technologies, particularly for the processing of spontaneous speech. Most NLP techniques have been developed for spoken input resembling read or highly constrained speech. When applied to spontaneous speech, such techniques encounter at least two main difficulties. First, spontaneous speech contains many surface phenomena relating to non-propositional aspects of the input (such as hesitations, repairs, and restarts) that need to be detected and corrected for further processing. Secondly, speech lacks overt segmentation cues (such as punctuation, paragraphs, and headers) for parsing the input into meaningful units such as utterances or topics. For a variety of NLP tasks, these types of events should be annotated in the input. Current automatic segmentation techniques are word-based and do not use the rich source of speech prosody that plays an important role in human speech understanding. Area ReferencesP. A. Heeman and J. Allen (1997), Intonational Boundaries, Speech Repairs, and Discourse Markers: Modeling Spoken Dialog. Proc. 35th Annual Meeting of the Association for Computational Linguistics, Madrid. D. J. Litman and R. J. Passonneau (1995), Combining multiple knowledge sources for discourse segmentation. Proc. 33th Annual Meeting of the Association for Computational Linguistics, Cambridge, MA. C. H. Nakatani and J. Hirschberg (1994), A corpus-based study of repair cues in spontaneous speech. Journal of the Acoustical Society of America, 95(3), 1603-1616. D. Beeferman, A. Berger, and J. Lafferty (1999), Statistical Models for Text Segmentation, Machine learning (to appear). P. Price (1996), Spoken Language Understanding, in R.A. Cole (ed.), Survey of the State of the Art in Human Language Technology, Center for Spoken Language Understanding, Oregon Graduate Institute. H. Sacks, E. A. Schegloff, and G. Jefferson (1974), A simplest semantics for the organization of turn-taking in conversation. Language, 50(4), 696-735. Related Program Areas
Adaptive Human Interfaces Potential Related ProjectsThe project has widespread implications and is therefore related to a number of other efforts currently funded by NSF, DARPA, and other government agencies. Main related projects include those concerning spontaneous speech modeling and recognition, disfluency modeling, topic segmentation, information extraction and retrieval, speech understanding, discourse modeling, and automatic speech annotation. |