Modeling and Automatic Labeling of Hidden Word-Level Events in Spontaneous Speech

Funding Information


Recently in the press

Project Summary

Current speech recognition technology is focused on transcribing spoken input into a sequence of words. Natural language processing (NLP) on the other hand, is concerned with the parsing, understanding and indexing of transcribed utterances and larger linguistic units. In a fully automatic spoken language system, the output of the recognizer typically serves as input to the NLP component. At present however, a significant gap remains between these two technologies, particularly for the processing of spontaneous speech.

Most NLP techniques have been developed for spoken input resembling read or highly constrained speech. When applied to spontaneous speech, such techniques encounter at least two main difficulties. First, spontaneous speech contains many surface phenomena relating to non-propositional aspects of the input. These include disfluencies (hesitations, repairs, and restarts), discourse markers (''well''), and other elements. Second, in spontaneous speech there is no overtly marked punctuation available for segmenting the input into meaningful units such as utterances. For optimal NLP performance, these types of phenomena should be annotated in the input; current speech recognizers, however, produce only a raw sequence of words.

The primary goal of this project is to augment standard speech recognition models to enable recognizers to output sequences annotated for the phenomena mentioned above. Inspection reveals that these phenomena can be represented as non-overt events occurring between words. For example, turns and utterances can be delimited by events occurring at inter-word junctures. Similarly, disfluencies and discourse elements can be demarcated by events surrounding the words involved. Because such phenomena 1) are not overt in the word sequence; and 2) can be delimited using inter-word event representations, we refer to them collectively as Hidden Word-level Events or simply HWEs.

The approach is to develop more comprehensive speech models that will allow the automatic recognition and classification of HWEs to occur in tandem with standard word recognition. HWE recognition will be based on a combination of acoustic and language models, extending the standard components found in current systems. New models will capture the specific prosodic characteristics of HWEs, such as intonation and duration patterns. Information from prosodic features will be combined with statistical language models that describe the distribution of HWEs in relation to words, parts-of-speech, and other syntactic and lexical units.

The integrated modeling of HWEs distinguishes the proposed work from past efforts which have been based on post-processing of word-level information. The integrated approach promises to yield better results than post-processing techniques because it uses additional information. Furthermore, acoustic information may be more reliable than word-level information, which is subject to speech recognition errors.

A second goal of our research is to improve word recognition itself, since the combined word/HWE models are expected to be superior to those used by standard recognizers. For example, with the help of an HWE representation, a recognizer would be able to discount hypotheses for which information from prosodic features is inconsistent with the hypothesized word/HWE sequence.

In addition to its use in the development of speech recognition algorithms, automatic HWE labeling has great potential practical benefit for speech research in general. The techniques to be developed will enable automatic labeling of large corpora of spontaneous speech, reducing the need for human annotators and benefiting other areas of speech and language research that rely on large amounts of annotated spontaneous speech data.