Modeling and Automatic Labeling of Hidden Word-Level Events in Spontaneous Speech
Recently in the press
Current speech recognition technology is focused on transcribing
spoken input into a sequence of words. Natural language processing
(NLP) on the other hand, is concerned with the parsing, understanding
and indexing of transcribed utterances and larger linguistic units. In
a fully automatic spoken language system, the output of the recognizer
typically serves as input to the NLP component. At present however, a
significant gap remains between these two technologies, particularly
for the processing of spontaneous speech.
Most NLP techniques have been developed for spoken input resembling
read or highly constrained speech. When applied to spontaneous speech,
such techniques encounter at least two main difficulties. First,
spontaneous speech contains many surface phenomena relating to
non-propositional aspects of the input.
disfluencies (hesitations, repairs, and restarts), discourse markers
(''well''), and other elements. Second, in spontaneous speech there
is no overtly marked punctuation available for segmenting the input
into meaningful units such as utterances. For optimal NLP
performance, these types of phenomena should be
annotated in the input; current speech recognizers, however, produce
only a raw sequence of words.
The primary goal of this project is to augment standard speech
recognition models to enable recognizers to output
sequences annotated for the phenomena mentioned above.
Inspection reveals that
these phenomena can be represented as non-overt events occurring
between words. For example, turns and utterances can be delimited by
events occurring at inter-word junctures.
Similarly, disfluencies and discourse elements can be demarcated by
events surrounding the words involved. Because
such phenomena 1) are not overt in the word sequence; and 2) can be
delimited using inter-word event representations, we refer to them
collectively as Hidden Word-level Events or simply HWEs.
The approach is to develop more comprehensive speech models that will
allow the automatic recognition and classification of HWEs to occur in
tandem with standard word recognition. HWE recognition will be based
on a combination of acoustic and language models, extending the
standard components found in current systems. New models will capture
the specific prosodic characteristics of HWEs, such as intonation and
duration patterns. Information from prosodic features will be combined
with statistical language models that describe the distribution of
HWEs in relation to words, parts-of-speech, and other syntactic and
The integrated modeling of HWEs distinguishes the
proposed work from past efforts which have been based on
post-processing of word-level information. The integrated approach
promises to yield better results than post-processing techniques
because it uses additional information. Furthermore, acoustic
information may be more reliable than word-level information, which is
subject to speech recognition errors.
A second goal of our research is to improve word recognition
itself, since the combined word/HWE models are expected to be superior
to those used by standard recognizers. For example, with the help of an
HWE representation, a recognizer would be able to discount hypotheses for which
information from prosodic features is inconsistent with the
hypothesized word/HWE sequence.
In addition to
its use in the development of speech recognition algorithms, automatic
HWE labeling has great potential practical benefit for speech research
in general. The techniques to be developed will enable automatic labeling
of large corpora of spontaneous speech, reducing the need for human
annotators and benefiting other areas of speech and language research
that rely on large amounts of annotated spontaneous speech data.