Annual NSF Grant Progress Report: Year 1 IRI-9619921

Elizabeth Shriberg and Andreas Stolcke

Summary of Progress

Year 1 of the project focused on two main efforts related to hidden-event modeling for conversational speech. One area was the modeling of hidden dialog-act level events, such as statements and questions, when the sentence boundaries are known. The second area was the detection of disfluencies and sentence boundaries, when neither the words nor the sentence boundaries are known. We briefly describe each of these efforts below.

Area 1: Dialog act modeling

The PI and co-PI attended the 1997 Summer Workshop at the Center for Language and Speech Processing at Johns Hopkins University (WS97), as members of a team of researchers investigating discourse modeling for spontaneous conversational speech. The other researchers on the team included D. Jurafsky (U. Colorado at Boulder, project leader), M. Meteer (BBN), P. Taylor (U. Edinburgh, UK), and C. Van Ess-Dykema (DoD), along with students R. Bates (Boston U.), N. Coccaro (U. Colorado at Boulder), R. Martin (Johns Hopkins U.), and K. Ries (CMU). Dr. Shriberg led the prosody modeling aspects of the project and co-designed the discourse annotation system. Dr. Stolcke co-developed the original project proposal with Dr. Jurafsky, and led the language modeling and model integration work. The six-week workshop was preceeded by year-long preparatory work and followed by extensive follow-up research and publication activities.

In the WS97 project on discourse modeling we pursued a two-fold goal. First, we wanted to develop methods for automatically labeling the utterances of a spontaneous human-to-human conversation as to their pragmatic function, such as Statement, Acknowledgment, Agreement, Question. This kind of analysis of utterance function (or `dialog acts') is a crucial first step in parsing a conversation for automatic speech understanding, e.g., for the purpose of summarizing the conversation. Second, the project investigated if and how speech recognition might benefit from modeling spoken language in terms of such dialog acts, instead of treating all utterances the same, as is currently done in most recognizer models. This seems plausible since, if one could first determine that an utterance is of a certain type (say, a Question), additional constraints could be imposed on the word choice during word recognition, hopefully to improve the recognition accuracy.

The WS97 project had close conceptual and technical connections to the Hidden Event Modeling (HWE) project. Both projects aim at a basic level of conversational speech understanding by recovering information that is hidden, given only the words. For example, one of the goals of HWE detection is to find utterance (sentence) boundaries. By using the discourse models developed at WS97, the identified sentential units can then be further categorized as Statements, Questions, and so forth. At the technical level, both projects shared the basic approach of combining word-level information (from a speech recognizer) and evidence from the prosody (melody and rhythm) of the speech to achieve their automatic labeling goals. Consequently, we were able to develop specific techniques, algorithms, and data resources that are now supporting the HWE work. These include a large parallel database of spontaneous speech transcripts, annotations, and prosodic features; training techniques for prosodic decision tree classifers; and techniques for integrating statistical language models and decision trees.

Experiments and results

All training and testing of discourse models used the Switchboard corpus of spontaneous human-to-human telephone conversations. A dialog act labeling scheme was developed to code discourse functions, and a group of students at CU Boulder labeled a 1.4-million-word subset of the corpus. For modeling purposes, we distinguished 42 distinct dialog acts.

We investigated automatic dialog act classification using a range of techniques and knowledge sources. A full report of the findings is available in [1]. Two figures serve as lower and upper bounds for all results. First, we have the frequency of the most prevalent dialog act type (non-opinion statements), resulting in a chance performance of 35% for any classifier. Second, human labelers (looking only at the word transcripts of the conversations) achieve 84% agreement on labels, which is therefore the best we can expect any system to do in the absence of acoustic information.

Given correct word information, we developed a classifier that uses utterance-level and conversation-level language models to achieve 72% labeling accuracy. The utterance-level model indicates how likely each dialog act is, given the word in an observed utterance, while the conversation-level model characterizes the probabilities of sequences of dialog acts. Both models have the same form as the standard N-gram models typically used in speech recognition. The models are combined in a Hidden Markov Model paradigm that models each dialog act as a hidden `conversation state.'

When we replaced the true words with the output of a speech recognizer, labeling accuracy dropped to 64%, a surprisingly small degradation given that almost half the words are incorrectly recognized in this type of speech.

We also found encouraging preliminary results using prosodic information. A decision tree trained to classify the five most frequent dialog acts achieved an accuracy of 50% when combined with a conversation-level language model. When the prosodic information was combined with word-level evidence from the speech recognizer, performance improved to 65%, a result better than that obtained using either knowledge source alone. Part of the ongoing work in HWE modeling is concerned with the question of how to design prosodic and word-level models so as to optimize the combined overall result.

Focused prosodic analyses

In addition to the work at the summer workshop, we conducted additional experiments at SRI to more closely assess the potential contribution of prosody to dialog act modeling. Results are described in detail in [3]. It turned out that the majority of dialog acts in Switchboard fell into two main classes, sentences and backchannels, which are easy to distinguish based on length alone. Because of the high priors for these two dialog acts, the inherent contribution of prosodic features was obscured when evaluating on overall classification performance. To investigate the degree to which dialog acts are inherently marked prosodically, we thus conducted additional analyses in the no-priors domain, effectively making all dialog acts equally likely. Such analyses provide a better picture of how prosodic features can be used for corpora with less highly skewed distributions. Similarly, they can be applied to the problem of detecting a specific type of dialog acts (for example, questions), when a similar-looking type of dialog act (e.g. Statement) is more frequent in the naturally occurring data.

Feature importance

Across analyses we found that a variety of features were useful for dialog act classification. There was considerable feature redundancy among the pause, duration, F0, energy, and speaking-rate features, such that there was little loss in classification when features of a single type were removed from the decision trees. Interestingly, although canonical or predicted features such as F0 for questions were important, less predictable features (such as pause features for questions) show similar or even greater influence on results. Also importantly, across analyses gender was not used in the trees. This suggests that gender-dependent features such as F0 were sufficiently normalized to allow gender-independent modeling. For specific tasks however, there were considerable differences across features, largely related to the size of the dialog acts considered. For example, duration features are a powerful predictor when the dialog acts to be classified have considerable length; however for tasks distinguishing only among short dialog acts, duration is less helpful and other features such as F0 and energy play a stronger role.

Overall, the high degree of feature compensation found across tasks suggests that automatic systems could be successful using only a subset of the feature types. However we also found that different feature types are used to varying degrees in the different tasks, and it is not straightforward at this point to predict which features will be most important for a task. Therefore, for best coverage on a variety of classification tasks, it is desirable to have available as many different feature types as possible.

Integration of trees with language models

Not only were the prosodic trees able to classify the data at rates significantly above chance, but they also provided a consistent advantage over word information alone (for an overall task and three subtasks). Furthermore, the advantage was larger when the word information was based on recognizer hypotheses, than when it was based on the ``true'' words (cheating). This pattern of results suggests that prosody can provide significant benefit over word information alone, and particularly when word recognition is imperfect.

Area 2: Disfluency and sentence boundary modeling

The first major goal of the current project is to build a baseline system that uses both prosodic-acoustic information and automatically recognized words to detect HWEs in spontaneous speech. In previous NSF-funded work, we investigated the use of prosodic features alone for disfluency detection [2, 4], as well as the use of word-based language models for the detection of utterance-boundary event [5]. Additionally we investigated language models designed to represent disfluency events [6]. Therefore the present research extends and integrates the prior work. We worked on three fronts toward this goal.

Mathematical framework

First, in joint work with the disfluencies project at SRI (NSF-IRI9314967) and with our consultant Prof. Mari Ostendorf at Boston University, we developed a theoretical framework to combine prosodic knowledge with an errorful word transcript obtained from a recognizer for HWE detection. This framework extends the one we used in our earlier work. One chief advantage of the new framework is that it allows word-dependent and prosodic features to be integrated in a way that accounts for possible dependence between these knowledge sources. An addition to the theoretical framework is the use of prosody to discriminate likely from unlikely word and segmentation hypotheses.

Prosodic features database

The second area of effort involved the development of our word-level prosodic database. This work extended the set of features used in previous work, developed improved extraction methods, and added features now available given the new mathematical framework. This includes features based on word and phone level alignment information. Main efforts focused on

The construction of the prosodic database (which will be continuously improved throughout the project) has just been completed. Much of the raw feature computation work was done by Rebecca Bates, a Ph.D. student at Boston University, who worked with us at SRI on the related earlier projects. The database will also be used for her work under Dr. Ostendorf's STIMULATE project.

Baseline recognizer

Our third area of effort was in the development of a baseline HWE recognizer. This work required building a new parallel database of Switchboard transcripts, human HWE annotations, and prosodic features. We made use of the experience gained from earlier work, especially during the WS97 JHU discourse modeling project, to add new prosodic features, improve extraction and normalization techniques, and clean up deficiencies in the corpus data. A major element in the new database is the addition of a portion based on automatically generated transcripts from a speech recognizer. This portion of the database uses the top 100 hypotheses from the recognizer instead of the human transcripts. A part of this N-best database is needed to train the new prosodic model that assesses word and segmentation likelihood; the remainder is set aside for evaluation purposes.

Current status

We have completed first versions of all the necessary components of the detection system. We plan to run our first integrated experiments and report first results for HWE detection at the upcoming STIMULATE grantees workshop.

Publications and Manuscripts in Year 1

During Year 1 of the project, we coauthored four papers (including one journal publication), as well as a labeling document for dialog acts. We are currently working on additional manuscripts. The papers are listed and attached at the end of this report.

Plans for Year 2

With a baseline HWE system in place, Year 2 of the project will focus on experimental work addressing a range of empirical questions.

The current baseline system detects utterance boundaries and disfluencies, the two event types we were most familiar with from previous work. The scope of the system will be extended to discourse markers, the third major class of events addressed in the project.

We will also explore alternative statistical modeling approaches. In place of the currently used decision trees, one could use neural networks, Bayes networks, or others. Also, different types of classifiers could be combined in parallel or cascades. We recently interviewed a graduate student from Stanford University's Department of Statistics who is interested in helping us pursue this line of research as a summer intern at SRI.

We also plan to combine some of the discourse modeling results with HWE detection. For example, the event/word language model could be conditioned on the type of dialog act. This is intuitively promising since some dialog act types are more likely to have certain HWEs than others (e.g., statements are more prone to disfluencies than acknowledgements).

Statement of Funds to Remain Unobligated

[Omitted]

Proposed Budget

[Omitted]

Current and Pending Support

[Omitted]

Contribution to Education and HR Development

Not applicable.

Animal/Biohazard/Human Subjects Certification

Not applicable.

References

1
D. Jurafsky, R. Bates, N. Coccaro, R. Martin, M. Meteer, K. Ries, E. Shriberg, A. Stolcke, P. Taylor, and C. Van Ess-Dykema. Switchboard discourse language modeling project final report. Research Note 30, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, January 1998.

2
E. Shriberg, R. Bates, and A. Stolcke. A prosody-only decision-tree model for disfluency detection. In Proceedings 5th European Conference on Speech Communication and Technology, volume 5, pages 2383-2386, Rhodes, Greece, September 1997.

3
E. Shriberg, R. Bates, P. Taylor, A. Stolcke, D. Jurafsky, K. Ries, N. Coccaro, R. Martin, M. Meteer, and C. Van Ess-Dykema. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 1998. To appear.

4
E. E. Shriberg, R. A. Bates, and A. Stolcke. Integrated acoustic and language modeling of speech disfluencies. Journal of the Acoustical Society of America, 100(4):2848, oct 1996. Talk abstract from the Third ASA/ASJ Joint Meeting, Honolulu, Hawaii.

5
A. Stolcke and E. Shriberg. Automatic linguistic segmentation of conversational speech. In Proceedings of the International Conference on Spoken Language Processing, volume 2, pages 1005-1008, Philadelphia, October 1996.

6
A. Stolcke and E. Shriberg. Statistical language modeling for speech disfluencies. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 405-408, Atlanta, May 1996.

Attached Publications and Manuscripts

About this document ...

Annual NSF Grant Progress Report: Year 1 IRI-9619921

This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 -html_version 2.0 hwe-nsf-yr1.

The translation was initiated by Elizabeth Shriberg on Fri Apr 3 17:40:19 PST 1998


Elizabeth Shriberg
Fri Apr 3 17:40:19 PST 1998