next up previous
Next: Acknowledgments Up: Statistical Language Modeling for Previous: Results from related work

Discussion and Conclusions

 

The preceding analysis shows that a disfluency model based on the intuition underlying the Cleanup Model can yield only very small improvements in model perplexity, although the cleanup assumption seems to be valid on the Switchboard data we used in our experiments. The local perplexity analysis we performed shows that the word positions at and immediately following DF events can be predicted with sometimes significantly lower perplexity, although the effect on overall perplexity is very small, due to the low frequency of DF events.

An interesting (and prima facie unexpected) result was that the Cleanup Model does not lower perplexity for filled pauses in acoustically segmented utterances. We attribute this to the particular way that the cleanup assumption is violated by filled pauses at linguistic segment boundaries internal to an acoustic segment. There are correlations between segmentation and other types of DFs, too, but the effects on the LM should be smaller in those cases as the bigram contexts for following words are not as radically changed by different segmentations. Our findings highlight the need for a more careful modeling (possibly with automatic recovery) of linguistic structures in conversational speech, a topic we plan to address in future work.

However, even for repetitions and deletions, it does not follow that recognition accuracy would necessarily improve with better local perplexities. In fact, we tested a trigram DF model (modeling only REP and DEL events) against a standard trigram on a Switchboard test set of 1192 segments, and found virtually no difference in overall word error rate (49.5% in both cases). This can be attributed to a number of factors. First, the REP/DEL model affects only a small portion of the total corpus (less than two cases per 100 words). Second, its advantage in modeling REP/DEL contexts should rarely come into effect due to the high error rate on adjacent words.

There are other reasons why lower perplexity may not lead to reduced word error rate. For instance, it could be that DFs tend to involve words of high frequency for which good acoustic models exist, so that a slightly improved LM would not affect recognition accuracy.

The overall conclusion is that by DF modeling at the LM level, contrary to high hopes in parts of the LM community, one should not expect a significant improvement in terms of word recognition performance. The main reason is that DFs are inherently local phenomena that are modeled surprisingly well by standard N-grams, even without context ``cleanup.''

On the positive side, our results confirm that DFs have a systematic, nonrandom distribution that can be partly captured even with simple N-gram-like models; it is therefore conceivable that more sophisticated approaches could reap benefit for recognition accuracy.

One potential source of improved DF modeling are correlations with speaker identity. For example, [9] found that speakers can be grouped into those preferring deletions over repetitions (`deleters'), and those with the opposite tendency (`repeaters'). Such cross-utterance effects could be modeled in the LM using standard techniques, e.g., using adaptive interpolation of specialized models.

Finally, we note that the language modeling techniques described could also be used for automatic disfluency tagging and removal. Given a sequence of words and a probabilistic DF model of the type used here, one can use a Viterbi-style backtrace to recover the most likely sequence of DF events underlying the words sequence. This is another application we plan to study in the future.


next up previous
Next: Acknowledgments Up: Statistical Language Modeling for Previous: Results from related work

Andreas Stolcke
Fri Jun 28 19:31:43 PDT 1996