next up previous
Next: Analysis by DF type Up: Results and Analysis Previous: Results and Analysis

Overall results

We trained a trigram model for FP, REP, and DEL disfluencies as described above, using 1.4 million words of Switchboard data labeled for DF events (see note 2). The model was then evaluated on a test set of 17,500 words. Table 1 compares baseline trigram and DF models.gif

   table89
Table 1: Overall results

As can be seen, there is no significant difference in recognition word error rates. While this may be due to a number of factors (some of which we discuss in Section 4), we would have expected at least a reduction in perplexity for the DF model; this was not the case. We wanted to know whether this was because our underlying assumptions were wrong, or whether it was due to other factors, so we decided to analyze the DF model performance in detail.

We note with regard to these and later results that some types of disfluencies may contain word fragments (from speakers cutting themselves off in mid-word). According to [9], 20 to 25% of repetitions and deletions in Switchboard contain word fragments; however, filled pauses, as classified here, never involve words fragments. Fragments are usually not part of the vocabulary of current recognizers, and are not modeled in our system. They were therefore omitted from the transcripts used for our perplexity computations. We can expect an additional benefit from successful fragment recognition, since they would serve as extra evidence for repetitions and deletions, as well as for other DF events.



Andreas Stolcke
Fri Jun 28 19:31:43 PDT 1996