Speech disfluencies (DFs) are prevalent in spontaneous speech, and are among the characteristics distinguishing spontaneous speech from planned or read speech. DFs are one of many potential factors contributing to the relatively poor performance of state-of-the-art recognizers on this type of speech, e.g., as found in the Switchboard [2] corpus.
Past work on disfluent speech has focused on disfluency detection, using either acoustic features [7, 6] or recognized word sequences [1, 3]. Our goal in this work is to develop a statistical language model (LM) that can be used for speech decoding or rescoring, and that improves upon standard LMs by explicitly modeling the most frequent DF types. The main reason to expect that DF modeling can improve the LM is that standard N-gram models are based on word predictions from local contexts, which are rendered less uniform by intervening DFs. Other researchers have recently started exploring approaches to DF modeling based on similar assumptions [4, 8].
Section 2 describes a simple N-gram-style DF model, based on the intuition that DF events need to be predicted and edited from the context to improve the prediction of following words. Section 3 compares the DF model with a baseline LM, in terms of both perplexities and word error rates on Switchboard data. The emphasis is on a detailed analysis of the model at DF and following word positions. Section 4 provides a general discussion of the results.