Modeling Disfluencies in Spontaneous Speech

Funding Information

Principal Investigator


Project Summary

Spoken language is the medium used first and foremost by humans for accurate and efficient interactive problem solving. As an input modality for human-computer interaction, spoken language can offer: (1) accessibility to an increasing number of people, including those with little or no training, (2) increased access to a growing set of data resources via telephone without a computer terminal, (3) increased power for those already familiar with computer technology, (4) an additional communication channel for more robust communication, for use in unusual environments, and for devices for the disabled, (5) flexibility of modality and use of computers by humans generally, and (6) increased applications and job opportunities in areas that will grow out of increased exposure of people to the potential of technology.

Although there has been significant work devoted to some spontaneous speech phenomena, such as "slips of the tongue," other much more frequent types of spontaneous speech "disfluencies" have been largely ignored, e.g., false starts, hesitations, filled pauses and related phenomena. Such disfluencies are highly prevalent in normal human communication. Although disfluencies are less frequent in human-machine dialog, the causes and costs (e.g., in terms of cognitive load on the user) of this discrepancy are unknown. Further, because current speech understanding systems do not model disfluencies well, when they do occur, they are correlated with speech recognition and understanding errors. As spoken language systems evolve to allow more natural human-machine dialogue, the rate of disfluencies is likely to rise to rates closer to those observed in human-machine communication. A better understanding of the interdisciplinary aspects of disfluencies is critical to the development of a principled treatment of these highly frequent attributes of spontaneous speech.

This project models disfluencies at lexical, syntactic, and acoustic-prosodic levels. The goal is to gain insight into human communication, and to develop algorithms to robustly recognize speech that includes disfluencies. The approach involves analysis of disfluencies in existing, digitized corpora and in speech collected in controlled experiments. The investigation is undertaken by a team representing expertise in different, complementary disciplines, including linguistics, psycholinguistics, and cognitive psychology. As the project enters its final phase, recent efforts at SRI have investigated how results of the descriptive research can be integrated in SRI's speech understanding system. In particular SRI has developed methods for automatically detecting disfluencies, using acoustic-prosodic information combined with specialized language models. Related studies at Stanford have focused on syntactic properties of disfluencies and on functional aspects. Additional related work at MIT aims to understand the articulatory mechanisms involved in self-interruption, as well as the relationship between speech errors and sentence prosody.