SRI Speech Technology and Research Laboratory

Seminar Series

2002 Seminars


Speaker: Marco Orlandi, SRI
Time: Tue Oct 15, 2002. 3:00 PM. (Note unusual time)
Venue: EJ 124
Title: Dialogue Error Correction in Automatic Speech Recognition

Abstract:

Current recognition systems usually show a rather static and rigid behavior regarding recognition errors. Human-machine interaction is often led to the same recognition error. This talk would focus on methods to adapt the recognizer using the error feedback.

We collected data in the spoken address recognition domain and studied how users try to correct recognition errors during interaction with the machine. We developed recognition grammars to match their error-correcting behavior and propose a more natural way to interact in the presence of errors.

Furthermore, when an error happened, we attempted to use information from the error to remove the wrong hypothesis and all its semantic equivalents from the list of allowable hypotheses. We then incorporated these constraints in a new recognition grammar applied in a rescoring phase.

Preliminary experimental results using these method show a significant reduction in the number of dialog turns to get a given error rate.

This is joint work with Christopher Culy and Horacio Franco.


Speaker: Arun C. Surendran, Bell Labs
Time: Wed Oct 9, 2002. 11:00 AM. (Note unusual time)
Venue: EJ 124
Title: Robust and Efficient Adaptation of Hidden Markov Models: A Bayesian Approach

Abstract:

A common problem in machine learning is one of mismatch between training and testing data. Specifically in speech recognition, statistical models that are learned in one environment (e.g. a quiet office room or a speaker independent model) need to be adapted to new environments (e.g. noisy airport, new speaker) to improve performance. Adaptation techniques normally derive point estimates of unknown parameters that are supposed to bridge this gap in learning. These approaches face shortcoming in terms of accuracy and flexibility of the model, accuracy in estimating the parameters, and efficiency of data usage. Some of these problems, like estimation errors, are prominent when the data available is limited, while others, like modeling inflexibility, manifest when the amount of data available is large.

In this talk, I will present a unified framework called Structural Bayesian Prediction (SBP) to address these problems. Instead of estimating parameters and blindly treating them as true values, the technique explicitly takes into account uncertainties associated with them to create predictive densities in the decision rule. Further, by using hierarchical techniques to estimate the prior densities, it makes efficient use of data. Also, by using a hierarchical Bayesian approach to smartly exploit relationships between different spectral features, SBP improves the quality of learning for those features that are noisier and hence more difficult to learn. The approaches developed in this talk are applicable to many other techniques and problems.

I will present experiments to effectively demonstrate the strength of SBP compared to earlier techniques.


Speaker: Barbara Peskin, ICSI
Time: Wed Oct 2, 2002. 12:00 noon. (Note unusual time)
Venue: EJ 124
Title: SuperSID: Exploiting High-Level Information for High-Performance Speaker Recognition

Abstract:

Most current state-of-the-art Speaker Recognition systems rely on short-term cepstral features for their analysis, ignoring higher-level features such as prosody, pronunciations, word usage, and conversational style. While there have been many attempts to incorporate prosodic features, esp. pitch, into speaker models, results have generally been disappointing, with relatively poor performance when such features are used in isolation and only modest improvements when used in conjunction with traditional cepstral-based systems.

At this summer's JHU Workshop, WS2002, the SuperSID (SID = Speaker ID) team launched a concerted effort to establish the value of higher-level information for the Speaker Recognition task. We were the beneficiaries of substantial preparatory work (from SRI, and others) developing databases of information at a number of levels -- prosodic, phonetic, lexical, ... -- using as a testbed NIST's Extended Data SID task, based on conversations from the Switchboard corpus. We explored a wide variety of knowledge sources and a number of different modelling and classification schemes. By summer's end, we had created a collection of Speaker Recognition modules which combined to match the performance of the cepstral-feature system, and in addition yielded siginificant gains when merged with the cepstral scores.


Speaker: Qifeng Zhu, UCLA
Time: Wed Sep 25, 2002. 11:00 AM. (Note unusual time)
Venue: EJ 124
Title: Noise robust front-end processing and feature extraction for speech recognition

Abstract:

The performance of current automatic speech recognition (ASR) systems degrades greatly under noise. This talk focuses on the front-end approach to improving the noise robustness of ASR systems. Several novel algorithms are developed for feature extraction. An analysis-based non-linear feature extraction approach is proposed inspired by a quantitative model of how speech amplitude spectra are affected by additive noise. Acoustic features are extracted based on the noise-robust parts of speech spectra without losing discriminative information. Two nonlinear processing algorithms, harmonic demodulation and spectral peak-to-valley ratio locking, are designed to minimize mismatch between clean and noisy speech features. A previously studied method, peak isolation, is also discussed with this model. These algorithms do not require noise estimation and are effective in dealing with both stationary and non-stationary noise backgrounds. A noise removal algorithm derived directly from the additive noise model is also tested and compared with the other new algorithms in this dissertation and with the linear and nonlinear spectral subtraction methods. Another algorithm is variable frame rate analysis, which is inspired by human speech perception. It uses a high frame rate for rapidly-changing segments of high energy and a low frame rate for relatively steady segments. The proposed front-end processing algorithms are tested in Hidden Markov Model (HMM) based speech recognition experiments with the TI46 database and the Aurora 2 database. Significant improvement is observed by using these algorithms. For the Aurora 2 connected digit-string database, the average recognition rate across different noise types, including non-stationary noise background, and SNRs is improved from 58% to 83%. The limitations and the further improvements and extensions on these algorithms will also be discussed.

Speaker: Robert Eklund, Telia Research AB
Time: Wed Aug 14, 2002. 12:00 noon. (Note unusual time)
Venue: EJ 124
Title: Foreign Speech Sounds in Swedish and implications for ASR and synthesis

Abstract:

In recent years, both automatic speech recognition (ASR) and text-to-speech (TTS) conversion systems have attained quality levels that allow inclusion in everyday applications. One remaining problem to be solved in both these types of applications is that alleged phone inventories of specific languages are commonly expanded with phones from other languages, a problem that becomes the more acute in an increasingly internationalized world where multilingual automatic speech-based services and devices are beginning to reach the market. This talk investigates the nature of phone set expansion in Swedish, recapping studies that have been carried out since 1996 at Telia Research, Farsta, Sweden. The status of these phones is discussed, and since such added phones do not have a phonemic (or allophonic) function, the term 'xenophones' was suggested (1998). The analysis is based on a production study involving 491 subjects, and the observed xenophonic expansion is described in terms of three categories along the "awareness" and the "fidelity" dimensions. The results show that very few subjects resort to full rephonematization and that xenophonic expansion is the rule, although there is an uneven distribution depending on particular phones, spanning from phones produced by most subjects, to phones produced by almost no subjects. Of the possible explanatory factors analyzed -- regional background, gender, age and educational level -- the latter is by far the most important.

Speaker: Erik McDermott, Nippon Telephone and Telegraph.
Time: Thu Jul 25, 2002. 3:00 AM. (Note unusual time)
Venue: EJ 124
Title: Recent Speech Research Activities at NTT.
Comments: Informal talk.

Abstract:

I'll talk a little bit about what my colleagues and I are doing: Finite State Transducer based decoding for large vocabulary tasks (Willett), Trajectory modeling using HMMs in speech generation mode (Minami), Blind Source Separation using Independent Component Analysis (Makino). I'll then talk in more detail about my own recent work: recent experimental results for discriminative training based on Minimum Classification Error (MCE), as well as recent theoretical work in which I show that the MCE loss function can seen as the result of Parzen estimation of the theoretical Bayes classification risk. The latter work will be presented at the IEEE workshop on Neural Nets for Signal Processing and at ICSLP.

Speaker: Sachin Kajarekar, Oregon Graduate Institute.
Time: Tue Jul 9, 2002. 11:00 AM.
Venue: EJ 124
Title: Analysis of variability in speech with applications to speaker recognition
Comments: Interview Talk.

Abstract:

Speech signal has variability due to language, speakers, and communication channels. In this work, I propose study of the nature and the contribution of different types of variability in speech, where variability in speech refers to the variability in the features extracted from speech signal. The analysis is performed using multivariate analysis of variance (MANOVA). Variability is measured as the covariance of the features due to three factors - language, speaker, and channel. Variability due to language is measured as variability due to different phones in the language. Variability due to speaker and channel is referred to as speaker and channel variability. Using these factors, we analyze the most commonly used features in speech and speaker recognition. The features are analyzed in spectral and temporal domains. The results of MANOVA are applied in two ways. First, we show that contribution of the variabilities in features is related to their performance on speech and speaker recognition tasks. Second, we show that results of MANOVA can be used for deriving discriminant features for a given task. In general, we show that performance of the speech recognition system improves when information from a longer time-span is included in the features.

Speaker: Peter Beyerlein, Philips Research Labs.
Time: Wed Jul 3, 2002. 3:00 PM (Note unusual time).
Venue: EJ 124
Title: Towards Large Margin Speech Recognizers by Boosting and Discriminative Training

Abstract:

We show that boosting improves the best error rates obtained with maximum likelihood and discriminative training in state-of-the-art automatic speech recognition (ASR). Specifically we discuss the application of AdaBoost.M2 on the level of spoken utterances (``utterance level approach'') and its combination with discriminative training. Analyzing our experiments, we show that both boosting and discriminative training increase the ``classification margins'' of patterns classified with low margin. In particular, this effect turns out to be additive when combining boosting and discriminative training. Boosting is a general method for improving the accuracy of almost any learning algorithm (Schapire, 1990). The technique works by sequentially training and combining a collection of classifiers in such a way that the later classifiers focus more and more on hard-to-classify examples. To this end, in the well-known AdaBoost algorithm (Freund & Schapire, 1997) a probability distribution is introduced and maintained on the input space. Initially, every training example gets the same weight. In the following iterations, the weights of hard-to-classify examples are increased relative to the easy ones, according to the classification result of the current classifier on the training set. Using the calculated training weights, a new classifier is trained. This process is repeated, thus producing a set of individual classifiers. As in discriminative model combination (Beyerlein, 1998), which combines arbitrary models/classifiers into one log-linear distribution, the output of the final classifier is determined from a log-linear combination of the individual classifier probabilities, i.e. a linear combination of scores (log-probabilities). However the weights of the combination are determined from the boosting theory, whereas the weights in discriminative model combination are trained discriminatively.

Speaker: Shawn Chang, ICSI, Berkeley.
Time: Tue Jun 25, 2002. 11:00 AM
Venue: EJ 124
Title: A Hierarchical, Syllable-Based Model of Speech Recognition.

Abstract:

Current-generation speech recognition systems assume that words are readily decomposable into constituent phonetic components ("phonemes"). A detailed evaluation of state-of-the-art speech recognition systems indicates that the conventional phonemic "beads-on-a-string" approach is of limited utility, particularly with respect to informal, conversational material. In this study an alternative approach is adopted, one in which the syllable assumes pre-eminent status, and is melded to both the lower and higher tiers of linguistic organization through incorporation of prosodic information (such as stress accent). Such an approach can provide an accurate and parsimonious model of pronunciation variation in spontaneous discourse, as well as a computational mechanism with which to relate the fine phonetic structure observed at the articulatory-acoustic feature level with higher-level phonetic and lexical structure. A computational model is proposed that explicitly relates articulatory-acoustic features, syllable structure and stress accent to lexical representation and utilizes this framework for automatic speech recognition. The advantages of this syllable-oriented approach is illustrated for a limited-vocabulary task using both automatically computed as well as fabricated data.

Speaker: C. Chandra Sekhar, Center for Integrated Acoustic Information Research, Nagoya University, Nagoya, Japan.
Time: Mon May 20, 2002. 11:00 AM. (Note unusual time).
Venue: EJ 124
Title: Recognition of Subword Units of Speech using Support Vector Machines

Abstract:

Learning in support vector machines (SVMs) is based on the structural risk minimization principle of statistical learning theory. It has been demonstrated that SVMs give a good generalization performance for complex pattern recognition tasks. We address the issues in recognition of the large number of subword units of speech using SVMs. In conventional approaches for multi-class pattern recognition using SVMs, learning involves discrimination of each class against all the other classes. We propose a close-class-set discrimination method suitable for large-class-set pattern recognition problems. In the proposed method, learning involves discrimination of each class against a subset of classes confusable with it and included in its close-class-set. We consider different criteria such as the description of classes, similarity measure between example patterns of classes, and the margin of pairwise classification SVMs, for identification of close-class-sets.

We study the effectiveness of the proposed method in reducing the complexity of multi-class pattern recognition systems based on the one-against-the rest and one-against-one approaches. We discuss the effects of symmetry and uniformity in size of the close-class-sets on the performance for these approaches. We present our studies on recognition of isolated utterances of 80 highly confusable Stop_Consonant-Vowel units, and on recognition of continuous speech segments of 86 frequently occurring Consonant-Vowel units in a continuous speech database of broadcast news.


Speaker: Wen Wang, School of Electrical and Computer Engineering, Purdue University.
Time: Fri May 3, 2002. 11:00 AM. (Note unusual time).
Venue: EJ 124
Title: Investigating the Effectiveness of Tightly Integrating Multiple Knowledge Sources in Language Modeling

Abstract:

We believe that significant improvement in speech recognition accuracy can be obtained by gaining as much knowledge as possible about the kinds of utterances that can occur in a domain. We have developed a constraint dependency grammar almost-parsing based language model incorporating multiple knowledge sources. Word, lexical category, lexical feature, and syntactic constraints are tightly integrated into a uniform structure that is associated with a word in the lexicon, which was developed based upon the concept of Constraint Dependency Grammars. This model achieves a perplexity reduction and word accuracy improvement compared to a trigram, part-of-speech, and several prominent parser-based language models. We investigated the relative contributions of the various knowledge sources to the strength of our model by using constraint relaxation at the level of the knowledge source. We have found that although each knowledge source contributes to the language model quality, lexical features are an outstanding contributor when they are tightly integrated with word and syntactic constraints. Finally, current work on developing a full CDG parser-based language model and the proposal of using these probabilistic CDG language models in spontaneous speech will be briefly discussed.

About the Speaker:
Wen Wang is presently finishing her doctorate at Purdue University's School of Electrical and Computer Engineering (ECE). Her research interest has focused on Language Modeling, Parsing and Grammar Induction, and Constraint Satisfaction; in particular, using constraint dependency grammar as a framework to combine multiple knowledge sources in language modeling. Wen received her B.S. and M.S. in Shanghai Jiao Tong University (P.R.China). Wen is a finalist of 2001 IBM Ph.D. Research Fellowship and also the only nominee by ECE of Purdue University. She has been studying under Shanghai Jiao Tong University Fellowship and several other fellowships in 1995-1998. She has also won the Championship in the China National College-Student Electronic Instrument Design Contest. Wen is a student member of IEEE and ACL.

Speaker: Steve Renals, International Computer Science Institute, Berkeley, California.
Time: Thursday May 2, 2002. 11:00 AM. (Note unusual time).
Venue: EJ 124
Title: Lexical and Prosodic Feature Selection for Voicemail Summarization

Abstract:

This talk is about some work Costis Koumpis and I have been doing at Sheffield on the automatic summarization of voicemail messages. Our basic approach involves transcribing a voicemail message using a speech recognizer, and then choosing a subset of the transcribed words as the summary. Recognition of voicemail speech is not straightforward: we obtain an average word error rate of about 40%. To augment standard text processing techniques on the speech recognizer output, we have investigated prosodic features and recognition confidences. These features are used as the input to classifiers which are trained to select the "content words" in a transcription. Since there are very many potential features, we have looked at how to choose the important ones for the task. Our approach to this problem uses an analysis based on ROC curves, building on the intuition that different feature sets might be optimal at different points in precision-recall space.

Speaker: Wen Wang, School of Electrical and Computer Engineering, Purdue University, West Lafayette.
Time: Friday, Mar 15, 2002. 3:30 PM (Note unusual time)
Venue: EJ 124
Title: Rescoring Effectiveness of Language Models Using Different Levels of Knowledge and Their Integration (approximate title).
Duration: 30 minutes. (approx).

Abstract:

A new almost-parsing language model incorporating multiple sources of knowledge is presented in this paper. Word, lexical features, and syntactic constraints are tightly integrated into a uniform structure called a SuperARV, which was developed based upon the concept of Constraint Dependency Grammars. The SuperARV language model reduces perplexity and word error rate compared to trigram, part-of-speech, and parser-based language models. The effect of building the language model employing joint versus conditional probabilistic estimations is also evaluated.

Last updated $Date: 2002/10/15 04:42:54 $ by anand@speech.sri.com