SRI Speech Technology and Research Laboratory

Seminar Series

2000 Seminar Talks

Hector Raul Javkin, Univ. of California at Santa Barbara.

Time: Tuesday, Feb 8th, 2000, 10:30 AM
Title: Text to Speech applied to Aids for Speech Therapy and Deaf Education
Abstract:
Although the output of text-to-speech synthesis would normally be synthesized speech, the elements of a text-to-speech system can be used to provide another useful thing - a sequence of phonetic specifications to serve as a model in speech pathology and in speech training for the deaf. Providing models in this way can give deaf students independence in their speech training, because they can type any utterance they would like to practice and immediately view model parameters to imitate. In deaf education, some of the acoustic parameters produced by TTS (fundamental frequency, vowel formants, amplitude) can used directly. Some articulatory parameters, such as nasalization, can be derived by a simple transformation of the TTS parameters which generate that feature. For example, in a TTS system using the Klatt synthesizer to turn parameters into sound, the difference between the nasal zero and nasal pole (which increases to produce varying amounts of nasalization) becomes the indicator for nasalization.

This paper will report on a more difficult task - providing articulatory models for articulatory parameters which have a complex relationship to the synthesis parameters. In this case, TTS provides only phoneme labels and timing information. Deaf children receive feedback as to their tongue-palate contact from a dynamic palatograph. The system generates models of the movements of the tongue on the palate, using targets and rules for moving between targets. The optimal contact areas for different consonants vary markedly from speaker to speaker. The system was therefore designed to permit customization of targets for individual speakers, and the rules controlling the movement from one target to the next were made general. The improvement of these rules is an ongoing problem.

Anand Venkataraman, Computer Science Group, Institute of Information Sciences and Technology, Massey University. New Zealand

Time: Tuesday, Feb 15th, 2000, 10:30 AM
Title: Word discovery in fluent speech - A Statistical Approach
Abstract:
A statistical model for word discovery in fluent child directed speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described. Although the algorithm is presented as an unsupervised learner, empirical results (and a possible demonstration of the software) are presented showing improvement in performance with training size that is consistent with predictions from learning theory.

A paper on the seminar topic is presently archived at:
http://xxx.lanl.gov/abs/cs/9910011
and it has references to related literature in the area.

Yoon Kim, Center for Computer Research in Music and Acoustics, Stanford University

Time: Friday, March 3rd, 2000, 10:30 AM, at CDL
Title: Signal Modeling for Robust Speech Recognition with Frequency Warping and Convex Optimization
Abstract:

In speaker-independent speech recognition, compensating for the differences pertaining to speaker-dependent acoustics is crucial. Acoustic mismatches between the test speakers and the statistical model result in considerable degradation of recognition performance. In this work, we disect the entire signal modeling process into feature extraction and compensation, and propose a technique to alleviate the aforementioned problem using Bark-frequency warping and convex optimization.

For feature extraction, we propose a new method of obtaining parameters from speech that is based on frequency warping of the vocal-tract spectrum, rather than the FFT spectrum. Frequency warping is performed by taking the non-uniform DFT (NDFT) of the impulse response corresponding to the vocal-tract transfer function using a Bark-warped frequency grid. The warped spectrum is then modeled by another linear prediction, which provides a good estimate of the spectral envelope, especially near the peaks. Results of vowel classification experiments show that the proposed method effectively captures linguistic information while reducing the effects of speaker-dependent information.

While robust feature extraction helps in reducing the acoustic variability among speakers, feature compensation based on optimization is needed to "minimize" the mismatches with respect to a recognition model. For the compensation, we propose a linear transformation of the cepstral space, which can be viewed as the Fourier dual of the log spectral space. If we restrict the linear transformation to be that of convolution, which is equivalent to filtering in the log spectrum domain, the resulting normalization matrix exhibits a Toeplitz structure, simplifying the parameter estimation. The problem of finding the optimal matrix coefficients that maximize the likelihood of the utterance with respect to an existing Gaussian-based model then becomes nothing but a constrained least-squares problem, which is convex in nature, yielding a unique optimum. Applying the optimal linear transformation to the test feature space resulted in considerable improvements over the baseline system in frame-based vowel recognition using data from 23 British speakers, and in isolated digit recognition using the TI digits database, consisting of over 300 speakers.

Sebastien Grange, Star Lab/CHIC!/EPFL

Time: Wed. March 8th, 3:00pm, CDL
Title: Vision-based sensor fusion for active interfaces: H.O.T.-Human Oriented Tracking
Abstract:

Sensor fusion can be used to build a robust tracker and to analyze human activity for automatic input monitoring. The goal of this project is to develop new sensor fusion techniques using vision sensors (stereo cameras, panoramic cameras). These techniques will enable creation of active interfaces for intelligent environments, for example smart meeting rooms. This project describes the implementation of a vision-based people tracker, which is used to create a geometric and dynamic model of a person. The tracker uses sensor fusion from two input modalities, namely color images and stereovision, to locate particular human features. A model of the human pose is then built from the tracking results, and human movements are segmented and parameterized. The model is finally used to monitor the person 's activity. The purpose of activity monitoring is primarily to allow for better human-computer interaction by giving the computer knowledge about the human. A benefit from activity modeling is to provide the human with appropriate feedback, at the right moment, and using the best available output modality. The result is a software library, called H.O.T. (for Human Oriented Tracking), which can be used as a reusable interface between a human-computer interaction application and a human user interacting with the machine. H.O.T. is specifically designed for smart spaces applications, but can be used in many other fields involving human monitoring, such as mobile robotics and computer- assisted teleoperation.

Erik McDermott, NTT Communication Science Labs and ATR Human Information processing Labs, Kyoto, Japan.

Time: Friday, March 10th, 2000, 10:30 AM, EJ 124
Title: MCE-based discriminative training for speech recognition
Abstract:
At the heart of the widespread practice of using Maximum Likelihood Estimation (MLE) to choose a classification system's parameters lies a fundamental mismatch between method and goal. Simply stated, maximizing likelihood is not the same thing as optimizing classification performance. Though MLE-designed systems perform well when the correct model structure (i.e., a sufficient number of the right type of pdfs) is used, it is well known that optimal classification performance is not guaranteed when an incorrect model structure is used. In many cases, the MLE solution is different from the optimal (for classification) solution even though the model structure is fully capable of representing that solution.

The Minimum Classification Error (MCE) framework attempts to eliminate the mismatch between the approach used for parameter design and the goal of optimal classification performance. The central idea in MCE is to use a smooth, differentiable approximation of classification performance as the criterion function for optimization. Choosing the system parameters (e.g., means, covariances and mixing weights) so as to minimize the MCE loss function results in a much more efficient use of the parameters than in MLE. Stated differently, MCE typically yields better classification performance than MLE, even when using a (much) smaller number of parameters. It should be mentioned that the fact that the MCE loss function is a smoothed approximation of classification performance not only enables gradient-based optimization, but also offers a way of controlling generalization, in a manner similar to the choice of smoothing parameters in Parzen window or k-nearest neighbors estimation.

In this talk I will go over the definition of the MCE loss function, and discuss different approaches to HMM optimization with MCE. I will also try to put MCE in a historical setting and contrast it with other discriminative approaches to classifier design, such as Maximum Mutual Information (MMI). If there is time, I will present recent MCE results from a challenging commercial speech recognition application, to be presented at ICASSP2000 (see below). If there is not enough time, that could be the subject for another talk.

------------------------------------------------------------------

I include the abstract of our ICASSP2000 paper here:

"Discriminative Training for Large Vocabulary Telephone-Based Name Recognition", McDermott, Biem, Tenpaku & Katagiri.

This paper describes progress on a commercial application of the MECS recognition system to the task of recognizing Japanese family names spoken by customers into the answering machines of a large marketing / human resource company. The task is thus speaker-independent, open vocabulary, and is characterized by large variation in caller speaking styles, telephone types and acoustic environments. Our results show that context-independent hidden Markov models trained discriminatively with the Minimum Classification Error criterion are a practical alternative to context-dependent models based on phonetic decision trees, yielding better performance with a much smaller number of parameters. On this difficult task we have obtained 59% correct family name recognition. A phoneme-based confidence measure enables us to obtain 85% correct name recognition for accepted utterances, at an overall utterance acceptance rate of 15%.

Debby Hindus, Interval Research Corporation, Palo Alto.

Time: Thursday March 16, 10:30AM, CDL
Title: The Importance of Homes in Technology Research
Abstract:
In this talk, I will argue for the importance of home-related research on technology. Several important differences between researching homes and researching workplaces are described, and several issues in conducting home-related research are discussed in the context of specific research efforts. Ways to advance home-related research as a discipline are presented, including an existing course on technology design with a home focus.

Bio:

Debby Hindus has been a Member of the Research Staff at Interval Research Corporation in Palo Alto, CA, since 1992. Her current research interests include broadband applications in the home and wireless technologies. Ms. Hindus has co-authored several studies of novel communications technology for workplaces and homes. In 1999, Ms. Hindus taught a new Stanford course on The Design of Domestic and Consumer Technologies.. Earlier research addressed a new kind of computer-mediated communication, the audio space, and the design of user interactions within an audio space. Ms. Hindus holds an MS degree from the MIT Media Lab and a BSCS degree from the University of Michigan. While in the Media Lab’s Speech Research group, her work focused on innovative speech applications for interacting with computers.

Annette Adler, Xerox PARC, Palo Alto.

Time: Monday March 20; 3-4 pm CDL
Title: Goldfish: Document Use in the Real World
Abstract:
This talk is about how documents are used. It describes a research project, Goldfish, that I led for over two years. Goldfish considered how people in a variety of office, home and mobile settings use documents, emerging document types and document appliances, particularly in the context of:

- multiple locations for document use, some predictable and other unpredictable

- the many different (and sometimes overlooked) sorts of devices, such as pagers, palms, VM, involved in document use and some of reasons why people use these

- intermingled virtual (computers and telephones) and physical (pencil and paper) document environments and how people negotiate getting back and forth between them

- the intertwined personal and professional nature of document work how people juggle this

More than this, though, Goldfish was both a evolving perspective on documents and their use, and a methodology for how to go about looking at that. My goal was to identity some of the initial assumptions about the world underlying Xerox strategic directions and consider these through iterative field interviews, with an eye towards quickly understanding where to dig further (or not). As such Goldfish was an early stage research approach that has lead to more targeted research and development since.

Words About Me:

I am presently in a new Xerox PARC lab responsible for transferring research from PARC into Xerox business groups. Prior to this, I did research on document use, particularly digital/physical transitions, and technology-mediated collaboration and communities. I have a formal background in social anthropology and have also worked as part of a systems architecture team at Xerox and, on the business side, in knowledge-based systems consulting and software product marketing. Integrating these experiences has lead to many interesting and fruitful tensions that provide me with a strong appreciation for the value of interdisciplinary collaboration.

Steven Greenberg, International Computer Science Institute, Berkeley

Time: Wednesday May 10; 11:00AM, EJ-124
Title: Understanding Spoken Language in the Twenty-First Century
Abstract:
Spoken language is destined to play a major role in the coming digital age by virtue of its importance in human communication. The performance of current-day speech applications such as voice synthesis and dictation, natural-language query recognition, foreign-language-learning tools and auditory prostheses is still primitive in comparison to the level of quality associated with mature technology. The development of robust, reliable applications will depend on deep insight into the nature of spoken language and the melding of this knowledge with sophisticated engineering and design. One means by which to foster the development of future-generation technology is through the establishment of an advanced science and technology research center whose charter would cover a broad spectrum of intellectual domains germane to spoken language. This presentation will discuss a proposal for establishing such a center, as well as its prospects for collaboration with SRI (especially the STAR Laboratory and the Artificial Intelligence Center).
Dimitra Vergyri, Center for Language and Speech, Johns Hopkins University

Time: Thursday May 25th; 10:30AM, EJ-124
Title: Integration of Multiple Knowledge Sources in Speech Recognition using Minimum Error Training
Abstract:
Modern automatic speech recognition systems employ two statistical models: a language model and an acoustic model. These two models are estimated independently of each other, most often using a maximum likelihood criterion. They are then combined to compute the a posteriori probability of a word sequence given an acoustic signal. During combination, a static tunable parameter is used to scale the score of one of the models relative to the other.

In this talk a general formulation will be presented for combining scores from several models -- knowledge sources -- into a single log-linear model to compute sentence probabilities. The parameters of the new model are the weights of the log-linear combination. The combination can be performed either statically , with constant weights, as is the case in the traditional way the acoustic and language models are combined, or dynamically, where the parameters may vary for different segments of a hypothesis. In the dynamic combination the weights aim to capture the dynamic change of confidence on each of the models combined. In order to achieve robust estimation of the parameters, each segment is automatically assigned to one of a small number of categories, or classes, and a single set of parameters is used for each segment class. Different techniques will be described to estimate the parameters in order to achieve minimum word error rate.

Three applications of this approach will be presented:

- The combination of several acoustic models trained using speech from resource-rich languages in order to obtain a recognition system for a language with little acoustic training data. The segments for which a set of parameters is defined correspond to hypothesized phones and the classes for the dynamic combination are chosen using phonological knowledge.

- The dynamic combination of the baseline acoustic and language models. Different ways are investigated for clustering word links in a recognition lattice, and the language model weight and insertion penalty parameters are estimated for each cluster. Features previously used to predict confidence (correctness) of the recognized words are utilized here to define the link clusters. This way of dynamically modifying the language model weight may be interpreted as acoustic sensitive language modeling.

- The integration in the model of side-information, available during a first pass recognition. The new model is used to rescore the hypotheses.

Experimental results will be reported for all three applications.

Jun Wu, Center for Language and Speech Processing, Johns Hopkins University

Time: Friday, August 4th; 10:30AM
Title: Maximum Entropy language modeling for Exploiting Syntactic, Semantic and Collocational Dependencies in Language Modeling
Abstract:
A new statistical language model is presented which incorporates collocational dependencies with two important sources of long-range statistical dependence: the syntactic structure and the topic of a sentence. These dependencies or constraints are integrated using the maximum entropy method. Substantial improvements are demonstrated over a trigram model in both perplexity and speech recognition accuracy on the Switchboard task. A detailed analysis of the performance of this language model is provided in order to characterize the manner in which it performs better than a standard N-gram model. It is shown that topic dependencies are most useful in predicting words which are semantically related by the subject matter of the conversation. Syntactic dependencies on the other hand are found to be most helpful in positions where the best predictors of the following word are not within N-gram range due to an intervening phrase or clause. It is also shown that these two methods individually enhance an N-gram model in complementary ways and the overall improvement from their combination is nearly additive.
Harry Printz, IBM Watson Research Center

Time: Tuesday, September 12th; 10:30AM
Title: Confusability
Abstract:
A language model is a function that returns the probability that any given sequence of words will appear in a very large corpus of naturally generated text. Such models lie at the heart of statistical speech recognition and machine translation systems.

One common way of constructing a language model is to define it in terms of a collection of parameters, and then adjust those parameters to maximize the probability that the model assigns to a training corpus. This is an instance of maximum likelihood modeling; it is equivalent to minimizing the model's perplexity on the given corpus. But it is well-known that the performance of speech recognition systems is not well-correlated with language model perplexity, hereafter called lexical perplexity. In particular it can easily happen that some new insight or technique lowers the lexical perplexity, but raises the word error rate.

In this talk, we argue that this conundrum arises from designing the model in isolation from the channel with which it will be used. Essentially, we propose that language models should be built to help with the hard parts of speech recognition, or the source-channel decoding task at hand. After all, it's not that hard to tell "nostril" from "rutabaga." But when you can separate "Austin" and "Boston," you know you're doing well.

In the first part, we analyze the operation of a language model in a source-channel decoding scheme. We define acoustic perplexity, a statistic that incorporates the characteristics of both the source (language model) and the channel (acoustic model). We show how this notion depends upon a still more fundamental expression, the acoustic confusability of word pairs.

In the second part, we present an algorithm for computing acoustic confusability, which can be applied directly to the well-known hidden Markov model paradigm, and which encompasses ALL possible paths through such models, of ALL possible lengths. From these confusability numbers, we show how to obtain both the acoustic perplexity, and another new measure of goodness for language models: the synthetic acoustic word error rate. We present experimental evidence that demonstrates that these measures are better correlated with word error rate than lexical perplexity.

In the third part, we show how a language model may be directly trained to minimize the synthetic acoustic word error rate, and give recognizer performance for a simple language model, so trained. We finish by discussing other applications of these ideas, notably to such varied areas as vocabulary definition, feature selection for maximum entropy models, and statistical machine translation.

This is joint work with Peder Olsen.

Martin Graciarena, School of Engineering, University of Buenos Aires

Time: Thursday, November 2nd, 2000, 10:30 AM, at EJ 124
Title: Maximum Likelihood Noise HMM Estimation In Model-Based Robust Speech Recognition
Abstract:
The performance of speech recognition systems in real world applications may suffer a severe degradation in unknown noisy situations. The source of this performance degradation is an environmental mismatch between training and testing conditions. The goal of robust speech recognition systems is to reduce this mismatch in order to bring back the performance to matched conditions.

The mismatch reduction is done in model-based techniques by the construction of a noisy speech model, from a clean speech model and a noise model, for a particular environment. Among the most important references of this group are Gales' Parallel Model Combination (PMC), Rose et. al. Integrated Parametric Model (IPM), Logan's work and Sankar's Stochastic Matching.

Most model-based techniques assume that the noise model is available a priori from a particular environment. The noise model is estimated from an available noise signal or from silent parts of noisy speech. However it is desirable that the estimation is carried out in order to maximize the likelihood of the noisy speech model with noisy speech adaptation data. Logan's work is the first to propose an estimation technique for the noise model.

First a generalization of Rose's IPM model, from gaussian mixture models to the gaussian mixture HMM formulation is introduced. Observations from clean speech HMM and noise HMM are combined, through a corruption function in the log spectrum domain, to generate noisy speech observations.

Also a Maximum Likelihood estimation algorithm for the gaussian mixture noise HMM parameters, is provided within the framework of the proposed noisy speech model with noisy speech adaptation data. For parameter estimation, in order to produce closed form expressions, the "max" approximation is used as the corruption function. The adaptation data can be provided either in supervised mode (with transcriptions) or unsupervised mode (without transcriptions).

Noisy digit recognition experiments, with NOISEX-92, show that almost the same performance is achieved between the proposed model using either a noise model calculated from silent sections of several utterances or the estimated noise model from a single noisy utterance.


Last updated $Date: 2001/05/11 04:48:57 $ by hef@speech.sri.com