STAR Laboratory: Seminar Series

Speech Technology and Research (STAR) Laboratory Seminar Series

Past talks: 2009

Speaker: Sibel Yaman, ICSI, Berkeley
Time: Thursday, November 5th, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Single- and Multi-Objective Programming-Based Approaches to Automatic Spoken Language Identific ation
Abstract: Automatic language identification (LID) is the problem of identifying the language being spoken by an unknown sp eaker from a sample of speech. LID is often used as a front-end system to a language-specific speech recognition system for applications such as directory assistance, machine translation, and multi-lingual information retrie val. It is possible to distinguish four categories of approaches which differ in the information sources being u sed. These four categories are (i) acoustic approaches, (ii) phonotactic approaches, (iii) approaches that use p rosodic and duration information, and (iv) Discriminative approaches. In my PhD research, we developed a single-objective programming (SOP)-based and a multi-objective programming (M OP)-based approach to LID, where the standard detection performance evaluation measures false-rejection (or miss ) and false-acceptance (or false alarm) rates for a number of languages were to be simultaneously minimized. To obtain an approximation of the empirical FR and FA rates, the minimum classification error rate classification ( MCE) framework was followed using linear discriminant functions (LDFs). When the LID task is defined as detectin g the language spoken in a speech utterance, the actual goal is to minimize the FA and FR rates for each of the target languages rather than to minimize their average. Therefore, we formulated the LID problem as an MOP probl em with a total of (2M) objectives, where each individual objective is either an FA or an FR rate for a target l anguage/dialect. The MOP-based approach under discussion directly attempts to find how to make individual error rates as small as possible without significant degradation in any one of them.

Speaker: Douglas W. Oard, University of Maryland
Time: Thursday, October 1st, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Combining LVCSR and STD for Robust Speech Retrieval

Abstract: Well tuned Large-Vocabulary Continuous Speech Recognition (LVCSR) has been shown to generally be more effective than vocabulary-independent techniques as a basis for topic-based ranked retrieval of spoken content. Tuning LVCSR systems to a topic domain can be costly, however, and Out-Of-Vocabulary (OOV) query terms can adversely affect retrieval effectiveness when that tuning is not performed. I will show, however, that retrieval effectiveness for queries with OOV terms can be substantially improved by combining evidence from LVCSR with additional evidence from utterance-scale Spoken Term Detection (STD). The combination is performed by using relevance judgments from held-out topics to learn generic (i.e., topic-independent), smooth, non-decreasing transformations from LVCSR and STD system scores to relevance probabilities. I'll describe an evaluation using a test collection that includes, conversational speech audio from an oral history collection, topics based on actual requests for information in that collection, and relevance judgments made by trained experts. For short queries, our combined system recovers 57% of the mean average precision that could have been obtained through LVCSR domain tuning. This is joint work with Scott Olsson.
Douglas Oard is an Associate Professor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies; he is on sabbatical at Berkeley's iSchool for the Fall 2009 semester. Dr. Oard earned his Ph.D. in Electrical Engineering from the University of Maryland. His research interests center around the use of emerging technologies to support information seeking by end users, with recent work on interactive techniques for cross-language information retrieval and techniques for search and sense-making in conversational media.

Speaker: Constantine Kotropoulos, Aristotle University of Thessaloniki
Time: Thursday, May 14th, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Speech Emotion Recognition

Abstract: The emotion information facilitates human/computer communication by exploiting the non-linguistic content of speech. Theories, feature extraction, classification methods, and research initiatives are surveyed first. Next, our results on basic research problems (e.g. feature selection, Gaussian mixture modeling) within the context of speech emotion recognition are presented. Subset feature selection performed by the sequential floating forward selection (SFFS) is addressed, when the criterion employed for feature selection is the correct classification rate of the Bayes classifier assuming that the features obey the multivariate Gaussian distribution. By modeling the number of correctly classified utterances as a hypergeometric random variable, an accurate estimate of the variance of the correct classification rate during cross-validation is derived, which is exploited in order to develop a fast SFFS variant. By studying the distribution properties of the Mahalanobis distance and employing proper statistical tests, we improve the expectation-maximization (EM) algorithm for Gaussian mixture modeling. Experimental results are presented against 7 other EM variants, which demonstrate that the proposed EM variant has an increased capability to find the underlying model, while maintaining a low execution time. Furthermore, the total variation between the distribution of the Mahalanobis distance for an infinite training set and that for a finite training set when either cross-validation or re-substitution is used for parameter estimation allows us to set a lower limit for the correct classification rate achieved by the Bayes classifier. This lower limit is exploited in subset feature selection. Finally, gender classification by processing emotionally colored speech is briefly addressed. Perfect classification is reported.
Constantine Kotropoulos received the Diploma degree with honors in Electrical Engineering in 1988 and the PhD degree in Electrical & Computer Engineering in 1993, both from the Aristotle University of Thessaloniki. He is currently an Associate Professor in the Department of Informatics at the Aristotle University of Thessaloniki. Since September 1st, 2008 and for one calendar year, he has been a visiting research scholar in the Department of Electrical and Computer Engineering at the University of Delaware, U.S.A. He conducted also research in the Signal Processing Laboratory at Tampere University of Technology, Finland during the summer of 1993. He has co-authored 38 journal papers, 144 conference papers, and contributed 6 chapters to edited books in his areas of expertise. He is co-editor of the book "Nonlinear Model-Based Image/Video Processing and Analysis" (J. Wiley and Sons, 2001). His current research interests include audio, speech, and language processing; signal processing; pattern recognition; multimedia information retrieval; biometric authentication techniques, and human-centered multimodal computer interaction. Prof. Kotropoulos was a scholar of the State Scholarship Foundation of Greece and the Bodossaki Foundation. He is a senior member of the IEEE and a member of EURASIP, IAPR, ISCA, and the Technical Chamber of Greece. He is a member of the Editorial Board of Advances in Multimedia journal and serves as a EURASIP local liaison officer for Greece.
Speaker: Richard M. Stern, Carnegie Mellon University
Time: Thursday, April 2nd, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Robust Automatic Speech Recognition in Reverberant Environments

Abstract:
As core speech recognition technology continues to improve, along with the means to mitigate the effects of many types of additive n oise sources, reverberation in natural environments is becoming an increasingly dominant impediment to the accuracy of automatic spe ech recognition systems. Reverberation is a natural part of virtually every indoor acoustical environment, and it is a factor in ma ny outdoor settings with reflective surfaces as well. The presence of even a relatively small amount of reverberation destroys the temporal structure of speech waveforms, which degrades the recognition accuracy that is obtained from speech systems deployed in pub lic spaces, homes, and offices. This talk will discuss the nature of the reverberation problem and will compare a number of classic al and contemporary approaches to its remediation in speech recognition systems. We will discuss approaches to reverberation compen sation that are motivated by the putative auditory processing of reverberant sounds, as well as approaches that are motivated purely by signal processing or statistical considerations. Finally, we will compare the efficacy of selected approaches to reverberation compensation in reducing errors in automatic speech recognition and speaker identification.
Richard M. Stern received the S.B. degree from the Massachusetts Institute of Technology in 1970, the M.S. f rom the University of California, Berkeley, in 1972, and the Ph.D. from MIT in 1977, all in electrical engineering. He has been on the faculty of Carnegie Mellon University since 1977, where he is currently a Professor in the Electrical and Computer Engineering, Computer Science, and Biomedical Engineering Departments, the Language Technologies Institute, and the School of Music. Much of Dr . Stern's current research is in spoken language systems, where he is particularly concerned with the development of techniques with which automatic speech recognition can be made more robust with respect to changes in environment and acoustical ambience. He has also developed sentence parsing and speaker adaptation algorithms for earlier CMU speech systems. In addition to his work in speech recognition, Dr. Stern has worked extensively in psychoacoustics, where he is best known for theoretical work in binaural perceptio n. Dr. Stern is a Fellow of the Acoustical Society of America, the 2008-2009 Distinguished Lecturer of the International Speech Com munication Association, a recipient of the Allen Newell Award for Research Excellence in 1992, and he served as General Chair of Int erspeech 2006. He is also a member of the IEEE and the Audio Engineering Society.
Speaker: Joaquin Gonzalez-Rodriguez, Universidad Autonoma de Madrid
Time: Wednesday, March 18th, 2009, 10:30 am
Venue: STAR Lab, EJ 124
Title: Forensic Automatic Speaker Recognition: Fiction or Science?

Abstract:
The expectations of Courts, juries and fact finders about what can be obtained from the scientific analysis of the forensic evidence are usually unrealistic, clearly represented in the well-known CSI-effect. Additionally, the actual analysis and reporting procedures in most forensic identification areas, including speaker recognition, have usually been far from truly scientific. However, three different key factors have developed and consolidated in the last 20 years which clearly define the proper way to analyze and report forensic evidence. First, the requirements for admissibility of forensic evidence are becoming increasingly demanding, exemplified by the US Daubert ruling (1993) and US Federal Rule of Evidence 702 (2000), forcing old forensic identification sciences to move from expert-based approaches to more scientific and data based ones. Secondly, an approach to forensic identification based on Bayesian inference of the identity of sources, which fully respects and clarifies the Court and scientist roles, is widely accepted as the best way to provide useful information to Courts in the presence of uncertainty (always present in forensic cases) and is being progressively adopted by more and more scientists and laboratories across different countries and forensic identification disciplines. Last and not least, DNA typing has shown that it is possible to successfully fulfill both previous requirements, with a truly scientific and data-based approach. DNA typing has become the new golden standard that classical forensic identification sciences should emulate, and we will show in this talk that this approach can also be followed with voice evidence. However, DNA profiling has a better knowledge of the sources of variation of DNA markers in a given population, and much greater discrimination ability among individuals than speaker recognition (especially with forensic voice recordings). Therefore, caution is still needed, as the information reported to a court will be a reliable estimate of the weight of the evidence only if the conditions of assessment and calibration of the system and the type of speech reasonably match those in the forensic case at hand, which requires proper knowledge of the variations of the acoustic and/or linguistic features in use across the sociolinguistic populations involved in the case.
Joaquin Gonzalez-Rodriguez received the M.S. degree in 1994, and the Ph.D. degree "cum laude" in 1999, both in electrical engineering, from Univ. Politecnica de Madrid (UPM), Spain. After 15 years of research and lecturing at UPM, he is since May 2006 an Associate Professor at the Computer Science Department at Univ. Autonoma de Madrid, where he leads the Speech group of the ATVS-Biometric Recognition Group. He has led ATVS participations in NIST speaker (2001, 2002, 2004, 2005, 2006, and 2008) and language (2005 and 2007) recognition evaluations, and in the 2003 NFI-TNO forensic speaker recognition evaluation.
Dr. Gonzalez-Rodriguez is since 2000 an invited member of FSAAWG (Forensic Speech and Audio Analysis Working Group) at ENFSI (European Network of Forensic Science Institutes), and has focused his research work on the proper use of automatic speaker recognition in forensic science. He is a member of ISCA and the Signal Processing Society of IEEE, and is also a member of the program committee of the ISCA Odyssey Conferences on Speaker and Language Recognition, having been vice-chair of Odyssey 2004 in Toledo, Spain.
Speaker: Shasha Xie, ICSI and UT Dallas
Time: Wednesday, March 11th, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Using Supervised and Unsupervised Approaches for Extractive Meeting Summarization

Abstract:
Meeting summarization provides a concise and informative summary for the lengthy meetings and is an effective tool for efficient information access. In this talk, the focus is extractive summarization, where salient sentences are selected from the meeting transcripts to form a summary. First, we exploited unsurpervised learning approach on the framework of MMR, and evaluated different measures to better capture the similarity between texts. Then, we adopted a supervised learning approach for this task and use a classifier to determine whether to select a sentence in the summary based on a rich set of features. We addressed three important problems associated with this supervised classification approach, imbalanced data problem, human annotation disagreement and the effectiveness of different features.
Speaker: Carol Espy-Wilson, University of Maryland
Time: Friday, February 27, 2009, 1:30 pm
Venue: STAR Lab, EJ 124
Title: Speech Enhancement based on the Modified Phase-Opponency Model

Abstract:
A major issue for speech recognition technology is its performance in everyday noisy environments. In this talk I will discuss a speech enhancement algorithm we have developed that is based on the auditory PO model proposed for detection of tones in noise. The PO model includes a physiologically realistic mechanism for processing the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery by using a cross-auditory-never-fiber coincidence detection for extracting temporal cues. An important feature of the PO model is that it does not need to estimate the noise characteristics, nor does it assume that the noise satisfies any statistical model. We modified the PO model (MPO) so that its basic functionality was maintained, but the properties of the model can be analyzed and modified independently. In addition, we improved on its performance by coupling the PO model with our Aperiodicity, Periodicity, Pitch (APP) detector. I will also show perceptual data showing the effectiveness of the MPO-APP speech enhancement algorithm for people with hearing impairments. Presently, we are investigating additional processing to further improve the performance of the MPO-APP, especially at low signal-to-noise ratios. Some of these techniques involve variable frame rate analysis and the application of single-channel speech segregation techniques based on a new paradigm wherein the mixture signal is shared among the participating speakers, rather than divided amongst them as done in present approaches.
Dr. Carol Espy-Wilson is a Professor in the Electrical and Computer Engineering Department at the University of Maryland. She directs the Speech Communication Lab at the University of Maryland. Her research interests include the integration of engineering, linguistics and speech science to study speech communication. She is developing an approach to speech recognition based on landmarks and gestural phonology to address the limitations of present recognizers (e.g., effective handling of prosodically-guided variability). She also conducts research in the areas of speech production, speech enhancement, speaker recognition, single-channel speaker separation and language and genre detection in audio content analysis. Presently, Dr. Espy-Wilson is on sabbatical as a Radcliffe Fellow at Harvard University.