| |
Speech Technology and Research (STAR) Laboratory Seminar Series
Past talks: 2009
-
Speaker: Douglas W. Oard, University of Maryland
Time: Thursday, October 1st, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Combining LVCSR and STD for Robust Speech Retrieval
Abstract:
Well tuned Large-Vocabulary Continuous Speech Recognition (LVCSR) has been
shown to generally be more effective than vocabulary-independent
techniques as a basis for topic-based ranked retrieval of spoken content.
Tuning LVCSR systems to a topic domain can be costly, however, and
Out-Of-Vocabulary (OOV) query terms can adversely affect retrieval
effectiveness when that tuning is not performed. I will show, however,
that retrieval effectiveness for queries with OOV terms can be
substantially improved by combining evidence from LVCSR with additional
evidence from utterance-scale Spoken Term Detection (STD). The combination
is performed by using relevance judgments from held-out topics to learn
generic (i.e., topic-independent), smooth, non-decreasing transformations
from LVCSR and STD system scores to relevance probabilities. I'll describe
an evaluation using a test collection that includes, conversational speech
audio from an oral history collection, topics based on actual requests for
information in that collection, and relevance judgments made by trained
experts. For short queries, our combined system recovers 57% of the mean
average precision that could have been obtained through LVCSR domain
tuning. This is joint work with Scott Olsson.
Douglas Oard is an Associate Professor at the University of Maryland,
College Park, with joint appointments in the College of Information
Studies and the Institute for Advanced Computer Studies; he is on
sabbatical at Berkeley's iSchool for the Fall 2009 semester. Dr. Oard
earned his Ph.D. in Electrical Engineering from the University of
Maryland. His research interests center around the use of emerging
technologies to support information seeking by end users, with recent work
on interactive techniques for cross-language information retrieval and
techniques for search and sense-making in conversational media.
-
Speaker: Constantine Kotropoulos, Aristotle University of Thessaloniki
Time: Thursday, May 14th, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Speech Emotion Recognition
Abstract:
The emotion information facilitates human/computer communication by exploiting the non-linguistic content of speech. Theories, feature extraction, classification methods, and research initiatives are surveyed first. Next, our results on basic research problems (e.g. feature selection, Gaussian mixture modeling) within the context of speech emotion recognition are presented. Subset feature selection performed by the sequential floating forward selection (SFFS) is addressed, when the criterion employed for feature selection is the correct classification rate of the Bayes classifier assuming that the features obey the multivariate Gaussian distribution. By modeling the number of correctly classified utterances as a hypergeometric random variable, an accurate estimate of the variance of the correct classification rate during cross-validation is derived, which is exploited in order to develop a fast SFFS variant. By studying the distribution properties of the Mahalanobis distance and employing proper statistical tests, we improve the expectation-maximization (EM) algorithm for Gaussian mixture modeling. Experimental results are presented against 7 other EM variants, which demonstrate that the proposed EM variant has an increased capability to find the underlying model, while maintaining a low execution time. Furthermore, the total variation between the distribution of the Mahalanobis distance for an infinite training set and that for a finite training set when either cross-validation or re-substitution is used for parameter estimation allows us to set a lower limit for the correct classification rate achieved by the Bayes classifier. This lower limit is exploited in subset feature selection. Finally, gender classification by processing emotionally colored speech is briefly addressed. Perfect classification is reported.
Constantine Kotropoulos received the Diploma degree with honors in Electrical Engineering in 1988 and the PhD degree in Electrical & Computer Engineering in 1993, both from the Aristotle University of Thessaloniki. He is currently an Associate Professor in the Department of Informatics at the Aristotle University of Thessaloniki. Since September 1st, 2008 and for one calendar year, he has been a visiting research scholar in the Department of Electrical and Computer Engineering at the University of Delaware, U.S.A. He conducted also research in the Signal Processing Laboratory at Tampere University of Technology, Finland during the summer of 1993. He has co-authored 38 journal papers, 144 conference papers, and contributed 6 chapters to edited books in his areas of expertise. He is co-editor of the book "Nonlinear Model-Based Image/Video Processing and Analysis" (J. Wiley and Sons, 2001). His current research interests include audio, speech, and language processing; signal processing; pattern recognition; multimedia information retrieval; biometric authentication techniques, and human-centered multimodal computer interaction. Prof. Kotropoulos was a scholar of the State Scholarship Foundation of Greece and the Bodossaki Foundation. He is a senior member of the IEEE and a member of EURASIP, IAPR, ISCA, and the Technical Chamber of Greece. He is a member of the Editorial Board of Advances in Multimedia journal and serves as a EURASIP local liaison officer for Greece.
-
Speaker: Richard M. Stern, Carnegie Mellon University
Time: Thursday, April 2nd, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Robust Automatic Speech Recognition in Reverberant Environments
Abstract:
As core speech recognition technology continues to improve, along with the means to mitigate the effects of many types of additive n
oise sources, reverberation in natural environments is becoming an increasingly dominant impediment to the accuracy of automatic spe
ech recognition systems. Reverberation is a natural part of virtually every indoor acoustical environment, and it is a factor in ma
ny outdoor settings with reflective surfaces as well. The presence of even a relatively small amount of reverberation destroys the
temporal structure of speech waveforms, which degrades the recognition accuracy that is obtained from speech systems deployed in pub
lic spaces, homes, and offices. This talk will discuss the nature of the reverberation problem and will compare a number of classic
al and contemporary approaches to its remediation in speech recognition systems. We will discuss approaches to reverberation compen
sation that are motivated by the putative auditory processing of reverberant sounds, as well as approaches that are motivated purely
by signal processing or statistical considerations. Finally, we will compare the efficacy of selected approaches to reverberation
compensation in reducing errors in automatic speech recognition and speaker identification.
Richard M. Stern Richard M. Stern received the S.B. degree from the Massachusetts Institute of Technology in 1970, the M.S. f
rom the University of California, Berkeley, in 1972, and the Ph.D. from MIT in 1977, all in electrical engineering. He has been on
the faculty of Carnegie Mellon University since 1977, where he is currently a Professor in the Electrical and Computer Engineering,
Computer Science, and Biomedical Engineering Departments, the Language Technologies Institute, and the School of Music. Much of Dr
. Stern's current research is in spoken language systems, where he is particularly concerned with the development of techniques with
which automatic speech recognition can be made more robust with respect to changes in environment and acoustical ambience. He has
also developed sentence parsing and speaker adaptation algorithms for earlier CMU speech systems. In addition to his work in speech
recognition, Dr. Stern has worked extensively in psychoacoustics, where he is best known for theoretical work in binaural perceptio
n. Dr. Stern is a Fellow of the Acoustical Society of America, the 2008-2009 Distinguished Lecturer of the International Speech Com
munication Association, a recipient of the Allen Newell Award for Research Excellence in 1992, and he served as General Chair of Int
erspeech 2006. He is also a member of the IEEE and the Audio Engineering Society.
-
Speaker: Joaquin Gonzalez-Rodriguez, Universidad Autonoma de Madrid
Time: Wednesday, March 18th, 2009, 10:30 am
Venue: STAR Lab, EJ 124
Title: Forensic Automatic Speaker Recognition: Fiction or Science?
Abstract:
The expectations of Courts, juries and fact finders about what can be
obtained from the scientific analysis of the forensic evidence are
usually unrealistic, clearly represented in the well-known CSI-effect.
Additionally, the actual analysis and reporting procedures in most
forensic identification areas, including speaker recognition, have
usually been far from truly scientific. However, three different
key factors have developed and consolidated in the last 20 years which
clearly define the proper way to analyze and report forensic evidence.
First, the requirements for admissibility of forensic evidence are
becoming increasingly demanding, exemplified by the US Daubert ruling
(1993) and US Federal Rule of Evidence 702 (2000), forcing old forensic
identification sciences to move from expert-based approaches to more
scientific and data based ones. Secondly, an approach to forensic
identification based on Bayesian inference of the identity of sources,
which fully respects and clarifies the Court and scientist roles, is
widely accepted as the best way to provide useful information to Courts
in the presence of uncertainty (always present in forensic cases) and
is being progressively adopted by more and more scientists and
laboratories across different countries and forensic identification
disciplines. Last and not least, DNA typing has shown that it is
possible to successfully fulfill both previous requirements, with a
truly scientific and data-based approach. DNA typing has become the
new golden standard that classical forensic identification sciences
should emulate, and we will show in this talk that this approach can
also be followed with voice evidence. However, DNA profiling has a
better knowledge of the sources of variation of DNA markers in a given
population, and much greater discrimination ability among individuals
than speaker recognition (especially with forensic voice recordings).
Therefore, caution is still needed, as the information reported to a
court will be a reliable estimate of the weight of the
evidence only if the conditions of assessment and calibration of the
system and the type of speech reasonably match those in the forensic
case at hand, which requires proper knowledge of the variations of the
acoustic and/or linguistic features in use across the sociolinguistic
populations involved in the case.
Joaquin Gonzalez-Rodriguez received the M.S. degree in 1994, and
the Ph.D. degree "cum laude" in 1999, both in electrical engineering,
from Univ. Politecnica de Madrid (UPM), Spain. After 15 years of
research and lecturing at UPM, he is since May 2006 an Associate
Professor at the Computer Science Department at Univ. Autonoma de
Madrid, where he leads the Speech group of the ATVS-Biometric
Recognition Group. He has led ATVS participations in NIST speaker
(2001, 2002, 2004, 2005, 2006, and 2008) and language (2005 and 2007)
recognition evaluations, and in the 2003 NFI-TNO forensic speaker recognition
evaluation.
Dr. Gonzalez-Rodriguez is since 2000 an invited member of FSAAWG
(Forensic Speech and Audio Analysis Working Group) at ENFSI (European
Network of Forensic Science Institutes), and has focused his research
work on the proper use of automatic speaker recognition in forensic
science. He is a member of ISCA and the Signal Processing Society of
IEEE, and is also a member of the program committee of the ISCA Odyssey
Conferences on Speaker and Language Recognition, having been vice-chair of
Odyssey 2004 in Toledo, Spain.
-
Speaker: Shasha Xie, ICSI and UT Dallas
Time: Wednesday, March 11th, 2009, 11:00 am
Venue: STAR Lab, EJ 124
Title: Using Supervised and Unsupervised Approaches for Extractive Meeting Summarization
Abstract:
Meeting summarization provides a concise and informative summary for
the lengthy meetings and is an effective tool for efficient information
access. In this talk, the focus is extractive summarization, where
salient sentences are selected from the meeting transcripts to form a
summary. First, we exploited unsurpervised learning approach on the
framework of MMR, and evaluated different measures to better capture
the similarity between texts. Then, we adopted a supervised learning
approach for this task and use a classifier to determine whether to
select a sentence in the summary based on a rich set of features. We
addressed three important problems associated with this supervised
classification approach, imbalanced data problem, human annotation
disagreement and the effectiveness of different features.
-
Speaker: Carol Espy-Wilson,
University of Maryland
Time: Friday, February 27, 2009, 1:30 pm
Venue: STAR Lab, EJ 124
Title: Speech Enhancement based on the Modified Phase-Opponency Model
Abstract:
A major issue for speech recognition technology is its performance in
everyday noisy environments. In this talk I will discuss a speech
enhancement algorithm we have developed that is based on the auditory PO
model proposed for detection of tones in noise. The PO model includes a
physiologically realistic mechanism for processing the information in
neural discharge times and exploits the frequency-dependent phase
properties of the tuned filters in the auditory periphery by using a
cross-auditory-never-fiber coincidence detection for extracting temporal
cues. An important feature of the PO model is that it does not need to
estimate the noise characteristics, nor does it assume that the noise
satisfies any statistical model. We modified the PO model (MPO) so that
its basic functionality was maintained, but the properties of the model
can be analyzed and modified independently. In addition, we improved on
its performance by coupling the PO model with our Aperiodicity,
Periodicity, Pitch (APP) detector. I will also show perceptual data
showing the effectiveness of the MPO-APP speech enhancement algorithm
for people with hearing impairments. Presently, we are investigating
additional processing to further improve the performance of the MPO-APP,
especially at low signal-to-noise ratios. Some of these techniques
involve variable frame rate analysis and the application of
single-channel speech segregation techniques based on a new paradigm
wherein the mixture signal is shared among the participating speakers,
rather than divided amongst them as done in present approaches.
Dr. Carol Espy-Wilson is a Professor in the Electrical and Computer
Engineering Department at the University of Maryland. She directs the
Speech Communication Lab at the University of Maryland. Her research
interests include the integration of engineering, linguistics and speech
science to study speech communication. She is developing an approach to
speech recognition based on landmarks and gestural phonology to address
the limitations of present recognizers (e.g., effective handling of
prosodically-guided variability). She also conducts research in the
areas of speech production, speech enhancement, speaker recognition,
single-channel speaker separation and language and genre detection in
audio content analysis. Presently, Dr. Espy-Wilson is on sabbatical as a
Radcliffe Fellow at Harvard University.
|
|