| |
Speech Technology and Research (STAR) Laboratory Seminar Series
Past talks: 2008
-
Speaker: Kevin Knight,
USC/Information Sciences Institute
Time: Tuesday, August 12th, 2008, 11:00 am
Venue: STAR Lab, EJ 124
Title: Language Translation as Code-Breaking
Abstract:
In 1949, Warren Weaver suggested applying cryptanalytic approaches to
the problem of automatic language translation. ("When I look at an
article in Russian, I say: this is really written in English, but it has
been coded in some strange symbols. I will now proceed to decode").
Claude Shannon had just laid a foundation of information theory for
cryptology, while Alan Turing and others had developed practical
techniques and machinery. Shannon's work was declassified in 1949, and
Turing's in 1996. The history of postwar cryptology has not yet been
written.
The 1990s actually saw Weaver's language translation idea picked up.
Since then, there has been tremendous progress in statistical language
translation. We take large, human-translated text collections (up to
half a billion words) and train models. Some models can be viewed as
word substitution/transposition ciphers, while others are linguistically
more sophisticated.
The need for large translated texts often perplexes newcomers, who ask:
(1) can I train translation systems without parallel text, and (2) how
much text do I need? These questions are annoying those working in the
field, but Turing and Shannon would relate, as codebreakers, they did
not have the luxury of parallel plaintext/ciphertext collections, and a
short ciphers were at the epitome of data sparsity. We'll look at these
two questions in this talk.
-
Speaker: Korbinian Riedhammer, ICSI/Berkeley
Time: Thursday, June 19th, 2008, 11:00 am
Venue: STAR Lab, EJ 124
Title: Baselines and Limits in Extractive Meeting Summarization
Abstract:
After a short introduction to summarization, I'll describe two meeting
data sets (AMI and ICSI) and their annotations. Current state of the
art automatic evaluation methods include the text summarization rooted
ROUGE and a weighted precision measure. In preparation for
understanding the limits in extractive summarization, I'll give
detailed examples for these measures. The important reason for
baseline and limit results is that prior works on meeting
summarization always changed preprocessing, summary lengths and
evaluation criterion, which makes it very hard to compare algorithms
and results. Accompanying new results with baseline and limit results
for the same conditions allows a comparison between algorithms and
results. To do so, I introduce two simple baselines for summarization
(random selection and longest utterances). To determine the upper
limit, we mapped the summarization problem to a knapsack problem,
searching for the best subset of utterances to achieve the best
evaluation score while satisfying a given length constraint. We solve
that optimization problem with a linear integer program and give
results for manual transcripts and ASR data. Finally, I give a brief
outlook on further work to do in meeting summarization.
-
Speaker: Francoise Beaufays and Brian Strope, Google, Mountain View, CA
Time: Thursday, May 15th, 2008, 11:00 am
Venue: STAR Lab, EJ 124
Title: Deploying GOOG-411: Early Lessons in Data, Measurement, and Testing
Abstract:
We describe our early experience building and optimizing GOOG-411, a
fully automated, voice-enabled, business finder. We show how taking an
iterative approach to system development allows us to optimize the
various components of the system, thereby progressively improving
user-facing metrics. We show the contributions of different data
sources to recognition accuracy. For business listing language models, we see a
nearly linear performance increase with the logarithm of the amount of
training data. To date, we have improved our correct accept rate by
25% absolute, and increased our transfer rate by 35% absolute.
Brian Strope has been working on building, testing, deploying, and
re-optimizing goog411 for the last couple years. Before that he worked
on acoustic modeling, speech detection, and application tuning at
Nuance. His PhD from UCLA is on signal processing, perceptual
experiments, and ASR robustness. In a past life he designed
workstation hardware for HP, and he currently spends a lot of his
spare time playing golf with his 5 3/4 year old son.
Francoise Beaufays is a research scientist at Google where she
develops speech recognition products, and researches ways to
optimize their performance. For the last 2+ years she has focussed
mostly on building and growing Goog411. Prior to Google, she was
a researcher in speech recognition at SRI and then Nuance. She
holds a PhD, EE from Stanford. Francoise spends a lot of her spare
time with her 5 and 7 year old daughters, Gina and Barbara.
-
Speaker: Kemal Oflazer, Sabanci University, Istanbul, Turkey
Time: Friday, May 9th, 2008, 11:00 am
Venue: STAR Lab, EJ 124
Title: Statistical Machine Translation into a Morphologically Complex Language
Abstract:
In this talk, we present some results of our on-going work on English
to Turkish statistical machine translation. Turkish is an
agglutinative language with very rich inflectional and derivational
morphology. Turkish is also a free constituent order with almost no
formal ordering constraints at the sentence level. These and the
fact that Turkish -- English parallel corpora is a scarce resource
compared to other languages popular in SMT research, bring about
interesting issues for SMT involving Turkish. After a discussion of
the highlights of relevant aspects of Turkish, we investigate
different representational granularities for sub-lexical
representation. We find that (i) representing both Turkish and
English at the morpheme-level but with some selective
morpheme-grouping on the Turkish side of in the training data, (ii)
augmenting the training data with ``sentences'' comprising only the
"content words" of the original training data, and (iii) re-ranking
the n-best decoder outputs with a word-level language model by combining
translation model scores with word-level language model scores,
provide a non-trivial improvement over a fully word-based baseline
model. Additional improvements are obtained by iterative model
training (which may very loosely be called "statistical
post-editing"), augmenting training data with phrase-pairs which are
high-probability translations of each other, and by "word-repair" --
automatically identifying and correcting morphologically malformed words.
Despite our relatively limited training data, we improve from 19.77 BLEU for the
baseline, to 28.41 BLEU for a 42% relative improvement. We also
touch briefly on the suitability of BLEU for languages like a Turkish
and present an overview of our BLUE+ tool which considers root and
morphological proximity when comparing candidate sentence words to
reference sentence words and also provides various oracle BLUE scores.
Kemal Oflazer has got his PhD from Computer Science at Carmegie Mellon University in 1987. He is currently a faculty member at Sabanci University, associated with the Computer Science pro
gram. He is directing the Human Language and Speech Processing Laboratory. He is mainly interested in Natural Language Processing with specific applications to Turkish. Currently he is working on s
tatistical machine translation (MT) between English and Turkish and developing NLP-based application for language learning. He is especially well known for his work on applying finite state methods
for language processing and error tolerant finite state recognition. Two recent very interesting studies include extension of BLEU, called as BLEU+ for the evaluation of MT systems of morphologica
lly rich languages and adaptation of the Turkish MT system to other Turkic languages, such as Uzbek or Turkmen. He has co-authored more than 100 international conference and peer reviewed journal p
apers. Prof. Oflazer is in the editorial board of Computational Linguistics, Machine Translation, and a number of other journals. He is in the organization committees of EACL'09, IJCNLP'08, EACL'06
, ACL'05, ACL'04, EACL'03, and many others.
-
Speaker: Daniel Cer, University of Colorado at Boulder (visiting at Stanford University)
Time: Thursday, May 1st, 2008, 3:00 pm
Venue: STAR Lab, EJ 124
Title: Improvements in MERT
Abstract:
Minimum error rate training (MERT) is a widely used learning
procedure for statistical machine translation models. I will
contrast three search strategies for MERT: Powell's method,
the variant of coordinate descent found in the Moses MERT
utility, and a novel stochastic method that outperforms both
of these. I will also present a method for regularizing the
MERT objective that achieves statistically significant gains
when combined with both Powell's method and coordinate descent.
-
Speaker: John DeNero, UC Berkeley
Time: Thursday, April 17th, 2008, 11:00 am
Venue: STAR Lab, EJ 124
Title: Inference in Phrase Alignment Models
Abstract:
Models that align phrases instead of words offer an
appealing alternative to the standard relative frequency estimates of
phrase translation probabilities. But, while some effective word
alignment models (Model 1, Model 2 & HMM) can be estimated tractably
with EM, phrase alignment models cannot. I'll talk about how to show
that estimation and inference under these models is intractable.
Then, I'll present two useful approximation techniques.
First, I'll talk about how to cast phrase alignment search as an
integer linear programming (ILP) problem and find the optimal
alignment reliably and quickly with off-the-shelf ILP software. Some
applications of this technique include training phrase alignment
models and interpreting the output of word alignment models.
Second, we'll look at how to estimate translation probabilities under
a phrase alignment model using a Gibbs sampling procedure. The
sampler has some nice asymptotic convergence properties and also seems
to produce good results in practice. I'll walk through the different
models we've trained and how they performed.
|
|