Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Speech Technology and Research (STAR) Laboratory Seminar Series

Past talks: 2008

  • Speaker: Kemal Oflazer, Sabanci University, Istanbul, Turkey
    Time: Friday, May 9th, 2008, 11:00 am
    Venue: STAR Lab, EJ 124
    Title: Statistical Machine Translation into a Morphologically Complex Language

    Abstract:

    In this talk, we present some results of our on-going work on English to Turkish statistical machine translation. Turkish is an agglutinative language with very rich inflectional and derivational morphology. Turkish is also a free constituent order with almost no formal ordering constraints at the sentence level. These and the fact that Turkish -- English parallel corpora is a scarce resource compared to other languages popular in SMT research, bring about interesting issues for SMT involving Turkish. After a discussion of the highlights of relevant aspects of Turkish, we investigate different representational granularities for sub-lexical representation. We find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of in the training data, (ii) augmenting the training data with ``sentences'' comprising only the "content words" of the original training data, and (iii) re-ranking the n-best decoder outputs with a word-level language model by combining translation model scores with word-level language model scores, provide a non-trivial improvement over a fully word-based baseline model. Additional improvements are obtained by iterative model training (which may very loosely be called "statistical post-editing"), augmenting training data with phrase-pairs which are high-probability translations of each other, and by "word-repair" -- automatically identifying and correcting morphologically malformed words. Despite our relatively limited training data, we improve from 19.77 BLEU for the baseline, to 28.41 BLEU for a 42% relative improvement. We also touch briefly on the suitability of BLEU for languages like a Turkish and present an overview of our BLUE+ tool which considers root and morphological proximity when comparing candidate sentence words to reference sentence words and also provides various oracle BLUE scores.

    Kemal Oflazer has got his PhD from Computer Science at Carmegie Mellon University in 1987. He is currently a faculty member at Sabanci University, associated with the Computer Science pro gram. He is directing the Human Language and Speech Processing Laboratory. He is mainly interested in Natural Language Processing with specific applications to Turkish. Currently he is working on s tatistical machine translation (MT) between English and Turkish and developing NLP-based application for language learning. He is especially well known for his work on applying finite state methods for language processing and error tolerant finite state recognition. Two recent very interesting studies include extension of BLEU, called as BLEU+ for the evaluation of MT systems of morphologica lly rich languages and adaptation of the Turkish MT system to other Turkic languages, such as Uzbek or Turkmen. He has co-authored more than 100 international conference and peer reviewed journal p apers. Prof. Oflazer is in the editorial board of Computational Linguistics, Machine Translation, and a number of other journals. He is in the organization committees of EACL'09, IJCNLP'08, EACL'06 , ACL'05, ACL'04, EACL'03, and many others.

  • Speaker: Daniel Cer, University of Colorado at Boulder (visiting at Stanford University)
    Time: Thursday, May 1st, 2008, 3:00 pm
    Venue: STAR Lab, EJ 124
    Title: Improvements in MERT

    Abstract:

    Minimum error rate training (MERT) is a widely used learning procedure for statistical machine translation models. I will contrast three search strategies for MERT: Powell's method, the variant of coordinate descent found in the Moses MERT utility, and a novel stochastic method that outperforms both of these. I will also present a method for regularizing the MERT objective that achieves statistically significant gains when combined with both Powell's method and coordinate descent.
  • Speaker: John DeNero, UC Berkeley
    Time: Thursday, April 17th, 2008, 11:00 am
    Venue: STAR Lab, EJ 124
    Title: Inference in Phrase Alignment Models

    Abstract:

    Models that align phrases instead of words offer an appealing alternative to the standard relative frequency estimates of phrase translation probabilities. But, while some effective word alignment models (Model 1, Model 2 & HMM) can be estimated tractably with EM, phrase alignment models cannot. I'll talk about how to show that estimation and inference under these models is intractable. Then, I'll present two useful approximation techniques. First, I'll talk about how to cast phrase alignment search as an integer linear programming (ILP) problem and find the optimal alignment reliably and quickly with off-the-shelf ILP software. Some applications of this technique include training phrase alignment models and interpreting the output of word alignment models. Second, we'll look at how to estimate translation probabilities under a phrase alignment model using a Gibbs sampling procedure. The sampler has some nice asymptotic convergence properties and also seems to produce good results in practice. I'll walk through the different models we've trained and how they performed.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified May 10, 2008