Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: KN discounting and zeroton words

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Sat, 11 Jun 2005 20:40:28 PDT

In message <1118075939.16700.23.camel@localhost>you wrote:
>
> A little correction: also with KN discounting, zeroton words get the
> same unigram probability as words discounted to zero (using -gt1min).
> What I don't understand, is why can this probability be higher than for
> words that are not discounted to zero? E.g.
>
> E.g. for a very little test set, and using '-gt1min 2', zeroton and
> singleton words get a probability -0.7323937, but a word occurring twice
> gets a probability -1.556303.
>
> I believe this is some magic property of KN discounting, in which case I
> apologize for polluting the list and go back to reading the description
> of the algorithm.

The unigram probabilities for zeroton words are obtained by distributing
the backoff mass left by the non-zeroton words evenly over all the zerotons
(this corresponds to backing off to a uniform distribution).
Now, if the number of zerotons is small they might actually get more
probability than the low-count observed unigrams that way.

The -interpolate1 option should prevent this since it distributes the
backoff mass over ALL unigrams (adding to the probability of those words
that were observed).
Please check if this is the case, and if not, send me a test case so
I can look into why it doesn't work as intended.

--Andreas

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Nov 21, 2008