Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Understanding lm-files and discounting

From: "Deniz Yuret" <dyuret at ADDRESS HIDDEN>
Date: Mon, 3 Dec 2007 12:38:29 +0200

I spent last weekend trying to figure out the discrepancies between the
SRILM kn-discounting implementations and my earlier implementations.
Basically I am trying to go from the text file to the count file to
the model file
to the probabilities assigned to the words in the test file.  This took me on a
journey from man pages to debug outputs to the source code.  I figured
a lot of it out but it turned out to be nontrivial to go from paper
descriptions to the numbers in the ARPA ngram format to the final
probability calculations.  If you help me with a couple of things I
promise I'll write a man page detailing all discounting calculations
in SRILM.

1. Sometimes the model seems to use smaller ngrams even when longer
ones are in the training file.  An example from a letter model:

E i s e n h o w e r
       p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
       p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
       p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
       p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
       p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
       p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
       p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
       p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
       p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
       p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
       p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
1 sentences, 10 words, 0 OOVs
0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213

This is an -order 7 model and the training file does have the word
Eisenhower.  So I don't understand why it goes back to using lower
order ngrams after the letter 'h'.

2. Not all (n-1)-grams have backoff weights in the model file, why?

3. What exactly does srilm do with google ngrams?  Can you give an
example usage?  Does it do things like extract a small subset useful
for evaluating a test file?

4. Since google-ngrams have all ngrams below count=40 missing, the kn
discount constants that rely on the number of ngrams with low counts
will fail.  Also I found that empirically the best highest order
discount constant is close to 40, not in the [0,1] range.  How does
srilm handle this?

5. Do I need to understand what the following messages mean to
understand the calculations:
warning: 7.65818e-10 backoff probability mass left for "" --
incrementing denominator
warning: distributing 0.000254455 left-over probability mass over all 124 words
discarded 254764 7-gram probs discounted to zero
inserted 2766 redundant 3-gram probs

best,
deniz

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Nov 21, 2008