Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: FW: A simple question about SRILM

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 18 May 2004 20:03:39 PDT

In message <002701c43c4a$4f810b00$34284484 at ADDRESS HIDDEN>you wrote:
> Hi Andreas,
>
> Thanks for you super-fast reply!
>
> I tried it like you suggested:
> ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0
> -gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1
>
> Many of the backoff weights indeed became 99 (which is good), but many
> remained non-zero (although small: -6,-7,-8...)
>
> Is there a way to make them all 99?

This might not be necessary.

If the left-over probability mass in some context is 0 (as it should
be when using ML estimates) AND the sum of the lower-order probabilities
for the occurring N-grams is also 0 (since those are also ML estimates),
the backoff weight is 0/0, and due to numerical inaccuracies this may turn
out to be one of the values your observed. (The code catches actual
0/0 divisions and generates -99 in those cases.)
However, this is not a problem because the backoff log prob value for one of
the non-observed ngrams would be -infinity, and the particular value of
the backoff weight that gets applied doesn't matter for the outcome
(-infinity plus any value is still -infinity).

To verify that that's the case just feed some of those unobserved
ngrams to ngram -debug 2 -ppl and make sure the log probabilities are -infinity.

--Andreas

>
> The debug messages I got are listed below.
>
> Thanks a lot,
> Roy.
> ------------------------------------------------------------------------
> ---------------------
> corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs
> 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
> Good-Turing discounting 1-grams
> GT-count [0] = 0
> GT-count [1] = 0
> warning: no singleton counts
> GT discounting disabled
> Good-Turing discounting 2-grams
> GT-count [0] = 0
> GT-count [1] = 126
> GT discounting disabled
> Good-Turing discounting 3-grams
> GT-count [0] = 0
> GT-count [1] = 2142
> GT discounting disabled
> discarded 1 2-gram contexts containing pseudo-events
> discarded 2 3-gram contexts containing pseudo-events
> writing 41 1-grams
> writing 800 2-grams
> writing 5145 3-grams
>
> > -----Original Message-----
> > From: Andreas Stolcke [mailto:stolcke at ADDRESS HIDDEN]
> > Sent: Monday, May 17, 2004 7:38 PM
> > To: Roy Bar Haim
> > Cc: srilm-user at ADDRESS HIDDEN
> > Subject: Re: FW: A simple question about SRILM
> >
> >
> >
> > In message
> > <001701c43c3c$65fc62c0$34284484 at ADDRESS HIDDEN>you wrote:
> > > Hi,
> > >
> > > I have the same problem. I want the LM to give maximum-likelihood
> > > estimates. That is, all the backoff weights should be zero.
> > >
> > > I applied the solution below, but still I get backoff weights.
> > >
> > > For example, when I build the lm like this:
> > > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text
> > corpus.tags
> > > -lm corp us.tags.lm
> > >
> > > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so
> > > probablit y mass is still discounted.
> >
> > the default minimum coccurrence count for trigrams is 2.  set
> > it to 1 to
> > include all trigrams:
> >
> > -gt3min 1 etc.
> >
> > that's why you still get backoff.
> >
> > >
> > > When I turned on the debug messages, I saw many messages like:
> > > warning: 0 backoff probability mass left for "AT SCLN" --
> > incrementing denomi
> > > nator
> > >
> > > Does it mean that smoothing is enforced here?
> > >
> > > Is there a way to get a pure maximum-likelihood language model,
> > > without backo ff weights at all, using ngram-count?
> >
> > see above.
> >
> > --Andreas
> >
> >
>

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Nov 21, 2008