Search SRILM-USER Archives

Re: FW: A simple question about SRILM

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 18 May 2004 20:03:39 PDT

In message <002701c43c4a$4f810b00$34284484 at ADDRESS HIDDEN>you wrote:
> Hi Andreas,
>
> Thanks for you super-fast reply!
>
> I tried it like you suggested:
> ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0
> -gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1
>
> Many of the backoff weights indeed became 99 (which is good), but many
> remained non-zero (although small: -6,-7,-8...)
>
> Is there a way to make them all 99?

This might not be necessary.

If the left-over probability mass in some context is 0 (as it should
be when using ML estimates) AND the sum of the lower-order probabilities
for the occurring N-grams is also 0 (since those are also ML estimates),
the backoff weight is 0/0, and due to numerical inaccuracies this may turn
out to be one of the values your observed. (The code catches actual
0/0 divisions and generates -99 in those cases.)
However, this is not a problem because the backoff log prob value for one of
the non-observed ngrams would be -infinity, and the particular value of
the backoff weight that gets applied doesn't matter for the outcome
(-infinity plus any value is still -infinity).

To verify that that's the case just feed some of those unobserved
ngrams to ngram -debug 2 -ppl and make sure the log probabilities are -infinity.

--Andreas

>
> The debug messages I got are listed below.
>
> Thanks a lot,
> Roy.
> ------------------------------------------------------------------------
> ---------------------
> corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs
> 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
> Good-Turing discounting 1-grams
> GT-count [0] = 0
> GT-count [1] = 0
> warning: no singleton counts
> GT discounting disabled
> Good-Turing discounting 2-grams
> GT-count [0] = 0
> GT-count [1] = 126
> GT discounting disabled
> Good-Turing discounting 3-grams
> GT-count [0] = 0
> GT-count [1] = 2142
> GT discounting disabled
> discarded 1 2-gram contexts containing pseudo-events
> discarded 2 3-gram contexts containing pseudo-events
> writing 41 1-grams
> writing 800 2-grams
> writing 5145 3-grams
>
> > -----Original Message-----
> > From: Andreas Stolcke [mailto:stolcke at ADDRESS HIDDEN]
> > Sent: Monday, May 17, 2004 7:38 PM
> > To: Roy Bar Haim
> > Cc: srilm-user at ADDRESS HIDDEN
> > Subject: Re: FW: A simple question about SRILM
> >
> >
> >
> > In message
> > <001701c43c3c$65fc62c0$34284484 at ADDRESS HIDDEN>you wrote:
> > > Hi,
> > >
> > > I have the same problem. I want the LM to give maximum-likelihood
> > > estimates. That is, all the backoff weights should be zero.
> > >
> > > I applied the solution below, but still I get backoff weights.
> > >
> > > For example, when I build the lm like this:
> > > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text
> > corpus.tags
> > > -lm corp us.tags.lm
> > >
> > > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so
> > > probablit y mass is still discounted.
> >
> > the default minimum coccurrence count for trigrams is 2. set
> > it to 1 to
> > include all trigrams:
> >
> > -gt3min 1 etc.
> >
> > that's why you still get backoff.
> >
> > >
> > > When I turned on the debug messages, I saw many messages like:
> > > warning: 0 backoff probability mass left for "AT SCLN" --
> > incrementing denomi
> > > nator
> > >
> > > Does it mean that smoothing is enforced here?
> > >
> > > Is there a way to get a pure maximum-likelihood language model,
> > > without backo ff weights at all, using ngram-count?
> >
> > see above.
> >
> > --Andreas
> >
> >
>

Click here to go to the SRILM home page.