witten_bell discounting between SRL and CMU are incompatible ?

Andreas Stolcke stolcke at speech.sri.com
Thu Oct 21 18:18:16 PDT 2004


In message <417859E3.3050309 at u.washington.edu>you wrote:
> 
> I use the following command to evaluate these LMs using ngram:
> 
> ngram -unk -lm edge_word_linear3/edge_word_linear3.bin.arpa -ppl 
> /u/yangmei/srilm/src/normtext/edge.test.norm.check
> file /u/yangmei/srilm/src/normtext/edge.test.norm.check: 2576 sentences, 
> 35920 words, 1372 OOVs
> 0 zeroprobs, logprob= -125966 ppl= 2472.37 ppl1= 4427.05

Mei Yang,

one small problem is that the CMU LM uses the word <UNK> for the unknown
word, whereas SRILM uses <unk> (lowercase).  you can fix that by running

 ngram  -map-unk '<UNK>'  ...

This replaces all unknown words with <UNK>.  

However, that is not the reason for the high perplexities.
The end-of-sentence symbol </s> in your CMU-generated LM has a unigram
probability that is essentially 0 (log = -98.9923).
So every time you backoff to a unigram for predicting </s> the perplexity
goes through the roof.  I would consider this a bug in the LM construction.

The problem is made worse by the fact that there are no ngrams containing
<UNK> (other than the unigram), so after an OOV word you always back off to
unigram, and if that happens to be the end-of-sentence the whole sentence
gets probability 0.

I suspect that the CMU perplexity computation either excludes </s> from
the computation, or excludes the word after an OOV.  I think both are
inappropriate, but the handling of OOVs in perplexity computation is 
not well standardized.

I suggest you run 

	ngram -debug 2 -ppl ....

to output all the conditional word probabilties, and do something equivalent
with the CMU tools, and compare the probabilities.  They should
match exactly.  The difference will probably come from (a) which words are
excluded in the overall perplexity, and (b) what the denominator in the
computation is.  For SRILM, the denominator is the sum of all non-OOV words and
end-of-sentence tokens, both of which are reported in the output.

--Andreas 




More information about the SRILM-User mailing list