interpreting -order and -debug results

Alexy Khrabrov deliverable at gmail.com
Sun Nov 30 19:41:17 PST 2008


Greetings -- I've trained a Kneser-Ney model of a Russian corpus with - 
order 5 -kndiscount, and started it as a server with -order 5.  Then,  
to see that indeed 5-grams are working, I feed it a sentence with (a)  
an existing first word present in the corpus, (b) a made-up first word  
not present in the Russian language.  Then I run both 5-word sentences  
in two ways: (1) -order 5 -debug 2 (2) -order 0 debug 3, both for - 
ppl.  The results, which puzzle me, are below, followed by a  
description of the puzzlement.

~ echo c этим заявлением он выступил | ngram - 
use-server <badbox> -order 5 -debug 2 -ppl -
server <badbox>: probserver ready
c этим заявлением он выступил
         p( c | <s> )    =  3.67342e-06 [ -5.43493 ]
         p( этим | c ...)        =  0.00102315 [ -2.99006 ]
         p( заявлением | этим ...)       =  0.00151464  
[ -2.81969 ]
         p( он | заявлением ...)         =  0.0218172  
[ -1.6612 ]
         p( выступил | он ...)   =  0.000925487 [ -3.03363 ]
         p( </s> | выступил ...)         =  0.00693155  
[ -2.15917 ]
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16

file -: 1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16
~ echo жуемотничая этим заявлением он  
выступил | ngram -use-server <badbox> -order 5 -debug 2 -ppl -
server <badbox>: probserver ready
жуемотничая этим заявлением он  
выступил
         p( жуемотничая | <s> )  =  0 [ -inf ]
         p( этим | жуемотничая ...)      =  0.00014788  
[ -3.83009 ]
         p( заявлением | этим ...)       =  0.00151464  
[ -2.81969 ]
         p( он | заявлением ...)         =  0.0218172  
[ -1.6612 ]
         p( выступил | он ...)   =  0.000925487 [ -3.03363 ]
         p( </s> | выступил ...)         =  0.00693155  
[ -2.15917 ]
1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54

file -: 1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54

== notice that from the 3rd line p(word | context ...), the  
conditional probs are the same, although we're using a 5-gram model  
and in the second batch the first word is non-existing!  We also have  
0 OOVs reported there (?).

== Now, let's explore what "unlimited ngrams" mean with -order 0, and  
set -debug 3 too:

~ echo с этим заявлением он выступил | ngram - 
use-server <badbox> -order 0 -debug 3 -ppl -
server <badbox>: probserver ready
с этим заявлением он выступил

warning: word probs for this context sum to 0.00119158 != 1 : <s>
         p( с | <s> )    =  0.000113967 [ -3.94322 ] / 0.00119158

warning: word probs for this context sum to 0.0248594 != 1 : с <s>
         p( этим | с ...)        =  0.00614229 [ -2.21167 ] /  
0.0248594

warning: word probs for this context sum to 0.0135057 != 1 : этим  
с <s>
         p( заявлением | этим ...)       =  0.0026996  
[ -2.5687 ] / 0.0135057

warning: word probs for this context sum to 0.136629 != 1 :  
заявлением этим с <s>
         p( он | заявлением ...)         =  0.0191721  
[ -1.71733 ] / 0.136629

warning: word probs for this context sum to 0.00931138 != 1 : он  
заявлением этим с <s>
         p( выступил | он ...)   =  0.000925487  
[ -3.03363 ] / 0.00931138

warning: word probs for this context sum to 0.243228 != 1 :  
выступил он заявлением этим с <s>
         p( </s> | выступил ...)         =  0.00693155  
[ -2.15917 ] / 0.243228
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89

file -: 1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89

-----

~ echo жуемотничая этим заявлением он  
выступил | ngram -use-server <badbox> -order 0 -debug 3 -ppl -
server <badbox>: probserver ready
жуемотничая этим заявлением он  
выступил

warning: word probs for this context sum to 0.00107762 != 1 : <s>
         p( жуемотничая | <s> )  =  0 [ -inf ] / 0.00107762

warning: word probs for this context sum to 0.0136768 != 1 :  
жуемотничая <s>
         p( этим | жуемотничая ...)      =  0.00014788  
[ -3.83009 ] / 0.0136768

warning: word probs for this context sum to 0.0105593 != 1 : этим  
жуемотничая <s>
         p( заявлением | этим ...)       =  0.00151464  
[ -2.81969 ] / 0.0105593

warning: word probs for this context sum to 0.0891667 != 1 :  
заявлением этим жуемотничая <s>
         p( он | заявлением ...)         =  0.0218172  
[ -1.6612 ] / 0.0891667

warning: word probs for this context sum to 0.00501918 != 1 : он  
заявлением этим жуемотничая <s>
         p( выступил | он ...)   =  0.000925487  
[ -3.03363 ] / 0.00501918

warning: word probs for this context sum to 0.00712921 != 1 :  
выступил он заявлением этим  
жуемотничая <s>
         p( </s> | выступил ...)         =  0.00693155  
[ -2.15917 ] / 0.00712921
1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54

file -: 1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54

== Now we get more differences, the "real" example, the first one,  
differs from the "fake" second one in the first 4 lines, the p(|)'s  
are the same only for the last two lines, 5 and 6.  However, the 4th  
line of the first "real" case has a *lower* p( он |  
заявлением ...)   =  0.0191721 < p( он |  
заявлением ...)         =  0.0218172 in 4th line of the  
second *fake* case!

Again, we see 0 OOVs reported in both cases, despite  
"жуемотничая" being a fake word with 0 [-Inf] prob.

Although the final perplexities are higher for the fake case, I can't  
be certain, from these results, that the -order 5 option is being  
honored, and am not sure what -order 0 does here, as well as why some  
conditional probability can be higher for a fake word.  Also, what  
exactly is the -debug 3 "word probs for this context", and why would  
they cause a warning for a rather large real corpus, and how should I  
interpret it?

For the reference, here's the model building command I used:

time make-batch-counts list/list-stok 100000 cat counts/5g -order 5 > / 
dev/null 2>&1; time merge-batch-counts counts/5g; time make-big-lm - 
name lm-ko-kn5 -lm lm-ko-kn5 -max-per-file 100000000 -kndiscount - 
order 5 -read counts/5g/*.ngrams.gz

-- and here's how I launch the resulting LM server:

ngram -server-port <badport> -lm /data/rupress/lm-ko-kn5 -order 5

Cheers,
Alexy



More information about the SRILM-User mailing list