<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi Siva,<br>
<br>
Thanks a lot, with these arguments the perplexity is very close to
the reported 141.2 (still not entirely the same though):<br>
<br>
<jpeleman@spchcl23:~/exp/025> ngram-count -order 5 -text
data/penn/ptb.train.txt -lm models/ptb.train_5-gram_kn.arpa7
-kndiscount -interpolate -unk -gt3min 1 -gt4min 1<br>
<jpeleman@spchcl23:~/exp/025> ngram -ppl
data/penn/ptb.test.txt -lm models/ptb.train_5-gram_kn.arpa7 -order
5 -unk<br>
file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs<br>
0 zeroprobs, logprob= -177278 ppl= <b>141.464</b> ppl1= 179.251<br>
<br>
I wonder about the value of experiments that include <unk>
in the perplexity calculation. Does it not make the problem a lot
easier (predicting a huge class is not hard - imagine mapping all
words to <unk>) and as such yield misleading results?<br>
<br>
Joris<br>
<br>
<br>
On 07/09/14 16:24, Siva Reddy Gangireddy wrote:<br>
</div>
<blockquote
cite="mid:CAL6CX2n5VOWzoCYKf8gxXFXsiN-ia3eSWrH6-HpM9boT+gQycw@mail.gmail.com"
type="cite">
<div dir="ltr">Hi Joris,
<div><br>
</div>
<div>Use the count cut-offs like this. </div>
<div><br>
</div>
<div>
<div>ngram-count -order 5 -text <span
style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">ptb.train.txt</span> -lm
templm -kndiscount -interpolate -unk -gt3min 1 -gt4min 1</div>
</div>
<div>ngram -ppl <span
style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">ptb.test.txt</span> -lm
templm -order 5 -unk<br>
</div>
<div><br>
</div>
<div>By default SRILM uses different count cut-offs.<br>
</div>
<div><br>
</div>
<div>---</div>
<div>Siva</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Wed, Jul 9, 2014 at 11:03 PM, Joris
Pelemans <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:Joris.Pelemans@esat.kuleuven.be"
target="_blank">Joris.Pelemans@esat.kuleuven.be</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> Hi all,<br>
<br>
I'm trying to reproduce some reported N-gram perplexity
results on the Penn Treebank with SRILM, but somehow my
results are always different by a large degree. Since I
will be interpolating with these models and comparing the
interpolated model with others, I would really prefer to
start on the same level :-).<br>
<br>
The data set I'm using is the one that comes with
Mikolov's RNNLM toolkit and applies the same processing of
data as used in many LM papers, including "Empirical
Evaluation and Combination of Advanced Language Modeling
Techniques". In that paper, Mikolov et al report a KN5
perplexity of 141.2. It's not entirely clear (1) whether
they ignore OOV words or simply use the <unk>
probability; and (2) whether it's a back-off or
interpolated model, but I assume the latter as this has
been reported as best many times. They do report using
SRILM and no count cut-offs.<br>
<br>
I have tried building the same model in many ways:<br>
<br>
<b>regular:</b> ngram-count -order 5 -text
data/penn/ptb.train.txt -lm
models/ptb.train_5-gram_kn.arpa2 -kndiscount -interpolate<br>
<b>open vocab:</b> ngram-count -order 5 -text
data/penn/ptb.train.txt -lm
models/ptb.train_5-gram_kn.arpa3 -kndiscount -interpolate
-unk<br>
<b>no sentence markers:</b> ngram-count -order 5 -text
data/penn/ptb.train.txt -lm
models/ptb.train_5-gram_kn.arpa4 -kndiscount -interpolate
-no-sos -no-eos<br>
<b>open vocab + no sentence markers:</b> ngram-count
-order 5 -text data/penn/ptb.train.txt -lm
models/ptb.train_5-gram_kn.arpa5 -kndiscount -interpolate
-unk -no-sos -no-eos<br>
<b>back-off (just in case</b><b>):</b> ngram-count -order
5 -text data/penn/ptb.train.txt -lm
models/ptb.train_5-gram_kn.arpa5 -kndiscount -unk<br>
<br>
None of them however, give me a perplexity of 141.2:<br>
<br>
<jpeleman@spchcl23:~/exp/025> ngram -ppl
data/penn/ptb.test.txt -lm
models/ptb.train_5-gram_kn.arpa2 -order 5<br>
file data/penn/ptb.test.txt: 3761 sentences, 78669 words,
4794 OOVs<br>
0 zeroprobs, logprob= -172723 ppl= 167.794 ppl1= 217.791<br>
<br>
<jpeleman@spchcl23:~/exp/025> ngram -ppl
data/penn/ptb.test.txt -lm
models/ptb.train_5-gram_kn.arpa3 -order 5 -unk<br>
file data/penn/ptb.test.txt: 3761 sentences, 78669 words,
0 OOVs<br>
0 zeroprobs, logprob= -178859 ppl= 147.852 ppl1= 187.743<br>
<br>
<jpeleman@spchcl23:~/exp/025> ngram -ppl
data/penn/ptb.test.txt -lm
models/ptb.train_5-gram_kn.arpa4 -order 5<br>
file data/penn/ptb.test.txt: 3761 sentences, 78669 words,
4794 OOVs<br>
0 zeroprobs, logprob= -179705 ppl= 206.4 ppl1= 270.74<br>
<br>
<jpeleman@spchcl23:~/exp/025> ngram -ppl
data/penn/ptb.test.txt -lm
models/ptb.train_5-gram_kn.arpa5 -order 5 -unk<br>
file data/penn/ptb.test.txt: 3761 sentences, 78669 words,
0 OOVs<br>
0 zeroprobs, logprob= -186444 ppl= 182.746 ppl1= 234.414<br>
<br>
<jpeleman@spchcl23:~/exp/025> ngram -ppl
data/penn/ptb.test.txt -lm
models/ptb.train_5-gram_kn.arpa5 -order 5 -unk<br>
file data/penn/ptb.test.txt: 3761 sentences, 78669 words,
0 OOVs<br>
0 zeroprobs, logprob= -181381 ppl= 158.645 ppl1= 202.127<br>
<br>
So... what am I missing here? 147.852 is close, but still
not quite 141.2.<span class="HOEnZb"><font color="#888888"><br>
<br>
Joris<br>
</font></span></div>
<br>
_______________________________________________<br>
SRILM-User site list<br>
<a moz-do-not-send="true"
href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a><br>
<a moz-do-not-send="true"
href="http://www.speech.sri.com/mailman/listinfo/srilm-user"
target="_blank">http://www.speech.sri.com/mailman/listinfo/srilm-user</a><br>
</blockquote>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
</pre>
</blockquote>
<br>
</body>
</html>