<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 3/18/2014 12:44 PM, Stefy D. wrote:<br>
</div>
<blockquote
cite="mid:1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div>Dear all,</div>
<div><br>
</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida
Grande', sans-serif; background-color: transparent;
font-style: normal;">I have some questions regarding
perplexity...I am very thankful for your time/ answers.</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida
Grande', sans-serif; background-color: transparent;
font-style: normal;"><br>
</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida
Grande', sans-serif; background-color: transparent;
font-style: normal;">Settings:</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida
Grande', sans-serif; background-color: transparent;
font-style: normal;">- one language model LM_A estimated using
training corpus A </div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida
Grande', sans-serif; background-color: transparent;
font-style: normal;">- one language model LM_B estimated using
training corpus B (B = corpus_A + corpus_X)</div>
<div style="background-color: transparent;"><br
class="Apple-interchange-newline">
My intention is to prove that model B is better than model A
so I though I should show that the perplexity decreased (which
can be seen from the ppl files).</div>
<div><br>
</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida
Grande', sans-serif; background-color: transparent;
font-style: normal;">Commands used to estimate ppl:</div>
<div style="background-color: transparent;">$NGRAM_FILE -order 3
-lm $WORKING_DIR"lm_A/lmodel.lm" -ppl
$WORKING_DIR"test.lowercased."$TARGET >
$WORKING_DIR"ppl_A.ppl"<br>
</div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;"><br>
</div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;">$NGRAM_FILE -order 3 -lm
$WORKING_DIR"lm_B/lmodel.lm" -ppl
$WORKING_DIR"test.lowercased."$TARGET >
$WORKING_DIR"ppl_B.ppl"<br>
</div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;"><br>
</div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;">This contents of the two ppl files is (A then B):</div>
<div style="background-color: transparent;">1000 sentences,
21450 words, 0 OOVs</div>
<div style="background-color: transparent;">0 zeroprobs,
logprob= -57849.4 ppl= 377.407 ppl1= 497.67</div>
<div style="background-color: transparent;">-------------------------------------------------------------------------------------------</div>
<div style="background-color: transparent;">1000 sentences,
21450 words, 0 OOVs</div>
<div style="background-color: transparent;">0 zeroprobs,
logprob= -55535.3 ppl= 297.67 ppl1= 388.204</div>
<div style="background-color: transparent;"><br>
</div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;">Questions:</div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;">1. Why do I get 0 OOVs? I checked using the
compute-oov-rate script how many OOV there are in the test
data compared to the training and it gave me the result "OOV
tokens: 393 / 21450 (1.83%) excluding fragments: 390 / 21442
(1.82%)".</div>
</div>
</blockquote>
You didn't say how you trained the LMs. Did you include an
unknown-word probability? The exact option used for LM training
matter here.<br>
<blockquote
cite="mid:1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;"><br>
</div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;">2. I read on the srilm-faq that "<span
style="font-family: 'Times New Roman'; font-size: 16px;">Note
that perplexity comparisons are only ever meaningful if the
vocabularies of all LMs are the same." </span><span
style="font-size: 12pt;">Since I want to compare
perplexities of two LM I am wondering if I did the right
thing with my settings and commands used. The two LM were
estimated on different training corpora so the vocabularies
are not identical, right? Please tell me what am I doing
wrong.</span></div>
</div>
</blockquote>
Again, we don't know how you trained the LMs, hence we don't know
the vocabularies.<br>
The best way to make the perplexities comparable would be to extract
the vocabulary from corpus A + corpus X, and then specify that for
training LM_A (using -vocab).<br>
<br>
<blockquote
cite="mid:1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 12pt; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;"><span style="font-size: 12pt;"><br>
</span></div>
<div style="background-color: transparent; color: rgb(0, 0, 0);
font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',
Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:
normal;"><span style="font-size: 12pt;">3. If those two
perplexities were computed correctly, then could you please
tell me if their difference means that the LM model has been
really improved and if there is a measure that says if this
improvement is significantly? <br>
</span></div>
</div>
</blockquote>
The perplexities looks quite different. Differences of 10-20% are
usually considered non-negligible.<br>
For statistical significance there are a number of tests you can
apply, although none are built into SRILM.<br>
<br>
The most straightforward tests would be nonparametric ones that
compare the probabilities output by the two LMs for corresponding
word or sentences.<br>
Generate a table of word-level probabilities for LM_A and then LM_B,
on the same test set. Then ask, how many words had
lower/same/greater probability in LM_B?<br>
From those statistics you can apply either the <a
href="http://en.wikipedia.org/wiki/Sign_test">Sign test</a> or the
stronger <a
href="http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon
test</a> (for the latter you need the differences of the
probabilities, not just their sign).<br>
<br>
The Sign test is extremely simple and can be computed with a small
helper script included in SRILM. For example if LM_B gives higher
probability for 1080 out of 2000 words (and there are no ties), then
the significance levels are computed by<br>
<br>
% $SRILM/bin/cumbin 2000 1080<br>
One-tailed: P(k >= 1080 | n=2000, p=0.5) = 0.00018750253721029<br>
Two-tailed: 2*P(k >= 1080 | n=2000, p=0.5) = 0.00037500507442058<br>
<br>
Doing this at the word-level assumes that all the words in a
sentence are assigned probabilities independently, which is plainly
not true (the same word occurs in several ngrams). So a more
conservative approach would compare the sentence-level probabilities.<br>
<br>
Andreas<br>
<br>
</body>
</html>