<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 3/18/2014 12:44 PM, Stefy D. wrote:<br>

    </div>

    <blockquote

      cite="mid:1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div>Dear all,</div>

        <div><br>

        </div>

        <div style="color: rgb(0, 0, 0); font-size: 16px; font-family:

          HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida

          Grande', sans-serif; background-color: transparent;

          font-style: normal;">I have some questions regarding

          perplexity...I am very thankful for your time/ answers.</div>

        <div style="color: rgb(0, 0, 0); font-size: 16px; font-family:

          HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida

          Grande', sans-serif; background-color: transparent;

          font-style: normal;"><br>

        </div>

        <div style="color: rgb(0, 0, 0); font-size: 16px; font-family:

          HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida

          Grande', sans-serif; background-color: transparent;

          font-style: normal;">Settings:</div>

        <div style="color: rgb(0, 0, 0); font-size: 16px; font-family:

          HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida

          Grande', sans-serif; background-color: transparent;

          font-style: normal;">- one language model LM_A estimated using

          training corpus A </div>

        <div style="color: rgb(0, 0, 0); font-size: 16px; font-family:

          HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida

          Grande', sans-serif; background-color: transparent;

          font-style: normal;">- one language model LM_B estimated using

          training corpus B (B = corpus_A + corpus_X)</div>

        <div style="background-color: transparent;"><br

            class="Apple-interchange-newline">

          My intention is to prove that model B is better than model A

          so I though I should show that the perplexity decreased (which

          can be seen from the ppl files).</div>

        <div><br>

        </div>

        <div style="color: rgb(0, 0, 0); font-size: 16px; font-family:

          HelveticaNeue, 'Helvetica Neue', Helvetica, Arial, 'Lucida

          Grande', sans-serif; background-color: transparent;

          font-style: normal;">Commands used to estimate ppl:</div>

        <div style="background-color: transparent;">$NGRAM_FILE -order 3

           -lm $WORKING_DIR"lm_A/lmodel.lm" -ppl

          $WORKING_DIR"test.lowercased."$TARGET >

           $WORKING_DIR"ppl_A.ppl"<br>

        </div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;"><br>

        </div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;">$NGRAM_FILE -order 3  -lm

          $WORKING_DIR"lm_B/lmodel.lm" -ppl

          $WORKING_DIR"test.lowercased."$TARGET >

           $WORKING_DIR"ppl_B.ppl"<br>

        </div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;"><br>

        </div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;">This contents of the two ppl files is (A then B):</div>

        <div style="background-color: transparent;">1000 sentences,

          21450 words, 0 OOVs</div>

        <div style="background-color: transparent;">0 zeroprobs,

          logprob= -57849.4 ppl= 377.407 ppl1= 497.67</div>

        <div style="background-color: transparent;">-------------------------------------------------------------------------------------------</div>

        <div style="background-color: transparent;">1000 sentences,

          21450 words, 0 OOVs</div>

        <div style="background-color: transparent;">0 zeroprobs,

          logprob= -55535.3 ppl= 297.67 ppl1= 388.204</div>

        <div style="background-color: transparent;"><br>

        </div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;">Questions:</div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;">1. Why do I get 0 OOVs? I checked using the

          compute-oov-rate script how many OOV there are in the test

          data compared to the training and it gave me the result "OOV

          tokens: 393 / 21450 (1.83%) excluding fragments: 390 / 21442

          (1.82%)".</div>

      </div>

    </blockquote>

    You didn't say how you trained the LMs.  Did you include an

    unknown-word probability?   The exact option used for LM training

    matter here.<br>

    <blockquote

      cite="mid:1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;"><br>

        </div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;">2. I read on the srilm-faq that "<span

            style="font-family: 'Times New Roman'; font-size: 16px;">Note

            that perplexity comparisons are only ever meaningful if the

            vocabularies of all LMs are the same." </span><span

            style="font-size: 12pt;">Since I want to compare

            perplexities of two LM I am wondering if I did the right

            thing with my settings and commands used. The two LM were

            estimated on different training corpora so the vocabularies

            are not identical, right? Please tell me what am I doing

            wrong.</span></div>

      </div>

    </blockquote>

    Again, we don't know how you trained the LMs, hence we don't know

    the vocabularies.<br>

    The best way to make the perplexities comparable would be to extract

    the vocabulary from corpus A + corpus X, and then specify that for

    training LM_A (using -vocab).<br>

    <br>

    <blockquote

      cite="mid:1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 12pt; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;"><span style="font-size: 12pt;"><br>

          </span></div>

        <div style="background-color: transparent; color: rgb(0, 0, 0);

          font-size: 16px; font-family: HelveticaNeue, 'Helvetica Neue',

          Helvetica, Arial, 'Lucida Grande', sans-serif; font-style:

          normal;"><span style="font-size: 12pt;">3. If those two

            perplexities were computed correctly, then could you please

            tell me if their difference means that the LM model has been

            really improved and if there is a measure that says if this

            improvement is significantly? <br>

          </span></div>

      </div>

    </blockquote>

    The perplexities looks quite different.  Differences of 10-20% are

    usually considered non-negligible.<br>

    For statistical significance there are a number of tests you can

    apply, although none are built into SRILM.<br>

    <br>

    The most straightforward tests would be nonparametric ones that

    compare the probabilities output by the two LMs for corresponding

    word or sentences.<br>

    Generate a table of word-level probabilities for LM_A and then LM_B,

    on the same test set.  Then ask, how many words had

    lower/same/greater probability in LM_B?<br>

    From those statistics you can apply either the <a

      href="http://en.wikipedia.org/wiki/Sign_test">Sign test</a> or the

    stronger <a

      href="http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon

      test</a> (for the latter you need the differences of the

    probabilities, not just their sign).<br>

    <br>

    The Sign test is extremely simple and can be computed with a small

    helper script included in SRILM.  For example if LM_B gives higher

    probability for 1080 out of 2000 words (and there are no ties), then

    the significance levels are computed by<br>

    <br>

    % $SRILM/bin/cumbin 2000 1080<br>

    One-tailed: P(k >= 1080 | n=2000, p=0.5) = 0.00018750253721029<br>

    Two-tailed: 2*P(k >= 1080 | n=2000, p=0.5) = 0.00037500507442058<br>

    <br>

    Doing this at the word-level assumes that all the words in a

    sentence are assigned probabilities independently, which is plainly

    not true (the same word occurs in several ngrams).  So a more

    conservative approach would compare the sentence-level probabilities.<br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>