<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 3/19/2014 10:57 AM, Stefy D. wrote:<br>

    </div>

    <blockquote

      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div>

          <div>

            <div>

              <div>

                <div>

                  <div>

                    <div>

                      <div>

                        <div>Dear Andreas,<br>

                          <br>

                        </div>

                        thank you very much for replying.<br>

                        <br>

                      </div>

                      I trained both LMs using the "-unk" option like

                      this:<br>

                    </div>

                  </div>

                </div>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <blockquote

      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div>

          <div>

            <div>

              <div>

                <div>

                  <div>

                    <div>$NGRAMCOUNT_FILE -order 3 -interpolate

                      -kndiscount -unk -text

                      $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -lm

                      $WORKING_DIR"lm_a/lmodel.lm"<br>

                    </div>

                  </div>

                </div>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

    That explains who you are not getting OOVs reported in the ppl

    output.  Unknown words are mapped to <unk> and thus the LM has

    a probability for <unk>.<br>

    <br>

    <blockquote

      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div>

          <div>

            <div>

              <div>

                <div>

                  <div>

                    <div><br>

                    </div>

                    For the OOV rate I created a vocabulary list for the

                    training data and I used the unigram counts of the

                    test set and the compute-oov-rate script like this:<br>

                    <br>

                    $NGRAMCOUNT_FILE -order 1 -write-vocab

                    "vocabularyTargetUnigram.txt" -text

                    $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -sort<br>

                    <br>

                    $NGRAMCOUNT_FILE -order 1  -text

                    $WORKING_DIR"test.lowercased."$TARGET -write

                    "unigramCounts_testdatal.txt" -sort<br>

                    <br>

                    $OOVCOUNT_FILE vocabularyTargetUnigram.txt

                    unigramCounts_testdata.txt <br>

                    <br>

                  </div>

                  This is how I got that OOV rate mentioned in the first

                  mail. Could you please let me know if I used the right

                  commands to compute that?<br>

                </div>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    You did it right.<br>

    <br>

    <blockquote

      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div>

          <div>

            <div>

              <div>

                <div><br>

                </div>

                You said I should train LM_A using the vocabulary of

                corpus A + corpus X so that the perplexities can be

                compared. So I should train LM_A using only corpus A but

                the vocabulary of A + X? I am sorry to be confused, but

                I thought that for estimating the LM the vocabulary

                should be from the same corpus used for estimating. I am

                using these LMs in SMT systems (a baseline and an

                adapted one). If I influence the baseline LM with

                vocabulary from the adapted data, then the baseline is

                not really a baseline. Please tell me if I am thinking

                incorrectly.<br>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    You are right.   What this illustrates is that perplexity alone is

    not a sufficient metric for comparing LMs.  In your scenario (LM

    adaptation) the expansion of the vocabulary is a key component of

    the adaptation process, but LMs with different vocabularies are no

    longer comparable by ppl.  My suggestion to unify the vocabularies

    was a workaround to allow you to still use perplexity comparison.  <br>

    <br>

    <blockquote

      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div>

          <div>

            <div>

              <div><br>

              </div>

              Thank you for introducing me into statistical

              significance. <br>

            </div>

            To generate a table of word level probabilities on the same

            test set should I use get-unigram-probs? But where do I

            specify the test set?<br>

            $UNIGRAMPROBS_FILE linear=1 

            $WORKING_DIR"lm_a/lmodel.arpa."$TARGET > table_A.out<br>

          </div>

        </div>

      </div>

    </blockquote>

    No, you get the word probabilities from output of ngram -debug 2

    -ppl (you need to write some perl or whatever script to extract the

    probabilities).<br>

    <blockquote

      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"

      type="cite">

      <div style="color:#000; background-color:#fff;

        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,

        Lucida Grande, sans-serif;font-size:12pt">

        <div>

          <div><br>

          </div>

          To get how many words had lower/same/greater probability in

          LM_B is using compare-ppls script ok? For example, I get this

          output when applying it to my 2 LMs (ngram -debug 2 on the

          same test set as in previous commands): <br>

          $COMPARE_PPLS $WORKING_DIR"ppl_files/ppl_A_detail.ppl"

          $WORKING_DIR"ppl_files/ppl_B_detail.ppl"<br>

          output: total 22450, equal 0, different 22450, greater 11447<br>

        </div>

      </div>

    </blockquote>

    Yes, it seems compare-ppls extracts exactly the statistics I was

    talking about.  I had forgotten about it ...<br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>