<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 3/19/2014 10:57 AM, Stefy D. wrote:<br>
    </div>
    <blockquote
      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
      type="cite">
      <div style="color:#000; background-color:#fff;
        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
        Lucida Grande, sans-serif;font-size:12pt">
        <div>
          <div>
            <div>
              <div>
                <div>
                  <div>
                    <div>
                      <div>
                        <div>Dear Andreas,<br>
                          <br>
                        </div>
                        thank you very much for replying.<br>
                        <br>
                      </div>
                      I trained both LMs using the "-unk" option like
                      this:<br>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <blockquote
      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
      type="cite">
      <div style="color:#000; background-color:#fff;
        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
        Lucida Grande, sans-serif;font-size:12pt">
        <div>
          <div>
            <div>
              <div>
                <div>
                  <div>
                    <div>$NGRAMCOUNT_FILE -order 3 -interpolate
                      -kndiscount -unk -text
                      $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -lm
                      $WORKING_DIR"lm_a/lmodel.lm"<br>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    That explains who you are not getting OOVs reported in the ppl
    output.  Unknown words are mapped to <unk> and thus the LM has
    a probability for <unk>.<br>
    <br>
    <blockquote
      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
      type="cite">
      <div style="color:#000; background-color:#fff;
        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
        Lucida Grande, sans-serif;font-size:12pt">
        <div>
          <div>
            <div>
              <div>
                <div>
                  <div>
                    <div><br>
                    </div>
                    For the OOV rate I created a vocabulary list for the
                    training data and I used the unigram counts of the
                    test set and the compute-oov-rate script like this:<br>
                    <br>
                    $NGRAMCOUNT_FILE -order 1 -write-vocab
                    "vocabularyTargetUnigram.txt" -text
                    $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -sort<br>
                    <br>
                    $NGRAMCOUNT_FILE -order 1  -text
                    $WORKING_DIR"test.lowercased."$TARGET -write
                    "unigramCounts_testdatal.txt" -sort<br>
                    <br>
                    $OOVCOUNT_FILE vocabularyTargetUnigram.txt
                    unigramCounts_testdata.txt <br>
                    <br>
                  </div>
                  This is how I got that OOV rate mentioned in the first
                  mail. Could you please let me know if I used the right
                  commands to compute that?<br>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    You did it right.<br>
    <br>
    <blockquote
      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
      type="cite">
      <div style="color:#000; background-color:#fff;
        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
        Lucida Grande, sans-serif;font-size:12pt">
        <div>
          <div>
            <div>
              <div>
                <div><br>
                </div>
                You said I should train LM_A using the vocabulary of
                corpus A + corpus X so that the perplexities can be
                compared. So I should train LM_A using only corpus A but
                the vocabulary of A + X? I am sorry to be confused, but
                I thought that for estimating the LM the vocabulary
                should be from the same corpus used for estimating. I am
                using these LMs in SMT systems (a baseline and an
                adapted one). If I influence the baseline LM with
                vocabulary from the adapted data, then the baseline is
                not really a baseline. Please tell me if I am thinking
                incorrectly.<br>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    You are right.   What this illustrates is that perplexity alone is
    not a sufficient metric for comparing LMs.  In your scenario (LM
    adaptation) the expansion of the vocabulary is a key component of
    the adaptation process, but LMs with different vocabularies are no
    longer comparable by ppl.  My suggestion to unify the vocabularies
    was a workaround to allow you to still use perplexity comparison.  <br>
    <br>
    <blockquote
      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
      type="cite">
      <div style="color:#000; background-color:#fff;
        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
        Lucida Grande, sans-serif;font-size:12pt">
        <div>
          <div>
            <div>
              <div><br>
              </div>
              Thank you for introducing me into statistical
              significance. <br>
            </div>
            To generate a table of word level probabilities on the same
            test set should I use get-unigram-probs? But where do I
            specify the test set?<br>
            $UNIGRAMPROBS_FILE linear=1 
            $WORKING_DIR"lm_a/lmodel.arpa."$TARGET > table_A.out<br>
          </div>
        </div>
      </div>
    </blockquote>
    No, you get the word probabilities from output of ngram -debug 2
    -ppl (you need to write some perl or whatever script to extract the
    probabilities).<br>
    <blockquote
      cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
      type="cite">
      <div style="color:#000; background-color:#fff;
        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
        Lucida Grande, sans-serif;font-size:12pt">
        <div>
          <div><br>
          </div>
          To get how many words had lower/same/greater probability in
          LM_B is using compare-ppls script ok? For example, I get this
          output when applying it to my 2 LMs (ngram -debug 2 on the
          same test set as in previous commands): <br>
          $COMPARE_PPLS $WORKING_DIR"ppl_files/ppl_A_detail.ppl"
          $WORKING_DIR"ppl_files/ppl_B_detail.ppl"<br>
          output: total 22450, equal 0, different 22450, greater 11447<br>
        </div>
      </div>
    </blockquote>
    Yes, it seems compare-ppls extracts exactly the statistics I was
    talking about.  I had forgotten about it ...<br>
    <br>
    Andreas<br>
    <br>
  </body>
</html>