<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 11/04/13 01:01, Andreas Stolcke

      wrote:<br>

    </div>

    <blockquote cite="mid:5276E3E4.7010801@icsi.berkeley.edu"

      type="cite">

      <meta content="text/html; charset=ISO-8859-1"

        http-equiv="Content-Type">

      <div class="moz-cite-prefix">On 11/3/2013 1:43 AM, Joris Pelemans

        wrote:<br>

      </div>

      <blockquote cite="mid:52761ADB.50906@esat.kuleuven.be" type="cite">

        <meta content="text/html; charset=ISO-8859-1"

          http-equiv="Content-Type">

        I am investigating different techniques to introduce new words

        to the vocabulary. Say I have a vocabulary of 100,000 words and

        I want to introduce 1 new word X (for the sake of simplicity). I

        could do one of 3 options:<br>

        <ol>

          <li>use the contexts in which X appears in some training data

            (but sometimes X may not appear (enough))</li>

          <li>estimate the probability of X by taking a fraction of the

            prob mass of a synonym of X (which I described earlier)</li>

          <li>estimate the probability of X by taking a fraction of the

            prob mass of the <unk> class (if e.g. no good synonym

            is at hand)</li>

        </ol>

        <p>I could then compare the perplexities of these 3 LMs with a

          vocabulary of size 100,001 words to see which technique is

          best for a given word/situation.<br>

        </p>

      </blockquote>

      And option 3 is effectively already implemented by the way unseen

      words are mapped to <unk>.  If you want to compute

      perplexity in a fair way you would take the LM containing

      <unk> and for every occurrence of X you add log p(X |

      <unk>)  (the share of unk-probability mass you want to give

      to X).  That way you don't need to add any ngrams to the LM.  What

      this effectively does is simulate a class-based Ngram model where

      <unk> is a class and X one of its members.<br>

    </blockquote>

    Yes, this is exactly what I meant when I asked for a "smart way in

    the SRILM toolkit", so I assume this is included. I looked up how to

    use class-based models and I think I found what I need to do. Is the

    following the correct way to calculate perplexity for these models?<br>

    <br>

    ngram -lm class_lm.arpa -ppl test.txt -order n -classes

    expansions.class<br>

    <br>

    where expansions.class contains lines like this:<br>

    <br>

    <unk> p(X | <unk>) X<br>

    <unk> p(Y | <unk>) Y<br>

    <unk> 1-p(X | <unk>)-p(Y | <unk>) not_mapped<br>

    <br>

    I assume the last line is necessary since the man page for

    "classes-format" says "All expansion probabilities for a given class

    should sum to one,

    although this is not necessarily enforced by the software and would

    lead to improper models."<br>

    <br>

    Joris<br>

  </body>

</html>