<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 11/4/2013 1:01 AM, Joris Pelemans

      wrote:<br>

    </div>

    <blockquote cite="mid:52776266.3020409@esat.kuleuven.be" type="cite">

      <meta content="text/html; charset=ISO-8859-1"

        http-equiv="Content-Type">

      <div class="moz-cite-prefix">On 11/04/13 01:01, Andreas Stolcke

        wrote:<br>

      </div>

      <blockquote cite="mid:5276E3E4.7010801@icsi.berkeley.edu"

        type="cite">

        <meta content="text/html; charset=ISO-8859-1"

          http-equiv="Content-Type">

        <div class="moz-cite-prefix">On 11/3/2013 1:43 AM, Joris

          Pelemans wrote:<br>

        </div>

        <blockquote cite="mid:52761ADB.50906@esat.kuleuven.be"

          type="cite">

          <meta content="text/html; charset=ISO-8859-1"

            http-equiv="Content-Type">

          I am investigating different techniques to introduce new words

          to the vocabulary. Say I have a vocabulary of 100,000 words

          and I want to introduce 1 new word X (for the sake of

          simplicity). I could do one of 3 options:<br>

          <ol>

            <li>use the contexts in which X appears in some training

              data (but sometimes X may not appear (enough))</li>

            <li>estimate the probability of X by taking a fraction of

              the prob mass of a synonym of X (which I described

              earlier)</li>

            <li>estimate the probability of X by taking a fraction of

              the prob mass of the <unk> class (if e.g. no good

              synonym is at hand)</li>

          </ol>

          <p>I could then compare the perplexities of these 3 LMs with a

            vocabulary of size 100,001 words to see which technique is

            best for a given word/situation.<br>

          </p>

        </blockquote>

        And option 3 is effectively already implemented by the way

        unseen words are mapped to <unk>.  If you want to compute

        perplexity in a fair way you would take the LM containing

        <unk> and for every occurrence of X you add log p(X |

        <unk>)  (the share of unk-probability mass you want to

        give to X).  That way you don't need to add any ngrams to the

        LM.  What this effectively does is simulate a class-based Ngram

        model where <unk> is a class and X one of its members.<br>

      </blockquote>

      Yes, this is exactly what I meant when I asked for a "smart way in

      the SRILM toolkit", so I assume this is included. I looked up how

      to use class-based models and I think I found what I need to do.

      Is the following the correct way to calculate perplexity for these

      models?<br>

      <br>

      ngram -lm class_lm.arpa -ppl test.txt -order n -classes

      expansions.class<br>

      <br>

      where expansions.class contains lines like this:<br>

      <br>

      <unk> p(X | <unk>) X<br>

      <unk> p(Y | <unk>) Y<br>

      <unk> 1-p(X | <unk>)-p(Y | <unk>) not_mapped<br>

    </blockquote>

    Yes, except you have to use a new class symbol, like UNKWORD, and

    replace the "not_mapped"  with the standard <unk>.<br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>