<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 11/03/13 02:35, Andreas Stolcke

      wrote:<br>

    </div>

    <blockquote cite="mid:5275A84B.8060401@icsi.berkeley.edu"

      type="cite">On 11/2/2013 7:46 AM, Joris Pelemans wrote:

      <br>

      <blockquote type="cite">On 11/02/13 02:07, Andreas Stolcke wrote:

        <br>

        <blockquote type="cite">

          <br>

          For example, if have p(c | a b) = x  and d and c synonyms, you

          set

          <br>

          <br>

          p(c | a b ) = x/2

          <br>

          p(d | a b) = x/2

          <br>

        </blockquote>

        <br>

        Another question with regards to this problem. Say, I don't know

        a good synonym for d, but I still want to include it by mapping

        it onto <unk> (what else, right?), obviously by a very

        small fraction of the <unk> probability, since it's a

        class. The above technique would lead to gigantic LMs, since

        <unk> is all over the place. Is there a smart way in the

        SRILM toolkit that lets you specify that some words should be

        modeled as <unk>?

        <br>

      </blockquote>

      <br>

      I'm not sure I understand what you mean.  <unk>  is a

      special word that all words not in the vocabulary are mapped to at

      test time.  So the way you 'model'  a word by <unk> is to

      not include it in the vocabulary of your LM.

      <br>

    </blockquote>

    I am investigating different techniques to introduce new words to

    the vocabulary. Say I have a vocabulary of 100,000 words and I want

    to introduce 1 new word X (for the sake of simplicity). I could do

    one of 3 options:<br>

    <ol>

      <li>use the contexts in which X appears in some training data (but

        sometimes X may not appear (enough))</li>

      <li>estimate the probability of X by taking a fraction of the prob

        mass of a synonym of X (which I described earlier)</li>

      <li>estimate the probability of X by taking a fraction of the prob

        mass of the <unk> class (if e.g. no good synonym is at

        hand)</li>

    </ol>

    <p>I could then compare the perplexities of these 3 LMs with a

      vocabulary of size 100,001 words to see which technique is best

      for a given word/situation.<br>

    </p>

    <p>Joris<br>

    </p>

  </body>

</html>