<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    On 3/31/2012 8:00 PM, Meng Chen wrote:

    <blockquote

cite="mid:CA+bc0mppaw_F+gyCSkK2TW_W6y+2gTR=_YTtPNV6wRZyS++r1w@mail.gmail.com"

      type="cite"><font face="'trebuchet ms', sans-serif">Hi, I met a

        question when training class-based language model by

        replace-words-with-classes command. My commands are as follows:</font>

      <div><font face="'trebuchet ms', sans-serif"><br>

        </font></div>

      <div>

        <ul>

          <li><span style="font-family:'trebuchet ms',sans-serif">ngram-class

              -vocab wlist -text training_set -numclasses 200

              -incremental -classes output.classes</span></li>

          <li><span style="font-family:'trebuchet ms',sans-serif">replace-words-with-classes

              classes=</span><span style="font-family:'trebuchet

              ms',sans-serif">output.classes</span><span

              style="font-family:'trebuchet ms',sans-serif">

              training_set > training_set_classes</span></li>

        </ul>

        <div><font face="'trebuchet ms', sans-serif">After these two

            steps, I found that there are both words and classes in

            training_set_classes. These words are OOVs in wlist,

            however, I don't need them at all. Shouldn't these words

            belong to <unk> in CLASS-00001? So I wonder to know

            how to process this situation? Does SRILM support some

            scripts to map these OOVs to CLASS-00001? Or Do I need to

            write a script by myself?</font></div>

      </div>

    </blockquote>

    <br>

    It must be the case that wlist does not contain all the words in

    training_set, and therefore output.classes does not cover the entire

    vocabulary.<br>

    In that case replace-words-with-classes will only operate on words

    contained in the class definitions.<br>

    <br>

    You can easily augment the class definitions to add an extra class

    that catches all your OOV words.  The format should be

    self-explanatory, or check the classes-format(5) man page.<br>

    <br>

    Andreas<br>

    <br>

    <br>

  </body>

</html>