<html>

  <head>

    <meta http-equiv="content-type" content="text/html;

      charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Hi,<br>

    <br>

    I am sorry for such a long e-mail, but I found a strange behavior

    during the log probability calculation of the unigrams.<br>

    <br>

    I have two language models trained on two text sets. Actually, those

    sets are just two different sentences, repeated 100 times each:<br>

    <br>

    ACTION_REJECT_003.train.txt:<br>

    <font face="Courier New"><s> der gewünschte artikel ist nicht

      im koffer enthalten  </s>  (x 100)</font><br>

    <br>

    ACTION_REJECT_004.train.txt:<br>

    <font face="Courier New"><s> ihre aussage kann nicht

      verarbeitet werden </s> (x 100)</font><br>

    <br>

    Also, I have defined few specific categories to build a class-based

    LM.<br>

    One class is numbers (ein, eine, eins, einundachtzig etc.), the

    second one comprises names of specific items related to the task

    domain (achselshirt, blusen), and the last one consists just of two

    words: 'wurde' and 'wurden'.<br>

    <br>

    So, I am building two expanded class-based LMs using Witten-Bell

    discounting (I tried<span style="color: rgb(0, 0, 0); font-family:

      'Times New Roman'; font-size: medium; font-style: normal;

      font-variant: normal; font-weight: normal; letter-spacing: normal;

      line-height: normal; orphans: 2; text-align: start; text-indent:

      0px; text-transform: none; white-space: normal; widows: 2;

      word-spacing: 0px; -webkit-text-size-adjust: auto;

      -webkit-text-stroke-width: 0px; display: inline !important; float:

      none; "><span class="Apple-converted-space"></span></span> also

    the default Good-Turing, but with the same result):<br>

    <br>

    <font face="Courier New">replace-words-with-classes

      classes=wizard.class.defs ACTION_REJECT_003.train.txt >

      ACTION_REJECT_003.train.class.txt<br>

      <br>

      ngram-count -text ACTION_REJECT_003.train.class.txt -lm

      ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2

      -wbdiscount3<br>

      <br>

      ngram -lm ACTION_REJECT_003.lm  -write-lm

      ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs

      -expand-classes 3 -expand-exact 3 -vocab wizard.wlist </font><br>

    <br>

    <br>

    The second LM (ACTION_REJECT_004) is built using the same approach.

    But these two models are pretty different.<br>

    <br>

    ACTION_REJECT_003.expanded.lm has reasonable smoothed log

    probabilities for the unseen unigrams:<br>

    <br>

    <font face="Courier New">\data\<br>

      ngram 1=924<br>

      ngram 2=9<br>

      ngram 3=8<br>

      <br>

      \1-grams:<br>

      -0.9542425    </s><br>

      -10.34236    <BREAK><br>

      -99    <s>    -99<br>

      -10.34236    ab<br>

      -10.34236    abgeben<br>

      <br>

      [...]<br>

      <br>

      -10.34236    überschritten<br>

      -10.34236    übertragung<br>

      <br>

      \2-grams:<br>

      0    <s> der    0<br>

      0    artikel ist    0<br>

      0    der gewünschte    0<br>

      0    enthalten </s><br>

      0    gewünschte artikel    0<br>

      0    im koffer    0<br>

      0    ist nicht    0<br>

      0    koffer enthalten    0<br>

      0    nicht im    0<br>

      <br>

      \3-grams:<br>

      0    gewünschte artikel ist<br>

      0    <s> der gewünschte<br>

      0    koffer enthalten </s><br>

      0    der gewünschte artikel<br>

      0    nicht im koffer<br>

      0    artikel ist nicht<br>

      0    im koffer enthalten<br>

      0    ist nicht im<br>

      <br>

      \end\</font><br>

    <br>

    <br>

    Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have a

    zero probability:<br>

    <br>

    <font face="Courier New">\data\<br>

      ngram 1=924<br>

      ngram 2=7<br>

      ngram 3=6<br>

      <br>

      \1-grams:<br>

      -0.845098    </s><br>

      -99    <BREAK><br>

      -99    <s>    -99<br>

      -99    ab<br>

      -99    abgeben<br>

      [...]<br>

      -0.845098    aussage    -99<br>

      [...]<br>

      -99    überschritten<br>

      -99    übertragung<br>

      <br>

      \2-grams:<br>

      0    <s> ihre    0<br>

      0    aussage kann    0<br>

      0    ihre aussage    0<br>

      0    kann nicht    0<br>

      0    nicht verarbeitet    0<br>

      0    sagen </s><br>

      0    verarbeitet sagen    0<br>

      <br>

      \3-grams:<br>

      0    ihre aussage kann<br>

      0    <s> ihre aussage<br>

      0    aussage kann nicht<br>

      0    kann nicht verarbeitet<br>

      0    verarbeitet sagen </s><br>

      0    nicht verarbeitet sagen<br>

      <br>

      \end\</font><br>

    <br>

    <br>

    None of the words from both training sentences belong to any class.<br>

    <br>

    Also, I found that removing the last word from the second training

    sentence fixes the problem.<br>

    Thus, for the following sentence:<br>

    <font face="Courier New"><br>

      <s> ihre aussage kann nicht  </s></font><br>

    <br>

    corresponding LM has correctly discounted probabilities (also around

    -10). Replacing 'werden' with any other word (I tried 'sagen',

    'abgeben' and 'beer') causes the same problem again.<br>

    <br>

    Is it a bug or am I doing something wrong?<br>

    I would be appreciated for any advice. I also can provide all

    necessary data and scripts if needed.<br>

    <br>

    Sincerely yours,<br>

    Dmytro Prylipko. <br>

    <br>

  </body>

</html>