The problem is that your final vocabulary is introduced as a surprise in the last step (to ngram). When class expansion likelihoods sum to exactly 1.0 there is no room for novelty in back off orders at this stage. <br><br>


To get the correct behavior you must prime the initial language model with a vocabulary of all either class tags or the individual words themselves.  E.g.<br><br><div style="margin-left:40px"><font><span style="font-family:courier new,monospace">awk '{print $1}' </span></font><font><span style="font-family:courier new,monospace">wizard.class.defs </span></font><font><span style="font-family:courier new,monospace">| sort -u >wizard.classnames.txt</span></font><br style="font-family:courier new,monospace">


<br style="font-family:courier new,monospace"><font style="font-family:courier new,monospace">cat $datafile \</font><br style="font-family:courier new,monospace"><font style="font-family:courier new,monospace">  | replace-words-with-classes classes=wizard.class.defs - \</font><br style="font-family:courier new,monospace">


<font style="font-family:courier new,monospace">  | ngram-count -text - -lm - -order 1 -wbdiscount -vocab wizard.classnames.txt</font><span style="font-family:courier new,monospace"> \</span><br style="font-family:courier new,monospace">


<span style="font-family:courier new,monospace">  > your-lm.1bo</span><br style="font-family:courier new,monospace"><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"># Expanding classes in your-lm.1bo now will give you the desired behavior</span>.<br style="font-family:courier new,monospace">


<br style="font-family:courier new,monospace"></div>HTH<br><br>&<br><br><div class="gmail_quote">On Tue, Oct 2, 2012 at 2:48 AM, Dmytro Prylipko <span dir="ltr"><<a href="mailto:dmytro.prylipko@ovgu.de" target="_blank">dmytro.prylipko@ovgu.de</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  
  <div text="#000000" bgcolor="#FFFFFF">

    Hi,<br>

    <br>

    Thank you for the quick feedback.<br>

    <br>

    I found out something else remarkable: I tried to run the script on

    our cluster under CentOS (my workstation is running Kubuntu 12.04)

    and discovered that on the cluster all the LMs have zero

    probabilities for unseen 1-grams. No smoothing at all!<br>

    <br>

    The setup is of course different. Output of the <font face="Courier

      New">uname -a</font> on the cluster:<br>

    <br>

    <font face="Courier New">Linux frontend1.service 2.6.18-164.11.1.el5

      #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64 x86_64 x86_64 GNU/Linux</font><br>

    <br>

    On the workstation:<br>

    <br>

    <font face="Courier New">Linux KS-PC113 3.2.0-31-generic-pae

      #50-Ubuntu SMP Fri Sep 7 16:39:45 UTC 2012 i686 i686 i386

      GNU/Linux </font><br>

    <br>

    SRILM on the cluster was build with <font face="Courier New">MACHINE_TYPE=i686-m64</font>

    (with and without _C option, both give the same result), on the

    workstation with <font face="Courier New">MACHINE_TYPE=i686-gcc4</font><br>

    <br>

    LANG variable is en_US.UTF-8 on both machines. Replacing umlauts

    with regular characters gave no difference.<br>

    <br>

    What do you mean exactly under 'behavior of your local awk

    installation when it encounters extended chars'?<br>

    <br>

    So, I am sending you the minimal dataset for replicating it. Shell

    script buildtaglm.sh does all the job.<br>

    <br>

    Yours,<br>

    Dmytro Prylipko.<div><br>

    <br>

    On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote:<br>

    </div><blockquote type="cite"><div><br>

      On a first reading of your email I'm indeed surprised that the

      results <br>

      differ between the two texts. Have you tried replacing the umlaut

      in <br>

      the first corpus with a regular "u" and checked if you still get

      the <br>

      same behavior. Check the LANG environment variable and the

      behavior of <br>

      your local awk installation when it encounters extended chars.<br>

      <br>

      If the problem persists, please send me the two corpora, along

      with <br>

      the class file and I'll be glad to take a look for you.<br>

      <br>

      &<br>

      <br>

      On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko <br></div><div>

      <<a href="mailto:dmytro.prylipko@ovgu.de" target="_blank">dmytro.prylipko@ovgu.de</a>

      <a href="mailto:dmytro.prylipko@ovgu.de" target="_blank"><mailto:dmytro.prylipko@ovgu.de></a>> wrote:<br>

      <br>

      Hi,<br>

      <br>

      I am sorry for such a long e-mail, but I found a strange behavior<br>

      during the log probability calculation of the unigrams.<br>

      <br>

      I have two language models trained on two text sets. Actually,<br>

      those sets are just two different sentences, repeated 100 times

      each:<br>

      <br>

      ACTION_REJECT_003.train.txt:<br>

      <s> der gewünschte artikel ist nicht im koffer enthalten

      </s> (x<br>

      100)<br>

      <br>

      ACTION_REJECT_004.train.txt:<br>

      <s> ihre aussage kann nicht verarbeitet werden </s> (x

      100)<br>

      <br>

      Also, I have defined few specific categories to build a<br>

      class-based LM.<br>

      One class is numbers (ein, eine, eins, einundachtzig etc.), the<br>

      second one comprises names of specific items related to the task<br>

      domain (achselshirt, blusen), and the last one consists just of<br>

      two words: 'wurde' and 'wurden'.<br>

      <br>

      So, I am building two expanded class-based LMs using Witten-Bell<br></div>

      discounting (I triedalso the default Good-Turing, but with the<div><div><br>

      same result):<br>

      <br>

      replace-words-with-classes classes=wizard.class.defs<br>

      ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt<br>

      <br>

      ngram-count -text ACTION_REJECT_003.train.class.txt -lm<br>

      ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2

      -wbdiscount3<br>

      <br>

      ngram -lm ACTION_REJECT_003.lm -write-lm<br>

      ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs<br>

      -expand-classes 3 -expand-exact 3 -vocab wizard.wlist<br>

      <br>

      <br>

      The second LM (ACTION_REJECT_004) is built using the same<br>

      approach. But these two models are pretty different.<br>

      <br>

      ACTION_REJECT_003.expanded.lm has reasonable smoothed log<br>

      probabilities for the unseen unigrams:<br>

      <br>

      \data\<br>

      ngram 1=924<br>

      ngram 2=9<br>

      ngram 3=8<br>

      <br>

      \1-grams:<br>

      -0.9542425 </s><br>

      -10.34236 <BREAK><br>

      -99 <s> -99<br>

      -10.34236 ab<br>

      -10.34236 abgeben<br>

      <br>

      [...]<br>

      <br>

      -10.34236 überschritten<br>

      -10.34236 übertragung<br>

      <br>

      \2-grams:<br>

      0 <s> der 0<br>

      0 artikel ist 0<br>

      0 der gewünschte 0<br>

      0 enthalten </s><br>

      0 gewünschte artikel 0<br>

      0 im koffer 0<br>

      0 ist nicht 0<br>

      0 koffer enthalten 0<br>

      0 nicht im 0<br>

      <br>

      \3-grams:<br>

      0 gewünschte artikel ist<br>

      0 <s> der gewünschte<br>

      0 koffer enthalten </s><br>

      0 der gewünschte artikel<br>

      0 nicht im koffer<br>

      0 artikel ist nicht<br>

      0 im koffer enthalten<br>

      0 ist nicht im<br>

      <br>

      \end\<br>

      <br>

      <br>

      Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have<br>

      a zero probability:<br>

      <br>

      \data\<br>

      ngram 1=924<br>

      ngram 2=7<br>

      ngram 3=6<br>

      <br>

      \1-grams:<br>

      -0.845098 </s><br>

      -99 <BREAK><br>

      -99 <s> -99<br>

      -99 ab<br>

      -99 abgeben<br>

      [...]<br>

      -0.845098 aussage -99<br>

      [...]<br>

      -99 überschritten<br>

      -99 übertragung<br>

      <br>

      \2-grams:<br>

      0 <s> ihre 0<br>

      0 aussage kann 0<br>

      0 ihre aussage 0<br>

      0 kann nicht 0<br>

      0 nicht verarbeitet 0<br>

      0 sagen </s><br>

      0 verarbeitet sagen 0<br>

      <br>

      \3-grams:<br>

      0 ihre aussage kann<br>

      0 <s> ihre aussage<br>

      0 aussage kann nicht<br>

      0 kann nicht verarbeitet<br>

      0 verarbeitet sagen </s><br>

      0 nicht verarbeitet sagen<br>

      <br>

      \end\<br>

      <br>

      <br>

      None of the words from both training sentences belong to any

      class.<br>

      <br>

      Also, I found that removing the last word from the second training<br>

      sentence fixes the problem.<br>

      Thus, for the following sentence:<br>

      <br>

      <s> ihre aussage kann nicht </s><br>

      <br>

      corresponding LM has correctly discounted probabilities (also<br>

      around -10). Replacing 'werden' with any other word (I tried<br>

      'sagen', 'abgeben' and 'beer') causes the same problem again.<br>

      <br>

      Is it a bug or am I doing something wrong?<br>

      I would be appreciated for any advice. I also can provide all<br>

      necessary data and scripts if needed.<br>

      <br>

      Sincerely yours,<br>

      Dmytro Prylipko.<br>

      <br>

      <br>

      _______________________________________________<br>

      SRILM-User site list<br>

      </div></div><a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a> <a href="mailto:SRILM-User@speech.sri.com" target="_blank"><mailto:SRILM-User@speech.sri.com></a><br>

      <a href="http://www.speech.sri.com/mailman/listinfo/srilm-user" target="_blank">http://www.speech.sri.com/mailman/listinfo/srilm-user</a><br>

      <br>

      <br>

    </blockquote>

    <br>

    <br>

  </div>


</blockquote></div><br>