The problem is that your final vocabulary is introduced as a surprise in the last step (to ngram). When class expansion likelihoods sum to exactly 1.0 there is no room for novelty in back off orders at this stage. <br><br>
To get the correct behavior you must prime the initial language model with a vocabulary of all either class tags or the individual words themselves. E.g.<br><br><div style="margin-left:40px"><font><span style="font-family:courier new,monospace">awk '{print $1}' </span></font><font><span style="font-family:courier new,monospace">wizard.class.defs </span></font><font><span style="font-family:courier new,monospace">| sort -u >wizard.classnames.txt</span></font><br style="font-family:courier new,monospace">
<br style="font-family:courier new,monospace"><font style="font-family:courier new,monospace">cat $datafile \</font><br style="font-family:courier new,monospace"><font style="font-family:courier new,monospace"> | replace-words-with-classes classes=wizard.class.defs - \</font><br style="font-family:courier new,monospace">
<font style="font-family:courier new,monospace"> | ngram-count -text - -lm - -order 1 -wbdiscount -vocab wizard.classnames.txt</font><span style="font-family:courier new,monospace"> \</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> > your-lm.1bo</span><br style="font-family:courier new,monospace"><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"># Expanding classes in your-lm.1bo now will give you the desired behavior</span>.<br style="font-family:courier new,monospace">
<br style="font-family:courier new,monospace"></div>HTH<br><br>&<br><br><div class="gmail_quote">On Tue, Oct 2, 2012 at 2:48 AM, Dmytro Prylipko <span dir="ltr"><<a href="mailto:dmytro.prylipko@ovgu.de" target="_blank">dmytro.prylipko@ovgu.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
Hi,<br>
<br>
Thank you for the quick feedback.<br>
<br>
I found out something else remarkable: I tried to run the script on
our cluster under CentOS (my workstation is running Kubuntu 12.04)
and discovered that on the cluster all the LMs have zero
probabilities for unseen 1-grams. No smoothing at all!<br>
<br>
The setup is of course different. Output of the <font face="Courier
New">uname -a</font> on the cluster:<br>
<br>
<font face="Courier New">Linux frontend1.service 2.6.18-164.11.1.el5
#1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64 x86_64 x86_64 GNU/Linux</font><br>
<br>
On the workstation:<br>
<br>
<font face="Courier New">Linux KS-PC113 3.2.0-31-generic-pae
#50-Ubuntu SMP Fri Sep 7 16:39:45 UTC 2012 i686 i686 i386
GNU/Linux </font><br>
<br>
SRILM on the cluster was build with <font face="Courier New">MACHINE_TYPE=i686-m64</font>
(with and without _C option, both give the same result), on the
workstation with <font face="Courier New">MACHINE_TYPE=i686-gcc4</font><br>
<br>
LANG variable is en_US.UTF-8 on both machines. Replacing umlauts
with regular characters gave no difference.<br>
<br>
What do you mean exactly under 'behavior of your local awk
installation when it encounters extended chars'?<br>
<br>
So, I am sending you the minimal dataset for replicating it. Shell
script buildtaglm.sh does all the job.<br>
<br>
Yours,<br>
Dmytro Prylipko.<div><br>
<br>
On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote:<br>
</div><blockquote type="cite"><div><br>
On a first reading of your email I'm indeed surprised that the
results <br>
differ between the two texts. Have you tried replacing the umlaut
in <br>
the first corpus with a regular "u" and checked if you still get
the <br>
same behavior. Check the LANG environment variable and the
behavior of <br>
your local awk installation when it encounters extended chars.<br>
<br>
If the problem persists, please send me the two corpora, along
with <br>
the class file and I'll be glad to take a look for you.<br>
<br>
&<br>
<br>
On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko <br></div><div>
<<a href="mailto:dmytro.prylipko@ovgu.de" target="_blank">dmytro.prylipko@ovgu.de</a>
<a href="mailto:dmytro.prylipko@ovgu.de" target="_blank"><mailto:dmytro.prylipko@ovgu.de></a>> wrote:<br>
<br>
Hi,<br>
<br>
I am sorry for such a long e-mail, but I found a strange behavior<br>
during the log probability calculation of the unigrams.<br>
<br>
I have two language models trained on two text sets. Actually,<br>
those sets are just two different sentences, repeated 100 times
each:<br>
<br>
ACTION_REJECT_003.train.txt:<br>
<s> der gewünschte artikel ist nicht im koffer enthalten
</s> (x<br>
100)<br>
<br>
ACTION_REJECT_004.train.txt:<br>
<s> ihre aussage kann nicht verarbeitet werden </s> (x
100)<br>
<br>
Also, I have defined few specific categories to build a<br>
class-based LM.<br>
One class is numbers (ein, eine, eins, einundachtzig etc.), the<br>
second one comprises names of specific items related to the task<br>
domain (achselshirt, blusen), and the last one consists just of<br>
two words: 'wurde' and 'wurden'.<br>
<br>
So, I am building two expanded class-based LMs using Witten-Bell<br></div>
discounting (I triedalso the default Good-Turing, but with the<div><div><br>
same result):<br>
<br>
replace-words-with-classes classes=wizard.class.defs<br>
ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt<br>
<br>
ngram-count -text ACTION_REJECT_003.train.class.txt -lm<br>
ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2
-wbdiscount3<br>
<br>
ngram -lm ACTION_REJECT_003.lm -write-lm<br>
ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs<br>
-expand-classes 3 -expand-exact 3 -vocab wizard.wlist<br>
<br>
<br>
The second LM (ACTION_REJECT_004) is built using the same<br>
approach. But these two models are pretty different.<br>
<br>
ACTION_REJECT_003.expanded.lm has reasonable smoothed log<br>
probabilities for the unseen unigrams:<br>
<br>
\data\<br>
ngram 1=924<br>
ngram 2=9<br>
ngram 3=8<br>
<br>
\1-grams:<br>
-0.9542425 </s><br>
-10.34236 <BREAK><br>
-99 <s> -99<br>
-10.34236 ab<br>
-10.34236 abgeben<br>
<br>
[...]<br>
<br>
-10.34236 überschritten<br>
-10.34236 übertragung<br>
<br>
\2-grams:<br>
0 <s> der 0<br>
0 artikel ist 0<br>
0 der gewünschte 0<br>
0 enthalten </s><br>
0 gewünschte artikel 0<br>
0 im koffer 0<br>
0 ist nicht 0<br>
0 koffer enthalten 0<br>
0 nicht im 0<br>
<br>
\3-grams:<br>
0 gewünschte artikel ist<br>
0 <s> der gewünschte<br>
0 koffer enthalten </s><br>
0 der gewünschte artikel<br>
0 nicht im koffer<br>
0 artikel ist nicht<br>
0 im koffer enthalten<br>
0 ist nicht im<br>
<br>
\end\<br>
<br>
<br>
Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have<br>
a zero probability:<br>
<br>
\data\<br>
ngram 1=924<br>
ngram 2=7<br>
ngram 3=6<br>
<br>
\1-grams:<br>
-0.845098 </s><br>
-99 <BREAK><br>
-99 <s> -99<br>
-99 ab<br>
-99 abgeben<br>
[...]<br>
-0.845098 aussage -99<br>
[...]<br>
-99 überschritten<br>
-99 übertragung<br>
<br>
\2-grams:<br>
0 <s> ihre 0<br>
0 aussage kann 0<br>
0 ihre aussage 0<br>
0 kann nicht 0<br>
0 nicht verarbeitet 0<br>
0 sagen </s><br>
0 verarbeitet sagen 0<br>
<br>
\3-grams:<br>
0 ihre aussage kann<br>
0 <s> ihre aussage<br>
0 aussage kann nicht<br>
0 kann nicht verarbeitet<br>
0 verarbeitet sagen </s><br>
0 nicht verarbeitet sagen<br>
<br>
\end\<br>
<br>
<br>
None of the words from both training sentences belong to any
class.<br>
<br>
Also, I found that removing the last word from the second training<br>
sentence fixes the problem.<br>
Thus, for the following sentence:<br>
<br>
<s> ihre aussage kann nicht </s><br>
<br>
corresponding LM has correctly discounted probabilities (also<br>
around -10). Replacing 'werden' with any other word (I tried<br>
'sagen', 'abgeben' and 'beer') causes the same problem again.<br>
<br>
Is it a bug or am I doing something wrong?<br>
I would be appreciated for any advice. I also can provide all<br>
necessary data and scripts if needed.<br>
<br>
Sincerely yours,<br>
Dmytro Prylipko.<br>
<br>
<br>
_______________________________________________<br>
SRILM-User site list<br>
</div></div><a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a> <a href="mailto:SRILM-User@speech.sri.com" target="_blank"><mailto:SRILM-User@speech.sri.com></a><br>
<a href="http://www.speech.sri.com/mailman/listinfo/srilm-user" target="_blank">http://www.speech.sri.com/mailman/listinfo/srilm-user</a><br>
<br>
<br>
</blockquote>
<br>
<br>
</div>
</blockquote></div><br>