<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi,<br>
<br>
I am sorry for such a long e-mail, but I found a strange behavior
during the log probability calculation of the unigrams.<br>
<br>
I have two language models trained on two text sets. Actually, those
sets are just two different sentences, repeated 100 times each:<br>
<br>
ACTION_REJECT_003.train.txt:<br>
<font face="Courier New"><s> der gewünschte artikel ist nicht
im koffer enthalten </s> (x 100)</font><br>
<br>
ACTION_REJECT_004.train.txt:<br>
<font face="Courier New"><s> ihre aussage kann nicht
verarbeitet werden </s> (x 100)</font><br>
<br>
Also, I have defined few specific categories to build a class-based
LM.<br>
One class is numbers (ein, eine, eins, einundachtzig etc.), the
second one comprises names of specific items related to the task
domain (achselshirt, blusen), and the last one consists just of two
words: 'wurde' and 'wurden'.<br>
<br>
So, I am building two expanded class-based LMs using Witten-Bell
discounting (I tried<span style="color: rgb(0, 0, 0); font-family:
'Times New Roman'; font-size: medium; font-style: normal;
font-variant: normal; font-weight: normal; letter-spacing: normal;
line-height: normal; orphans: 2; text-align: start; text-indent:
0px; text-transform: none; white-space: normal; widows: 2;
word-spacing: 0px; -webkit-text-size-adjust: auto;
-webkit-text-stroke-width: 0px; display: inline !important; float:
none; "><span class="Apple-converted-space"></span></span> also
the default Good-Turing, but with the same result):<br>
<br>
<font face="Courier New">replace-words-with-classes
classes=wizard.class.defs ACTION_REJECT_003.train.txt >
ACTION_REJECT_003.train.class.txt<br>
<br>
ngram-count -text ACTION_REJECT_003.train.class.txt -lm
ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2
-wbdiscount3<br>
<br>
ngram -lm ACTION_REJECT_003.lm -write-lm
ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs
-expand-classes 3 -expand-exact 3 -vocab wizard.wlist </font><br>
<br>
<br>
The second LM (ACTION_REJECT_004) is built using the same approach.
But these two models are pretty different.<br>
<br>
ACTION_REJECT_003.expanded.lm has reasonable smoothed log
probabilities for the unseen unigrams:<br>
<br>
<font face="Courier New">\data\<br>
ngram 1=924<br>
ngram 2=9<br>
ngram 3=8<br>
<br>
\1-grams:<br>
-0.9542425 </s><br>
-10.34236 <BREAK><br>
-99 <s> -99<br>
-10.34236 ab<br>
-10.34236 abgeben<br>
<br>
[...]<br>
<br>
-10.34236 überschritten<br>
-10.34236 übertragung<br>
<br>
\2-grams:<br>
0 <s> der 0<br>
0 artikel ist 0<br>
0 der gewünschte 0<br>
0 enthalten </s><br>
0 gewünschte artikel 0<br>
0 im koffer 0<br>
0 ist nicht 0<br>
0 koffer enthalten 0<br>
0 nicht im 0<br>
<br>
\3-grams:<br>
0 gewünschte artikel ist<br>
0 <s> der gewünschte<br>
0 koffer enthalten </s><br>
0 der gewünschte artikel<br>
0 nicht im koffer<br>
0 artikel ist nicht<br>
0 im koffer enthalten<br>
0 ist nicht im<br>
<br>
\end\</font><br>
<br>
<br>
Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have a
zero probability:<br>
<br>
<font face="Courier New">\data\<br>
ngram 1=924<br>
ngram 2=7<br>
ngram 3=6<br>
<br>
\1-grams:<br>
-0.845098 </s><br>
-99 <BREAK><br>
-99 <s> -99<br>
-99 ab<br>
-99 abgeben<br>
[...]<br>
-0.845098 aussage -99<br>
[...]<br>
-99 überschritten<br>
-99 übertragung<br>
<br>
\2-grams:<br>
0 <s> ihre 0<br>
0 aussage kann 0<br>
0 ihre aussage 0<br>
0 kann nicht 0<br>
0 nicht verarbeitet 0<br>
0 sagen </s><br>
0 verarbeitet sagen 0<br>
<br>
\3-grams:<br>
0 ihre aussage kann<br>
0 <s> ihre aussage<br>
0 aussage kann nicht<br>
0 kann nicht verarbeitet<br>
0 verarbeitet sagen </s><br>
0 nicht verarbeitet sagen<br>
<br>
\end\</font><br>
<br>
<br>
None of the words from both training sentences belong to any class.<br>
<br>
Also, I found that removing the last word from the second training
sentence fixes the problem.<br>
Thus, for the following sentence:<br>
<font face="Courier New"><br>
<s> ihre aussage kann nicht </s></font><br>
<br>
corresponding LM has correctly discounted probabilities (also around
-10). Replacing 'werden' with any other word (I tried 'sagen',
'abgeben' and 'beer') causes the same problem again.<br>
<br>
Is it a bug or am I doing something wrong?<br>
I would be appreciated for any advice. I also can provide all
necessary data and scripts if needed.<br>
<br>
Sincerely yours,<br>
Dmytro Prylipko. <br>
<br>
</body>
</html>