<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2963" name=GENERATOR><LINK 
href="BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em}" 
rel=stylesheet></HEAD>
<BODY style="FONT-SIZE: 10pt; MARGIN: 10px; FONT-FAMILY: verdana">
<DIV>Hi&nbsp; People,</DIV>
<DIV>&nbsp;</DIV>
<DIV>I meet a problem when I train a language model with option 
"-text-has-weights".</DIV>
<DIV>&nbsp;</DIV>
<DIV>The input text with&nbsp;fraction count&nbsp;is as below:</DIV>
<DIV>&nbsp;</DIV>
<DIV>======================================</DIV>
<DIV>&nbsp;</DIV>
<DIV>
<DIV>1&nbsp;china_H&nbsp;today</DIV>
<DIV>1&nbsp;on_H</DIV>
<DIV>1&nbsp;smuggling_H&nbsp;scale</DIV>
<DIV>0.000283545&nbsp;less_H&nbsp;or</DIV>
<DIV>1&nbsp;under_H</DIV>
<DIV>1&nbsp;'s_H&nbsp;last&nbsp;year</DIV>
<DIV>0.202422&nbsp;more_H</DIV>
<DIV>0.000283545&nbsp;more_H</DIV>
<DIV>1&nbsp;less_H&nbsp;more&nbsp;or</DIV>
<DIV>1&nbsp;brought_H</DIV>
<DIV>1&nbsp;crackdown_H&nbsp;the</DIV>
<DIV>1.41754e-05&nbsp;smuggling_H&nbsp;large&nbsp;-&nbsp;scale</DIV>
<DIV>1&nbsp;of_H</DIV>
<DIV>0.105263&nbsp;less_H&nbsp;more&nbsp;or</DIV>
<DIV>0.0021736&nbsp;brought_H&nbsp;more</DIV>
<DIV>1.02756e-05&nbsp;brought_H&nbsp;less</DIV>
<DIV>0.202422&nbsp;been_H</DIV>
<DIV>1&nbsp;been_H</DIV>
<DIV>0.105263&nbsp;been_H</DIV>
<DIV>0.0021736&nbsp;been_H</DIV></DIV>
<DIV>&nbsp;</DIV>
<DIV>=======================================</DIV>
<DIV>&nbsp;</DIV>
<DIV>The fraction count and sentence are separated by space.</DIV>
<DIV>&nbsp;</DIV>
<DIV>And when we use the kn-discount, it went wrong, the command is:</DIV>
<DIV>&nbsp;</DIV>
<DIV><STRONG>./ngram-count&nbsp;-text-has-weights&nbsp;test&nbsp;-order&nbsp;3&nbsp;-lm&nbsp;test.o3.lm.gz&nbsp;-float-counts&nbsp;-unk&nbsp;-kndiscount 
</STRONG></DIV>
<DIV><STRONG></STRONG>&nbsp;</DIV>
<DIV>and the error message is:</DIV>
<DIV>&nbsp;</DIV>
<DIV>error&nbsp;in&nbsp;discount&nbsp;estimator&nbsp;for&nbsp;order&nbsp;1 
</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>Then I went to look for more information on Internet, and found that 
for&nbsp;the option "-float-counts", <SPAN class=Apple-style-span 
style="WORD-SPACING: 0px; FONT: medium Monaco; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><FONT 
face=Verdana size=2>only certain discounting methods support non-integer counts 
(wbdiscount and cdiscount). So I use the wb-discount with the 
command:</FONT></SPAN></DIV>
<DIV><SPAN class=Apple-style-span 
style="WORD-SPACING: 0px; FONT: medium Monaco; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><FONT 
face=Verdana size=2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=Apple-style-span 
style="WORD-SPACING: 0px; FONT: medium Monaco; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><STRONG><FONT 
face=Verdana 
size=2>./ngram-count&nbsp;-text-has-weights&nbsp;test&nbsp;-order&nbsp;3&nbsp;-lm&nbsp;test.o3.lm.gz&nbsp;-float-counts&nbsp;-unk&nbsp;-wbdiscount&nbsp;-debug&nbsp;3</FONT></STRONG> 
</SPAN></DIV>
<DIV>&nbsp;</DIV>
<DIV>and the output information is:</DIV>
<DIV>&nbsp;</DIV>
<DIV>
<DIV>using&nbsp;WittenBell&nbsp;for&nbsp;1-grams</DIV>
<DIV>using&nbsp;WittenBell&nbsp;for&nbsp;2-grams</DIV>
<DIV>using&nbsp;WittenBell&nbsp;for&nbsp;3-grams</DIV>
<DIV>warning:&nbsp;distributing&nbsp;1&nbsp;left-over&nbsp;probability&nbsp;mass&nbsp;over&nbsp;2&nbsp;zeroton&nbsp;words</DIV>
<DIV>writing&nbsp;3&nbsp;1-grams</DIV>
<DIV>writing&nbsp;0&nbsp;2-grams</DIV>
<DIV>writing&nbsp;0&nbsp;3-grams</DIV></DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>It seems that everything goes well, however, in the lm file, there is 
only:</DIV>
<DIV>&nbsp;</DIV>
<DIV>
<DIV>\data\</DIV>
<DIV>ngram&nbsp;1=3</DIV>
<DIV>ngram&nbsp;2=0</DIV>
<DIV>ngram&nbsp;3=0</DIV>
<DIV></DIV>
<DIV>\1-grams:</DIV>
<DIV>-0.30103&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;/s&gt;</DIV>
<DIV>-99&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;s&gt;</DIV>
<DIV>-0.30103&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;unk&gt;</DIV>
<DIV></DIV>
<DIV>\2-grams:</DIV>
<DIV></DIV>
<DIV>\3-grams:</DIV>
<DIV></DIV>
<DIV>\end\</DIV></DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>So what is the problem? Is there something wrong with the input file or the 
command line?</DIV>
<DIV>&nbsp;</DIV>
<DIV>Thanks and Regards</DIV>
<DIV>&nbsp;</DIV>
<DIV>Tu Zhaopeng</DIV>
<DIV>&nbsp;</DIV>
<DIV align=left><FONT color=#c0c0c0>2010-03-23 </FONT></DIV>
<HR style="WIDTH: 122px; HEIGHT: 2px" align=left SIZE=2>

<DIV><FONT color=#c0c0c0><SPAN>
<DIV>
<DIV>---------------------------------------------------<BR>&nbsp;Tu 
Zhaopeng<BR>&nbsp;Institute of Computing Technology,<BR>&nbsp;Chinese Academy of 
Sciences<BR>&nbsp;<A 
href="http://nlp.ict.ac.cn/~tuzhaopeng/">http://nlp.ict.ac.cn/~tuzhaopeng/</A><BR>---------------------------------------------------&nbsp;</DIV></DIV></SPAN> 
</FONT></DIV></BODY></HTML>