<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2963" name=GENERATOR><LINK
href="BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em}"
rel=stylesheet></HEAD>
<BODY style="FONT-SIZE: 10pt; MARGIN: 10px; FONT-FAMILY: verdana">
<DIV>Hi People,</DIV>
<DIV> </DIV>
<DIV>I meet a problem when I train a language model with option
"-text-has-weights".</DIV>
<DIV> </DIV>
<DIV>The input text with fraction count is as below:</DIV>
<DIV> </DIV>
<DIV>======================================</DIV>
<DIV> </DIV>
<DIV>
<DIV>1 china_H today</DIV>
<DIV>1 on_H</DIV>
<DIV>1 smuggling_H scale</DIV>
<DIV>0.000283545 less_H or</DIV>
<DIV>1 under_H</DIV>
<DIV>1 's_H last year</DIV>
<DIV>0.202422 more_H</DIV>
<DIV>0.000283545 more_H</DIV>
<DIV>1 less_H more or</DIV>
<DIV>1 brought_H</DIV>
<DIV>1 crackdown_H the</DIV>
<DIV>1.41754e-05 smuggling_H large - scale</DIV>
<DIV>1 of_H</DIV>
<DIV>0.105263 less_H more or</DIV>
<DIV>0.0021736 brought_H more</DIV>
<DIV>1.02756e-05 brought_H less</DIV>
<DIV>0.202422 been_H</DIV>
<DIV>1 been_H</DIV>
<DIV>0.105263 been_H</DIV>
<DIV>0.0021736 been_H</DIV></DIV>
<DIV> </DIV>
<DIV>=======================================</DIV>
<DIV> </DIV>
<DIV>The fraction count and sentence are separated by space.</DIV>
<DIV> </DIV>
<DIV>And when we use the kn-discount, it went wrong, the command is:</DIV>
<DIV> </DIV>
<DIV><STRONG>./ngram-count -text-has-weights test -order 3 -lm test.o3.lm.gz -float-counts -unk -kndiscount
</STRONG></DIV>
<DIV><STRONG></STRONG> </DIV>
<DIV>and the error message is:</DIV>
<DIV> </DIV>
<DIV>error in discount estimator for order 1
</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>Then I went to look for more information on Internet, and found that
for the option "-float-counts", <SPAN class=Apple-style-span
style="WORD-SPACING: 0px; FONT: medium Monaco; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><FONT
face=Verdana size=2>only certain discounting methods support non-integer counts
(wbdiscount and cdiscount). So I use the wb-discount with the
command:</FONT></SPAN></DIV>
<DIV><SPAN class=Apple-style-span
style="WORD-SPACING: 0px; FONT: medium Monaco; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><FONT
face=Verdana size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=Apple-style-span
style="WORD-SPACING: 0px; FONT: medium Monaco; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><STRONG><FONT
face=Verdana
size=2>./ngram-count -text-has-weights test -order 3 -lm test.o3.lm.gz -float-counts -unk -wbdiscount -debug 3</FONT></STRONG>
</SPAN></DIV>
<DIV> </DIV>
<DIV>and the output information is:</DIV>
<DIV> </DIV>
<DIV>
<DIV>using WittenBell for 1-grams</DIV>
<DIV>using WittenBell for 2-grams</DIV>
<DIV>using WittenBell for 3-grams</DIV>
<DIV>warning: distributing 1 left-over probability mass over 2 zeroton words</DIV>
<DIV>writing 3 1-grams</DIV>
<DIV>writing 0 2-grams</DIV>
<DIV>writing 0 3-grams</DIV></DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>It seems that everything goes well, however, in the lm file, there is
only:</DIV>
<DIV> </DIV>
<DIV>
<DIV>\data\</DIV>
<DIV>ngram 1=3</DIV>
<DIV>ngram 2=0</DIV>
<DIV>ngram 3=0</DIV>
<DIV></DIV>
<DIV>\1-grams:</DIV>
<DIV>-0.30103 </s></DIV>
<DIV>-99 <s></DIV>
<DIV>-0.30103 <unk></DIV>
<DIV></DIV>
<DIV>\2-grams:</DIV>
<DIV></DIV>
<DIV>\3-grams:</DIV>
<DIV></DIV>
<DIV>\end\</DIV></DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>So what is the problem? Is there something wrong with the input file or the
command line?</DIV>
<DIV> </DIV>
<DIV>Thanks and Regards</DIV>
<DIV> </DIV>
<DIV>Tu Zhaopeng</DIV>
<DIV> </DIV>
<DIV align=left><FONT color=#c0c0c0>2010-03-23 </FONT></DIV>
<HR style="WIDTH: 122px; HEIGHT: 2px" align=left SIZE=2>
<DIV><FONT color=#c0c0c0><SPAN>
<DIV>
<DIV>---------------------------------------------------<BR> Tu
Zhaopeng<BR> Institute of Computing Technology,<BR> Chinese Academy of
Sciences<BR> <A
href="http://nlp.ict.ac.cn/~tuzhaopeng/">http://nlp.ict.ac.cn/~tuzhaopeng/</A><BR>--------------------------------------------------- </DIV></DIV></SPAN>
</FONT></DIV></BODY></HTML>