[SRILM User List] A problem with ngram-count with option "-text-has-weights"

tuzhaopeng tuzhaopeng at ict.ac.cn
Mon Mar 22 23:33:29 PDT 2010


Hi  People,

I meet a problem when I train a language model with option "-text-has-weights".

The input text with fraction count is as below:

======================================

1 china_H today
1 on_H
1 smuggling_H scale
0.000283545 less_H or
1 under_H
1 's_H last year
0.202422 more_H
0.000283545 more_H
1 less_H more or
1 brought_H
1 crackdown_H the
1.41754e-05 smuggling_H large - scale
1 of_H
0.105263 less_H more or
0.0021736 brought_H more
1.02756e-05 brought_H less
0.202422 been_H
1 been_H
0.105263 been_H
0.0021736 been_H

=======================================

The fraction count and sentence are separated by space.

And when we use the kn-discount, it went wrong, the command is:

./ngram-count -text-has-weights test -order 3 -lm test.o3.lm.gz -float-counts -unk -kndiscount 

and the error message is:

error in discount estimator for order 1 



Then I went to look for more information on Internet, and found that for the option "-float-counts", only certain discounting methods support non-integer counts (wbdiscount and cdiscount). So I use the wb-discount with the command:

./ngram-count -text-has-weights test -order 3 -lm test.o3.lm.gz -float-counts -unk -wbdiscount -debug 3 

and the output information is:

using WittenBell for 1-grams
using WittenBell for 2-grams
using WittenBell for 3-grams
warning: distributing 1 left-over probability mass over 2 zeroton words
writing 3 1-grams
writing 0 2-grams
writing 0 3-grams


It seems that everything goes well, however, in the lm file, there is only:

\data\
ngram 1=3
ngram 2=0
ngram 3=0
\1-grams:
-0.30103        </s>
-99     <s>
-0.30103        <unk>
\2-grams:
\3-grams:
\end\


So what is the problem? Is there something wrong with the input file or the command line?

Thanks and Regards

Tu Zhaopeng

2010-03-23 



---------------------------------------------------
 Tu Zhaopeng
 Institute of Computing Technology,
 Chinese Academy of Sciences
 http://nlp.ict.ac.cn/~tuzhaopeng/
--------------------------------------------------- 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100323/d865c3f1/attachment-0002.html>


More information about the SRILM-User mailing list