<html><body><div style="color:#000; background-color:#fff; font-family:verdana, helvetica, sans-serif;font-size:12pt"><div><span>Thank you, </span></div><div><span>-cdiscount 0 works perfectly, but now that </span><span>I have read about smoothing and different methods of discounting I have </span><span>another question:</span><span><br></span></div><div><span><br></span></div><div><span>I want to know your ideas about this problem:</span></div><div><span>I want to have a model out of a text. and then predict what the user is typing (a word prediction approach). at any moment I will predict what the next character would be according to my bigrams.</span></div><div><span>Do you think methods of discounting and smoothing are useful in treating the training data?</span></div><div><span>or it is more appropriate if I just disable it?</span></div><div><br><span></span></div><div><span>Thank
you</span></div><div><span>Saman<br></span></div><div><span><br></span></div><div><br></div> <div style="font-family: verdana,helvetica,sans-serif; font-size: 12pt;"> <div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"> <div dir="ltr"> <font size="2" face="Arial"> <hr size="1"> <b><span style="font-weight: bold;">From:</span></b> Andreas Stolcke <stolcke@icsi.berkeley.edu><br> <b><span style="font-weight: bold;">To:</span></b> Saman Noorzadeh <saman_2004@yahoo.com> <br><b><span style="font-weight: bold;">Cc:</span></b> Srilm group <srilm-user@speech.sri.com> <br> <b><span style="font-weight: bold;">Sent:</span></b> Wednesday, April 11, 2012 1:46 AM<br> <b><span style="font-weight: bold;">Subject:</span></b> Re: [SRILM User List] how are the probabilities computed in ngram-count<br> </font> </div> <br>
<div id="yiv53509230">
<div>
On 4/10/2012 1:29 AM, Saman Noorzadeh wrote:
<blockquote type="cite">
<div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: verdana,helvetica,sans-serif; font-size: 12pt;">
<div>Hello</div>
<div>I am getting confused about the models that ngram-count
make:</div>
<div>ngram-count -order 2 -write-vocab vocabulary.voc -text
mytext.txt -write <a target="_blank" href="http://model1.bo">model1.bo</a><br>
ngram-count -order 2 -read model1.bo -lm <a target="_blank" href="http://model2.BO">model2.BO</a></div>
<div><br>
</div>
<div>forexample: (the text is very large and these words are
just a sample)<br>
</div>
<div><br>
</div>
<div>in model1.bo:</div>
<div>cook 14 <br>
</div>
<div>cook was 1</div>
<div><br>
</div>
<div>in model2.BO:</div>
<div>-1.904738 cook was </div>
<div><br>
</div>
<div>my question is that the probability of 'cook was' bigram
should be log10(1/14), but ngram-count result shows:
log(1/80)== -1.9047</div>
<div>how is these probabilities computed?</div>
</div>
</blockquote>
<br>
It's called "smoothing" or "discounting" and ensures that word
sequences of ngrams never seen in the training data receive nonzero
probability.<br>
Please consult any of the basic LM tutorial sources listed at
http://www.speech.sri.com/projects/srilm/manpages/, or specifically
http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
.<br>
<br>
To obtain the unsmoothed probability estimates that you are
expecting you need to change the parameters. Try ngram-count
-addsmooth 0 .... <br>
<br>
Andreas<br>
<br>
</div>
</div><br><br> </div> </div> </div></body></html>