<html><body><div style="color:#000; background-color:#fff; font-family:verdana, helvetica, sans-serif;font-size:12pt"><div><span>Thank you, </span></div><div><span>-cdiscount 0 works perfectly, but now that </span><span>I have read about smoothing and different methods of discounting I have </span><span>another question:</span><span><br></span></div><div><span><br></span></div><div><span>I want to know your ideas about this problem:</span></div><div><span>I want to have a model out of a text. and then predict what the user is typing (a word prediction approach). at any moment I will predict what the next character would be according to my bigrams.</span></div><div><span>Do you think methods of discounting and smoothing are useful in treating the training data?</span></div><div><span>or it is more appropriate if I just disable it?</span></div><div><br><span></span></div><div><span>Thank

 you</span></div><div><span>Saman<br></span></div><div><span><br></span></div><div><br></div>  <div style="font-family: verdana,helvetica,sans-serif; font-size: 12pt;"> <div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"> <div dir="ltr"> <font size="2" face="Arial"> <hr size="1">  <b><span style="font-weight: bold;">From:</span></b> Andreas Stolcke <stolcke@icsi.berkeley.edu><br> <b><span style="font-weight: bold;">To:</span></b> Saman Noorzadeh <saman_2004@yahoo.com> <br><b><span style="font-weight: bold;">Cc:</span></b> Srilm group <srilm-user@speech.sri.com> <br> <b><span style="font-weight: bold;">Sent:</span></b> Wednesday, April 11, 2012 1:46 AM<br> <b><span style="font-weight: bold;">Subject:</span></b> Re: [SRILM User List] how are the probabilities computed in ngram-count<br> </font> </div> <br>

<div id="yiv53509230">

  
  <div>

    On 4/10/2012 1:29 AM, Saman Noorzadeh wrote:

    <blockquote type="cite">

      <div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: verdana,helvetica,sans-serif; font-size: 12pt;">

        <div>Hello</div>

        <div>I am getting confused about the models that ngram-count

          make:</div>

        <div>ngram-count -order 2  -write-vocab vocabulary.voc -text

          mytext.txt   -write <a target="_blank" href="http://model1.bo">model1.bo</a><br>

          ngram-count -order 2  -read model1.bo -lm <a target="_blank" href="http://model2.BO">model2.BO</a></div>

        <div><br>

        </div>

        <div>forexample: (the text is very large and these words are

          just a sample)<br>

        </div>

        <div><br>

        </div>

        <div>in model1.bo:</div>

        <div>cook   14 <br>

        </div>

        <div>cook was 1</div>

        <div><br>

        </div>

        <div>in model2.BO:</div>

        <div>-1.904738  cook was </div>

        <div><br>

        </div>

        <div>my question is that the probability of 'cook was' bigram

          should be log10(1/14), but ngram-count result shows:

          log(1/80)== -1.9047</div>

        <div>how is these probabilities computed?</div>

      </div>

    </blockquote>

    <br>

    It's called "smoothing" or "discounting" and ensures that word

    sequences of ngrams never seen in the training data receive nonzero

    probability.<br>

    Please consult any of the basic LM tutorial sources listed at

    http://www.speech.sri.com/projects/srilm/manpages/, or specifically

    http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

    .<br>

    <br>

    To obtain the unsmoothed probability estimates that you are

    expecting you need to change the parameters.  Try ngram-count 

    -addsmooth 0 .... <br>

    <br>

    Andreas<br>

    <br>

  </div>

</div><br><br> </div> </div>  </div></body></html>