<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    On 4/10/2012 1:29 AM, Saman Noorzadeh wrote:
    <blockquote
      cite="mid:1334046577.74819.YahooMailNeo@web162005.mail.bf1.yahoo.com"
      type="cite">
      <div style="color: rgb(0, 0, 0); background-color: rgb(255, 255,
        255); font-family: verdana,helvetica,sans-serif; font-size:
        12pt;">
        <div>Hello</div>
        <div>I am getting confused about the models that ngram-count
          make:</div>
        <div>ngram-count -order 2  -write-vocab vocabulary.voc -text
          mytext.txt   -write model1.bo<br>
          ngram-count -order 2  -read model1.bo -lm model2.BO</div>
        <div><br>
        </div>
        <div>forexample: (the text is very large and these words are
          just a sample)<br>
        </div>
        <div><br>
        </div>
        <div>in model1.bo:</div>
        <div>cook   14 <br>
        </div>
        <div>cook was 1</div>
        <div><br>
        </div>
        <div>in model2.BO:</div>
        <div>-1.904738  cook was </div>
        <div><br>
        </div>
        <div>my question is that the probability of 'cook was' bigram
          should be log10(1/14), but ngram-count result shows:
          log(1/80)== -1.9047</div>
        <div>how is these probabilities computed?</div>
      </div>
    </blockquote>
    <br>
    It's called "smoothing" or "discounting" and ensures that word
    sequences of ngrams never seen in the training data receive nonzero
    probability.<br>
    Please consult any of the basic LM tutorial sources listed at
    <a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/">http://www.speech.sri.com/projects/srilm/manpages/</a>, or specifically
    <a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html">http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html</a>
    .<br>
    <br>
    To obtain the unsmoothed probability estimates that you are
    expecting you need to change the parameters.  Try ngram-count 
    -addsmooth 0 .... <br>
    <br>
    Andreas<br>
    <br>
  </body>
</html>