<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 4/10/2012 1:29 AM, Saman Noorzadeh wrote:
<blockquote
cite="mid:1334046577.74819.YahooMailNeo@web162005.mail.bf1.yahoo.com"
type="cite">
<div style="color: rgb(0, 0, 0); background-color: rgb(255, 255,
255); font-family: verdana,helvetica,sans-serif; font-size:
12pt;">
<div>Hello</div>
<div>I am getting confused about the models that ngram-count
make:</div>
<div>ngram-count -order 2 -write-vocab vocabulary.voc -text
mytext.txt -write model1.bo<br>
ngram-count -order 2 -read model1.bo -lm model2.BO</div>
<div><br>
</div>
<div>forexample: (the text is very large and these words are
just a sample)<br>
</div>
<div><br>
</div>
<div>in model1.bo:</div>
<div>cook 14 <br>
</div>
<div>cook was 1</div>
<div><br>
</div>
<div>in model2.BO:</div>
<div>-1.904738 cook was </div>
<div><br>
</div>
<div>my question is that the probability of 'cook was' bigram
should be log10(1/14), but ngram-count result shows:
log(1/80)== -1.9047</div>
<div>how is these probabilities computed?</div>
</div>
</blockquote>
<br>
It's called "smoothing" or "discounting" and ensures that word
sequences of ngrams never seen in the training data receive nonzero
probability.<br>
Please consult any of the basic LM tutorial sources listed at
<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/">http://www.speech.sri.com/projects/srilm/manpages/</a>, or specifically
<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html">http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html</a>
.<br>
<br>
To obtain the unsmoothed probability estimates that you are
expecting you need to change the parameters. Try ngram-count
-addsmooth 0 .... <br>
<br>
Andreas<br>
<br>
</body>
</html>