addsmooth on unigrams

Wed Oct 15 04:02:12 PDT 2008

Great toolkit and thanks for the recent update. SRILM has been really useful
for some computational phonology problems I've been working on.

I know it's not advised to use the add one smoothing method, but at the
moment I'm trying to replicate someone else's results. I have a question on
the unigram case because the results aren't what I was expecting. These
results are from version 1.5.6; as far as I am aware addsmooth has not
changed since then.

Following the equation in the ngram-discount manual page
p(a_z) = (c(a_z) + D) / (c(a_) + D n(*)) 

I assume that, in the unigram case, c(a_) simplifies to the total number of
word tokens (Jurafsky and Martin, 2000). When D=0, this appears to be the
case.
e.g. for the test data below
p(</s>) = 1/18
log(1/18) = -1.255273
this matches with the results given below

When D=1 I was expecting:
p(</s>) = (1+1) / (18+2)
log(2/20) = -1

However the result of -0.9242793 below, corresponds to a raw probability
very close to 5/42. And I'm not sure where this comes from. If anyone could
explain how this is calculated, I'd be very grateful.

Thanks, 

Tim Kempton
PhD student
University of Sheffield, UK

Here is the test run:

[tim at trill i686]$ echo "a a a a a a a a a a a a a a a a a" | ./ngram-count
-order 1 -text -
<s>     1
a       17
</s>    1
[tim at trill i686]$ echo "a a a a a a a a a a a a a a a a a" | ./ngram-count
-order 1 -text - -addsmooth 0 -lm -

\data\
ngram 1=3

\1-grams:
-1.255273       </s>
-99     <s>
-0.02482358     a

\end\
[tim at trill i686]$ echo "a a a a a a a a a a a a a a a a a" | ./ngram-count
-order 1 -text - -addsmooth 1 -lm -

\data\
ngram 1=3

\1-grams:
-0.9242793      </s>
-99     <s>
-0.05504756     a

\end\