Search SRILM-USER Archives

Re: question about SRILM

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Fri, 12 Mar 2004 09:03:09 PST

In message <4051DAA8.5080700 at ADDRESS HIDDEN>you wrote:
> Hi.
> I have one question about SRILM. I don't understand how is computed the
> log-probability of an unigram.
> Isn't it log[P(w)] = log[c(w)] - log[|V|], where c(w) is the frequency
> of the word w in the training set and |V| the size of the vocabulary ?
> And, if this formula is used, are the tokens <s> and </s> considered to
> be part of the vocabulary or not (i.e. are they counted in |V| ?) ?
>
> Thank you for answering.
> Solen Quiniou.
>

The formula for unigram probabilities (modulo smoothing) is

log[P(w)] = log[c(w)] - log[N]

where N is the number of word TOKENS in the training corpus (not the
vocabulary).

End-of-sentence tags are included in the count, since they are among the
events that are predicted by the LM, but Beginning-of-sentence is not.
You will notice that the log probabilty of <s> is set to -99 (a
stand-in for minus infinity).

--Andreas

PS. Please send your questions to "srilm-user at ADDRESS HIDDEN" in the
future.

Click here to go to the SRILM home page.