Search SRILM-USER Archives

Re: KN discounting and zeroton words

From: Tanel =?ISO-8859-1?Q?Alum=E4e?= <tanel.alumae at ADDRESS HIDDEN>
Date: Mon, 06 Jun 2005 19:38:59 +0300

A little correction: also with KN discounting, zeroton words get the
same unigram probability as words discounted to zero (using -gt1min).
What I don't understand, is why can this probability be higher than for
words that are not discounted to zero? E.g.

E.g. for a very little test set, and using '-gt1min 2', zeroton and
singleton words get a probability -0.7323937, but a word occurring twice
gets a probability -1.556303.

I believe this is some magic property of KN discounting, in which case I
apologize for polluting the list and go back to reading the description
of the algorithm.

Regards,
Tanel A.

On Mon, 2005-06-06 at 19:03 +0300, Tanel Alumäe wrote:
> Hello,
>
> I've noticed that when using -kndiscount, the zeroton words (words that
> are in the vocabulary but not in the training corpus) get a higher
> unigram LM probability than words that actually occur (rarely) in the
> training corpus. Shouldn't the zeroton words get the same unigram
> probability as the words that are discounted to 0 using the -gt1min
> option?
>
> With GT, WB and natural discounting, everything works as expected:
> zeroton words get the same unigram probability as the words discounted
> to 0.
>
> Regards,
> Tanel A.
>
>

Click here to go to the SRILM home page.