KN discounting and zeroton words

Mon Jun 6 09:38:59 PDT 2005

A little correction: also with KN discounting, zeroton words get the
same unigram probability as words discounted to zero (using -gt1min).
What I don't understand, is why can this probability be higher than for
words that are not discounted to zero? E.g.

E.g. for a very little test set, and using '-gt1min 2', zeroton and
singleton words get a probability -0.7323937, but a word occurring twice
gets a probability -1.556303. 

I believe this is some magic property of KN discounting, in which case I
apologize for polluting the list and go back to reading the description
of the algorithm.

Regards,
Tanel A.

On Mon, 2005-06-06 at 19:03 +0300, Tanel Alumäe wrote:
> Hello,
> 
> I've noticed that when using -kndiscount, the zeroton words (words that
> are in the vocabulary but not in the training corpus) get a higher
> unigram LM probability than words that actually occur (rarely) in the
> training corpus. Shouldn't the zeroton words get the same unigram
> probability as the words that are discounted to 0 using the -gt1min
> option? 
> 
> With GT, WB and natural discounting, everything works as expected:
> zeroton words get the same unigram probability as the words discounted
> to 0.
> 
> Regards,
> Tanel A.
> 
>