Hello,
I've noticed that when using -kndiscount, the zeroton words (words that
are in the vocabulary but not in the training corpus) get a higher
unigram LM probability than words that actually occur (rarely) in the
training corpus. Shouldn't the zeroton words get the same unigram
probability as the words that are discounted to 0 using the -gt1min
option?
With GT, WB and natural discounting, everything works as expected:
zeroton words get the same unigram probability as the words discounted
to 0.
Regards,
Tanel A.
Click here to go to the SRILM home page.