Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: SRILM 1.3.2

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 05 Nov 2002 16:30:04 PST

In message <3DC8153D.E75FFED2 at ADDRESS HIDDEN>you wrote:
>
> Hi,
>
> I did many tests to find the best suited language model for a given text
> with the "ngram" program with the -prune option and I maybe have
> discovered a bug with the OOV displayed in ngram.
>
> With a command like:
> jarjar jfbeaumo/mlf> ngram -order 3 -vocab vocab20k.txt -unk -lm
> transtalk10.arpa -ppl test.txt
> file test.txt: 635 sentences, 9448 words, 0 OOVs
> 0 zeroprobs, logprob=3D -17926.9 ppl=3D 59.9706 ppl1=3D 78.9647
>
> I am always ending with 0 OOV. The language model does contain the <unk>
> token. I supposed with a sufficient large value for -prune I will begin
> to get OOV word but it is fixed on 0. If I specified an empty vocabulary
> file, again, there is 0 OOV and I suppose this isn't correct. Maybe
> ngram is taking its vocabulary from the LM but then, there will be no
> use for the switch -vocab.
>
> Can you help me? Did I miss something?
>
> Best regards,
>
> JF
> --
> Jean-Fran=E7ois Beaumont - Agent de recherche (jfbeaumont at ADDRESS HIDDEN)
> CRIM - 550, rue Sherbrooke Ouest Bureau 100 (www.crim.ca)
> Montr=E9al (Qu=E9bec) H3A 1B9  T=E9l.: 514.840-1235 #3625

Dear JF,

it is actually a feature (not a bug) that ngram -unk counts OOVs as regular
words.   They would only be counted as OOVs in the ppl output if the
LM did not contain the <unk> token, or if it had probability 0.
Of course whether this is what you expect is debatable.
You can get the OOV count you want by grepping the ngram -ppl 2 output
for "p( <unk> | ".

--Andreas

Click here to go to the SRILM home page.