Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: Unexpected "ngram-count -recompute" result

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Fri, 20 Dec 2002 02:03:18 PST

In message <20021217143852.A7495 at ADDRESS HIDDEN>you wrote:
> Hello,
>
> We just noticed the following when using the -recompute flag of ngram-count.
> We're just try to generate uni- and bigram counts from trigram counts but som
> e are missing:
>
> [1 - directly summing uni-, bi- and trigram counts of a simple text file]
>
> melis@luistervink:/local/export/melis/lm> cat t
> <s> this is a test </s>
>
> melis@luistervink:/local/export/melis/lm> ngram-count -text t -sort
> </s>    1
> <s>     1
> <s> this        1
> <s> this is     1
> a       1
> a test  1
> a test </s>     1
> is      1
> is a    1
> is a test       1
> test    1
> test </s>       1
> this    1
> this is 1
> this is a       1
>
> [2 - only summing trigram counts]
>
> melis@luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3
> -sort
> <s> this is     1
> a test </s>     1
> is a test       1
> this is a       1
>
> [3 - using the previous trigram counts to generate uni- and bigram counts]
>
> melis@luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3
> -sort | ngram-count -recompute -sort -read -
> <s>     1
> <s> this        1
> <s> this is     1
> a       1
> a test  1
> a test </s>     1
> is      1
> is a    1
> is a test       1
> this    1
> this is 1
> this is a       1
>
> We expected the output of 1 and 3 to be the same, but notice the missing unig
> rams "</s>" and "test". Also, the bigram "test </s>" is missing.
> Is this a bug, or is there something we're missing here? It seems to be relat
> ed to the end of sentence symbol.
> This is with SRILM 1.3.2, BTW.
>
> Regards,
> Paul
>

It's a bug of sorts, or a feature depending on your point of view.

Because </s> is not followed by anything, discarding unigrams and bigrams
ending in </s> will in fact discard information that is not contained
in the trigrams.  I'm not sure why you are doing what you describe,
but a quick solution would be to introduce "dummy" N-grams that
complete the ngrams ending in </s> to the full length of the counts
you want to keep.  The little scripts below does that.
If you call it "complete-eos-ngrams" then

ngram-count -text t -write - | \
complete-eos-ngrams | \
ngram-count -read - -write-order 3 | \
ngram-count -recompute -sort -read -

will produce the output you expect.
Alternatively you could tack dummy words onto the end of your input
sentences.  in either case you have to delete the dummy ngrams from the
final output.

--Andreas

#!/usr/local/bin/gawk -f

BEGIN {
order = 3;
}

{
print;
}

$(NF - 1) == "</s>" {
count = $NF;

for (i = NF; i <= order; i ++) {
$i = "DUMMY";
print $0, count;
}
}

Click here to go to the SRILM home page.