ngram-class is too time consuming

Andreas Stolcke stolcke at speech.sri.com
Tue Oct 28 11:24:45 PDT 2008


??? wrote:
> I want to use the class based Bigram , like this:
> P (w2 | w1) = lambda * Pw (w2 | w1)+ (1-lambda) * P (w2 | G2) * Pc 
> (G2| G1)
> where wi belongs to class Gi, i=1, 2, respectively.
> So I used the "ngram-class" program to generate a set of classes using 
> some corpus (282,360 unique words),
> And the output classnum is 2,000.
> but I found the time of this program is too long,maybe for 10 days. my 
> computer is Core2, 1.8G.
> Here is my command:
> ngram-class -text<word-corpus> -numclasses 2000-classes<cls> -incremental
>
> does it has some problem? or it is normal?
It's probably normal. 282k is quite a large vocabulary. You might want 
to play with difference vocab sizes, especially excluding words with 
very low counts (such as singletons), because their statistics are not 
reliable and won't be clustered properly. It might be best to group all 
those words in a special class ahead of time.

For comparison, running the small test in $SRILM/test

make TEST=class-ngram

should take about 0.15 seconds of cpu time on a 2.6GHz Opteron machine.

Andreas





More information about the SRILM-User mailing list