question about vocabulary

Andreas Stolcke stolcke at speech.sri.com
Tue May 4 08:57:23 PDT 2004


In message <4097A623.1E8E3B88 at loria.fr>you wrote:
> Hello everybody,
> 
> I would like to know if it's possible with the SRILM toolkit to generate
> a vocabulary with the 20000 most frequent words of a corpus for example. 
> 
> I know that with -write-vocab  in the ngram-count function I can
> generate a vocabulary but only with all the words of the corpus.

How about this:

ngram-count -order 1 -text CORPUS -write - | \
sort +1rn -2 | awk 'NR <= 20000 { print $1 }' > top20000.vocab


--Andreas 




More information about the SRILM-User mailing list