Search SRILM-USER Archives

Re: question about vocabulary

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 04 May 2004 08:57:23 PDT

In message <4097A623.1E8E3B88 at ADDRESS HIDDEN>you wrote:
> Hello everybody,
>
> I would like to know if it's possible with the SRILM toolkit to generate
> a vocabulary with the 20000 most frequent words of a corpus for example.
>
> I know that with -write-vocab in the ngram-count function I can
> generate a vocabulary but only with all the words of the corpus.

How about this:

ngram-count -order 1 -text CORPUS -write - | \
sort +1rn -2 | awk 'NR <= 20000 { print $1 }' > top20000.vocab

--Andreas

Click here to go to the SRILM home page.