ngram-class - induce word classes from N-gram statistics


ngram-class [ -help ] option ...


ngram-class induces word classes from distributional statistics, so as to minimize perplexity of a class-based N-gram model given the provided word N-gram counts. Presently, only bigram statistics are used, i.e., the induced classes are best suited for a class-bigram language model.

The program generates the class N-gram counts and class expansions needed by ngram-count(1) and ngram(1), respectively to train and to apply the class N-gram model.


Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

Print option summary.
Print version information.
-debug level
Set debugging output at level. Level 0 means no debugging. Debugging messages are written to stderr. A useful level to trace the formation of classes is 2.

Input Options

-vocab file
Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary.
Map the vocabulary to lowercase.
-counts file
Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Counts collected by -text and -counts are additive as well.
Note that the input should contain consistent lower- and higher-order counts (i.e., unigrams and bigrams), as would be generated by ngram-count(1).
-text textfile
Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored.

Class Merging

-numclasses C
The target number of classes to induce. A zero argument suppresses automatic class merging altogether (e.g., for use with -interact).
Perform full greedy merging over all classes starting with one class per word. This is the O(V^3) algorithm described in Brown et al. (1992).
Perform incremental greedy merging, starting with one class each for the C most frequent words, and then adding one word at a time. This is the O(V*C^2) algorithm described in Brown et al. (1992); it is the default.
-maxwordsperclass M
Limits the number of words in a class to M in incremental merging. By default there is no such limit.
Enter a primitive interactive interface when done with automatic class induction, allowing manual specification of additional merging steps.
-noclass-vocab file
Read a list of vocabulary items from file that are to be excluded from classes. These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>) from class merging; this can be suppressed by specifying -noclass-vocab /dev/null.
-read file
Read initial class memberships from file. Class memberships need to be stored in classes-format(5) with the additional condition that probabilities are obligatory and that each membership definition must specify exactly one word.

Output Options

-class-counts file
Write class N-gram counts to file when done. The format is the same as for word N-gram counts, and can be read by ngram-count(1) to estimate a class-N-gram model.
-classes file
Write class definitions (member words and their probabilities) to file when done. The output format is the same as required by the -classes option of ngram(1).
-save S
Save the class counts and/or class definitions every S iterations during induction. The filenames are obtained from the -class-counts and -classes options, respectively, by appending the iteration number. This is convenient for producing sets of classes at different granularities during the same run. The saved class memberships can also be used with the -read option to restart class merging at a later time. S=0 (the default) suppresses the saving actions.
-save-maxclasses K
Modifies the action of -save so as to only start saving once the number of classes reaches K. (The iteration numbers embedded in filenames will start at 0 from that point.)


ngram-count(1), ngram(1), classes-format(5).
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, ``Class-Based n-gram Models of Natural Language,'' Computational Linguistics 18(4), 467-479, 1992.


Classes are optimized only for bigram models at present.


Andreas Stolcke <>
Seppo Enarvi <>
Copyright (c) 1999-2010 SRI International
Copyright (c) 2012-2014 Microsoft Corp.
Copyright (c) 2013-2014 Seppo Enarvi