multi-ngram - build multiword N-gram models
multi-ngram [ -help ] option ...
builds N-gram language models that contain multiwords, i.e., compound words
that are a concatenation of words from some prior given model.
It will optionally generate multiword N-grams and insert them into
an existing, reference N-gram model, so as to cover multiwords occuring
in a specified vocabulary.
It will then assign probabilities to the multiword N-grams so that word
strings containing multiwords have the same probabilities as the strings
of component words in the reference model.
Note that the inverse operation (expanding a multiword N-gram to contain
only regular words) is subsumed by the
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
Print option summary.
Print version information.
- -order n
Set the maximal N-gram order to be used from the reference model.
NOTE: The order of the model is not set automatically when a model
file is read, so the same file can be used at various orders.
To use models of order higher than 3 it is always necessary to specify this
- -multi-order n
The maximal N-gram order in the multiword-based model.
- -debug level
Set the debugging output level (0 means no debugging output).
- -vocab file
Words to be added to the model.
In particular, this should include all the multiwords to be added.
- -multi-char C
Character used to delimit component words in multiwords
(an underscore character by default).
- -lm file
Reference N-gram model.
- -multi-lm file
Model containing multiwords; the N-grams in this model will be assigned
new probabilities based on the reference model.
If this option is
given then the multiword model will be generated by adding multiword
N-grams to the reference model.
This option prevents the insertion of multiword N-grams whose component
N-grams are not contained in the reference model.
For example, for a multiword bigram "a_b c_d" to be inserted, a trigram
reference model must contain the trigrams "a b c" and "b c d".
If the reference model were a bigram LM, it would have to contain
"a b", "b c", and "c d".
This option is important to control the size of the multiword LM for
- -write-lm file
Output location of the generated multiword model.
This program is a hack for cases were the original training data is
not available and a multiword model has to be generated from an existing
The resulting model is no longer properly normalized, since the
same word string can potentially be represented with or without multiwords.
The generation of multiword N-grams uses a heuristic algorithm that
works well for bigrams and trigrams, but is not exhaustive.
Andreas Stolcke <email@example.com>.
Copyright 2000-2004 SRI International