ngram-format - File format for ARPA backoff N-gram models
p w [bow]
p w1 w2 [bow]
p w1 ... wN
The so-called ARPA (or Doug Paul) format for N-gram backoff models
starts with a header, introduced by the keyword \data\,
listing the number of N-grams of each length.
Following that, N-grams are listed one per line, grouped into sections
by length, each section starting with the keyword \N-gram:,
is the length of the N-grams to follow.
Each N-gram line starts with the logarithm (base 10) of conditional probability
of that N-gram, followed by the words
making up the N-gram.
These are optionally followed by the logarithm (base 10) of the
backoff weight for the N-gram.
The keyword \end\
concludes the model representation.
Backoff weights are required only for those N-grams
that form a prefix of longer N-grams in the model.
The highest-order N-grams in particular will not need backoff weights
(they would be useless).
Since log(0) (minus infinity) has no portable representation, such values
are mapped to a large negative number.
However, the designated dummy value (-99 in SRILM) is interpreted as log(0)
when read back from file into memory.
The correctness of the N-gram counts
... in the header is not enforced by SRILM software when reading
models (although a warning is printed when an inconsistency is encountered).
This allows easy textual insertion or deletion of parameters in a model file.
The proper format can be recovered by passsing the model through
ngram -order N -lm input -write-lm output
Note that the format is self-delimiting, allowing multiple models to
be stored in one file, or to be surrounded by ancillary information.
Some extensions of N-gram models in SRILM store additional parameters
after a basic N-gram section in the standard format.
ngram(1), ngram-count(1), lm-scripts(1), pfsg-scripts(1).
The ARPA format does not allow N-grams that have only a backoff weight
associated with them, but no conditional probability.
This makes the format less general than would otherwise be useful
(e.g., to support pruned models, or ones containing a mix of words and
tool satisfies this constraint by inserting dummy probabilities where
For simplicity, an N-gram model containing N-grams up to length
is referred to in the SRILM programs as an
order model, although techncally it represents a Markov model of
There is no way to specify words with embedded whitespace.
The ARPA backoff format was developed by Doug Paul at MIT Lincoln Labs
for research sponsored by the U.S. Department of Defense
Advanced Research Project Agency (ARPA).
Man page by Andreas Stolcke <firstname.lastname@example.org>.
Copyright 1999, 2004 SRI International