lm-scripts

lm-scripts

NAME

lm-scripts, add-dummy-bows, change-lm-vocab, empty-sentence-lm, get-unigram-probs, make-hiddens-lm, make-lm-subset, make-sub-lm, remove-lowprob-ngrams, reverse-lm, sort-lm - manipulate N-gram language models

SYNOPSIS

add-dummy-bows [ lm-file ] > new-lm-file
change-lm-vocab -vocab vocab -lm lm-file -write-lm new-lm-file \
	[ -tolower ] [ -subset ] [ ngram-options ... ]
empty-sentence-lm -prob p -lm lm-file -write-lm new-lm-file \
	[ ngram-options ... ]
get-unigram-probs [ linear=1 ] [ lm-file ]
make-hiddens-lm [ lm-file ] > hiddens-lm-file
make-lm-subset count-file|- [ lm-file |- ] > new-lm-file
make-sub-lm [ maxorder=N ] [ lm-file ] > new-lm-file
remove-lowprob-ngrams [ lm-file ] > new-lm-file
reverse-lm  [ lm-file ] > new-lm-file
sort-lm [ lm-file ] > sorted-lm-file

DESCRIPTION

These scripts perform various useful manipulations on N-gram models in their textual representation. Most operate on backoff N-grams in ARPA ngram-format(5).

Since these tools are implemented as scripts they don't automatically input or output compressed model files correctly, unlike the main SRILM tools. However, since most scripts work with data from standard input or to standard output (by leaving out the file argument, or specifying it as ``-'') it is easy to combine them with gunzip(1) or gzip(1) on the command line.

Also note that many of the scripts take their options with the gawk(1) syntax option=value instead of the more common -option value.

add-dummy-bows adds dummy backoff weights to N-grams, even where they are not required, to satisfy some broken software that expects backoff weights on all N-grams (except those of highest order).

change-lm-vocab modifies the vocabulary of an LM to be that in vocab. Any N-grams containing out-of-vocabulary words are removed, new words receive a unigram probability, and the model is renormalized. The -tolower option causes case distinctions to be ignored. -subset only removes words from the LM vocabulary, without adding any. Any remaining ngram-options are passes to ngram(1), and can be used to set debugging level, N-gram order, etc.

empty-sentence-lm modifies an LM so that it allows the empty sentence with probability p. This is useful to modify existing LMs that are trained on non-empty sentences only. ngram-options are passes to ngram(1), and can be used to set debugging level, N-gram order, etc.

make-hiddens-lm constructs an N-gram model that can be used with the ngram -hiddens option. The new model contains intra-utterance sentence boundary tags ``<#s>'' with the same probability as the original model had final sentence tags </s>. Also, utterance-initial words are not conditioned on <s> and there is no penalty associated with utterance-final </s>. Such as model might work better it the test corpus is segmented at places other than proper <s> boundaries.

make-lm-subset forms a new LM containing only the N-grams found in the count-file, in ngram-count(1) format. The result still needs to be renormalized with ngram -renorm (which will also adjust the N-gram counts in the header).

make-sub-lm removes N-grams of order exceeding N. This function is now redundant, since all SRILM tools can do this implicitly (without using extra memory and very small time overhead) when reading N-gram models with the appropriate -order parameter.

remove-lowprob-ngrams eliminates N-grams whose probability is lower than that which they would receive through backoff. This is useful when building finite-state networks for N-gram models. However, this function is now performed much faster by ngram(1) with the -prune-lowprobs option.

reverse-lm produces a new LM that generates sentences with probabilities equal to the reversed sentences in the input model.

sort-lm sorts the n-grams in an LM in lexicographic order (left-most words being the most significant). This is not a requirement for SRILM, but might be necessary for some other LM software. (The LMs output by SRILM are sorted somewhat differently, reflecting the internal data structures used; that is also the order that should give best cache utilization when using SRILM to read models.)

get-unigram-probs extracts the unigram probabilities in a simple table format from a backoff language model. The linear=1 option causes probabilities to be output on a linear (instead of log) scale.

SEE ALSO

ngram-format(5), ngram(1).

BUGS

These are quick-and-dirty scripts, what do you expect?
reverse-lm supports only bigram LMs, and can produce improper probability estimates as a result of inconsistent marginals in the input model.

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1995-2006 SRI International