anti-ngram

anti-ngram

NAME

anti-ngram - count posterior-weighted N-grams in N-best lists

SYNOPSIS

anti-ngram [ -help ] option ...

DESCRIPTION

anti-ngram counts the N-grams in a set of N-best hypotheses lists. The N-gram counts are weighted by the posterior probabilities of the hypotheses they occur in. Thus, anti-ngram can be used to construct language models of word sequences that are acoustically confusable with correct hypotheses. The counts output should be processed with ngram-count -float-counts to estimate a language model.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help
Print option summary.
-version
Print version information.
-refs file
Read the reference transcripts from file. Each line should contain an utterance ID followed by the transcript words.
-nbest-files file
List of N-best files. The base components of filenames must correspond to the utterance IDs found in the reference file.
-max-nbest n
Limits the number of hypotheses read from each N-best list to the first n.
-order n
Set the maximal order (length) of N-grams to count. The default order is 3.
-lm file
Reads an ARPA language model from file and rescores the N-best lists with it prior to counting N-grams.
-classes file
Interpret the LM as a class-based N-gram and read class definitions in classes-format(5) from file.
-tolower
Map vocabulary to lowercase, eliminating case distinctions.
-multiwords
Split multiwords (words joined by '_') into their components when reading N-best lists.
-multi-char C
Character used to delimit component words in multiwords (an underscore character by default).
-rescore-lmw lmw
Sets the language model weight used in combining the language model log probabilities with acoustic log probabilities (only relevant if separate scores are given in the N-best input).
-rescore-wtw wtw
Sets the word transition weight used to weight the number of words relative to the acoustic log probabilities (only relevant if separate scores are given in the N-best input).
-posterior-scale scale
Divide the total weighted log score by scale when computing normalized posterior probabilities. This controls the peakedness of the posterior distribution. The default value is whatever was chosen for -rescore-lmw, so that language model scores are scaled to have weight 1, and acoustic scores have weight 1/lmw.
-all-ngrams
Causes even N-grams that occur in the reference string to be counted. By default N-best N-grams that also occur in the corresponding reference are excluded.
-min-count C
Prune all N-grams from the output that have counts less than C.
-read-counts countsfile
Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer or fractional count, all separated by whitespace. Repeated counts for the same N-gram are added. N-grams from N-best lists are added to those read with this option.
-write-counts countsfile
Writes total N-gram counts to countsfile. The default is to write to stdout.
-sort
Output counts in lexicographic order, as required for ngram-merge(1).
-debug level
Set debugging output level. Level 0 means no debugging. Debugging messages are written to stderr.

SEE ALSO

ngram(1), ngram-merge(1), ngram-count(1), nbest-scripts(1), classes-format(5),
A. Stolcke et al., "The SRI March 2000 Hub-5 Conversational Speech Transcription System", Proc. NIST Speech Transcription Workshop, College Park, MD, 2000.

BUGS

There is no -vocab option to limit the vocabulary.

AUTHOR

Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 2000-2008 SRI International