anti-ngram

NAME

anti-ngram - count posterior-weighted N-grams in N-best lists

SYNOPSIS

anti-ngram [ -help ] option ...

DESCRIPTION

anti-ngram counts the N-grams in a set of N-best hypotheses lists. The N-gram counts are weighted by the posterior probabilities of the hypotheses they occur in. Thus, anti-ngram can be used to construct language models of word sequences that are acoustically confusable with correct hypotheses. The counts output should be processed with ngram-count -float-counts to estimate a language model.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help: Print option summary.
-version: Print version information.
-refs file: Read the reference transcripts from file. Each line should contain an utterance ID followed by the transcript words.
-nbest-files file: List of N-best files. The base components of filenames must correspond to the utterance IDs found in the reference file.
-max-nbest n: Limits the number of hypotheses read from each N-best list to the first n.
-order n: Set the maximal order (length) of N-grams to count. The default order is 3.
-lm file: Reads an ARPA language model from file and rescores the N-best lists with it prior to counting N-grams.
-classes file: Interpret the LM as a class-based N-gram and read class definitions in classes-format(5) from file.
-tolower: Map vocabulary to lowercase, eliminating case distinctions.
-multiwords: Split multiwords (words joined by '_') into their components when reading N-best lists.
-multi-char C: Character used to delimit component words in multiwords (an underscore character by default).
-rescore-lmw lmw: Sets the language model weight used in combining the language model log probabilities with acoustic log probabilities (only relevant if separate scores are given in the N-best input).
-rescore-wtw wtw: Sets the word transition weight used to weight the number of words relative to the acoustic log probabilities (only relevant if separate scores are given in the N-best input).
-posterior-scale scale: Divide the total weighted log score by scale when computing normalized posterior probabilities. This controls the peakedness of the posterior distribution. The default value is whatever was chosen for -rescore-lmw, so that language model scores are scaled to have weight 1, and acoustic scores have weight 1/lmw.
-all-ngrams: Causes even N-grams that occur in the reference string to be counted. By default N-best N-grams that also occur in the corresponding reference are excluded.
-min-count C: Prune all N-grams from the output that have counts less than C.
-read-counts countsfile: Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer or fractional count, all separated by whitespace. Repeated counts for the same N-gram are added. N-grams from N-best lists are added to those read with this option.
-write-counts countsfile: Writes total N-gram counts to countsfile. The default is to write to stdout.
-sort: Output counts in lexicographic order, as required for ngram-merge(1).
-debug level: Set debugging output level. Level 0 means no debugging. Debugging messages are written to stderr.

BUGS

There is no -vocab option to limit the vocabulary.