anti-ngram
anti-ngram
NAME
anti-ngram - count posterior-weighted N-grams in N-best lists
SYNOPSIS
anti-ngram [ -help ] option ...
DESCRIPTION
anti-ngram
counts the N-grams in a set of N-best hypotheses lists.
The N-gram counts are weighted by the posterior probabilities of the
hypotheses they occur in.
Thus,
anti-ngram
can be used to construct language models of word sequences
that are acoustically confusable with correct hypotheses.
The counts output should be processed with
ngram-count -float-counts
to estimate a language model.
OPTIONS
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
- -help
-
Print option summary.
- -version
-
Print version information.
- -refs file
-
Read the reference transcripts from
file.
Each line should contain an utterance ID followed by the transcript words.
- -nbest-files file
-
List of N-best files.
The base components of filenames must correspond to the utterance IDs found
in the reference file.
- -max-nbest n
-
Limits the number of hypotheses read from each N-best list to the first
n.
- -order n
-
Set the maximal order (length) of N-grams to count.
The default order is 3.
- -lm file
-
Reads an ARPA language model from
file
and rescores the N-best lists with it prior to counting N-grams.
- -classes file
-
Interpret the LM as a class-based N-gram and read class definitions
in
classes-format(5)
from
file.
- -tolower
-
Map vocabulary to lowercase, eliminating case distinctions.
- -multiwords
-
Split multiwords (words joined by '_') into their components when
reading N-best lists.
- -multi-char C
-
Character used to delimit component words in multiwords
(an underscore character by default).
- -rescore-lmw lmw
-
Sets the language model weight used in combining the language model log
probabilities with acoustic log probabilities
(only relevant if separate scores are given in the N-best input).
- -rescore-wtw wtw
-
Sets the word transition weight used to weight the number of words relative to
the acoustic log probabilities
(only relevant if separate scores are given in the N-best input).
- -posterior-scale scale
-
Divide the total weighted log score by
scale
when computing normalized posterior probabilities.
This controls the peakedness of the posterior distribution.
The default value is whatever was chosen for
-rescore-lmw,
so that language model scores are scaled to have weight 1,
and acoustic scores have weight 1/lmw.
- -all-ngrams
-
Causes even N-grams that occur in the reference string to be counted.
By default N-best N-grams that also occur in the corresponding reference
are excluded.
- -min-count C
-
Prune all N-grams from the output that have counts less than
C.
- -read-counts countsfile
-
Read N-gram counts from a file.
Each line contains an N-gram of
words, followed by an integer or fractional count, all separated by whitespace.
Repeated counts for the same N-gram are added.
N-grams from N-best lists are added to those read with this option.
- -write-counts countsfile
-
Writes total N-gram counts to
countsfile.
The default is to write to stdout.
- -sort
-
Output counts in lexicographic order, as required for
ngram-merge(1).
- -debug level
-
Set debugging output level.
Level 0 means no debugging.
Debugging messages are written to stderr.
SEE ALSO
ngram(1), ngram-merge(1), ngram-count(1), nbest-scripts(1),
classes-format(5),
A. Stolcke et al., "The SRI March 2000 Hub-5 Conversational Speech
Transcription System",
Proc. NIST Speech Transcription Workshop, College Park, MD, 2000.
BUGS
There is no
-vocab
option to limit the vocabulary.
AUTHOR
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 2000-2008 SRI International