LM

NAME

LM - Generic language model

SYNOPSIS

 #include <LM.h>

DESCRIPTION

The LM class specifies a minimal language model interface and provides some generic utilities.

LM inherits from Debug, and the debugging level of an LM object determines if and how much verbose information various is printed by various functions.

CLASS MEMBERS

LM(Vocab &vocab): Initializeing an LM object requries specifying the vocabulary over which the LM is defined. The vocab object can be shared among different LM instances. The LM object can modify vocab as a side-effect, e.g., as a result of reading an LM from a file.
LogP wordProb(VocabIndex word, const VocabIndex *context)
LogP wordProb(VocabString word, const VocabString *context): Returns the conditional log probability of word given a history. The history is given in reversed order (most recent word first) in context, and terminated by Vocab_None. Word or history can be specified either by strings or indices. All functional LM subclasses have to implement at least the first version.
LogP wordProbRecompute(VocabIndex word, const VocabIndex *context): Returns the same conditional log probability as wordProb(), but on the promise that context is identical to the last call to wordProb(). This often allows for efficient implementation to speed up repeated lookups in the same context.
LogP sentenceProb(const VocabIndex *sentence, TextStats &stats)
LogP sentenceProb(const VocabString *sentence, TextStats &stats): Returns the total log probability of a string of word (a sentence). The data in the stats object is incremented to reflect the statistics of the sentence.
unsigned pplFile(File &file, TextStats &stats, const char *escapeString = 0): Reads sentences from file, computing their probabilities and aggregate perplexity, and updating the stats. The debugging state of the LM object determines how much information is printed to stderr. debuglevel 0: total statistics only; debuglevel 1: per-sentence statistics; debuglevel 2: word probabilities; debuglevel 3 and greater: LM specific information.
Lines in file that start with escapeString are copied to the output. This allows extra information in the input file to be passed through unchanged.
unsigned rescoreFile(File &file, double lmScale, double wtScale, LM &oldLM, double oldLmScale, double oldWtScale, const char *escapeString = 0): Reads N-best hypotheses and scores from file, replaces the LM scores with new ones computed from the current model, and prints the new scores (including hypotheses) to stdout. lmScale and wtScore are the LM and word transition weights, respectively. oldLM is the LM whose scores are included in the aggregate scores read from the input (provided so that they can be subtracted out), and oldLmScale and oldWtScale are the old LM and word transition weights, respectively.
Lines in file that start with escapeString are copied to the output.
void setState(const char *state): This is a generic interface to change the internal ``state'' of a LM. The default implementation of this function does nothing, but certain LM subclass implementation may interpret the state string to assume different internal configurations.
Prob wordProbSum(const VocabIndex *context): Returns the sum of all word probabilities in context. Useful for checking the well-definedness of a model.
VocabIndex generateWord(const VocabIndex *context): Returns a word index from the vocabulary, randomly generated according to the conditional probabilities in context.
VocabIndex *generateSentence(unsigned maxWords = maxWordsPerLine, VocabIndex *sentence = 0)
VocabString *generateSentence(unsigned maxWords = maxWordsPerLine, VocabString *sentence = 0): Generates a random sentence of length up to maxWords. The result is placed in sentence if specified, or in a static buffer otherwise.
void *contextID(const VocabIndex *context): Returns an implementation-dependent value that identifies a the word context used to compute a conditional probability. (The context actually used may be shorted that what is specified in context).
Boolean isNonWord(VocabIndex word): Return true if word is a regular word in the LM, i.e., one that the LM computes probabilities for (as opposed to non-event tag such as sentence-start).
Boolean read(File &file, Boolean limitVocab = false): Read a LM from file. Return true is the file contents was formated correctly and an internal LM representation could be successfully constructed from it. The optional 2nd argument controls whether words not already in the vocabulary are to be added automatically.
void write(File &file): Writes the LM to file in a format that can be read back by read().
Vocab &vocab: The vocabulary object associated with LM (set at initialization).
VocabIndex noiseIndex: The index of the noise tag, i.e., a word that is skipped when computing probabilities.
const char *stateTag: A string introducing ``state'' information that should be passed to the LM. Input lines starting with this tag are handed to \fBsetState()\fB by pplFile() and rescoreFile().
Boolean reverseWords: If set to true, the LM reverses word order before computing sentence probabilities. This means wordProb() is expected to compute conditional probabilities based on right contexts.

LM

NAME

SYNOPSIS

DESCRIPTION

CLASS MEMBERS

SEE ALSO

BUGS

AUTHOR