Vocab

NAME

Vocab - Vocabulary indexing for SRILM

SYNOPSIS

 #include <Vocab.h>

DESCRIPTION

The Vocab class represents sets of string tokens as typically used for vocabularies, word class names, etc. Additionally, Vocab provides a mapping from such string tokens (type VocabString) to integers (type VocabIndex). VocabIndex values are typically used to index words in language models to conserve space and speed up comparisons etc. Thus, Vocab essentially implements a symbol table into which strings can be ``interned.''

TYPES

VocabIndex: A non-negative integer for representing a string internally.
VocabString: A character array representing a vocabulary item (e.g., a word).

CONSTANTS

maxWordLength: Maximum number of characters in a VocabString.
Vocab_None: A special VocabIndex used to denote no vocabulary item and to terminate VocabIndex arrays.
Vocab_Unknown
Vocab_SentStart
Vocab_SentEnd
Vocab_Pause: Default VocabString values for some common, predefined vocabulary items: unknown word, sentence begin, sentence end, and pause, respectively.

CLASS MEMBERS

Vocab(VocabIndex start = 0, VocabIndex end = 0x7fffffff): When initializing a Vocab object, start and end optionally set the minimum and maximum VocabIndex values assigned by the vocabulary. Indices are allocated in increasing order starting at start.
VocabIndex addWord(VocabString name): Looks up the index of a word string name, adding the word if not already part of the vocabulary.
VocabString getWord(VocabIndex index): Returns the VocabString for index, or 0 if the index isn't defined.
getIndex(VocabString name): Returns the VocabIndex for word name, or Vocab_None if the word isn't defined. (Unlike addWord(), this will not extend the vocabulary if the word is undefined.)
void remove(VocabString name)
void remove(VocabIndex index): Deletes a vocabulary item, either by name or by index.
unsigned int numWords(): Returns the number of current vocabulary entries.
VocabIndex highIndex(): Returns the highest VocabIndex value assigned so far. The next word added will receive an index that is one greater. When allocating various meaningful vocabulary subsets into contiguous ranges, this function can be used to determine the corresponding boundaries in VocabIndex space, and then use these values to test subset membership etc.
VocabIndex unkIndex: The index of the unknown word (by default assigned to Vocab_Unknown).
VocabIndex ssIndex: The index of the sentence-start tag (by default assignedrto Vocab_SentStart).
VocabIndex seIndex: The index of the sentence-end tag (by default assigned to Vocab_SentEnd).
VocabIndex pauseIndex: The index of the pause tag (by default assigned to Vocab_Pause).
Boolean unkIsWord: When true, the unknown word is considered a regular word (default false).
Boolean toLower: When true, all word strings are mapped to lowercase. This is convenient to combine vocabularies, language models, etc., whose vocabularies differ only in the case convention (default false).
Boolean isNonEvent(VocabString word)
Boolean isNonEvent(VocabIndex word): Tests a word string or index for being an ``non-event'', i.e., a token that is not assigned probability in a language model. By default, sentence-start, pauses, and unknown words are non-events.
unsigned read(File &file): Reads word strings from a file and adds them to the vocabulary. For convenience, only the first word on each line is significant (so extra information could be contained in such a file). Returns the number of words read.
void write(File &file, Boolean sorted = true): Write the vocabulary strings to a file in a format compatible with read(). The sorted argument controls whether the output is lexicographically sorted.

Often times one wants to manipulate not single vocabulary items, but strings of them, e.g., to represent sentences. Word strings are represented as self-delimiting arrays of type VocabString * or VocabIndex *. The last element in a string is 0 or Vocab_None, respectively.

unsigned getWords(const VocabIndex *wids, VocabString *words, unsigned max): Extends getWord() to strings of word. The result is placed in words, which must have room for at least max words. Returns the actual number of indices in wids.
unsigned addWords(const VocabString *words, VocabIndex *wids, unsigned max): Extends addWord() to strings of indices. The result is placed in wids, which must have room for at least max indices. Returns the actual number of words in words.
unsigned getIndices(const VocabString *words, VocabIndex *wids, unsigned max): Extends getIndex() to strings of indices. The result is placed in wids, which must have room for at least max indices. Returns the actual number of words in words.

FUNCTIONS

The following static member functions are utilities to manipulate strings of vocabulary items, independent of a particular vocabulary.

unsigned parseWords(char *line, VocabString *words, unsigned max): Parses a character string line into whitespace-delimited words. On return, words contains pointers to null-terminated substrings of line (whose contents is modified in the process). words must have room for at least max pointers. Returns the actual number of words parsed.
unsigned length(const VocabIndex *words)
unsigned length(const VocabString *words): Returns the number items in a word string.
Boolean contains(const VocabIndex *words, VocabIndex word): Returns true if the word occurs among words.
VocabIndex *reverse(VocabIndex *words)
VocabString *reverse(VocabString *words): Reverses a string of words in place (and returns it as a result).
void write(File &file, const VocabString *words): Writes a string of space-delimited words to a file.
int compare(VocabIndex word1, VocabIndex word2)
int compare(VocabString word1, VocabString word2): Compares two vocabulary items lexicographically. Returns -1, 0, +1 for less than, equal, or greater than, respectively.
int compare(const VocabIndex *words1, const VocabIndex *words2)
int compare(const VocabIndex *words1, const VocabIndex *words2): Extends the order of compare() to strings of words.

For compatibilty with the C library calling conventions, compare() cannot be a member function of a Vocab object. For index-based comparisons the associated vocabulary needs to be set globally. This is achieved by calling the compareIndex() member function of a Vocab object.

ostream &operator<< (ostream &, const VocabString *words)
ostream &operator<< (ostream &, const VocabIndex *words): These operators output strings of words to a stream. For the second variant, the Vocab object used for interpreting indices needs to be identified globally by calling the use() member function on the object.

ITERATORS

The VocabIter class provides iteration over vocabularies. An iteration returns the elements of a Vocab in some unspecified, but deterministic order.

When copied or used in initialization of other objects, VocabIter objects retain the current ``position'' in an iteration. This allows nested iterations that enumerate all pairs of distinct elements, etc.

NOTE: While an iteration over a Vocab object is ongoing, no modifications are allowed to the object, except removal of the ``current'' vocabulary item.

VocabIter(Vocab &vocab, Boolean sorted = false): Creates an iteration over vocab. If sorted is set to true the vocabulary items will be enumerated in lexicographic order.
void init(): Reinitializes the iteration to its beginning.
VocabString next()
VocabString next(VocabIndex &index): Steps the iteration and returns the next word string. Optionally, the associated word index is returned in index. Returns 0 if the vocabulary is exhausted.

BUGS

There is no good way to synchronize VocabIndex values across multiple Vocab objects.