select-vocab

select-vocab

NAME

select-vocab - Select a maximum-likelihood vocabulary from a mixture of corpora.

SYNOPSIS

select-vocab [ option ... ] -heldout file f1 f2 ... 

DESCRIPTION

select-vocab picks a vocabulary from the union of the vocabularies of files f1 through fn in order to maximize the likelihood of the heldout file. When invoked as above, the program will print out (unsorted) the list of words in all of the input corpora together with their weights. This list may subsequently be sorted to put the words in decreasing order of weight and a vocabulary may be chosen by picking a suitable threshold weight and ignoring words with weight less than this. A number of automatically detected formats are supported for the input files f1 through fn. They can be count files, which are characterized by each line ending in a number, ARPA language models in ngram-format(5), or simply text files. If they are text-files, further, and their names end in ".sentid", it is assumed that the first field of each line is a sentence identifier that is then discarded. Furthermore, all of the input files can also be compressed (if gzip is installed and available on the system).

OPTIONS

-help
Prints a short help message.
-heldout file
Likelihood maximization is performed on the contents of file. This file may also be in any of the formats supported for the input corpora, namely: text, counts, sentid, or ARPA-lm.
-quiet
Suppresses printing of progress and other informative messages during execution. By default the script writes these out to the output error stream.
-scale n
The combined final counts are scaled by n before being written out. This makes it possible to sort the output list numerically with sort(1). The default scale is 1e6.

NOTES

This implementation corrects a minor error in the algorithm specification in [1]. The paper describes corpus level interpolation, but the script actually does word-level interpolation. The program is written in perl(1) and requires it to be installed in order to run.

SEE ALSO

ngram-count(1), ngram-format(5), training-scripts(1).
[1] A. Venkataraman and W. Wang, "Techniques for effective vocabulary selection", in Proceedings of Eurospeech, Geneva, 2003.

BUGS

Probably.

SOURCE

Download as part of the SRILM toolkit, or stand-alone from http://www.speech.sri.com/people/anand/downloads/selvoc-v1.tar.gz

AUTHORS

Anand Venkataraman <anand@speech.sri.com>
Wen Wang <wwang@speech.dsri.com>
Copyright 2003 SRI International