[SRILM User List] vocab size from make-batch-counts

Andreas Stolcke stolcke at speech.sri.com
Mon Sep 21 01:07:36 PDT 2009


Ilana Heintz wrote:
> Hello,
>
> I am wondering about what type of ngram pruning is done in the 
> make-batch-counts training script, and if it can be handled with 
> flags. I've looked through the code and man pages but I'm not sure 
> whether I can pass the right argument.  I discovered that the pruning 
> happens because, when I vary the batch size, the resulting vocabulary 
> size changes.  For instance, on a small development corpus:
>
>> make-batch-counts files.list 10 xmlfilter.sh counts_10perbatch
>> merge-batch-counts counts_10perbatch
>> ngram-count -read counts_10perbatch/files.list-1.ngrams.gz -write-vocab
What you are doing is not working as intended.  make-batch-counts passes 
the -write-vocab option to ngram-count,
but each ngram-count invocation will dump only the vocabulary of the 
batch it is seeing (hence the result you observed).

To get the combined vocab of your data, run

ngram-count -order 1 -read COUNTS -write-vocab VOCAB

on the final count file.

Andreas

> 10perbatch.vocab
>> wc 10perbatch.vocab
>   2763  2763 32999 10perbatch.vocab
>
>> make-batch-counts files.list 1 xmlfilter.sh counts_1perbatch
>> merge-batch-counts counts_1perbatch
>> ngram-count -read counts_1perbatch/merge-iter2-1.ngrams.gz -write-vocab 
> 1perbatch.vocab
>> wc 1perbatch.vocab
>   5923  5923 72237 1perbatch.vocab
>
> Same sort of result when I use a larger corpus or other batch sizes; 
> the vocab decreases with an increase in the size of the batch.  I have 
> tried experimenting with -gtmin to change the output, without 
> success.  I'm confused as to why batch size would make a difference here.
>
> I am using version 1.5.5.
>
> Thanks,
> Ilana
>
>
> Ilana Heintz
> Department of Linguistics
> Ohio State University
> http://www.ling.ohio-state.edu/~bromberg
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user



More information about the SRILM-User mailing list