ngram-merge

Andreas Stolcke stolcke at speech.sri.com
Tue Mar 11 11:39:01 PST 2003


> 
> Hi.
> 
> I have problems with ngram-merge, when I want to merge 2 huge sorted
> 6-gram files  (the first is about 2G and contains 61M counts and the
> second
> is 700M and contains 21M counts).
> At once ngram-merge stucks. Output file does not change any more,  but
> ngram-merge is still doing something.  When I look at the info
> of the output file, I see, that the time of the last modification is
> changing and there is stil space on the disc.
> When I split both input files at the critical 6-gram and merge the top
> parts and the botton parts of both files separatelly, it works well, but
> I think
> this is not the case. I have to do merging many times :-(

Some systems have problems with files exceeding 2GB in size.  (This is because
on older systems file offsets are stored as signed 32-bit integers.)
There is nothing SRILM can do about it because files are accessed through
the stdio library that comes with the system.

However, it could help to compress or gzip the file, even if the 
file size stay above 2GB.  This is because then the IO routines effectively
read from a pipe, and there should be no limit on the amount of data
read that way.

> 
> One more question. If my count file contains 4-grams and 6-grams and I
> use -recompute option in ngram-count. Are in this case 5-grams
> recomputed from 6-grams and 3-grams from 4-grams?

No.  All the 1-5 grams will be recomputed from the highest order counts.
The exception are 4grams that don't have a common prefix with any of your 
6-grams.  
If you have a 6-gram "a b c d e f" then the counts for 

a
a b
a b c
a b c d
a b c d e

will all be recomputed.

--Andreas 




More information about the SRILM-User mailing list