line length limit

Andreas Stolcke stolcke at speech.sri.com
Tue Nov 25 11:29:06 PST 2008


In message <16429778.81227576703241.JavaMail.alexyk at De-Divinatione.local>you wr
ote:
> 
> Andreas -- a couple questions... I now use sensor data which has no real "sen
> tence" meaning, thus 2.5 million observations are all on the same line. Ans n
> gram-count complains that the line is too long, not surprisingly. Is there a 
> way to break it into several lines but teach ngram-count to ignore sentence b
> oundaries? In the worst case I can envision manipulating the margins by appen
> ding/prepending (n-1) stitching chunks, but managing it is a nightmare... 

That's what the continuous-ngram-count filter is for.
Please see the training-scripts(1) man page.  For example, you could pass
continuous-ngram-count as a filter to make-batch-counts (option 3rd 
argument).

> 
> Also, I'm now building a full KN model for about 2 billion Russian words. I s
> ee that in a week of running it the RAM usage gradually grew to about 32 GB, 
> with my 16 GB real RAM. Is there any way to estimnate how much longer can I j
> ustify using the box? :) 

Yes, you run is on smaller amounts of data, measure time and memory, and
extrapolate.  However, it sounds you don't have enough memory and should
consult the FAQ itemes on how to reduce memory requirements for ngram-count.

Andreas 




More information about the SRILM-User mailing list