Thank you Andreas. This was very helpful.  I will make use of the SRI mailing list from now on.<br><br>Ryan Roth<br>CCLS<br>Columbia University<br><br><div class="gmail_quote">On Tue, Aug 24, 2010 at 5:49 PM, Andreas Stolcke <span dir="ltr"><<a href="mailto:stolcke@speech.sri.com">stolcke@speech.sri.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Ryan,<br>

<br>

I suggested you use the -limit-vocab option with ngram, and write out your LM in binary.<br>

Reading a binary LM with -limit-vocab is very efficient in processing only the portions of the LM parameters that pertain to your test set vocabulary.<br>

You can generate the vocabulary used by your test data using<br>

<br>

ngram-count -text DATA -write-vocab VOCAB<br>

<br>

There is a tradeoff between processing small batches of data (hence small vocabularies, hence fast loading of the LM) with large batches (larger vocabularies, but the LM fewer times), so you might want to tune the batch size empirically for best overall throughput.<br>


<br>

If LM load time is still a limiting factor with this approach you should use an LM server (see ngram -use-server option), which effectively means you load the LM into memory only once.<br>

<br>

I suggest you join the srilm-user list and direct future questions there.<br><font color="#888888">

<br>

Andreas</font><div><div></div><div class="h5"><br>

<br>

Ryan Roth wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Hello:<br>

<br>

My name is Ryan Roth and I work at Columbia University's Center for Computational Learning Systems. My research focus currently is on Arabic Natural Language Processing.<br>

<br>

I have question about the SRILM toolkit that I hope you'll be able to help me with.<br>

<br>

My problem is the following.  I have a large N-gram LM file (non-binary) that I built from a collection of about 200 million words.  I want to able to read a given input text file (containing one sentence per line), and for every N-gram that I find there, extract the probability for that N-gram from the LM file.<br>


<br>

Currently, I am solving this problem by reading the entire LM file into memory first, and then reading the N-grams from the input text file and referencing the memory structure to get the probability for that N-gram.  This works fine, but is very slow and memory intensive.  I can reduce the memory issues by reading the input text file into memory instead, and reading the LM file line-by-line, but this is somewhat less convenient due to the other processing I need to perform on the input file.<br>


<br>

I've looked through the SRILM toolkit, and another option would seem to be to filter the large LM file first using the "make-lm-subset" script and a counts file built from input text file. I would then use the filtered output LM in place of the larger LM and proceed as before.  This method would seem to avoid the large memory requirements.  My initial tests, however, show that the filtering step is still a bit slower than I'd like.<br>


<br>

I was wondering if there is another, more time-efficient way of solving this particular problem (that is, extracting a specific subset of N-gram probabilities from a large LM file) using the other tools in the SRILM toolkit.  Is there some option combination for "ngram", for example, that would work? I don't currently see a direct solution.<br>


<br>

<br>

Thank you very much,<br>

<br>

Ryan Roth<br>

CCLS<br>

Columbia University<br>

<br>

</blockquote>

<br>

</div></div></blockquote></div><br>