<div dir="ltr">What Andreas suggests is probably the best. But depending on the exact application you have in mind, one other option to consider is to simply pre-process your input corpus and either delete all non-vocab words, or replace them (or runs of them) with a special meta-word of your choice, e.g. @reject@. It may be that there's there's an option in ngram* to do these in-process, I must check the docs. Else, a simple pre-processing filter in awk, perl or python should do the trick.<div>


<div><br></div><div>&</div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Jun 17, 2013 at 10:16 AM, Andreas Stolcke <span dir="ltr"><<a href="mailto:stolcke@icsi.berkeley.edu" target="_blank">stolcke@icsi.berkeley.edu</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 6/17/2013 1:03 AM, Joris Pelemans wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello,<br>

<br>

I am trying to build a unigram model with only the 400k most frequent words (this is essential) out of a training set of 4M tokens. The language model has to be open i.e. include the <unk> tag, because I want to assign probabilities to unseen words. However, I don't want it to base the probability for <unk> on that part of 4M minus 400k words, because then <unk> would get way too much probability mass (since there is a lot of data that I do not include in my LM). I simply want to ignore the other words and build a <unk> model based on the Good-Turing intuition of count-of-counts. However, since I limit the training data to 400k words, my training data does not contain any words with a frequency of 1 (i.e. N_1 = 0).<br>


<br>

How should I go about building this language model?<br>

</blockquote>

<br></div>

To work around the problem of missing N_1 for estimating GT parameters, you should run ngram-count twice.  First, without vocabulary restriction, and saving the GT parameters to a file (with -gt1 FILE  and no -lm option).    Second, you run ngram-count again, with -vocab option, -lm and -gt1 FILE.  This will read the smoothing parameters from FILE.   (The make-big-lm  wrapper script automates this two-step process.)<br>


<br>

I don't have a good solution for setting the <unk>  unigram probablity directly based on GT smoothing.    I would recommend one of two practical solutions.<br>

1) Replace rare words in your training data with <unk>  ahead of running ngram-count (this also gives you ngrams that predict unseen words).<br>

2) Interpolate your LM with an LM containing only <unk>  and optimize the interpolation weight on a held-out set.<br>

<br>

Of course you can always edit the LM file to insert <unk> with whatever probability you want (and possibly use ngram -renorm to renormalize the model).<span class="HOEnZb"><font color="#888888"><br>

<br>

Andreas</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

______________________________<u></u>_________________<br>

SRILM-User site list<br>

<a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a><br>

<a href="http://www.speech.sri.com/mailman/listinfo/srilm-user" target="_blank">http://www.speech.sri.com/<u></u>mailman/listinfo/srilm-user</a><br>

</div></div></blockquote></div><br></div>