<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 1/6/2014 7:45 AM, DUGAST Loic wrote:<br>
</div>
<blockquote
cite="mid:2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
<div style="direction: ltr;font-family: Tahoma;color:
#000000;font-size: 10pt;">Hi<br>
<br>
In the FAQ
(<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html">http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html</a>)<br>
<br>
You advise to ...<br>
<dl>
<dt>c) </dt>
<dd>Lower the minimum counts for N-grams included in the LM,
i.e., the values of the options
<b>-gt2min</b>, <b>-gt3min</b>, <b>-gt4min</b>, etc. The
higher order N-grams typically get higher minimum counts.
</dd>
</dl>
<p><br>
</p>
<p>Do you not mean : *rise* the minimum counts (...) instead ?<br>
</p>
</div>
</blockquote>
<br>
You are correct. It should say raise the min counts. We'll fix
the documentation ASAP.<br>
<br>
<blockquote
cite="mid:2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan"
type="cite">
<div style="direction: ltr;font-family: Tahoma;color:
#000000;font-size: 10pt;">
<p>
</p>
Plus I am not sure to understand why gt2min should be set higher
than gt1min etc ?<br>
Higher-order ngrams are naturally less frequent. Therefore the
same cutoff value (gt2min equal to gt1min)will be harsher to
bigrams than to unigrams... Can you explain ?<br>
</div>
</blockquote>
<br>
The minimum counts are a crude way to trade off performance for
space, and since there are lot more long ngrams than short ngrams
you get more space savings with higher order ngrams. It is
typically not worth it to eliminate unigrams and bigrams, but a
decent tradeoff to remove singleton trigrams and fourgrams. The
default values were chose based on historical practice (I think they
might have even been inherited from the CMU LM toolkit).<br>
<br>
The better and more principled way to remove ngrams is entropy-based
pruning (ngram/ngram-count -prune option). So the best strategy
given limited memory is to make the gtmin values as low are you can
afford to fit into memory, then use -prune (you can do this in the
same invocation of ngram-count or make-big-lm).<br>
<br>
Andreas<br>
<br>
<br>
</body>
</html>