<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 1/6/2014 7:45 AM, DUGAST Loic wrote:<br>

    </div>

    <blockquote

cite="mid:2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=ISO-8859-1">

      <style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>

      <div style="direction: ltr;font-family: Tahoma;color:

        #000000;font-size: 10pt;">Hi<br>

        <br>

        In the FAQ

        (<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html">http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html</a>)<br>

        <br>

        You advise to ...<br>

        <dl>

          <dt>c) </dt>

          <dd>Lower the minimum counts for N-grams included in the LM,

            i.e., the values of the options

            <b>-gt2min</b>, <b>-gt3min</b>, <b>-gt4min</b>, etc. The

            higher order N-grams typically get higher minimum counts.

          </dd>

        </dl>

        <p><br>

        </p>

        <p>Do you not mean : *rise* the minimum counts (...) instead ?<br>

        </p>

      </div>

    </blockquote>

    <br>

    You are correct.   It should say raise the min counts.  We'll fix

    the documentation ASAP.<br>

    <br>

    <blockquote

cite="mid:2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan"

      type="cite">

      <div style="direction: ltr;font-family: Tahoma;color:

        #000000;font-size: 10pt;">

        <p>

        </p>

        Plus I am not sure to understand why gt2min should be set higher

        than gt1min etc ?<br>

        Higher-order ngrams  are naturally less frequent. Therefore the

        same cutoff value (gt2min equal to gt1min)will be harsher to

        bigrams than to unigrams... Can you explain ?<br>

      </div>

    </blockquote>

    <br>

    The minimum counts are a crude way to trade off performance for

    space, and since there are lot more long ngrams than short ngrams

    you get more space savings with higher order ngrams.  It is

    typically not worth it to eliminate unigrams and bigrams, but a

    decent tradeoff to remove singleton trigrams and fourgrams.  The

    default values were chose based on historical practice (I think they

    might have even been inherited from the CMU LM toolkit).<br>

    <br>

    The better and more principled way to remove ngrams is entropy-based

    pruning (ngram/ngram-count -prune option).   So the best strategy

    given limited memory is to make the gtmin values as low are you can

    afford to fit into memory, then use -prune (you can do this in the

    same invocation of ngram-count or make-big-lm).<br>

    <br>

    Andreas<br>

    <br>

    <br>

  </body>

</html>