<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 8/8/2012 3:31 AM, Meng Chen wrote:<br>

    </div>

    <blockquote

cite="mid:CA+bc0mogppHN-PRzz_bOSVsqv=GPwWR8AruUKxxrqx594kJ77g@mail.gmail.com"

      type="cite">Hi, the<b> -prune-lowprobs</b> option in<b> ngram</b>

      will  "prune N-gram probabilities that are lower than the

      corresponding backed-off estimates". This option would be useful

      especially when the back-off-weight (bow) value is positive.

      However, I want to ask if I could simply replace the positive bow

      value with 0 instead of using prune-lowprobs. Are there any

      differences? Or replace simply is not correct?<br>

    </blockquote>

    It's not correct.  If you modify the backoff weight you end up with

    an LM that is no longer normalized (word probs for a given context

    don't sum to 1).<br>

    <blockquote

cite="mid:CA+bc0mogppHN-PRzz_bOSVsqv=GPwWR8AruUKxxrqx594kJ77g@mail.gmail.com"

      type="cite">

      <div><br>

      </div>

      <div>Another question:</div>

      <div>When training LM, we could use<b> -text-has-weights</b>

        option for the corpus with sentence frequency. I want to ask

        what we should do with the<b> duplicated sentences</b> in large

        corpus. Should I delete the duplicated sentences? Or should I

        calculate the sentence frequency first and use the

        -text-has-weights option instead? Or do nothing, just throw all

        the corpus into training? <br>

      </div>

    </blockquote>

    You can do either.   Have a duplicated sentence<br>

    <br>

    1.0 a b c<br>

    1.0 a b c<br>

    <br>

    is equivalent to having the sentence once with added weights:<br>

    <br>

    2.0 a b c<br>

    <br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>