<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 9/2/2012 10:10 PM, hic et nunc

      wrote:<br>

    </div>

    <blockquote cite="mid:BLU168-W2006515BC68A2F8601BB9FC9AB0@phx.gbl"

      type="cite">

      <style><!--

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

font-size: 10pt;

font-family:Tahoma

}

--></style>

      <div dir="ltr">

        <!--StartFragment-->hello again. i have a new question about lm

        ngram probs. <br>

        as you know well, in lm file, the log probs are calculated like

        this: log [(count[n-gram]*d/count[(n-1)-gram] -

        count[(n-1)-gram_<unk>]] <br>

        sometimes 1 is added to denominator, but sometimes not. what is

        the reason of this? <!--EndFragment--><br>

      </div>

    </blockquote>

    One is added to the denominator only a  last resort when the

    smoothing results in n-gram probabilities that sum to 1.<br>

    The following comment in NgramLM.cc explains why:<br>

    <br>

    <blockquote type="cite">            /*<br>

                   * This is a hack credited to Doug Paul (by Roni

      Rosenfeld in<br>

                   * his CMU tools).  It may happen that no probability

      mass<br>

                   * is left after totalling all the explicit probs,

      typically<br>

                   * because the discount coefficients were out of range

      and<br>

                   * forced to 1.0.  Unless we have seen all vocabulary

      words in<br>

                   * this context, to arrive at some non-zero backoff

      mass,<br>

                   * we try incrementing the denominator in the

      estimator by 1.<br>

                   * Another hack: If the discounting method uses

      interpolation<br>

                   * we first try disabling that because interpolation

      removes<br>

                   * probability mass.<br>

                   */<br>

    </blockquote>

    <br>

    This happens occasionally with GT smoothing due to degenerate

    count-of-counts statistics.<br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>