<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    On 4/10/2012 12:21 AM, bulusheva wrote:

    <blockquote cite="mid:4F83DF60.2000601@speechpro.com" type="cite">

      <meta http-equiv="content-type" content="text/html;

        charset=ISO-8859-1">

      Hi, I have two questions:<br>

      <br>

      1. If I generate the language model with Kneser-Ney smoothing (or

      Modified Kneser-Ney), why do the parameter "-gtnmin" apply to

      already modified counts? <br>

      <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex;

        border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

        <div>For example, if in the training data 2-gram "markov model"

          occurs only in the context "hidden markov model" and gt2min =

          2, then the modified count for "markov model" = n(* markov

          model) = 1 < gt2min and <br>

          prob("markov model") = bow("markov")*prob("model"). <br>

          Instead of  prob("markov model") = ( n(* markov model)  - D)/

          n(* markov *) ;<br>

        </div>

      </blockquote>

    </blockquote>

    That's how it is currently implemented.   It is debatable how the

    minimum count should be applied in the case of the lower-order

    distributions in KN models.<br>

    The way it currently works is natural from an implementation

    perspective,  because the lower-order counts are physically modified

    before applying the discounting (you can examine them by adding

    -write COUNTS).<br>

    <br>

    But you are raising a good point.  It might make more sense to have

    the -gtXmin values be interpreted independent of the discounting

    method.<br>

    <br>

    <blockquote cite="mid:4F83DF60.2000601@speechpro.com" type="cite">

      <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex;

        border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

        <div> <br>

          2. Let say I use ngram-count to generate the language model as

          following: <br>

          ngram-count -text text.txt -vocab vocab.txt -gt1min 5 -lm

          sri.lm<br>

          Let the word "hello" exists in "vocab.txt" and occurs 4 times

          in "text.txt". Then probability of "hello" is calculated as 

          probability of zerotone. Is it correct?<br>

        </div>

      </blockquote>

    </blockquote>

    That is correct, but the ARPA format doesn't allow you to prune

    unigrams, so the unigrams will always appear explicitly listed in

    the LM, even if their probabilities might be obtained by backing off

    to a uniform distribution.<br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>