<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Thanks for the answer, Andreas.<br>

      <br>

      As i read paper by<br>

      Chen and Goodman (1999), they used held-out data<br>

      to optimize parameters in language model. How do i<br>

      do this in SRILM? Does SRILM optimize parameters<br>

      when i use -kndiscount? I tried -kn to save <br>

      parameters in a file and included this file <br>

      when building LM but it turned out<br>

      my perplexity is getting bigger.<br>

      <br>

      And just one more,<br>

      do you have a link to good tutorial in using<br>

      class-based models with SRILM?<br>

      <br>

      Ismail <br>

      <br>

      <br>

      <br>

      <br>

      <br>

      <br>

      <br>

      <br>

      On 04/29/2014 06:20 AM, Andreas Stolcke wrote:<br>

    </div>

    <blockquote cite="mid:535EE23A.1080400@icsi.berkeley.edu"

      type="cite">

      <meta content="text/html; charset=ISO-8859-1"

        http-equiv="Content-Type">

      <div class="moz-cite-prefix">On 4/28/2014 3:01 AM, Ismail Rusli

        wrote:<br>

      </div>

      <blockquote cite="mid:535E26EA.30804@gmail.com" type="cite">

        <meta http-equiv="content-type" content="text/html;

          charset=ISO-8859-1">

        <font face="Liberation Mono">Dear all,<br>

          <br>

          I attempted to build n-gram LM from Wikipedia text. I have<br>

          clean up all unwanted lines. I have approximately 36M words.<br>

          I splitted the text into 90:10 proportions. Then from the 90,<br>

          i splitted again into 4 joint training sets with increasing<br>

          size (with the largest is about 1M sentences).<br>

          <br>

          Command i used are the followings:<br>

          <br>

          1. Count n-gram and vocabulary:<br>

          ngram-count -text 1M -order 3 -write count.1M -write-vocab

          vocab.1M -unk<br>

          <br>

          2. Build LM with ModKN:<br>

          ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm

          -kndiscount<br>

        </font></blockquote>

      <br>

      <font face="Liberation Mono">There is no need to specify -vocab if

        you are getting it from the same training data as the counts.<br>

        The use of -vocab is to specify a vocabulary that differs from

        that of the training data.<br>

        In fact you can combine 1 and 2 in one comment that is

        equivalent:<br>

        <br>

        ngram-count -text 1M -order 3  -unk -lm kn.lm -kndiscount<br>

        <br>

        Also, if you do use two steps, be sure to include the -unk

        option in the second step.<br>

        <br>

      </font>

      <blockquote cite="mid:535E26EA.30804@gmail.com" type="cite"><font

          face="Liberation Mono"> <br>

          3. Calculate perplexity:<br>

          ngram -ppl test -order 3 -lm kn.lm<br>

          <br>

          My questions are:<br>

          1. Did i do it right?<br>

        </font></blockquote>

      <font face="Liberation Mono">It looks like you did.<br>

        <br>

      </font>

      <blockquote cite="mid:535E26EA.30804@gmail.com" type="cite"><font

          face="Liberation Mono"> 2. Is there any optimization i can do

          in building LM?<br>

        </font></blockquote>

      <font face="Liberation Mono">a. Try different -order values<br>

        b. Different smoothing methods.<br>

        c. Possibly class-based models (interpolated with word-based)<br>

        d. If you want to increase training data size significantly

        check the methods for conserving memory on the FAQ page.<br>

      </font>

      <blockquote cite="mid:535E26EA.30804@gmail.com" type="cite"><font

          face="Liberation Mono"> 3. How to calculate perplexity in log

          2-based instead of log 10?<br>

        </font></blockquote>

      <font face="Liberation Mono">Perplexity is not dependent on the

        base of the logarithm (the log base is matched by the number you

        exponentiate to get the ppl).</font><br>

      <br>

      Andreas<br>

      <br>

    </blockquote>

    <br>

  </body>

</html>