<html>

  <head>

    <meta http-equiv="content-type" content="text/html;

      charset=ISO-8859-1">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <font face="Liberation Mono">Dear all,<br>

      <br>

      I attempted to build n-gram LM from Wikipedia text. I have<br>

      clean up all unwanted lines. I have approximately 36M words.<br>

      I splitted the text into 90:10 proportions. Then from the 90,<br>

      i splitted again into 4 joint training sets with increasing<br>

      size (with the largest is about 1M sentences).<br>

      <br>

      Command i used are the followings:<br>

      <br>

      1. Count n-gram and vocabulary:<br>

      ngram-count -text 1M -order 3 -write count.1M -write-vocab

      vocab.1M -unk<br>

      <br>

      2. Build LM with ModKN:<br>

      ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm

      -kndiscount<br>

      <br>

      3. Calculate perplexity:<br>

      ngram -ppl test -order 3 -lm kn.lm<br>

      <br>

      My questions are:<br>

      1. Did i do it right?<br>

      2. Is there any optimization i can do in building LM?<br>

      3. How to calculate perplexity in log 2-based instead of log 10?<br>

      <br>

      Thanks in advance.<br>

      <br>

      Ismail<br>

    </font>

  </body>

</html>