<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Thanks for the answer, Andreas.<br>
<br>
As i read paper by<br>
Chen and Goodman (1999), they used held-out data<br>
to optimize parameters in language model. How do i<br>
do this in SRILM? Does SRILM optimize parameters<br>
when i use -kndiscount? I tried -kn to save <br>
parameters in a file and included this file <br>
when building LM but it turned out<br>
my perplexity is getting bigger.<br>
<br>
And just one more,<br>
do you have a link to good tutorial in using<br>
class-based models with SRILM?<br>
<br>
Ismail <br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
On 04/29/2014 06:20 AM, Andreas Stolcke wrote:<br>
</div>
<blockquote cite="mid:535EE23A.1080400@icsi.berkeley.edu"
type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix">On 4/28/2014 3:01 AM, Ismail Rusli
wrote:<br>
</div>
<blockquote cite="mid:535E26EA.30804@gmail.com" type="cite">
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
<font face="Liberation Mono">Dear all,<br>
<br>
I attempted to build n-gram LM from Wikipedia text. I have<br>
clean up all unwanted lines. I have approximately 36M words.<br>
I splitted the text into 90:10 proportions. Then from the 90,<br>
i splitted again into 4 joint training sets with increasing<br>
size (with the largest is about 1M sentences).<br>
<br>
Command i used are the followings:<br>
<br>
1. Count n-gram and vocabulary:<br>
ngram-count -text 1M -order 3 -write count.1M -write-vocab
vocab.1M -unk<br>
<br>
2. Build LM with ModKN:<br>
ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm
-kndiscount<br>
</font></blockquote>
<br>
<font face="Liberation Mono">There is no need to specify -vocab if
you are getting it from the same training data as the counts.<br>
The use of -vocab is to specify a vocabulary that differs from
that of the training data.<br>
In fact you can combine 1 and 2 in one comment that is
equivalent:<br>
<br>
ngram-count -text 1M -order 3 -unk -lm kn.lm -kndiscount<br>
<br>
Also, if you do use two steps, be sure to include the -unk
option in the second step.<br>
<br>
</font>
<blockquote cite="mid:535E26EA.30804@gmail.com" type="cite"><font
face="Liberation Mono"> <br>
3. Calculate perplexity:<br>
ngram -ppl test -order 3 -lm kn.lm<br>
<br>
My questions are:<br>
1. Did i do it right?<br>
</font></blockquote>
<font face="Liberation Mono">It looks like you did.<br>
<br>
</font>
<blockquote cite="mid:535E26EA.30804@gmail.com" type="cite"><font
face="Liberation Mono"> 2. Is there any optimization i can do
in building LM?<br>
</font></blockquote>
<font face="Liberation Mono">a. Try different -order values<br>
b. Different smoothing methods.<br>
c. Possibly class-based models (interpolated with word-based)<br>
d. If you want to increase training data size significantly
check the methods for conserving memory on the FAQ page.<br>
</font>
<blockquote cite="mid:535E26EA.30804@gmail.com" type="cite"><font
face="Liberation Mono"> 3. How to calculate perplexity in log
2-based instead of log 10?<br>
</font></blockquote>
<font face="Liberation Mono">Perplexity is not dependent on the
base of the logarithm (the log base is matched by the number you
exponentiate to get the ppl).</font><br>
<br>
Andreas<br>
<br>
</blockquote>
<br>
</body>
</html>