<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 10/2/2013 1:16 AM, E wrote:<br>
</div>
<blockquote
cite="mid:8D08D5EC52F045F-1094-3E82E@webmail-d268.sysops.aol.com"
type="cite"><font color="black" face="arial" size="2"><font
style="color: black;" face="arial" size="2">Thanks for the
pointers! Three questions - </font>
<div style="color: black; font-family: arial; font-size: 10pt;"><br>
</div>
<div style="color: black; font-family: arial; font-size: 10pt;">1.
The same number of bins are used for all n-grams even though
number of ngrams for each N may differ. In web1T, </div>
<div style="color: black; font-family: arial; font-size: 10pt;">
<pre style="word-wrap: break-word; white-space: pre-wrap;">Number of unigrams: 13,588,391
Number of fivegrams: 1,176,470,663</pre>
</div>
<div>
<div id="AOLMsgPart_1_f25a2f23-e6ce-45eb-ac97-16c58b5fe90e">
<div class="aolReplacedBody" bgcolor="#FFFFFF"
text="#000000" style="color: black; font-family: arial,
helvetica; font-size: 10pt;"><font size="2"><font
face="arial">Would it make any improvement if
fivegrams were binned more number of times than
unigrams?</font></font></div>
</div>
</div>
</font></blockquote>
<font size="2"><font face="arial">That's good idea, but I haven't
tried it, so I cannot say how much it would help.<br>
It might also help to just have more bins for lower-order ngrams
since there are more samples of them (more data, hence more
parameters can be estimated).<br>
<br>
</font></font>
<blockquote
cite="mid:8D08D5EC52F045F-1094-3E82E@webmail-d268.sysops.aol.com"
type="cite"><font color="black" face="arial" size="2">
<div>
<div id="AOLMsgPart_1_f25a2f23-e6ce-45eb-ac97-16c58b5fe90e">
<div class="aolReplacedBody" bgcolor="#FFFFFF"
text="#000000" style="color: black; font-family: arial,
helvetica; font-size: 10pt;"><br>
</div>
<div class="aolReplacedBody" bgcolor="#FFFFFF"
text="#000000" style="color: black;"><font face="arial"
size="2">2. For a particular ngram in test data, the
algorithm will decide which bin Wij's to use based on
how many times that n-gram occurred in training data. Is
this right?</font></div>
</div>
</div>
</font></blockquote>
<font size="2"><font face="arial">Right.<br>
<br>
</font></font>
<blockquote
cite="mid:8D08D5EC52F045F-1094-3E82E@webmail-d268.sysops.aol.com"
type="cite"><font color="black" face="arial" size="2">
<div>
<div id="AOLMsgPart_1_f25a2f23-e6ce-45eb-ac97-16c58b5fe90e">
<div class="aolReplacedBody" bgcolor="#FFFFFF"
text="#000000" style="color: black;"><font face="arial"
size="2"><br>
</font></div>
<div class="aolReplacedBody" bgcolor="#FFFFFF"
text="#000000" style="color: black;"><font face="arial"
size="2">3. What does it mean when some weights are zero
after tuning them. I used just 10 sentences (5
repeated) in tune.txt and got google.countlm as at the
bottom.</font></div>
</div>
</div>
</font></blockquote>
<blockquote
cite="mid:8D08D5EC52F045F-1094-3E82E@webmail-d268.sysops.aol.com"
type="cite"><font color="black" face="arial" size="2">
<div>
<div id="AOLMsgPart_1_f25a2f23-e6ce-45eb-ac97-16c58b5fe90e">
<div class="aolReplacedBody" bgcolor="#FFFFFF"
text="#000000" style="color: black;"><font face="arial"
size="2"><br>
</font></div>
<div class="aolReplacedBody" bgcolor="#FFFFFF"
text="#000000" style="color: black;"><font face="arial"
size="2">For ex. w01, w02 are non-zero but w03 is zero.
Does this mean that in the development set, there were
no trigrams that corresponded to counts in bin 0?</font></div>
</div>
</div>
</font></blockquote>
<br>
<font size="2"><font face="arial">Correct.<br>
<br>
Andreas<br>
</font></font><br>
</body>
</html>