<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">This happened because the binary LM
file contains a record of the full vocabulary at the time the LM
was created, not just the words that appear as unigrams (as in the
ARPA format). You must have done ngram -renorm or something
similar later, which causes unigrams to be created for all words
in the vocabulary.<br>
<br>
Attached is a patch that prevents the _meta_ tokens from being
included in that vocabulary. Check that it fixes your problem.<br>
(You can also grab the beta version off the web site.)<br>
<br>
Andreas<br>
<br>
<br>
On 12/2/2012 8:06 PM, Meng Chen wrote:<br>
</div>
<blockquote
cite="mid:CA+bc0mpEE+fBwy7_QuAgPHnfb33iTpne=fOwAgM9cDv=_OLkAA@mail.gmail.com"
type="cite">I have checked the make-big-lm shell script and found
that the "_meta_" should be lowercase.
<div>In line 56 of make-big-lm script. It says:<br>
<div>metatag=__meta__ #lowercase so it works with ngram-count
-tolower</div>
</div>
<div><br>
</div>
<div>In fact, when I used make-big-lm to train LM, there are not
"__meta__1" in final arpa LM without the write-binary-lm. So I
guess it's possible related to the binary format.</div>
<div class="gmail_extra">
<br>
<br>
<div class="gmail_quote">2012/12/2 Andreas Stolcke <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:stolcke@icsi.berkeley.edu" target="_blank">stolcke@icsi.berkeley.edu</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb">
<div class="h5">On 12/1/2012 7:37 AM, Meng CHEN wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi, I trained LMs with the write-binary-lm option,
however, when I converted the LM of bin format into
arpa format, I found there were 4 more 1-grams in the
arpa LM as follows:<br>
-8.988857 _meta_1<br>
-8.988857 _meta_2<br>
-9.201852 _meta_3<br>
-9.201852 _meta_4<br>
In facter, these four words do not exisit in my vocab.
So where are they come from? What should I do to
remove them ?<br>
Thanks!<br>
</blockquote>
<br>
</div>
</div>
Counts for _META_1 etc. (note the uppercase) are used by
ngram-count to keep track of counts-of-counts required for
smoothing. They should never appear in the LM.<br>
<br>
I suspect you lowercased the strings in the counts file
somewhere in your processing, causing these special tokens
to no longer be recognized.<span class="HOEnZb"><font
color="#888888"><br>
<br>
Andreas<br>
<br>
</font></span></blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>