<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">This happened because the binary LM

      file contains a record of the full vocabulary at the time the LM

      was created, not just the words that appear as unigrams (as in the

      ARPA format).  You must have done  ngram -renorm or something

      similar later, which causes unigrams to be created for all words

      in the vocabulary.<br>

      <br>

      Attached is a patch that prevents the _meta_  tokens from being

      included in that vocabulary.  Check that it fixes your problem.<br>

      (You can also grab the beta version off the web site.)<br>

      <br>

      Andreas<br>

      <br>

      <br>

      On 12/2/2012 8:06 PM, Meng Chen wrote:<br>

    </div>

    <blockquote

cite="mid:CA+bc0mpEE+fBwy7_QuAgPHnfb33iTpne=fOwAgM9cDv=_OLkAA@mail.gmail.com"

      type="cite">I have checked the make-big-lm shell script and found

      that the "_meta_" should be lowercase.

      <div>In line 56 of make-big-lm script. It says:<br>

        <div>metatag=__meta__   #lowercase so it works with ngram-count

          -tolower</div>

      </div>

      <div><br>

      </div>

      <div>In fact, when I used make-big-lm to train LM, there are not

        "__meta__1" in final arpa LM without the write-binary-lm. So I

        guess it's possible related to the binary format.</div>

      <div class="gmail_extra">

        <br>

        <br>

        <div class="gmail_quote">2012/12/2 Andreas Stolcke <span

            dir="ltr"><<a moz-do-not-send="true"

              href="mailto:stolcke@icsi.berkeley.edu" target="_blank">stolcke@icsi.berkeley.edu</a>></span><br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div class="HOEnZb">

              <div class="h5">On 12/1/2012 7:37 AM, Meng CHEN wrote:<br>

                <blockquote class="gmail_quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  Hi, I trained LMs with the write-binary-lm option,

                  however, when I converted the LM of bin format into

                  arpa format, I found there were 4 more 1-grams in the

                  arpa LM as follows:<br>

                  -8.988857 _meta_1<br>

                  -8.988857 _meta_2<br>

                  -9.201852 _meta_3<br>

                  -9.201852 _meta_4<br>

                  In facter, these four words do not exisit in my vocab.

                  So where are they come from? What should I do to

                  remove them ?<br>

                  Thanks!<br>

                </blockquote>

                <br>

              </div>

            </div>

            Counts for _META_1 etc. (note the uppercase) are used by

            ngram-count to keep track of counts-of-counts required for

            smoothing.   They should never appear in the LM.<br>

            <br>

            I suspect you lowercased the strings in the counts file

            somewhere in your processing, causing these special tokens

            to no longer be recognized.<span class="HOEnZb"><font

                color="#888888"><br>

                <br>

                Andreas<br>

                <br>

              </font></span></blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

  </body>

</html>