<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 9/5/2012 1:05 PM, Anand Venkataraman

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAF6FMTUyWr8DPvYBWu92Urqg1V8oF=LBxMFCkV+EghM_1k+hiA@mail.gmail.com"

      type="cite">I realized I was off the list and just rejoined

      (thanks Andreas).<br>

      <br>

      Meng - In response to your questions about select-vocab:<br>

      <ol>

        <li>Yes, you're right about the PPL. The program trains separate

          unigram LMs for the given corpora (A & B) and the

          diagnostic output prints the PPL of the held-out set according

          to the _best_ word-level mixture of A.1bo and B.1bo.</li>

        <li>Hard to say how big the held-out set ought to be for given A

          and B sizes. My only suggestion is to ensure that the held-out

          set contains a representative sample of words that you expect

          to see in the domain. If in doubt, you can always extract the

          domain vocabulary and ensure that the held-out set covers the

          top N% (by freq) of the domain words (for some suitable N)</li>

      </ol>

      <p>Hope this helps.</p>

      <p>&</p>

    </blockquote>

    Thanks Anand.  Good to have you back on the list.<br>

    <br>

    Meng:  in case this wasn't clear, "PPL" is short for "perplexity". <br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>