<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 9/5/2012 1:05 PM, Anand Venkataraman
wrote:<br>
</div>
<blockquote
cite="mid:CAF6FMTUyWr8DPvYBWu92Urqg1V8oF=LBxMFCkV+EghM_1k+hiA@mail.gmail.com"
type="cite">I realized I was off the list and just rejoined
(thanks Andreas).<br>
<br>
Meng - In response to your questions about select-vocab:<br>
<ol>
<li>Yes, you're right about the PPL. The program trains separate
unigram LMs for the given corpora (A & B) and the
diagnostic output prints the PPL of the held-out set according
to the _best_ word-level mixture of A.1bo and B.1bo.</li>
<li>Hard to say how big the held-out set ought to be for given A
and B sizes. My only suggestion is to ensure that the held-out
set contains a representative sample of words that you expect
to see in the domain. If in doubt, you can always extract the
domain vocabulary and ensure that the held-out set covers the
top N% (by freq) of the domain words (for some suitable N)</li>
</ol>
<p>Hope this helps.</p>
<p>&</p>
</blockquote>
Thanks Anand. Good to have you back on the list.<br>
<br>
Meng: in case this wasn't clear, "PPL" is short for "perplexity". <br>
<br>
Andreas<br>
<br>
</body>
</html>