I realized I was off the list and just rejoined (thanks Andreas).<br><br>Meng - In response to your questions about select-vocab:<br><ol><li>Yes, you're right about the PPL. The program trains separate unigram LMs for the given corpora (A & B) and the diagnostic output prints the PPL of the held-out set according to the _best_ word-level mixture of A.1bo and B.1bo.</li>


<li>Hard to say how big the held-out set ought to be for given A and B sizes. My only suggestion is to ensure that the held-out set contains a representative sample of words that you expect to see in the domain. If in doubt, you can always extract the domain vocabulary and ensure that the held-out set covers the top N% (by freq) of the domain words (for some suitable N)</li>


</ol><p>Hope this helps.</p><p>&</p><p><br></p>