<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 11/04/13 01:01, Andreas Stolcke
wrote:<br>
</div>
<blockquote cite="mid:5276E3E4.7010801@icsi.berkeley.edu"
type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix">On 11/3/2013 1:43 AM, Joris Pelemans
wrote:<br>
</div>
<blockquote cite="mid:52761ADB.50906@esat.kuleuven.be" type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
I am investigating different techniques to introduce new words
to the vocabulary. Say I have a vocabulary of 100,000 words and
I want to introduce 1 new word X (for the sake of simplicity). I
could do one of 3 options:<br>
<ol>
<li>use the contexts in which X appears in some training data
(but sometimes X may not appear (enough))</li>
<li>estimate the probability of X by taking a fraction of the
prob mass of a synonym of X (which I described earlier)</li>
<li>estimate the probability of X by taking a fraction of the
prob mass of the <unk> class (if e.g. no good synonym
is at hand)</li>
</ol>
<p>I could then compare the perplexities of these 3 LMs with a
vocabulary of size 100,001 words to see which technique is
best for a given word/situation.<br>
</p>
</blockquote>
And option 3 is effectively already implemented by the way unseen
words are mapped to <unk>. If you want to compute
perplexity in a fair way you would take the LM containing
<unk> and for every occurrence of X you add log p(X |
<unk>) (the share of unk-probability mass you want to give
to X). That way you don't need to add any ngrams to the LM. What
this effectively does is simulate a class-based Ngram model where
<unk> is a class and X one of its members.<br>
</blockquote>
Yes, this is exactly what I meant when I asked for a "smart way in
the SRILM toolkit", so I assume this is included. I looked up how to
use class-based models and I think I found what I need to do. Is the
following the correct way to calculate perplexity for these models?<br>
<br>
ngram -lm class_lm.arpa -ppl test.txt -order n -classes
expansions.class<br>
<br>
where expansions.class contains lines like this:<br>
<br>
<unk> p(X | <unk>) X<br>
<unk> p(Y | <unk>) Y<br>
<unk> 1-p(X | <unk>)-p(Y | <unk>) not_mapped<br>
<br>
I assume the last line is necessary since the man page for
"classes-format" says "All expansion probabilities for a given class
should sum to one,
although this is not necessarily enforced by the software and would
lead to improper models."<br>
<br>
Joris<br>
</body>
</html>