<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 11/03/13 02:35, Andreas Stolcke
wrote:<br>
</div>
<blockquote cite="mid:5275A84B.8060401@icsi.berkeley.edu"
type="cite">On 11/2/2013 7:46 AM, Joris Pelemans wrote:
<br>
<blockquote type="cite">On 11/02/13 02:07, Andreas Stolcke wrote:
<br>
<blockquote type="cite">
<br>
For example, if have p(c | a b) = x and d and c synonyms, you
set
<br>
<br>
p(c | a b ) = x/2
<br>
p(d | a b) = x/2
<br>
</blockquote>
<br>
Another question with regards to this problem. Say, I don't know
a good synonym for d, but I still want to include it by mapping
it onto <unk> (what else, right?), obviously by a very
small fraction of the <unk> probability, since it's a
class. The above technique would lead to gigantic LMs, since
<unk> is all over the place. Is there a smart way in the
SRILM toolkit that lets you specify that some words should be
modeled as <unk>?
<br>
</blockquote>
<br>
I'm not sure I understand what you mean. <unk> is a
special word that all words not in the vocabulary are mapped to at
test time. So the way you 'model' a word by <unk> is to
not include it in the vocabulary of your LM.
<br>
</blockquote>
I am investigating different techniques to introduce new words to
the vocabulary. Say I have a vocabulary of 100,000 words and I want
to introduce 1 new word X (for the sake of simplicity). I could do
one of 3 options:<br>
<ol>
<li>use the contexts in which X appears in some training data (but
sometimes X may not appear (enough))</li>
<li>estimate the probability of X by taking a fraction of the prob
mass of a synonym of X (which I described earlier)</li>
<li>estimate the probability of X by taking a fraction of the prob
mass of the <unk> class (if e.g. no good synonym is at
hand)</li>
</ol>
<p>I could then compare the perplexities of these 3 LMs with a
vocabulary of size 100,001 words to see which technique is best
for a given word/situation.<br>
</p>
<p>Joris<br>
</p>
</body>
</html>