<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 3/31/2012 8:00 PM, Meng Chen wrote:
<blockquote
cite="mid:CA+bc0mppaw_F+gyCSkK2TW_W6y+2gTR=_YTtPNV6wRZyS++r1w@mail.gmail.com"
type="cite"><font face="'trebuchet ms', sans-serif">Hi, I met a
question when training class-based language model by
replace-words-with-classes command. My commands are as follows:</font>
<div><font face="'trebuchet ms', sans-serif"><br>
</font></div>
<div>
<ul>
<li><span style="font-family:'trebuchet ms',sans-serif">ngram-class
-vocab wlist -text training_set -numclasses 200
-incremental -classes output.classes</span></li>
<li><span style="font-family:'trebuchet ms',sans-serif">replace-words-with-classes
classes=</span><span style="font-family:'trebuchet
ms',sans-serif">output.classes</span><span
style="font-family:'trebuchet ms',sans-serif">
training_set > training_set_classes</span></li>
</ul>
<div><font face="'trebuchet ms', sans-serif">After these two
steps, I found that there are both words and classes in
training_set_classes. These words are OOVs in wlist,
however, I don't need them at all. Shouldn't these words
belong to <unk> in CLASS-00001? So I wonder to know
how to process this situation? Does SRILM support some
scripts to map these OOVs to CLASS-00001? Or Do I need to
write a script by myself?</font></div>
</div>
</blockquote>
<br>
It must be the case that wlist does not contain all the words in
training_set, and therefore output.classes does not cover the entire
vocabulary.<br>
In that case replace-words-with-classes will only operate on words
contained in the class definitions.<br>
<br>
You can easily augment the class definitions to add an extra class
that catches all your OOV words. The format should be
self-explanatory, or check the classes-format(5) man page.<br>
<br>
Andreas<br>
<br>
<br>
</body>
</html>