<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 7/20/2012 5:04 AM, Nouf Al-Harbi
wrote:<br>
</div>
<blockquote
cite="mid:1342785895.7010.YahooMailNeo@web171304.mail.ir2.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff; font-family:arial,
helvetica, sans-serif;font-size:12pt">
<div>Hello,</div>
<div><br>
</div>
<div>I am new to language modeling and was hoping that someone
can help me with the following.<br>
<br>
I try to predict a word given an input sentence. For example,
I would like to get a word replacing the ... that has the <br>
highest probability in sentences such as ' A man is ...' (e.g.
sitting).<br>
<br>
I try to use disambig tool but I couldn't found any example
illustrate how to use it especially how how I can create the
map file and what is the type of this file ( e.g. txt, arpa,
...).<br>
</div>
</div>
</blockquote>
<br>
Indeed you can use disambig, at least in theory to solve this
problem.<br>
<br>
1. prepare a map file of the form:<br>
<br>
a a<br>
man man<br>
... [for all words occurring in your data]<br>
UNKNOWN_WORD word1 word2 .... [list all words in the
vocabulary here]<br>
<br>
2. train an LM of word sequences.<br>
<br>
3. prepare disambig input of the form<br>
<br>
a man is sitting UNKNOWN_WORD <br>
<br>
You can also add known words to the right of UKNOWN_WORD if you
have that information (see the note about -fw-only below).<br>
<br>
4. run disambig<br>
<br>
disambig -map MAPFILE -lm LMFILE -text INPUTFILE<br>
<br>
If you want to use only the left context of the UNKNOWN_WORD use the
-fw-only option.<br>
<br>
This is in theory. If your vocabulary is large it may be very slow
and take too much memory. I haven't tried it, so let me know if it
works for you.<br>
<br>
Andreas<br>
<br>
</body>
</html>