<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 10/28/2012 5:24 PM, Melvin Jose
wrote:<br>
</div>
<blockquote
cite="mid:1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff; font-family:times
new roman, new york, times, serif;font-size:12pt"><br>
<div><br>
</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;">Hey,</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><br>
</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><span class="tab"> </span><span
class="tab">I am presently working with Tamil - a
morphologically rich language. I am trying to build an FLM
with approximately 3 million entires but it seems to take
more than a day and a half now. The FLM specification is</span></div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><br>
<span class="tab"></span></div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><span class="tab">W : W(-1)
W(-2) B(-1) S(-1) using generalized backoff. where B is
word-base and S is suffix.<br>
</span></div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><br>
</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;">Below is the output of
-debug 2<br>
</div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><br>
<span class="tab"></span></div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><span class="tab">warning:
distributing 0.0989813 left-over probability mass over all
577519 words<br>
discarded 1 0x4-gram probs predicting pseudo-events<br>
discarded 1587186 0x4-gram probs discounted to zero<br>
discarded 1 0x8-gram probs predicting pseudo-events<br>
discarded 1 0xc-gram probs predicting pseudo-events<br>
discarded 4721615 0xc-gram probs discnounted to zero<br>
Starting estimation of general graph-backoff node: LM 0 Node
0xC, children: 0x8 0x4<br>
Finished estimation of multi-child graph-backoff node: LM 0
Node 0xC</span></div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><br>
<span class="tab"></span></div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><span class="tab">This was
the last message I received a day and a half ago. Is it
normal for it to take soo long? I read that Katrin had no
problem training on 5 million entries. Did it take so long?
I am using a cluster in my lab to do the computation, so
there shouln't be a problem with memory and computational
power.</span></div>
</div>
</blockquote>
<br>
I have no experience myself to tell you how long it should take.<br>
However, in cases like this I would run some experiments increasing
the amount of data from, say 10k to 100k to see how the runtime
increases as a function of input size. Then you can extrapolate to
the full data set instead of just waiting.<br>
<br>
<blockquote
cite="mid:1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff; font-family:times
new roman, new york, times, serif;font-size:12pt">
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><br>
<span class="tab"></span></div>
<div style="color: rgb(0, 0, 0); font-size: 16px; font-family:
times new roman,new york,times,serif; background-color:
transparent; font-style: normal;"><span class="tab">Is there
any way by which I can tell the fngram-count to utilize as
much memory as it wants or parallelize the computation?</span></div>
</div>
</blockquote>
It will take as much memory as it needs to, and there is no easy way
to parallelize.<br>
<br>
Andreas<br>
<br>
<br>
</body>
</html>