<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 8/8/2012 3:31 AM, Meng Chen wrote:<br>
</div>
<blockquote
cite="mid:CA+bc0mogppHN-PRzz_bOSVsqv=GPwWR8AruUKxxrqx594kJ77g@mail.gmail.com"
type="cite">Hi, the<b> -prune-lowprobs</b> option in<b> ngram</b>
will "prune N-gram probabilities that are lower than the
corresponding backed-off estimates". This option would be useful
especially when the back-off-weight (bow) value is positive.
However, I want to ask if I could simply replace the positive bow
value with 0 instead of using prune-lowprobs. Are there any
differences? Or replace simply is not correct?<br>
</blockquote>
It's not correct. If you modify the backoff weight you end up with
an LM that is no longer normalized (word probs for a given context
don't sum to 1).<br>
<blockquote
cite="mid:CA+bc0mogppHN-PRzz_bOSVsqv=GPwWR8AruUKxxrqx594kJ77g@mail.gmail.com"
type="cite">
<div><br>
</div>
<div>Another question:</div>
<div>When training LM, we could use<b> -text-has-weights</b>
option for the corpus with sentence frequency. I want to ask
what we should do with the<b> duplicated sentences</b> in large
corpus. Should I delete the duplicated sentences? Or should I
calculate the sentence frequency first and use the
-text-has-weights option instead? Or do nothing, just throw all
the corpus into training? <br>
</div>
</blockquote>
You can do either. Have a duplicated sentence<br>
<br>
1.0 a b c<br>
1.0 a b c<br>
<br>
is equivalent to having the sentence once with added weights:<br>
<br>
2.0 a b c<br>
<br>
<br>
Andreas<br>
<br>
</body>
</html>