<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 9/2/2012 10:10 PM, hic et nunc
wrote:<br>
</div>
<blockquote cite="mid:BLU168-W2006515BC68A2F8601BB9FC9AB0@phx.gbl"
type="cite">
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Tahoma
}
--></style>
<div dir="ltr">
<!--StartFragment-->hello again. i have a new question about lm
ngram probs. <br>
as you know well, in lm file, the log probs are calculated like
this: log [(count[n-gram]*d/count[(n-1)-gram] -
count[(n-1)-gram_<unk>]] <br>
sometimes 1 is added to denominator, but sometimes not. what is
the reason of this? <!--EndFragment--><br>
</div>
</blockquote>
One is added to the denominator only a last resort when the
smoothing results in n-gram probabilities that sum to 1.<br>
The following comment in NgramLM.cc explains why:<br>
<br>
<blockquote type="cite"> /*<br>
* This is a hack credited to Doug Paul (by Roni
Rosenfeld in<br>
* his CMU tools). It may happen that no probability
mass<br>
* is left after totalling all the explicit probs,
typically<br>
* because the discount coefficients were out of range
and<br>
* forced to 1.0. Unless we have seen all vocabulary
words in<br>
* this context, to arrive at some non-zero backoff
mass,<br>
* we try incrementing the denominator in the
estimator by 1.<br>
* Another hack: If the discounting method uses
interpolation<br>
* we first try disabling that because interpolation
removes<br>
* probability mass.<br>
*/<br>
</blockquote>
<br>
This happens occasionally with GT smoothing due to degenerate
count-of-counts statistics.<br>
<br>
Andreas<br>
<br>
</body>
</html>