<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 4/10/2012 12:21 AM, bulusheva wrote:
<blockquote cite="mid:4F83DF60.2000601@speechpro.com" type="cite">
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
Hi, I have two questions:<br>
<br>
1. If I generate the language model with Kneser-Ney smoothing (or
Modified Kneser-Ney), why do the parameter "-gtnmin" apply to
already modified counts? <br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex;
border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div>For example, if in the training data 2-gram "markov model"
occurs only in the context "hidden markov model" and gt2min =
2, then the modified count for "markov model" = n(* markov
model) = 1 < gt2min and <br>
prob("markov model") = bow("markov")*prob("model"). <br>
Instead of prob("markov model") = ( n(* markov model) - D)/
n(* markov *) ;<br>
</div>
</blockquote>
</blockquote>
That's how it is currently implemented. It is debatable how the
minimum count should be applied in the case of the lower-order
distributions in KN models.<br>
The way it currently works is natural from an implementation
perspective, because the lower-order counts are physically modified
before applying the discounting (you can examine them by adding
-write COUNTS).<br>
<br>
But you are raising a good point. It might make more sense to have
the -gtXmin values be interpreted independent of the discounting
method.<br>
<br>
<blockquote cite="mid:4F83DF60.2000601@speechpro.com" type="cite">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex;
border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div> <br>
2. Let say I use ngram-count to generate the language model as
following: <br>
ngram-count -text text.txt -vocab vocab.txt -gt1min 5 -lm
sri.lm<br>
Let the word "hello" exists in "vocab.txt" and occurs 4 times
in "text.txt". Then probability of "hello" is calculated as
probability of zerotone. Is it correct?<br>
</div>
</blockquote>
</blockquote>
That is correct, but the ARPA format doesn't allow you to prune
unigrams, so the unigrams will always appear explicitly listed in
the LM, even if their probabilities might be obtained by backing off
to a uniform distribution.<br>
<br>
Andreas<br>
<br>
</body>
</html>