<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 8/14/2013 2:41 PM,
<a class="moz-txt-link-abbreviated" href="mailto:tm-oleary@comcast.net">tm-oleary@comcast.net</a> wrote:<br>
</div>
<blockquote
cite="mid:690089649.2148876.1376516479475.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net"
type="cite">
<style type="text/css">p { margin: 0; }</style>
<div style="font-family: Arial; font-size: 12pt; color: #000000">I
would like to get a good understanding of what the values in
.arpa files represent so I can do a better job on a project I am
working on. I have found some documentation about .arpa files on
the SRILM web site as well as in some other places that describe
the values in the first column of the "\n-grams" sections of the
file as conditional probabilities.
<div><br>
</div>
<div>I assumed from this that if I had an .arpa file containing
all of the unigrams and bigrams of a corpus, that [1] for all
unigrams, the sum of 10^unigram_value would equal 1.0 and [2]
for all bigrams, the sum of (10^bigram_value *
10^unigram_value_of_first_term_in_bigram) would also equal
1.0, since the joint probability p(a, b) = p(b|a) * p(a). It
turns out that [1] is true, but for the .arpa file I have been
working with, the [2] sum is about .68. I was expecting that
[2] might sum to something less than 1.0 to due to probability
mass redistributed for smoothing purposes, but that wouldn't
account for .32 of the total, would it?</div>
</div>
</blockquote>
You assume that the LM contains all possible N-grams of a given
order (in your case, all bigrams). That is not true. It only
lists the N-grams that occur in the training data, and that occur
frequently enough (subject to the -gtNmin parameters). The
probabilities of unlisted N-grams are computed by backoff. For an
explanation search for "backoff computation language model". <br>
<br>
So if you summed over all possible bigrams then you should get the
sum = 1 as you expect.<br>
<blockquote
cite="mid:690089649.2148876.1376516479475.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net"
type="cite">
<div style="font-family: Arial; font-size: 12pt; color: #000000">
<div><br>
</div>
<div>I think it's more likely that I don't understand what the
values in the left column represent in the "\n-grams" sections
for n >= 2. Is there a way to use the values in an .arpa
file to reconstruct joint probabilities for bigrams (and other
higher order n-grams) in order to verify that they actually do
sum to 1.0 for each "\n-grams" section in the file?</div>
</div>
</blockquote>
You are assuming above that the first column contains conditional
ngram log probabilities, and that is correct.<br>
<br>
Andreas <br>
<br>
</body>
</html>