<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 8/14/2013 2:41 PM,

      <a class="moz-txt-link-abbreviated" href="mailto:tm-oleary@comcast.net">tm-oleary@comcast.net</a> wrote:<br>

    </div>

    <blockquote

cite="mid:690089649.2148876.1376516479475.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net"

      type="cite">

      <style type="text/css">p { margin: 0; }</style>

      <div style="font-family: Arial; font-size: 12pt; color: #000000">I

        would like to get a good understanding of what the values in

        .arpa files represent so I can do a better job on a project I am

        working on. I have found some documentation about .arpa files on

        the SRILM web site as well as in some other places that describe

        the values in the first column of the "\n-grams" sections of the

        file as conditional probabilities.

        <div><br>

        </div>

        <div>I assumed from this that if I had an .arpa file containing

          all of the unigrams and bigrams of a corpus, that [1] for all

          unigrams, the sum of 10^unigram_value would equal 1.0 and [2]

          for all bigrams, the sum of (10^bigram_value *

          10^unigram_value_of_first_term_in_bigram) would also equal

          1.0, since the joint probability p(a, b) = p(b|a) * p(a). It

          turns out that [1] is true, but for the .arpa file I have been

          working with, the [2] sum is about .68. I was expecting that

          [2] might sum to something less than 1.0 to due to probability

          mass redistributed for smoothing purposes, but that wouldn't

          account for .32 of the total, would it?</div>

      </div>

    </blockquote>

    You assume that the LM contains all possible N-grams of a given

    order (in your case, all bigrams).   That is not true.   It only

    lists the N-grams that occur in the training data, and that occur

    frequently enough (subject to the -gtNmin parameters).  The

    probabilities of unlisted N-grams are computed by backoff.  For an

    explanation search for "backoff computation language model". <br>

    <br>

    So if you summed over all possible bigrams then you should get the

    sum = 1 as you expect.<br>

    <blockquote

cite="mid:690089649.2148876.1376516479475.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net"

      type="cite">

      <div style="font-family: Arial; font-size: 12pt; color: #000000">

        <div><br>

        </div>

        <div>I think it's more likely that I don't understand what the

          values in the left column represent in the "\n-grams" sections

          for n >= 2. Is there a way to use the values in an .arpa

          file to reconstruct joint probabilities for bigrams (and other

          higher order n-grams) in order to verify that they actually do

          sum to 1.0 for each "\n-grams" section in the file?</div>

      </div>

    </blockquote>

    You are assuming above that the first column contains conditional

    ngram log probabilities, and that is correct.<br>

    <br>

    Andreas <br>

    <br>

  </body>

</html>