<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 3/19/2014 10:57 AM, Stefy D. wrote:<br>
</div>
<blockquote
cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>Dear Andreas,<br>
<br>
</div>
thank you very much for replying.<br>
<br>
</div>
I trained both LMs using the "-unk" option like
this:<br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<blockquote
cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div>
<div>
<div>
<div>
<div>
<div>
<div>$NGRAMCOUNT_FILE -order 3 -interpolate
-kndiscount -unk -text
$WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -lm
$WORKING_DIR"lm_a/lmodel.lm"<br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<br>
That explains who you are not getting OOVs reported in the ppl
output. Unknown words are mapped to <unk> and thus the LM has
a probability for <unk>.<br>
<br>
<blockquote
cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div>
<div>
<div>
<div>
<div>
<div>
<div><br>
</div>
For the OOV rate I created a vocabulary list for the
training data and I used the unigram counts of the
test set and the compute-oov-rate script like this:<br>
<br>
$NGRAMCOUNT_FILE -order 1 -write-vocab
"vocabularyTargetUnigram.txt" -text
$WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -sort<br>
<br>
$NGRAMCOUNT_FILE -order 1 -text
$WORKING_DIR"test.lowercased."$TARGET -write
"unigramCounts_testdatal.txt" -sort<br>
<br>
$OOVCOUNT_FILE vocabularyTargetUnigram.txt
unigramCounts_testdata.txt <br>
<br>
</div>
This is how I got that OOV rate mentioned in the first
mail. Could you please let me know if I used the right
commands to compute that?<br>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
You did it right.<br>
<br>
<blockquote
cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div>
<div>
<div>
<div>
<div><br>
</div>
You said I should train LM_A using the vocabulary of
corpus A + corpus X so that the perplexities can be
compared. So I should train LM_A using only corpus A but
the vocabulary of A + X? I am sorry to be confused, but
I thought that for estimating the LM the vocabulary
should be from the same corpus used for estimating. I am
using these LMs in SMT systems (a baseline and an
adapted one). If I influence the baseline LM with
vocabulary from the adapted data, then the baseline is
not really a baseline. Please tell me if I am thinking
incorrectly.<br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
You are right. What this illustrates is that perplexity alone is
not a sufficient metric for comparing LMs. In your scenario (LM
adaptation) the expansion of the vocabulary is a key component of
the adaptation process, but LMs with different vocabularies are no
longer comparable by ppl. My suggestion to unify the vocabularies
was a workaround to allow you to still use perplexity comparison. <br>
<br>
<blockquote
cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div>
<div>
<div>
<div><br>
</div>
Thank you for introducing me into statistical
significance. <br>
</div>
To generate a table of word level probabilities on the same
test set should I use get-unigram-probs? But where do I
specify the test set?<br>
$UNIGRAMPROBS_FILE linear=1
$WORKING_DIR"lm_a/lmodel.arpa."$TARGET > table_A.out<br>
</div>
</div>
</div>
</blockquote>
No, you get the word probabilities from output of ngram -debug 2
-ppl (you need to write some perl or whatever script to extract the
probabilities).<br>
<blockquote
cite="mid:1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com"
type="cite">
<div style="color:#000; background-color:#fff;
font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
Lucida Grande, sans-serif;font-size:12pt">
<div>
<div><br>
</div>
To get how many words had lower/same/greater probability in
LM_B is using compare-ppls script ok? For example, I get this
output when applying it to my 2 LMs (ngram -debug 2 on the
same test set as in previous commands): <br>
$COMPARE_PPLS $WORKING_DIR"ppl_files/ppl_A_detail.ppl"
$WORKING_DIR"ppl_files/ppl_B_detail.ppl"<br>
output: total 22450, equal 0, different 22450, greater 11447<br>
</div>
</div>
</blockquote>
Yes, it seems compare-ppls extracts exactly the statistics I was
talking about. I had forgotten about it ...<br>
<br>
Andreas<br>
<br>
</body>
</html>