can't get right counts-entropy

Andreas Stolcke stolcke at speech.sri.com
Mon Jan 28 14:44:41 PST 2008


SAI TANG HUANG wrote:
> Hi,
>
> Thanks for your answer. However I'm still a bit confused. I ran the abc example you used in your last email and I got a count file of:
>
> sai at uk-notebook:~/Desktop$ ngram-count -text abc.txt
>      1
>  a   1
>  a b 1
> a       1
> a b     1
> a b c   1
> b       1
> b c     1
> b c         1
> c       1
> c   1
>     1
>
> According to you, I should only keep  
>
>  a
>  a b
> a b c
> b c 
>
> because the rest of them are "context". I understand that all the unigrams are contexts, but why do I keep " a" and chuck away all the other bigrams?
>
> Also, in what context do you mean by "context" and "event"? In P(a|b) = P(a,b)/P(b).
>
> Is P(b) the context and P(a,b) the event then?
>   
It's hard to make sense of this because somehow the <s> and </s> tags 
got deleted in the mail.   I will therefore use SB for "sentence begin" 
and SE for "sentence end".
I will try to explain better which ngrams are relevant for computing 
perplexity.  By "events" I mean the tokens that the LM predicts.  Those 
are all the tokens except
the SB.
    a
    b
    c
    SE

Now, you add the context for each for those tokens as is used by the LM 
for condition the predictions.  These are the precending two tokens, 
except in those cases where you are near the beginning of the sentence , 
since you cannot go beyond the SB token.  So the contexts together with 
the predicted tokens are

    SB a
    SB a b
    a b c
    b c SE

Those are the ngrams that you need to feed to ngram -counts to get a 
valid perplexity.
I hope this makes more sense now.

Andreas

> Thanks a lot,
>
> Sai
>
> ----------------------------------------
>   
>> Date: Wed, 23 Jan 2008 01:36:32 +0200
>> From: stolcke at speech.sri.com
>> To: sai_tang_huang at hotmail.com
>> CC: srilm-user at speech.sri.com
>> Subject: Re: can't get right counts-entropy
>>
>> SAI TANG HUANG wrote:
>>     
>>> Hi,
>>>
>>> I have created a counts file and a back-off LM file from a text file with sentences with the following command:
>>>
>>> sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file 
>>>
>>> Then I ran the ngram program with -counts here is the output:
>>>
>>> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file 
>>> file count_file: 23640 sentences, 460074 words, 0 OOVs
>>> 7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575
>>> sai at uk-notebook:~/Desktop$ 
>>>
>>> I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get:
>>>   
>>>       
>> The 7880 zeroprobs are probably due to the  tokens output by the 
>> ngram-count program.
>> you cannot use the ngram-count output directly as input to ngram 
>> -counts. See below.
>>     
>>> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt 
>>> file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs
>>> 0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985
>>> sai at uk-notebook:~/Desktop$ 
>>>
>>> Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ?
>>>
>>> If there is a more detailed manual or document describing these values then I'm willing to read it.
>>>   
>>>       
>> This is not yet well documented.    To use ngram -counts correctly to 
>> must only feed those N-grams that correspond to "events" in the LM, not 
>> those that only appear as "context".   That means you need to filter the 
>> ngram-count output and retain only ngrams that
>>
>> - are of the highest order (e.g., trigrams for a trigram LM), OR
>> - start with  (but not the  unigam, see above).
>>
>> For example, the sentence "a b c" in conjunction with a trigram LM 
>> should generate only the ngrams
>>
>>  a
>>  a b
>> a b c
>> b c 
>>
>> You can do this filtering with a small perl or gawk script.
>>
>> Sounds like another topic for the FAQ file.
>>
>> Andreas
>>
>>
>>     
>>> Thanks a lot,
>>>
>>> Sai
>>> _________________________________________________________________
>>> Tecnología, moda, motor, viajes,…suscríbete a nuestros boletines para estar siempre a la última
>>> http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com
>>>   
>>>       
>>     
>
> _________________________________________________________________
> La vida de los famosos al desnudo en MSN Entretenimiento
> http://entretenimiento.es.msn.com/
>   





More information about the SRILM-User mailing list