[SRILM User List] 回复： A confusion of the interpolated language model

Fri Aug 28 22:21:25 PDT 2009

Hi,Thanks for your concern!
I do know that back-off weight is not a probability,but in the interpolated mod-kn smoothing method,bows are not supposed to be greater than 1.
In the man document of srilm ngram-discount.7.html,I've got this:
For back-off smoothing,there is
(1)   p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) 
where f(a_z) depends on the smoothing method and the bow(a_) is calculated below:
    Sum_Z p(a_z) = 1 Sum_Z1 f(a_z) + Sum_Z0 bow(a_) p(_z) = 1 
(2)   bow(a_) = (1- Sum_Z1 f(a_z)) / Sum_Z0 p(_z) 
            = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 p(_z)) 
            = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z)) 
but for interpolated smoothing, there is
(3)    f(a_z) = g(a_z) + bow(a_) p(_z) 
(4)    p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) 
and
    Sum_Z p(a_z) = 1 
    Sum_Z1 g(a_z) + Sum_Z bow(a_) p(_z) = 1 
(5)    	bow(a_) = 1 - Sum_Z1 g(a_z) 

 (Where Z be the set of all words in the vocabulary, Z0 be the set of 
all words with c(a_z) = 0, and Z1 be the set of all 
words with c(a_z) > 0)

However in the srilm sourse codes ,it seems that the interpolated bows is calculated using (5) and then the probs and bows is trasfered into back-off model using (3) ,then the back-off version of the bows are recomputed using (2).I just don't understand why srilm do not use the bow calculated using (5)directedly.
Besides,I used to use the entropy-prune method to construct a language model:
~ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 0 -kndiscount -interpolate -prune 0.000000001 -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_pruned1e-9.lm
and there is definitely no bow greater than 1.
So this problem is wired and I wonder if anyone of you knows that.And was the command I used to build the mod-kn discount language model(where I want to exclude the 3-grams with the count of 1) correct?
~
ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min
2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm
1994-2003_lm_all_pruned.lm

Thank you very much!

史海龙
Hailoon Shi
w63,EE Dpt.Tinghua.Unv.Beijing.China
分享快乐，加倍快乐

________________________________
发件人： Yannick Estève <yannick.esteve at lium.univ-lemans.fr>
收件人： 海龙 史 <shl.thcn at yahoo.com.cn>
抄送： srilm-user at speech.sri.com
已发送： 2009/8/27(周四), 下午4:19:44
主题： Re: [SRILM User List] A confusion of the interpolated language model

Hi,

Back-off weights are not probabilities: they can be greater than 1.
So, your values are normal. You can have some explanations about back-off weight computation here, particularly for the use of the modified Kneser-Ney discounting method:
http://www.speech.sri.com/projects/srilm/manpages/pdfs/chen-goodman-tr-10-98.pdf

Regards,
Yannick Estève
LIUM - University of Le Mans
France

Le 27 août 09 à 09:21, 海龙 史 a écrit :

>
>
>
> 
> 
>I am a new student user of srilm from Asia.Here I used the command below to construct a interpolated mod-kn discount language model:
>~ ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_all_pruned.lm
>
>
> However in my model several N-grams' back-off werght(bow) appears to be greater than 1.That is ,in the text LM file,I've got a line:
>-6.457229    <s> 1635    0.1270406
>(Here we just use a kind of index to represent a chinese word)
>in whitch the 1og10(bow) is greater than 0.We don't think a normal interplotate discount method can produce an N-gram bow greater than 1,besides this circumstance only occured to several(less than 5) different
> N-grams.So I am confused and would like to ask if there is someyone who encounterd this circumstance or happens to know what is wrong.
>Thank you very much!
>
>史海龙
>Hailoon Shi
>w63,EE Dpt.Thu Univ.PRC
>
>
>
> 
>
>  
>  
>__________________________________________________
>赶快注册雅虎超大容量免费邮箱?
>http://cn.mail.yahoo.com_______________________________________________
>SRILM-User site list
>SRILM-User at speech.sri.com
>http://www.speech.sri.com/mailman/listinfo/srilm-user

      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20090828/6b69f7d3/attachment.html>