From stolcke at icsi.berkeley.edu Thu Apr 10 22:49:27 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 10 Apr 2014 22:49:27 -0700 Subject: [SRILM User List] FW: Create Phone Lattices In-Reply-To: References: <1397045953.27410.YahooMailNeo@web125803.mail.ne1.yahoo.com> Message-ID: <53478267.2050404@icsi.berkeley.edu> On 4/9/2014 3:40 PM, Andreas Stolcke wrote: > > *From:*Dmitriy Ivanko [mailto:dmitriy_ivanko at yahoo.com] > *Sent:* Wednesday, April 9, 2014 5:19 AM > *To:* Andreas Stolcke > *Subject:* Create Phone Lattices > > Hello, Andreas Stolcke ! > Thank you for your programm "lattice-too.exe" in SRI LM. > I'm sorry for my English. > Can you help me. I try create N-gram Language Model by using > EM-algorithm and Lattices. I have lattices, but I don't know: is it > possible create Language Model, using EM-algorithm? > Like in article: > > "LANGUAGE RECOGNITION USING PHONE LATTICES" *J.L. Gauvain*, A. > Messaoudi, and H. Schwenk. > > Anyway thank you! > Best Regards, > Dmitriy Ivanko. > Yes, you can use lattice-tool -write-ngrams (plus options to specify the lattices, ngram order etc.) to compute expected ngrams counts from lattices. The lattices should be in HTK format. You can then estimate LMs from the expected ngram counts (using ngram -float-counts ...). I have personally used this method to implement the Gauvain et al. language recognition method, and it works great. I'm not sure how well it works for other tasks. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhangj at computing.dcu.ie Wed Apr 16 03:20:28 2014 From: zhangj at computing.dcu.ie (jian zhang) Date: Wed, 16 Apr 2014 11:20:28 +0100 Subject: [SRILM User List] ppl output from ngram interpret Message-ID: Hi Andreas, I am confused about the ppl output from ngram. The following are the outputs from two sentences, resumption of the session p( resumption | ) = [1gram] 6.41856e-07 [ -6.19256 ] p( of | resumption ...) = [2gram] 0.547254 [ -0.261811 ] *p( the | of ...) = [2gram] 0.0826684 [ -1.08266 ]* p( session | the ...) = [1gram] 1.21666e-06 [ -5.91483 ] p( | session ...) = [1gram] 0.00150439 [ -2.82264 ] 1 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -16.2745 ppl= 1798.46 ppl1= 11711.9 4 words, rank1= 0.25 rank5= 0.5 rank10= 0.5 5 words+sents, rank1wSent= 0.2 rank5wSent= 0.4 rank10wSent= 0.4 qloss= 0.899274 absloss= 0.873714 you have requested a debate on this subject in the course of the next few days , during this part-session . p( you | ) = [2gram] 0.000716442 [ -3.14482 ] p( have | you ...) = [2gram] 0.0179397 [ -1.74618 ] p( requested | have ...) = [1gram] 6.43992e-06 [ -5.19112 ] p( a | requested ...) = [1gram] 0.00378035 [ -2.42247 ] p( debate | a ...) = [2gram] 0.000358849 [ -3.44509 ] p( on | debate ...) = [2gram] 0.0598839 [ -1.22269 ] p( this | on ...) = [2gram] 0.00443142 [ -2.35346 ] p( subject | this ...) = [2gram] 9.54276e-05 [ -4.02033 ] p( in | subject ...) = [2gram] 0.0436281 [ -1.36023 ] p( the | in ...) = [2gram] 0.147714 [ -0.830578 ] p( course | the ...) = [3gram] 0.00139691 [ -2.85483 ] p( of | course ...) = [3gram] 0.579381 [ -0.237035 ] *p( the | of ...) = [2gram] 0.0762541 [ -1.11774 ]* p( next | the ...) = [3gram] 0.00123622 [ -2.9079 ] p( few | next ...) = [3gram] 0.0245328 [ -1.61025 ] p( days | few ...) = [2gram] 0.00340647 [ -2.46769 ] p( , | days ...) = [2gram] 0.15756 [ -0.802555 ] p( during | , ...) = [2gram] 0.000749831 [ -3.12504 ] p( this | during ...) = [3gram] 0.0352358 [ -1.45302 ] p( | this ...) = [1gram] 9.0905e-07 [ -6.04141 ] p( . | ...) = [1gram] 0.0254746 [ -1.59389 ] p( | . ...) = [2gram] 0.809733 [ -0.091658 ] 1 sentences, 21 words, 0 OOVs 0 zeroprobs, logprob= -50.04 ppl= 188.168 ppl1= 241.466 21 words, rank1= 0.142857 rank5= 0.428571 rank10= 0.47619 22 words+sents, rank1wSent= 0.181818 rank5wSent= 0.454545 rank10wSent= 0.5 qloss= 0.930912 absloss= 0.909386 My two questions: 1. There are 2-gram p( the | of ...) computed from both sentences, why they have different probability (first sentence gives 0.0826684, second sentence gives 0.0762541)? 2. Is there a parameter setting for ngram which is able to print out the actual tokens instead of ellipsis. Thanks, Jian -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Apr 16 09:50:00 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 16 Apr 2014 09:50:00 -0700 Subject: [SRILM User List] ppl output from ngram interpret In-Reply-To: References: Message-ID: <534EB4B8.5000804@icsi.berkeley.edu> On 4/16/2014 3:20 AM, jian zhang wrote: > Hi Andreas, > > I am confused about the ppl output from ngram. > The following are the outputs from two sentences, > > resumption of the session > p( resumption | ) = [1gram] 6.41856e-07 [ -6.19256 ] > p( of | resumption ...) = [2gram] 0.547254 [ -0.261811 ] > *p( the | of ...) = [2gram] 0.0826684 [ -1.08266 ]* > p( session | the ...) = [1gram] 1.21666e-06 [ -5.91483 ] > p( | session ...) = [1gram] 0.00150439 [ -2.82264 ] > 1 sentences, 4 words, 0 OOVs > 0 zeroprobs, logprob= -16.2745 ppl= 1798.46 ppl1= 11711.9 > 4 words, rank1= 0.25 rank5= 0.5 rank10= 0.5 > 5 words+sents, rank1wSent= 0.2 rank5wSent= 0.4 rank10wSent= 0.4 qloss= > 0.899274 absloss= 0.873714 > > you have requested a debate on this subject in the course of the next > few days , during this part-session . > p( you | ) = [2gram] 0.000716442 [ -3.14482 ] > p( have | you ...) = [2gram] 0.0179397 [ -1.74618 ] > p( requested | have ...) = [1gram] 6.43992e-06 [ -5.19112 ] > p( a | requested ...) = [1gram] 0.00378035 [ -2.42247 ] > p( debate | a ...) = [2gram] 0.000358849 [ -3.44509 ] > p( on | debate ...) = [2gram] 0.0598839 [ -1.22269 ] > p( this | on ...) = [2gram] 0.00443142 [ -2.35346 ] > p( subject | this ...) = [2gram] 9.54276e-05 [ -4.02033 ] > p( in | subject ...) = [2gram] 0.0436281 [ -1.36023 ] > p( the | in ...) = [2gram] 0.147714 [ -0.830578 ] > p( course | the ...) = [3gram] 0.00139691 [ -2.85483 ] > p( of | course ...) = [3gram] 0.579381 [ -0.237035 ] > *p( the | of ...) = [2gram] 0.0762541 [ -1.11774 ]* > p( next | the ...) = [3gram] 0.00123622 [ -2.9079 ] > p( few | next ...) = [3gram] 0.0245328 [ -1.61025 ] > p( days | few ...) = [2gram] 0.00340647 [ -2.46769 ] > p( , | days ...) = [2gram] 0.15756 [ -0.802555 ] > p( during | , ...) = [2gram] 0.000749831 [ -3.12504 ] > p( this | during ...) = [3gram] 0.0352358 [ -1.45302 ] > p( | this ...) = [1gram] 9.0905e-07 [ -6.04141 ] > p( . | ...) = [1gram] 0.0254746 [ -1.59389 ] > p( | . ...) = [2gram] 0.809733 [ -0.091658 ] > 1 sentences, 21 words, 0 OOVs > 0 zeroprobs, logprob= -50.04 ppl= 188.168 ppl1= 241.466 > 21 words, rank1= 0.142857 rank5= 0.428571 rank10= 0.47619 > 22 words+sents, rank1wSent= 0.181818 rank5wSent= 0.454545 rank10wSent= > 0.5 qloss= 0.930912 absloss= 0.909386 > > My two questions: > 1. There are 2-gram p( the | of ...) computed from both sentences, why > they have different probability (first sentence gives 0.0826684, > second sentence gives 0.0762541)? Because the backoff weights are dependent on the trigram context. So the first probability equals bow("resumption of") * p("the"| "of") whereas the second probability is bow("course of") * p("the" | "of") > 2. Is there a parameter setting for ngram which is able to print out > the actual tokens instead of ellipsis. > > No, unfortunately. The idea behind the output format was to keep the number of fields constant so as to facilitate parsing with awk/perl/etc. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhangj at computing.dcu.ie Wed Apr 16 10:44:14 2014 From: zhangj at computing.dcu.ie (jian zhang) Date: Wed, 16 Apr 2014 18:44:14 +0100 Subject: [SRILM User List] ppl output from ngram interpret In-Reply-To: <534EB4B8.5000804@icsi.berkeley.edu> References: <534EB4B8.5000804@icsi.berkeley.edu> Message-ID: Hi Andreas and ???, Thanks. I understand now. Jian On Wed, Apr 16, 2014 at 5:50 PM, Andreas Stolcke wrote: > On 4/16/2014 3:20 AM, jian zhang wrote: > > Hi Andreas, > > I am confused about the ppl output from ngram. > The following are the outputs from two sentences, > > resumption of the session > p( resumption | ) = [1gram] 6.41856e-07 [ -6.19256 ] > p( of | resumption ...) = [2gram] 0.547254 [ -0.261811 ] > *p( the | of ...) = [2gram] 0.0826684 [ -1.08266 ]* > p( session | the ...) = [1gram] 1.21666e-06 [ -5.91483 ] > p( | session ...) = [1gram] 0.00150439 [ -2.82264 ] > 1 sentences, 4 words, 0 OOVs > 0 zeroprobs, logprob= -16.2745 ppl= 1798.46 ppl1= 11711.9 > 4 words, rank1= 0.25 rank5= 0.5 rank10= 0.5 > 5 words+sents, rank1wSent= 0.2 rank5wSent= 0.4 rank10wSent= 0.4 qloss= > 0.899274 absloss= 0.873714 > > you have requested a debate on this subject in the course of the next > few days , during this part-session . > p( you | ) = [2gram] 0.000716442 [ -3.14482 ] > p( have | you ...) = [2gram] 0.0179397 [ -1.74618 ] > p( requested | have ...) = [1gram] 6.43992e-06 [ -5.19112 ] > p( a | requested ...) = [1gram] 0.00378035 [ -2.42247 ] > p( debate | a ...) = [2gram] 0.000358849 [ -3.44509 ] > p( on | debate ...) = [2gram] 0.0598839 [ -1.22269 ] > p( this | on ...) = [2gram] 0.00443142 [ -2.35346 ] > p( subject | this ...) = [2gram] 9.54276e-05 [ -4.02033 ] > p( in | subject ...) = [2gram] 0.0436281 [ -1.36023 ] > p( the | in ...) = [2gram] 0.147714 [ -0.830578 ] > p( course | the ...) = [3gram] 0.00139691 [ -2.85483 ] > p( of | course ...) = [3gram] 0.579381 [ -0.237035 ] > *p( the | of ...) = [2gram] 0.0762541 [ -1.11774 ]* > p( next | the ...) = [3gram] 0.00123622 [ -2.9079 ] > p( few | next ...) = [3gram] 0.0245328 [ -1.61025 ] > p( days | few ...) = [2gram] 0.00340647 [ -2.46769 ] > p( , | days ...) = [2gram] 0.15756 [ -0.802555 ] > p( during | , ...) = [2gram] 0.000749831 [ -3.12504 ] > p( this | during ...) = [3gram] 0.0352358 [ -1.45302 ] > p( | this ...) = [1gram] 9.0905e-07 [ -6.04141 ] > p( . | ...) = [1gram] 0.0254746 [ -1.59389 ] > p( | . ...) = [2gram] 0.809733 [ -0.091658 ] > 1 sentences, 21 words, 0 OOVs > 0 zeroprobs, logprob= -50.04 ppl= 188.168 ppl1= 241.466 > 21 words, rank1= 0.142857 rank5= 0.428571 rank10= 0.47619 > 22 words+sents, rank1wSent= 0.181818 rank5wSent= 0.454545 rank10wSent= 0.5 > qloss= 0.930912 absloss= 0.909386 > > My two questions: > 1. There are 2-gram p( the | of ...) computed from both sentences, why > they have different probability (first sentence gives 0.0826684, second > sentence gives 0.0762541)? > > Because the backoff weights are dependent on the trigram context. > So the first probability equals > bow("resumption of") * p("the"| "of") > whereas the second probability is > bow("course of") * p("the" | "of") > > 2. Is there a parameter setting for ngram which is able to print out > the actual tokens instead of ellipsis. > > > No, unfortunately. The idea behind the output format was to keep the > number of fields constant so as to facilitate parsing with awk/perl/etc. > > Andreas > > -- Jian Zhang Centre for Next Generation Localisation (CNGL) Dublin City University -------------- next part -------------- An HTML attachment was scrubbed... URL: From tarek.ahmed at rdi-eg.com Wed Apr 23 08:21:20 2014 From: tarek.ahmed at rdi-eg.com (tarek) Date: Wed, 23 Apr 2014 17:21:20 +0200 Subject: [SRILM User List] SRILM commercial license Message-ID: <5357DA70.4040507@rdi-eg.com> hello, We are a software company working in the natural language processing field. We would like to use SRILM tools (especially lattice-tool) in our products. I am asking about the license issue. best regards tarek abuamer --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From ismail.indonesia at gmail.com Mon Apr 28 03:01:14 2014 From: ismail.indonesia at gmail.com (Ismail Rusli) Date: Mon, 28 Apr 2014 17:01:14 +0700 Subject: [SRILM User List] Right way to build LM Message-ID: <535E26EA.30804@gmail.com> Dear all, I attempted to build n-gram LM from Wikipedia text. I have clean up all unwanted lines. I have approximately 36M words. I splitted the text into 90:10 proportions. Then from the 90, i splitted again into 4 joint training sets with increasing size (with the largest is about 1M sentences). Command i used are the followings: 1. Count n-gram and vocabulary: ngram-count -text 1M -order 3 -write count.1M -write-vocab vocab.1M -unk 2. Build LM with ModKN: ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm -kndiscount 3. Calculate perplexity: ngram -ppl test -order 3 -lm kn.lm My questions are: 1. Did i do it right? 2. Is there any optimization i can do in building LM? 3. How to calculate perplexity in log 2-based instead of log 10? Thanks in advance. Ismail -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Apr 28 16:20:26 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 28 Apr 2014 16:20:26 -0700 Subject: [SRILM User List] Right way to build LM In-Reply-To: <535E26EA.30804@gmail.com> References: <535E26EA.30804@gmail.com> Message-ID: <535EE23A.1080400@icsi.berkeley.edu> On 4/28/2014 3:01 AM, Ismail Rusli wrote: > Dear all, > > I attempted to build n-gram LM from Wikipedia text. I have > clean up all unwanted lines. I have approximately 36M words. > I splitted the text into 90:10 proportions. Then from the 90, > i splitted again into 4 joint training sets with increasing > size (with the largest is about 1M sentences). > > Command i used are the followings: > > 1. Count n-gram and vocabulary: > ngram-count -text 1M -order 3 -write count.1M -write-vocab vocab.1M -unk > > 2. Build LM with ModKN: > ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm -kndiscount There is no need to specify -vocab if you are getting it from the same training data as the counts. The use of -vocab is to specify a vocabulary that differs from that of the training data. In fact you can combine 1 and 2 in one comment that is equivalent: ngram-count -text 1M -order 3 -unk -lm kn.lm -kndiscount Also, if you do use two steps, be sure to include the -unk option in the second step. > > 3. Calculate perplexity: > ngram -ppl test -order 3 -lm kn.lm > > My questions are: > 1. Did i do it right? It looks like you did. > 2. Is there any optimization i can do in building LM? a. Try different -order values b. Different smoothing methods. c. Possibly class-based models (interpolated with word-based) d. If you want to increase training data size significantly check the methods for conserving memory on the FAQ page. > 3. How to calculate perplexity in log 2-based instead of log 10? Perplexity is not dependent on the base of the logarithm (the log base is matched by the number you exponentiate to get the ppl). Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From ismail.indonesia at gmail.com Mon Apr 28 19:38:44 2014 From: ismail.indonesia at gmail.com (Ismail Rusli) Date: Tue, 29 Apr 2014 09:38:44 +0700 Subject: [SRILM User List] Right way to build LM In-Reply-To: <535EE23A.1080400@icsi.berkeley.edu> References: <535E26EA.30804@gmail.com> <535EE23A.1080400@icsi.berkeley.edu> Message-ID: <535F10B4.5060101@gmail.com> Thanks for the answer, Andreas. As i read paper by Chen and Goodman (1999), they used held-out data to optimize parameters in language model. How do i do this in SRILM? Does SRILM optimize parameters when i use -kndiscount? I tried -kn to save parameters in a file and included this file when building LM but it turned out my perplexity is getting bigger. And just one more, do you have a link to good tutorial in using class-based models with SRILM? Ismail On 04/29/2014 06:20 AM, Andreas Stolcke wrote: > On 4/28/2014 3:01 AM, Ismail Rusli wrote: >> Dear all, >> >> I attempted to build n-gram LM from Wikipedia text. I have >> clean up all unwanted lines. I have approximately 36M words. >> I splitted the text into 90:10 proportions. Then from the 90, >> i splitted again into 4 joint training sets with increasing >> size (with the largest is about 1M sentences). >> >> Command i used are the followings: >> >> 1. Count n-gram and vocabulary: >> ngram-count -text 1M -order 3 -write count.1M -write-vocab vocab.1M -unk >> >> 2. Build LM with ModKN: >> ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm -kndiscount > > There is no need to specify -vocab if you are getting it from the same > training data as the counts. > The use of -vocab is to specify a vocabulary that differs from that of > the training data. > In fact you can combine 1 and 2 in one comment that is equivalent: > > ngram-count -text 1M -order 3 -unk -lm kn.lm -kndiscount > > Also, if you do use two steps, be sure to include the -unk option in > the second step. > >> >> 3. Calculate perplexity: >> ngram -ppl test -order 3 -lm kn.lm >> >> My questions are: >> 1. Did i do it right? > It looks like you did. > >> 2. Is there any optimization i can do in building LM? > a. Try different -order values > b. Different smoothing methods. > c. Possibly class-based models (interpolated with word-based) > d. If you want to increase training data size significantly check the > methods for conserving memory on the FAQ page. >> 3. How to calculate perplexity in log 2-based instead of log 10? > Perplexity is not dependent on the base of the logarithm (the log base > is matched by the number you exponentiate to get the ppl). > > Andreas > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Apr 29 23:39:04 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 29 Apr 2014 23:39:04 -0700 Subject: [SRILM User List] Right way to build LM In-Reply-To: <535F10B4.5060101@gmail.com> References: <535E26EA.30804@gmail.com> <535EE23A.1080400@icsi.berkeley.edu> <535F10B4.5060101@gmail.com> Message-ID: <53609A88.1050204@icsi.berkeley.edu> On 4/28/2014 7:38 PM, Ismail Rusli wrote: > Thanks for the answer, Andreas. > > As i read paper by > Chen and Goodman (1999), they used held-out data > to optimize parameters in language model. How do i > do this in SRILM? Does SRILM optimize parameters > when i use -kndiscount? SRILM just uses the formulas for estimating the discounts from the count-of-counts, i.e., equations (26) in the Chen & Goodman technical report . > I tried -kn to save > parameters in a file and included this file > when building LM but it turned out > my perplexity is getting bigger. You can save the discounting parameters using: 1) ngram-count -read COUNTS -kndiscount -kn1 K1 -kn2 K2 -kn3 K3 (no -lm argument!) Then you can read them back in for LM estimation using 2) ngram-count -read COUNTS -kndiscount -kn1 K1 -kn2 K2 -kn3 K3 -lm LM and the result will be identical to the second command when run without -kn1/2/3 options. Now, if you want you can manipulate the discounting parameters before invoking command 2. For example, you could perform a search over the D1, D2, D3 parameters optimizing perplexity on a held-out set, just like C&G did. But you have to implement that search yourself by writing some wrapper scripts. Also consider the interpolated version of KN smoothing. Just add the ngram-count -interpolate option, it usually gives slightly better results. > > And just one more, > do you have a link to good tutorial in using > class-based models with SRILM? There is a basic tutorial at http://ssli.ee.washington.edu/ssli/people/sarahs/srilm.html . Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From ismail.indonesia at gmail.com Tue Apr 29 23:53:39 2014 From: ismail.indonesia at gmail.com (Ismail Rusli) Date: Wed, 30 Apr 2014 13:53:39 +0700 Subject: [SRILM User List] Right way to build LM In-Reply-To: <53609A88.1050204@icsi.berkeley.edu> References: <535E26EA.30804@gmail.com> <535EE23A.1080400@icsi.berkeley.edu> <535F10B4.5060101@gmail.com> <53609A88.1050204@icsi.berkeley.edu> Message-ID: <53609DF3.4080602@gmail.com> Right, thanks Andreas. It's getting clearer to me now. Regards, Ismail On 04/30/2014 01:39 PM, Andreas Stolcke wrote: > On 4/28/2014 7:38 PM, Ismail Rusli wrote: >> Thanks for the answer, Andreas. >> >> As i read paper by >> Chen and Goodman (1999), they used held-out data >> to optimize parameters in language model. How do i >> do this in SRILM? Does SRILM optimize parameters >> when i use -kndiscount? > SRILM just uses the formulas for estimating the discounts from the > count-of-counts, i.e., equations (26) in the Chen & Goodman technical > report > . > >> I tried -kn to save >> parameters in a file and included this file >> when building LM but it turned out >> my perplexity is getting bigger. > You can save the discounting parameters using: > > 1) ngram-count -read COUNTS -kndiscount -kn1 K1 -kn2 K2 -kn3 K3 > (no -lm argument!) > > Then you can read them back in for LM estimation using > > 2) ngram-count -read COUNTS -kndiscount -kn1 K1 -kn2 K2 -kn3 K3 -lm LM > > and the result will be identical to the second command when run > without -kn1/2/3 options. > > Now, if you want you can manipulate the discounting parameters before > invoking command 2. > For example, you could perform a search over the D1, D2, D3 parameters > optimizing perplexity on a held-out set, just like C&G did. But you > have to implement that search yourself by writing some wrapper scripts. > > Also consider the interpolated version of KN smoothing. Just add the > ngram-count -interpolate option, it usually gives slightly better results. >> >> And just one more, >> do you have a link to good tutorial in using >> class-based models with SRILM? > There is a basic tutorial at > http://ssli.ee.washington.edu/ssli/people/sarahs/srilm.html . > > Andreas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asosimi at unilag.edu.ng Fri May 9 10:34:14 2014 From: asosimi at unilag.edu.ng (ADEYANJU A. sosimi) Date: Fri, 9 May 2014 18:34:14 +0100 (WAT) Subject: [SRILM User List] make all error Message-ID: <1126106028.8818.1399656854435.JavaMail.root@unilag.edu.ng> I am new to srilm. On implementing make all command, the error below was displayed and I did a verification of packages and everything was in order. kindly help provide an insight to resolve this bug. Thanks as I await your kind respond. /srilm/common/Makefile.common.targets:85: recipe for target '../obj/cygwin/version.o' failed make[1]: *** [../obj/cygwin/version.o] Error 1 make[1]: Leaving directory '/cygdrive/c/cygwin/srilm/misc/src' Makefile:106: recipe for target 'all' failed make: *** [all] Error 1 From catherine.gasnier at epfl.ch Wed May 14 05:42:31 2014 From: catherine.gasnier at epfl.ch (Catherine Gasnier) Date: Wed, 14 May 2014 14:42:31 +0200 Subject: [SRILM User List] Fwd: installation on mac lion error on negram class In-Reply-To: References: Message-ID: I had almost the same problem. You probably have two versions of the iconv library, which don't use exactly the same symbols, which could be the case if you have one version installed with MacPorts. For some reason, it does not link to the library corresponding to the header it has included, so there is a problem with symbols. Right after the error message you copied, there must be the exact command that failed, which may help figure it out. You can do a 'find / -name "*iconv*" to find where your (possibly duplicate) iconv libraries are. In any case I would recommend having just one copy, or just not use MacPorts, but Homebrew, which does not duplicate libraries. Hope that helps. Catherine *mohsen jadidi* mohsen.jadidi at gmail.com > *Wed Mar 13 08:03:20 PDT 2013* > ------------------------------ > > I managed to run my program after installing the srilm > with NO_ICONV=anything . Otherwise it keeps giving me this error: > > > Undefined symbols for architecture x86_64: > "_iconv", referenced from: > File::fgetsUTF8(char*, int)in libmisc.a(File.o) > "_iconv_close", referenced from: > File::reopen(char const*)in libmisc.a(File.o) > File::reopen(std::basic_string, > std::allocator >&, int)in libmisc.a(File.o) > File::reopen(char const*, unsigned long, int)in libmisc.a(File.o) > File::reopen(char const*, char const*)in libmisc.a(File.o) > File::~File() in libmisc.a(File.o) > File::~File() in libmisc.a(File.o) > "_iconv_open", referenced from: > File::fgetsUTF8(char*, int)in libmisc.a(File.o) > ld: symbol(s) not found for architecture x86_64 > collect2: ld returned 1 exit status > > I read some pages and they suggested problem comes from the different > version of iconv,(The Mac default library and Macport installation). I > couldn't realize how to fix the problem. I tried to set ADDITIONAL_CFLAGS > = /usr/lib/libiconv.2.dylib but didnt work. > These are some information you might find useful : > For default version in /usr/lib I have : > > ls -ll libiconv* > lrwxr-xr-x 1 root wheel 16 Oct 8 2011 libiconv.2.4.0.dylib -> > libiconv.2.dylib > -r-xr-xr-x 1 root wheel 2105216 Oct 8 2011 libiconv.2.dylib > lrwxr-xr-x 1 root wheel 20 Oct 8 2011 libiconv.dylib -> > libiconv.2.4.0.dylib > > none of them have (*) to indicate which one my compiler use?! Also, > > nm libiconv.2.dylib | grep iconv > 00000000000f1af0 S ___iconv_2VersionNumber > 00000000000f1b90 S ___iconv_2VersionString > 00000000000f60f0 D __libiconv_version > 000000000000a1e1 T _iconv > 000000000000a5a0 T _iconv_canonicalize > 0000000000013164 T _iconv_close > 0000000000013171 T _iconv_open > 000000000000a72c T _iconvctl > 000000000000a20f T _iconvlist > 0000000000014dbd T _libiconv_relocate > 0000000000014cff T _libiconv_set_relocation_prefix > > For Macport version I have : > > -rw-r--r-- 1 root admin 1072264 Apr 4 2012 libiconv.2.dylib > -rw-r--r-- 1 root admin 1098856 Apr 4 2012 libiconv.a > lrwxr-xr-x 1 root admin 16 Apr 4 2012 libiconv.dylib -> > libiconv.2.dylib > -rw-r--r-- 1 root admin 914 Apr 4 2012 libiconv.la > > and also: > > nm libiconv.a | grep iconv > libiconv.a(iconv.o): > 0000000000016780 D __libiconv_version > 000000000000ac10 T _iconv_canonicalize > 00000000000f9908 S _iconv_canonicalize.eh > 000000000000a810 T _libiconv > 00000000000f97d0 S _libiconv.eh > 00000000000159f0 T _libiconv_close > 00000000000fa6c0 S _libiconv_close.eh > 0000000000015a00 T _libiconv_open > 00000000000fa6f0 S _libiconv_open.eh > 0000000000014950 T _libiconv_open_into > 00000000000fa518 S _libiconv_open_into.eh > 000000000000adc0 T _libiconvctl > 00000000000f9940 S _libiconvctl.eh > 000000000000a850 T _libiconvlist > 00000000000f9830 S _libiconvlist.eh > libiconv.a(localcharset.o): > libiconv.a(relocatable.o): > 00000000000000c0 T _libiconv_relocate > 00000000000001d0 S _libiconv_relocate.eh > 0000000000000000 T _libiconv_set_relocation_prefix > 0000000000000198 S _libiconv_set_relocation_prefix.eh > > > Do you have any suggestion ? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asosimi at unilag.edu.ng Thu May 15 10:16:03 2014 From: asosimi at unilag.edu.ng (ADEYANJU A. sosimi) Date: Thu, 15 May 2014 18:16:03 +0100 (WAT) Subject: [SRILM User List] Interpreting ngram.lm file In-Reply-To: <5370FE56.4070207@icsi.berkeley.edu> Message-ID: <1077097162.280046.1400174163579.JavaMail.root@unilag.edu.ng> Dear all, Have just generated a LM file but I need your tips in interpreting the results. See some extract of .lm file below. Anticipating a quick response. Kindest regards \data\ ngram 1=16536 ngram 2=7998 ngram 3=3200 \1-grams: -0.8297427 -99 -0.1990616 -2.382785 a -0.3606565 -4.994339 aabo -4.994339 aad??ta -3.596113 aago -0.1141468 -4.994339 aaw?? -4.994339 aaye? -4.994339 aay?? -4.994339 aa?re? -4.994339 aa?ru?n -4.994339 aa?r? -4.994339 aa?ra?bi?nrin -4.83583 aba -4.994339 abaja?de -4.477304 abala -4.630644 abara -0.6027466 -4.994339 abaram?j? -4.994339 aba? -4.994339 aba?mi? \2-grams: -2.143121 a -0.0767099 -3.407348 aago 0.003341295 -3.833085 ab?nugan 0.008342609 -4.054651 ad? -3.622831 ad?b??w?l? -3.833085 ad?gb?t?? -4.054651 ad?g?t?? -4.054651 ad?g?k? -3.407348 ad?k??y? -4.054651 ad?n?y? -3.774448 ad?f?l? -3.269046 agba?ra -2.968016 agbe?gbe? -4.054651 agbo -3.407348 agb?ra -3.774448 agb?gb? -3.310438 agb??gil?re -4.054651 age?mo? -3.134347 agogo -0.0162892 -3.833085 ajo -3.356196 ??p??l?p?? -3.407348 ??r?? -0.1759373 -2.306117 ??r?? 0.03070859 -1.854306 a -2.501609 a ba? -1.104184 a b? -0.05966994 -2.501609 a d? -2.501609 a fagb?ra -0.07674935 -1.502123 a fi -0.05625019 -2.501609 a fipa? -1.803154 a f?? -2.069788 a gb?? -0.05816576 -2.501609 a j? -1.643453 a k? -0.3194264 -2.221406 a k?? -1.502123 a k? -0.1097101 -2.280043 a le -1.957232 a le? -1.611268 a l? -1.957232 a l?h?n 0.1669339 -1.854306 a l? -1.414973 a m?a -0.05020244 -2.501609 a m?? -2.501609 a m? -2.501609 a m? -2.280043 a m?? -1.206611 aj? ? -0.2693426 -0.3606157 aj? -1.335277 aj? ?s?n -1.335277 aj? ?m? 0.1278539 -0.3462729 aka?po? i?j?ba -0.8581563 akitiyan -0.3462729 akoni t?b? -0.4431829 ako??we? a?gba? -0.8037053 ak??k???? -1.589311 ak??k???? ni? -1.589311 ak??k???? n? -1.589311 ak??k???? so? -1.367745 ak??k???? yor?b? -0.1815158 -1.367745 ak??k???? y?? -1.589311 ak??k???? ?ti -1.367745 ak??k???? ?d? -0.1630108 -0.2213341 aladani k?k?k? -0.3462729 alaga ?gb?? -0.7530956 ala?ga i?j?ba -0.9746618 ala?ga ?gb?? -0.1574904 ala?gba?ra -0.2213341 ala?papo?? ja?de -0.4768893 ale?? olu?wa -0.353785 aloh?n -0.9746618 al?gb? -0.9746618 al?gb? b?s?????b? -0.9746618 al?gb? l?j?d? 0.1278539 -0.8233941 al?d??ni k?k?k? -0.4785899 al?d??r? -0.1247067 al?gb?? -0.2216167 al?gb?d? in? -0.2213341 al?g? ?jo?ba -0.4431829 al?k????b??r?? \3-grams: -1.763766 a fi -2.02365 a gb?? -1.167317 a k? -1.614281 a k?? -1.483251 a k? -1.763766 a le? -1.483251 a l? -2.02365 a m?a -2.02365 a m?? -1.614281 a s? -1.614281 a ti -1.614281 a ? -1.088136 a ? -0.9368684 a ? -2.02365 a ? -2.02365 a ?e -0.5383909 b? a b? -1.436117 b? a j? -1.026747 b? a s??r?? -1.436117 b? a w? -0.4658403 b? a b? -1.026747 b? a k? -1.026747 b? a s?e -1.026747 b? a ti -1.436117 b? a ? -0.3861521 d? a l?h?n -1.114962 ni a fi -1.374847 ni a f?? -1.374847 ni a ti -1.198756 p? a m?a -0.9388713 p? a ? -1.198756 p? a ? -1.271307 ti? a ti -0.6378412 t? a b? -0.7056006 t? a b? -1.266508 t? a fi -0.7594231 ?w?r?n ad?l?w?? -1.032424 agogo m?r?n -0.4995385 ??r?? aje? -0.9355144 aj? ? -0.1610316 or? aj? -0.7725398 o?r?? aj? -0.4583931 aka?po? i?j?ba -0.4583931 ?re akoni t?b? -1.178552 ak??k???? y?? -0.7773587 ?wo?n ak??k???? -1.361483 ?wo?n ak??k???? so? -1.361483 ?wo?n ak??k???? yor?b? -1.361483 ?wo?n ak??k???? ?d? -0.8563331 ala?ga i?j?ba -0.4995385 ew? aloh?n -0.3334544 l?b?? al?gb? b?s?????b? -0.6378412 e?g?n al?r? -0.2264513 k?b?y?s? al?s?e? ?kej? -0.2251149 ?b?t? al?w?? m??rin -0.3334544 ?y?nb? al?w?? funfun -0.2954186 k?b?y?s? al?y?l?w? o?ba -0.2190236 ok??w? al??dani k?r?je -0.3334544 oko?w? al??d?ni k?r?je -0.3334544 gb? al??? -0.3645163 gan an From stolcke at icsi.berkeley.edu Thu May 15 11:13:28 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 15 May 2014 11:13:28 -0700 Subject: [SRILM User List] Interpreting ngram.lm file In-Reply-To: <1077097162.280046.1400174163579.JavaMail.root@unilag.edu.ng> References: <1077097162.280046.1400174163579.JavaMail.root@unilag.edu.ng> Message-ID: <537503C8.40306@icsi.berkeley.edu> Try searching for srilm ngram-format Andreas On 5/15/2014 10:16 AM, ADEYANJU A. sosimi wrote: > Dear all, > > Have just generated a LM file but I need your tips in interpreting the results. See some extract of .lm file below. Anticipating a quick response. Kindest regards > > \data\ > ngram 1=16536 > ngram 2=7998 > ngram 3=3200 > > \1-grams: > From amontalvo at cenatav.co.cu Thu May 15 11:32:15 2014 From: amontalvo at cenatav.co.cu (Ana) Date: Thu, 15 May 2014 14:32:15 -0400 Subject: [SRILM User List] problems building SRI tools Message-ID: <5375082F.4050109@cenatav.co.cu> Hi all: I've just started with srilm, and I wonder if is possible to build it over a x86_64 platform. Should I create a file common/Makefile.machine.x86_64? Has someone tried to do this before? thx in advance Ana Montalvo From stolcke at icsi.berkeley.edu Thu May 15 14:06:17 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 15 May 2014 14:06:17 -0700 Subject: [SRILM User List] problems building SRI tools In-Reply-To: <5375082F.4050109@cenatav.co.cu> References: <5375082F.4050109@cenatav.co.cu> Message-ID: <53752C49.1070609@icsi.berkeley.edu> On 5/15/2014 11:32 AM, Ana wrote: > Hi all: > I've just started with srilm, and I wonder if is possible to > build it over a x86_64 platform. > Should I create a file common/Makefile.machine.x86_64? > Has someone tried to do this before? MACHINE_TYPE=i686-m64 is for x86_64 Linux machines. It should be detected automatically if you type "make". If not use make MACHINE_TYPE=i686-m64 ... Andreas From yifenliu at gmail.com Fri May 16 01:51:55 2014 From: yifenliu at gmail.com (Liu yifen) Date: Fri, 16 May 2014 16:51:55 +0800 Subject: [SRILM User List] problem after srilm installation Message-ID: Dear Sir, The srilm is now successfully installed within cygwin evrionment. However, while I try to use command *make all* to recompile, it is popped out with the following error message. As for the test command *make test*, it works well. The current install gcc/g++ compiler is 4.8.2. Can anyone help me to resolve this problem? $ make all for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/srilm MACHINE_TYPE=cygwin OPTION= MAKE_PIC= all) || exit 1; \ done make[1]: Entering directory `/srilm/misc/src' make[1]: Nothing to be done for `all'. make[1]: Leaving directory `/srilm/misc/src' make[1]: Entering directory `/srilm/dstruct/src' g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testArray.o testArray.cc g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/testArray.exe ../obj/cygwin/testArray.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl -lm -liconv test -f ../bin/cygwin/testArray.exe g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testMap.o testMap.cc g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/testMap.exe ../obj/cygwin/testMap.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl -lm -liconv test -f ../bin/cygwin/testMap.exe g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/benchHash.o benchHash.cc g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/benchHash.exe ../obj/cygwin/benchHash.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl -lm -liconv test -f ../bin/cygwin/benchHash.exe g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testHash.o testHash.cc g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/testHash.exe ../obj/cygwin/testHash.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl -lm -liconv test -f ../bin/cygwin/testHash.exe g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testSizes.o testSizes.cc g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/testSizes.exe ../obj/cygwin/testSizes.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl -lm -liconv test -f ../bin/cygwin/testSizes.exe g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testCachedMem.o testCachedMem.cc g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/testCachedMem.exe ../obj/cygwin/testCachedMem.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl -lm -liconv test -f ../bin/cygwin/testCachedMem.exe g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testBlockMalloc.o testBlockMalloc.cc g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/testBlockMalloc.exe ../obj/cygwin/testBlockMalloc.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl -lm -liconv test -f ../bin/cygwin/testBlockMalloc.exe g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testMap2.o testMap2.cc testMap2.cc: In function ?int Delete(ClientData, Tcl_Interp*, int, char**)?: testMap2.cc:114:11: ???cannot convert ?Boolean {aka bool}? to ?char**? in assignment result = myMap2.remove(key1, key2); ^ /srilm/common/Makefile.common.targets:93: recipe for target `../obj/cygwin/testMap2.o' failed make[1]: *** [../obj/cygwin/testMap2.o] Error 1 make[1]: Leaving directory `/srilm/dstruct/src' Makefile:106: recipe for target `all' failed make: *** [all] Error 1 yifenliu -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri May 16 08:11:18 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 16 May 2014 08:11:18 -0700 Subject: [SRILM User List] problem after srilm installation In-Reply-To: References: Message-ID: <53762A96.8090305@icsi.berkeley.edu> On 5/16/2014 1:51 AM, Liu yifen wrote: > Dear Sir, > > The srilm is now successfully installed within cygwin evrionment. > However, while I try to use command /make all/ to recompile, it is > popped out with the following error message. As for the test command > /make test/, it works well. The current install gcc/g++ compiler is > 4.8.2. Can anyone help me to resolve this problem? > This is a known problem with one of the test programs. However, it doesn't affect the functioning of the main programs. So just don't run "make all". (The installation instructions don't require it.) The problem is fixed in the current beta release. Andreas > $ make all > for subdir in misc dstruct lm flm lattice utils; do \ > (cd $subdir/src; make SRILM=/srilm MACHINE_TYPE=cygwin > OPTION= MAKE_PIC= all) || exit 1; \ > done > make[1]: Entering directory `/srilm/misc/src' > make[1]: Nothing to be done for `all'. > make[1]: Leaving directory `/srilm/misc/src' > make[1]: Entering directory `/srilm/dstruct/src' > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/testArray.o testArray.cc > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g > -O2 -o ../bin/cygwin/testArray.exe ../obj/cygwin/testArray.o > ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl > -lm -liconv > test -f ../bin/cygwin/testArray.exe > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/testMap.o testMap.cc > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g > -O2 -o ../bin/cygwin/testMap.exe ../obj/cygwin/testMap.o > ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl > -lm -liconv > test -f ../bin/cygwin/testMap.exe > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/benchHash.o benchHash.cc > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g > -O2 -o ../bin/cygwin/benchHash.exe ../obj/cygwin/benchHash.o > ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl > -lm -liconv > test -f ../bin/cygwin/benchHash.exe > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/testHash.o testHash.cc > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g > -O2 -o ../bin/cygwin/testHash.exe ../obj/cygwin/testHash.o > ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl > -lm -liconv > test -f ../bin/cygwin/testHash.exe > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/testSizes.o testSizes.cc > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g > -O2 -o ../bin/cygwin/testSizes.exe ../obj/cygwin/testSizes.o > ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl > -lm -liconv > test -f ../bin/cygwin/testSizes.exe > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/testCachedMem.o testCachedMem.cc > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g > -O2 -o ../bin/cygwin/testCachedMem.exe ../obj/cygwin/testCachedMem.o > ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl > -lm -liconv > test -f ../bin/cygwin/testCachedMem.exe > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/testBlockMalloc.o testBlockMalloc.cc > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g > -O2 -o ../bin/cygwin/testBlockMalloc.exe > ../obj/cygwin/testBlockMalloc.o ../obj/cygwin/libdstruct.a -lm > ../../lib/cygwin/libmisc.a -ltcl -lm -liconv > test -f ../bin/cygwin/testBlockMalloc.exe > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o > ../obj/cygwin/testMap2.o testMap2.cc > testMap2.cc: In function ?int Delete(ClientData, Tcl_Interp*, int, > char**)?: > testMap2.cc:114:11: ??:cannot convert ?Boolean {aka bool}? to ?char**? > in assignment > result = myMap2.remove(key1, key2); > ^ > /srilm/common/Makefile.common.targets:93: recipe for target > `../obj/cygwin/testMap2.o' failed > make[1]: *** [../obj/cygwin/testMap2.o] Error 1 > make[1]: Leaving directory `/srilm/dstruct/src' > Makefile:106: recipe for target `all' failed > make: *** [all] Error 1 > > > yifenliu > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From csponay at gmail.com Wed May 21 06:08:43 2014 From: csponay at gmail.com (charmaine ponay) Date: Wed, 21 May 2014 06:08:43 -0700 Subject: [SRILM User List] possible source of error Message-ID: hi what could be a possible source of error when i get these messages after i execute the command make World srilm_iconv.h:15:25: fatal error: iconv.h: No such file or directory # include_next ^ compilation terminated. /srilm/common/Makefile.common.targets:93: recipe for target '../obj/cygwin/File.o' failed make[2]: *** [../obj/cygwin/File.o] Error 1 make[2]: Leaving directory '/srilm/misc/src' Makefile:107: recipe for target 'release-libraries' failed make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory '/srilm' Makefile:56: recipe for target 'World' failed make: *** [World] Error 2 Regards, * Charmaine Salvador - Ponay* Instructor Information and Computer Studies Dept. University of Santo Tomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed May 21 12:53:42 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 21 May 2014 12:53:42 -0700 Subject: [SRILM User List] possible source of error In-Reply-To: References: Message-ID: <537D0446.4000401@icsi.berkeley.edu> On 5/21/2014 6:08 AM, charmaine ponay wrote: > hi what could be a possible source of error when i get these messages > after i execute the command make World > > srilm_iconv.h:15:25: fatal error: iconv.h: No such file or directory > # include_next Start the cygwin setup tool. When you get to the "select packages" screen search for "iconv" and select the two matching packages (libconv and libiconv2) for installation. Then rebuild srilm. Andreas > ^ > compilation terminated. > /srilm/common/Makefile.common.targets:93: recipe for target > '../obj/cygwin/File.o' failed > make[2]: *** [../obj/cygwin/File.o] Error 1 > make[2]: Leaving directory '/srilm/misc/src' > Makefile:107: recipe for target 'release-libraries' failed > make[1]: *** [release-libraries] Error 1 > make[1]: Leaving directory '/srilm' > Makefile:56: recipe for target 'World' failed > make: *** [World] Error 2 > > Regards, > * > * > * > Charmaine Salvador - Ponay* > Instructor > Information and Computer Studies Dept. > University of Santo Tomas > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From csponay at gmail.com Wed May 21 19:54:01 2014 From: csponay at gmail.com (charmaine ponay) Date: Wed, 21 May 2014 19:54:01 -0700 Subject: [SRILM User List] possible source of error In-Reply-To: References: Message-ID: hi, i think i was already "successful" with the make World command. I am, however,getting this message after the make all command, please help (again) thanks g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -c -g -O2 -o ../obj/cygwin/testMap2.o testMap2.cc testMap2.cc: In function ?int Delete(ClientData, Tcl_Interp*, int, char**)?: testMap2.cc:114:11: error: cannot convert ?Boolean {aka bool}? to ?char**? in assignment result = myMap2.remove(key1, key2); ^ /srilm/common/Makefile.common.targets:93: recipe for target '../obj/cygwin/testMap2.o' failed make[1]: *** [../obj/cygwin/testMap2.o] Error 1 make[1]: Leaving directory '/srilm/dstruct/src' Makefile:107: recipe for target 'all' failed make: *** [all] Error 1 Regards, * Charmaine Salvador - Ponay* Instructor Information and Computer Studies Dept. University of Santo Tomas On Wed, May 21, 2014 at 6:08 AM, charmaine ponay wrote: > hi what could be a possible source of error when i get these messages > after i execute the command make World > > srilm_iconv.h:15:25: fatal error: iconv.h: No such file or directory > # include_next > ^ > compilation terminated. > /srilm/common/Makefile.common.targets:93: recipe for target > '../obj/cygwin/File.o' failed > make[2]: *** [../obj/cygwin/File.o] Error 1 > make[2]: Leaving directory '/srilm/misc/src' > Makefile:107: recipe for target 'release-libraries' failed > make[1]: *** [release-libraries] Error 1 > make[1]: Leaving directory '/srilm' > Makefile:56: recipe for target 'World' failed > make: *** [World] Error 2 > > Regards, > > * Charmaine Salvador - Ponay* > Instructor > Information and Computer Studies Dept. > University of Santo Tomas > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gragnkedir at gmail.com Fri May 23 00:03:50 2014 From: gragnkedir at gmail.com (Gragn Kedir) Date: Thu, 22 May 2014 23:03:50 -0800 Subject: [SRILM User List] Hi Message-ID: my system is ubuntu 14.04 ... 64 bit using virtual machine and type i686 I am new for machine translation , while installing moses i have got difficulty on the following srilm step 7c7 < # SRILM = /home/speech/stolcke/project/srilm/devel --- > SRILM = /home/jschroe1/demo/tools/srilm what is it i dont understand and also i cant download moses from when i run , svn co https://svn.sourceforge.net/svnroot/mosesdecoder/trunk mosesdecoder the error is svn: E000111: Unable to connect to a repository at URL 'https://svn.sourceforge.net/svnroot/mosesdecoder/trunk' svn: E000111: Error running context: Connection refused or when i run, git clone git://github.com/moses-smt/mosesdecoder.git moses the error is Cloning into 'moses'... fatal: unable to connect to github.com: github.com[0: 67.215.65.132]: errno=Connection refused pls help me I am working on ubuntu 14.04 ... 64 bit using virtual machine -------------- next part -------------- An HTML attachment was scrubbed... URL: From thipem at gmail.com Fri Jun 6 04:41:00 2014 From: thipem at gmail.com (Thipe Modipa) Date: Fri, 6 Jun 2014 13:41:00 +0200 Subject: [SRILM User List] unknown Word found in LM Message-ID: Hi, I am decoding utterances with a dictionary containing about 800 unique words with a language model containing about 63K unique words and I get the following warning: WARNING [-9999] ReadARPAunigram: unknown Word 'h' found in LM -- ignored in HDecode Will this warning have a negative impact on the word recognition accuracy, or what is the general effect? Thanks Thipe -------------- next part -------------- An HTML attachment was scrubbed... URL: From tonyr at cantabresearch.com Fri Jun 6 04:47:19 2014 From: tonyr at cantabresearch.com (Tony Robinson) Date: Fri, 06 Jun 2014 12:47:19 +0100 Subject: [SRILM User List] unknown Word found in LM In-Reply-To: References: Message-ID: <5391AA47.3000001@cantabResearch.com> On 06/06/2014 12:41 PM, Thipe Modipa wrote: > Hi, > > I am decoding utterances with a dictionary containing about 800 unique > words with a language model containing about 63K unique words and I > get the following warning: > > WARNING [-9999] ReadARPAunigram: unknown Word 'h' found in LM -- > ignored in HDecode > > Will this warning have a negative impact on the word recognition > accuracy, or what is the general effect? > > Thanks > > Thipe This is far more likely to be a HTK problem than a SRILM problem. You need to look to see whether you really do have a work 'h' in your language model and whether this has an associated entry in your pronunciation dictionary. Chances are that is is in the language model and not in the pronunciation dictionary so HDecode is simply ignoring it. This won't have a big impact, you probably didn't want to recognise the word anyway. Tony -- ** Cantab is hiring: www.cantabResearch.com/openings ** Dr A J Robinson, Founder, Cantab Research Ltd Phone direct: 01223 778240 office: 01223 794497 Company reg no GB 05697423, VAT reg no 925606030 51 Canterbury Street, Cambridge, CB4 3QG, UK From stolcke at icsi.berkeley.edu Mon Jun 9 14:38:08 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 09 Jun 2014 14:38:08 -0700 Subject: [SRILM User List] FW: question about SRILM non-events feature In-Reply-To: <60dd107450de493fb644f40672957b0e@BL2PR03MB193.namprd03.prod.outlook.com> References: <60dd107450de493fb644f40672957b0e@BL2PR03MB193.namprd03.prod.outlook.com> Message-ID: <53962940.1080405@icsi.berkeley.edu> > *From:* K. Richardson [mailto:kazimir.richardson at gmail.com] > > *Sent:* Monday, June 9, 2014 3:56 AM > *To:* Andreas Stolcke > *Subject:* question about SRILM non-events feature > > Hi Andreas, > > I apologize if there is some other official channel for asking SRILM > technical questions (I tried writing to the srilm mailing list, but it > bounced). > You need to join the mailing list to be able to post questions. > I am using SRILM as a black box in an MT system. I am trying to build > a LM that enforces that every sequence start with some default value, > e.g. X, such that X never occurs elsewhere in some other n-gram. > So do you want to (1) force X to occur always after , or do you want to (2) prevent it from occurring elsewhere, or both? You can do (1) by manipulating the conditional probability of bigram X to be 1, and 0 for all other bigrams starting with . You can do (2) by giving X a unigram probability of 0 and have it not occur in any other ngrams (other than those starting with ). The zero probability prevents X from getting probability via backoff. After you manipulate the probabilities you should use ngram -renorm to recompute backoff weights. > Is it possible to enforce this? Is this within the purview of what the > -nonevents option does? I have been having a hard time understanding > how this option works, and specifically how you specify the associated > non-events file. > Non-events are tags like are not predicted by the LM but that can occur in the history (context) portion of an N-gram to condition the next word. It doesn't sound like that's what you want here. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From k.nebhi at sheffield.ac.uk Wed Jun 11 06:00:19 2014 From: k.nebhi at sheffield.ac.uk (kamel nebhi) Date: Wed, 11 Jun 2014 14:00:19 +0100 Subject: [SRILM User List] stdin and disambig tool Message-ID: Hello, I'm trying to use the disambig tool but instead of having a file as an argument, i need to send string to STDIN. I try this but it doesn't work: *cat <<< "hello" | disambig -lm /Users/home/Truecaser_Joshua/model/lm/training.5gram.europarl.lm -keep-unk -order 5 -map /Users/**home**/Truecaser_Joshua/model/lm/true-case.map * and *echo "hello" | disambig -lm /Users/home/Truecaser_Joshua/model/lm/training.5gram.europarl.lm -keep-unk -order 5 -map /Users/home/Truecaser_Joshua/model/lm/true-case.map* How can i do ? Best -- Kamel NebhiVisiting Academics - PhD studentRoom G35Department of Computer Science University of Sheffield UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Jun 11 09:11:57 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 11 Jun 2014 09:11:57 -0700 Subject: [SRILM User List] stdin and disambig tool In-Reply-To: References: Message-ID: <53987FCD.2040907@icsi.berkeley.edu> On 6/11/2014 6:00 AM, kamel nebhi wrote: > Hello, > > I'm trying to use the disambig tool but instead of having a file as an > argument, i need to send string to STDIN. I try this but it doesn't work: > > /cat <<< "hello" | disambig -lm > /Users/home/Truecaser_Joshua/model/lm/training.5gram.europarl.lm > -keep-unk -order 5 -map > /Users///home///Truecaser_Joshua/model/lm/true-case.map / > > and > / > / > /echo "hello" | disambig -lm > /Users/home/Truecaser_Joshua/model/lm/training.5gram.europarl.lm > -keep-unk -order 5 -map > /Users/home/Truecaser_Joshua/model/lm/true-case.map/ > > How can i do ? > You need use "-" as the argument to the option for specifying the input file: echo hello | disambig -text - [other options ...] Andreas > Best > > -- > Kamel Nebhi > Visiting Academics - PhD student > Room G35 > Department of Computer Science > University of Sheffield > UK > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From k.nebhi at sheffield.ac.uk Wed Jun 11 11:07:13 2014 From: k.nebhi at sheffield.ac.uk (kamel nebhi) Date: Wed, 11 Jun 2014 19:07:13 +0100 Subject: [SRILM User List] stdin and disambig tool In-Reply-To: <53987FCD.2040907@icsi.berkeley.edu> References: <53987FCD.2040907@icsi.berkeley.edu> Message-ID: Thanks Andreas, it works. But it's very slow. But it seems that the STDIN argument is slower than File argument... Is it just an feeling or is there anything to do to solve it ? Best 2014-06-11 17:11 GMT+01:00 Andreas Stolcke : > On 6/11/2014 6:00 AM, kamel nebhi wrote: > > Hello, > > I'm trying to use the disambig tool but instead of having a file as an > argument, i need to send string to STDIN. I try this but it doesn't work: > > *cat <<< "hello" | disambig -lm > /Users/home/Truecaser_Joshua/model/lm/training.5gram.europarl.lm -keep-unk > -order 5 -map /Users/**home**/Truecaser_Joshua/model/lm/true-case.map * > and > > *echo "hello" | disambig -lm > /Users/home/Truecaser_Joshua/model/lm/training.5gram.europarl.lm -keep-unk > -order 5 -map /Users/home/Truecaser_Joshua/model/lm/true-case.map* > > How can i do ? > > You need use "-" as the argument to the option for specifying the > input file: > > echo hello | disambig -text - [other options ...] > > Andreas > > Best > > -- > > Kamel NebhiVisiting Academics - PhD studentRoom G35Department of Computer Science > University of Sheffield > UK > > > > > _______________________________________________ > SRILM-User site listSRILM-User at speech.sri.comhttp://www.speech.sri.com/mailman/listinfo/srilm-user > > > -- Kamel NebhiVisiting Academics - PhD studentRoom G35Department of Computer Science University of Sheffield UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Jun 11 11:23:53 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 11 Jun 2014 11:23:53 -0700 Subject: [SRILM User List] stdin and disambig tool In-Reply-To: References: <53987FCD.2040907@icsi.berkeley.edu> Message-ID: <53989EB9.5050101@icsi.berkeley.edu> On 6/11/2014 11:07 AM, kamel nebhi wrote: > Thanks Andreas, it works. But it's very slow. > But it seems that the STDIN argument is slower than File argument... > Is it just an feeling or is there anything to do to solve it ? There is no reason it should be slower. Of course if you invoke the disambig tool for every line of input then you're going to incur startup overhead (reading the LM for one thing) over and over, and that's going to make it slow. If you want to generate input lines with "echo" then you need to take care to invoke disamig only once, e.g., while true do echo "a line of input" # check whether to break the loop etc. done | disambig -text - other options .... Hope this helps, Andreas From k.nebhi at sheffield.ac.uk Thu Jun 12 07:53:47 2014 From: k.nebhi at sheffield.ac.uk (kamel nebhi) Date: Thu, 12 Jun 2014 15:53:47 +0100 Subject: [SRILM User List] ignore xml markup or other annotations Message-ID: Dear all, i'm actually using the disambig tool and i want to know if it's possible to ignore some xml markup or annotations during the process. For example, if i have this sentence: *I actually live in Paris *. i want to ignore the *PERS* tags but i need to keep it for my evaluation. Best -- Kamel NebhiVisiting Academics - PhD studentRoom G35Department of Computer Science University of Sheffield UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From k.nebhi at sheffield.ac.uk Thu Jun 12 07:54:08 2014 From: k.nebhi at sheffield.ac.uk (kamel nebhi) Date: Thu, 12 Jun 2014 15:54:08 +0100 Subject: [SRILM User List] stdin and disambig tool In-Reply-To: <53989EB9.5050101@icsi.berkeley.edu> References: <53987FCD.2040907@icsi.berkeley.edu> <53989EB9.5050101@icsi.berkeley.edu> Message-ID: Thanks a lot Andreas. It's very clear. Best 2014-06-11 19:23 GMT+01:00 Andreas Stolcke : > On 6/11/2014 11:07 AM, kamel nebhi wrote: > >> Thanks Andreas, it works. But it's very slow. >> But it seems that the STDIN argument is slower than File argument... >> Is it just an feeling or is there anything to do to solve it ? >> > There is no reason it should be slower. > > Of course if you invoke the disambig tool for every line of input then > you're going to incur startup overhead (reading the LM for one thing) over > and over, and that's going to make it slow. > > If you want to generate input lines with "echo" then you need to take > care to invoke disamig only once, e.g., > > while true > do > echo "a line of input" > > # check whether to break the loop etc. > > done | disambig -text - other options .... > > Hope this helps, > > Andreas > > -- Kamel NebhiVisiting Academics - PhD studentRoom G35Department of Computer Science University of Sheffield UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Jun 12 10:38:31 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 12 Jun 2014 10:38:31 -0700 Subject: [SRILM User List] ignore xml markup or other annotations In-Reply-To: References: Message-ID: <5399E597.20501@icsi.berkeley.edu> On 6/12/2014 7:53 AM, kamel nebhi wrote: > Dear all, > > i'm actually using the disambig tool and i want to know if it's > possible to ignore some xml markup or annotations during the process. > For example, if i have this sentence: > /I actually live in Paris /. > i want to ignore the /PERS/ tags but i need to keep it for my evaluation. > SRILM does not do text processing because it is too application-dependent. Instead most tools support readining/writing to/from stdin/stdout, so you can assemble a pipeline that combines text processing and SRILM tools. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From sfischer at ymail.com Fri Jun 13 12:16:05 2014 From: sfischer at ymail.com (Stefan Fischer) Date: Fri, 13 Jun 2014 21:16:05 +0200 Subject: [SRILM User List] Usage of make-big-lm and -interpolate option Message-ID: Hello, I read that using make-big-lm is preferable to using ngram-count directly. Even though my corpus is not very big, how do I switch from ngram-count to make-big-lm? This is what I'm using so far: ngram-count -order 3 -kndiscount -interpolate -unk -text training.txt -vocab at_least_twice.txt -lm lm.arpa Is this the right way to use make-big-lm? Do I have to pass more options to ngram-count if am only interested in counts? ngram-count -write counts.gz -text training.txt make-big-lm -read counts.gz -order 3 -kndiscount -interpolate -unk -text training.txt -vocab at_least_twice.txt -lm lm.arpa My second question is w.r.t. to the -interpolate option. I get the following warning several times: warning: 2.01524e-06 backoff probability mass left for ". dunno" -- disabling interpolation Is this just for my informtion or is it a sign of using bad parameters? Regards, Stefan From stolcke at icsi.berkeley.edu Sun Jun 15 19:07:24 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 15 Jun 2014 19:07:24 -0700 Subject: [SRILM User List] Usage of make-big-lm and -interpolate option In-Reply-To: References: Message-ID: <539E515C.3030503@ICSI.Berkeley.EDU> On 06/13/2014 12:16 PM, Stefan Fischer wrote: > Hello, > > I read that using make-big-lm is preferable to using ngram-count directly. > Even though my corpus is not very big, how do I switch from > ngram-count to make-big-lm? > > This is what I'm using so far: > ngram-count -order 3 -kndiscount -interpolate -unk -text > training.txt -vocab at_least_twice.txt -lm lm.arpa > > Is this the right way to use make-big-lm? > Do I have to pass more options to ngram-count if am only interested in counts? > ngram-count -write counts.gz -text training.txt > make-big-lm -read counts.gz -order 3 -kndiscount -interpolate -unk > -text training.txt -vocab at_least_twice.txt -lm lm.arpa You did it right. > > My second question is w.r.t. to the -interpolate option. > I get the following warning several times: > warning: 2.01524e-06 backoff probability mass left for ". dunno" -- > disabling interpolation > Is this just for my informtion or is it a sign of using bad parameters? It's just for information. Sometimes there is no backoff probability mass left for lower-order ngram estimates, and it doesn't make sense to apply interpolation in that case, so the code falls back on standard KN smoothing (without interpolation). Andreas From sfischer at ymail.com Fri Jun 27 07:46:41 2014 From: sfischer at ymail.com (Stefan Fischer) Date: Fri, 27 Jun 2014 16:46:41 +0200 Subject: [SRILM User List] Usage of make-big-lm and -interpolate option In-Reply-To: <539E515C.3030503@ICSI.Berkeley.EDU> References: <539E515C.3030503@ICSI.Berkeley.EDU> Message-ID: Thanks for your reply! There is one thing I don't understand: The training.txt file contains 857661 words and there are 4848 OOVs that all occurr only once. So, OOVs make up 0.00565% of training.txt. If I use ngram-count directly, p() is 0.00600, which is close to the actual percentage. If I use ngram-count + make-big-lm, p() is 0.03206, which is 5 times higher than the actual percentage. Do you have any explanation for that? It seems counter-intuitive ... Is my corpus large enough for option -kndiscount? Regards, Stefan 2014-06-16 4:07 GMT+02:00 Andreas Stolcke : > On 06/13/2014 12:16 PM, Stefan Fischer wrote: >> >> Hello, >> >> I read that using make-big-lm is preferable to using ngram-count directly. >> Even though my corpus is not very big, how do I switch from >> ngram-count to make-big-lm? >> >> This is what I'm using so far: >> ngram-count -order 3 -kndiscount -interpolate -unk -text >> training.txt -vocab at_least_twice.txt -lm lm.arpa >> >> Is this the right way to use make-big-lm? >> Do I have to pass more options to ngram-count if am only interested in >> counts? >> ngram-count -write counts.gz -text training.txt >> make-big-lm -read counts.gz -order 3 -kndiscount -interpolate -unk >> -text training.txt -vocab at_least_twice.txt -lm lm.arpa > > You did it right. > > >> >> My second question is w.r.t. to the -interpolate option. >> I get the following warning several times: >> warning: 2.01524e-06 backoff probability mass left for ". dunno" -- >> disabling interpolation >> Is this just for my informtion or is it a sign of using bad parameters? > > It's just for information. Sometimes there is no backoff probability mass > left for lower-order ngram estimates, and it doesn't make sense to apply > interpolation in that case, so the code falls back on standard KN smoothing > (without interpolation). > > Andreas > From stolcke at icsi.berkeley.edu Fri Jun 27 14:06:37 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 27 Jun 2014 14:06:37 -0700 Subject: [SRILM User List] Usage of make-big-lm and -interpolate option In-Reply-To: References: <539E515C.3030503@ICSI.Berkeley.EDU> Message-ID: <53ADDCDD.9060700@icsi.berkeley.edu> On 6/27/2014 7:46 AM, Stefan Fischer wrote: > Thanks for your reply! > > There is one thing I don't understand: > The training.txt file contains 857661 words and there are 4848 OOVs > that all occurr only once. > > So, OOVs make up 0.00565% of training.txt. > If I use ngram-count directly, p() is 0.00600, which is close to > the actual percentage. > If I use ngram-count + make-big-lm, p() is 0.03206, which is 5 > times higher than the actual percentage. The main difference for your purposes between ngram-count and make-big-lm is that the latter computes the discounting parameters from the entire vocabulary. ngram-count limits the vocabulary (according the -vocab option, which I'm assuming you're using) upon reading the counts and then estimates the discounting parameters. This different affects how much probability mass is held back from the unigram estimates and distributed over all words. There should be a message about that in the output. You can compare them to see if that explains the difference. Andreas