From nickakosu at yahoo.com Mon Oct 24 22:08:07 2011 From: nickakosu at yahoo.com (Nick Akosu) Date: Tue, 25 Oct 2011 06:08:07 +0100 (BST) Subject: [SRILM User List] Difficulty installing SRILM Message-ID: <1319519287.76246.YahooMailNeo@web24603.mail.ird.yahoo.com> Hi everyone, I need to use SRILM to do some language modeling research but I am having difficulty installing the package. I am using windows 7. Please I need advice/directions. Kindly help. Nicholas Akosu -------------- next part -------------- An HTML attachment was scrubbed... URL: From burkay at mit.edu Tue Oct 25 12:41:43 2011 From: burkay at mit.edu (Burkay Gur) Date: Tue, 25 Oct 2011 15:41:43 -0400 Subject: [SRILM User List] Question about 3-gram Language Model with OOV triplets Message-ID: <4EA710F7.7020802@mit.edu> Hi, I have just started using SRILM, and it is a great tool. But I ran across this issue. The situation is that I have: corpusA.txt corpusB.txt What I want to do is create two different 3-gram language models for both corpora. But I want to make sure that if a triplet is non-existent in the other corpus, then a smoothed probability should be assigned to that. For example; if corpusA has triplet counts: this is a 1 is a test 1 and corpusB has triplet counts: that is a 1 is a test 1 then the final counts for corpusA should be: this is a 1 is a test 1 that is a 0 because "that is a" is in B but not A. similarly corpusB should be: that is a 1 is a test 1 this is a 0 After the counts are setup, some smoothing algorithm might be used. I have manually tried to make the triple word counts 0, however it does not seem to work. As they are omitted from 3-grams. Can you recommend any other ways of doing this? Thank you, Burkay From stolcke at icsi.berkeley.edu Tue Oct 25 15:13:05 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 25 Oct 2011 15:13:05 -0700 Subject: [SRILM User List] Difficulty installing SRILM In-Reply-To: <1319519287.76246.YahooMailNeo@web24603.mail.ird.yahoo.com> References: <1319519287.76246.YahooMailNeo@web24603.mail.ird.yahoo.com> Message-ID: <4EA73471.7080202@icsi.berkeley.edu> Nick Akosu wrote: > Hi everyone, > I need to use SRILM to do some language modeling research but I am > having difficulty installing the package. I am using windows 7. Please > I need advice/directions. Please try to follow the directions in the INSTALL and doc/README.windows files, then, if you run into problems, ask again with specific information (error messages, output from make). Andreas From burkay at mit.edu Tue Oct 25 15:16:49 2011 From: burkay at mit.edu (Burkay Gur) Date: Tue, 25 Oct 2011 18:16:49 -0400 Subject: [SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets In-Reply-To: <4EA710F7.7020802@mit.edu> References: <4EA710F7.7020802@mit.edu> Message-ID: <4EA73551.6070104@mit.edu> To follow up, basically, when I edit the .count file and add 0 counts for some trigrams, they will not be included in the final .lm file, when I try to read from the .count file and create a language model. On 10/25/11 3:41 PM, Burkay Gur wrote: > Hi, > > I have just started using SRILM, and it is a great tool. But I ran > across this issue. The situation is that I have: > > corpusA.txt > corpusB.txt > > What I want to do is create two different 3-gram language models for > both corpora. But I want to make sure that if a triplet is > non-existent in the other corpus, then a smoothed probability should > be assigned to that. For example; > > if corpusA has triplet counts: > > this is a 1 > is a test 1 > > and corpusB has triplet counts: > > that is a 1 > is a test 1 > > then the final counts for corpusA should be: > > this is a 1 > is a test 1 > that is a 0 > > because "that is a" is in B but not A. > > similarly corpusB should be: > > that is a 1 > is a test 1 > this is a 0 > > After the counts are setup, some smoothing algorithm might be used. I > have manually tried to make the triple word counts 0, however it does > not seem to work. As they are omitted from 3-grams. > > Can you recommend any other ways of doing this? > > Thank you, > Burkay > From stolcke at icsi.berkeley.edu Tue Oct 25 15:38:41 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 25 Oct 2011 15:38:41 -0700 Subject: [SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets In-Reply-To: <4EA73551.6070104@mit.edu> References: <4EA710F7.7020802@mit.edu> <4EA73551.6070104@mit.edu> Message-ID: <4EA73A71.1020002@icsi.berkeley.edu> Burkay Gur wrote: > To follow up, basically, when I edit the .count file and add 0 counts > for some trigrams, they will not be included in the final .lm file, > when I try to read from the .count file and create a language model. A zero count is complete equivalent to a non-existent count, so what you're seeing it expected. It is not clear what precisely you want to happen. As a result of discounting and backing off, your LM, even without the unobserved trigram, will already assign a non-zero probability to that trigram. That's exactly what the ngram smoothing algorithms are for. If you want to inject some specific statistical information rom another dataset into your target LM you could interpolate (mix) the two LMs to obtain a third LM. See the description of the ngram -mix-lm option. Andreas > > On 10/25/11 3:41 PM, Burkay Gur wrote: >> Hi, >> >> I have just started using SRILM, and it is a great tool. But I ran >> across this issue. The situation is that I have: >> >> corpusA.txt >> corpusB.txt >> >> What I want to do is create two different 3-gram language models for >> both corpora. But I want to make sure that if a triplet is >> non-existent in the other corpus, then a smoothed probability should >> be assigned to that. For example; >> >> if corpusA has triplet counts: >> >> this is a 1 >> is a test 1 >> >> and corpusB has triplet counts: >> >> that is a 1 >> is a test 1 >> >> then the final counts for corpusA should be: >> >> this is a 1 >> is a test 1 >> that is a 0 >> >> because "that is a" is in B but not A. >> >> similarly corpusB should be: >> >> that is a 1 >> is a test 1 >> this is a 0 >> >> After the counts are setup, some smoothing algorithm might be used. I >> have manually tried to make the triple word counts 0, however it does >> not seem to work. As they are omitted from 3-grams. >> >> Can you recommend any other ways of doing this? >> >> Thank you, >> Burkay >> > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From burkay at mit.edu Tue Oct 25 17:29:35 2011 From: burkay at mit.edu (Burkay Gur) Date: Tue, 25 Oct 2011 20:29:35 -0400 Subject: [SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets In-Reply-To: <4EA73A71.1020002@icsi.berkeley.edu> References: <4EA710F7.7020802@mit.edu> <4EA73551.6070104@mit.edu> <4EA73A71.1020002@icsi.berkeley.edu> Message-ID: <4EA7546F.6020506@mit.edu> thank you, i understand that. but the problem is, like you said, how do we introduce these "unobserved trigrams" into the language model. i ll give another example if it helps: say you have this test.count file: 1-gram this is a test 2-gram this is is a a test 3-gram this is a is a test then, say you want to extend this language model with this trigram: "this is not" which basically has no previous count. and without smoothing in the 3-gram model, it will have zero probability. but how do we make sure that the smooth language model has a non-zero probability for this additional trigram? i thought i could do this my manually by updating the test.count with "this is not" with count 0. but apparently this is not working.. On 10/25/11 6:38 PM, Andreas Stolcke wrote: > Burkay Gur wrote: >> To follow up, basically, when I edit the .count file and add 0 counts >> for some trigrams, they will not be included in the final .lm file, >> when I try to read from the .count file and create a language model. > A zero count is complete equivalent to a non-existent count, so what > you're seeing it expected. > > It is not clear what precisely you want to happen. As a result of > discounting and backing off, your LM, even without the unobserved > trigram, will already assign a non-zero probability to that trigram. > That's exactly what the ngram smoothing algorithms are for. > > If you want to inject some specific statistical information rom > another dataset into your target LM you could interpolate (mix) the > two LMs to obtain a third LM. See the description of the ngram > -mix-lm option. > > Andreas > >> >> On 10/25/11 3:41 PM, Burkay Gur wrote: >>> Hi, >>> >>> I have just started using SRILM, and it is a great tool. But I ran >>> across this issue. The situation is that I have: >>> >>> corpusA.txt >>> corpusB.txt >>> >>> What I want to do is create two different 3-gram language models for >>> both corpora. But I want to make sure that if a triplet is >>> non-existent in the other corpus, then a smoothed probability should >>> be assigned to that. For example; >>> >>> if corpusA has triplet counts: >>> >>> this is a 1 >>> is a test 1 >>> >>> and corpusB has triplet counts: >>> >>> that is a 1 >>> is a test 1 >>> >>> then the final counts for corpusA should be: >>> >>> this is a 1 >>> is a test 1 >>> that is a 0 >>> >>> because "that is a" is in B but not A. >>> >>> similarly corpusB should be: >>> >>> that is a 1 >>> is a test 1 >>> this is a 0 >>> >>> After the counts are setup, some smoothing algorithm might be used. >>> I have manually tried to make the triple word counts 0, however it >>> does not seem to work. As they are omitted from 3-grams. >>> >>> Can you recommend any other ways of doing this? >>> >>> Thank you, >>> Burkay >>> >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > From stolcke at icsi.berkeley.edu Tue Oct 25 19:10:40 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 25 Oct 2011 19:10:40 -0700 Subject: [SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets In-Reply-To: <4EA7546F.6020506@mit.edu> References: <4EA710F7.7020802@mit.edu> <4EA73551.6070104@mit.edu> <4EA73A71.1020002@icsi.berkeley.edu> <4EA7546F.6020506@mit.edu> Message-ID: <4EA76C20.6010005@icsi.berkeley.edu> Burkay Gur wrote: > thank you, i understand that. but the problem is, like you said, how > do we introduce these "unobserved trigrams" into the language model. i > ll give another example if it helps: > > say you have this test.count file: > > 1-gram > this > is > a > test > > 2-gram > this is > is a > a test > > 3-gram > this is a > is a test > > then, say you want to extend this language model with this trigram: > > "this is not" > > which basically has no previous count. and without smoothing in the > 3-gram model, it will have zero probability. but how do we make sure > that the smooth language model has a non-zero probability for this > additional trigram? > > i thought i could do this my manually by updating the test.count with > "this is not" with count 0. but apparently this is not working.. The smoothed 3gram LM will have a non-zero probability, for ALL trigrams, trust me ;-) Try echo "this is not" | ngram -lm LM -ppl - -debug 2 to see it in action. Andreas > > On 10/25/11 6:38 PM, Andreas Stolcke wrote: >> Burkay Gur wrote: >>> To follow up, basically, when I edit the .count file and add 0 >>> counts for some trigrams, they will not be included in the final .lm >>> file, when I try to read from the .count file and create a language >>> model. >> A zero count is complete equivalent to a non-existent count, so >> what you're seeing it expected. >> >> It is not clear what precisely you want to happen. As a result of >> discounting and backing off, your LM, even without the unobserved >> trigram, will already assign a non-zero probability to that trigram. >> That's exactly what the ngram smoothing algorithms are for. >> >> If you want to inject some specific statistical information rom >> another dataset into your target LM you could interpolate (mix) the >> two LMs to obtain a third LM. See the description of the ngram >> -mix-lm option. >> >> Andreas >> >>> >>> On 10/25/11 3:41 PM, Burkay Gur wrote: >>>> Hi, >>>> >>>> I have just started using SRILM, and it is a great tool. But I ran >>>> across this issue. The situation is that I have: >>>> >>>> corpusA.txt >>>> corpusB.txt >>>> >>>> What I want to do is create two different 3-gram language models >>>> for both corpora. But I want to make sure that if a triplet is >>>> non-existent in the other corpus, then a smoothed probability >>>> should be assigned to that. For example; >>>> >>>> if corpusA has triplet counts: >>>> >>>> this is a 1 >>>> is a test 1 >>>> >>>> and corpusB has triplet counts: >>>> >>>> that is a 1 >>>> is a test 1 >>>> >>>> then the final counts for corpusA should be: >>>> >>>> this is a 1 >>>> is a test 1 >>>> that is a 0 >>>> >>>> because "that is a" is in B but not A. >>>> >>>> similarly corpusB should be: >>>> >>>> that is a 1 >>>> is a test 1 >>>> this is a 0 >>>> >>>> After the counts are setup, some smoothing algorithm might be used. >>>> I have manually tried to make the triple word counts 0, however it >>>> does not seem to work. As they are omitted from 3-grams. >>>> >>>> Can you recommend any other ways of doing this? >>>> >>>> Thank you, >>>> Burkay >>>> >>> >>> _______________________________________________ >>> SRILM-User site list >>> SRILM-User at speech.sri.com >>> http://www.speech.sri.com/mailman/listinfo/srilm-user >> > From burkay at mit.edu Tue Oct 25 19:53:06 2011 From: burkay at mit.edu (Burkay Gur) Date: Tue, 25 Oct 2011 22:53:06 -0400 Subject: [SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets In-Reply-To: <4EA76C20.6010005@icsi.berkeley.edu> References: <4EA710F7.7020802@mit.edu> <4EA73551.6070104@mit.edu> <4EA73A71.1020002@icsi.berkeley.edu> <4EA7546F.6020506@mit.edu> <4EA76C20.6010005@icsi.berkeley.edu> Message-ID: <50DBF2C0-634E-4391-8379-FD5017CF198E@mit.edu> But we have not even added "this is not" into the language model yet. If it is not a hard task, can you write a sample to show me how this works? On Oct 25, 2011, at 10:10 PM, Andreas Stolcke wrote: > Burkay Gur wrote: >> thank you, i understand that. but the problem is, like you said, how do we introduce these "unobserved trigrams" into the language model. i ll give another example if it helps: >> >> say you have this test.count file: >> >> 1-gram >> this >> is >> a >> test >> >> 2-gram >> this is >> is a >> a test >> >> 3-gram >> this is a >> is a test >> >> then, say you want to extend this language model with this trigram: >> >> "this is not" >> >> which basically has no previous count. and without smoothing in the 3-gram model, it will have zero probability. but how do we make sure that the smooth language model has a non-zero probability for this additional trigram? >> >> i thought i could do this my manually by updating the test.count with "this is not" with count 0. but apparently this is not working.. > The smoothed 3gram LM will have a non-zero probability, for ALL trigrams, trust me ;-) > > Try > echo "this is not" | ngram -lm LM -ppl - -debug 2 > > to see it in action. > > Andreas > >> >> On 10/25/11 6:38 PM, Andreas Stolcke wrote: >>> Burkay Gur wrote: >>>> To follow up, basically, when I edit the .count file and add 0 counts for some trigrams, they will not be included in the final .lm file, when I try to read from the .count file and create a language model. >>> A zero count is complete equivalent to a non-existent count, so what you're seeing it expected. >>> >>> It is not clear what precisely you want to happen. As a result of discounting and backing off, your LM, even without the unobserved trigram, will already assign a non-zero probability to that trigram. That's exactly what the ngram smoothing algorithms are for. >>> >>> If you want to inject some specific statistical information rom another dataset into your target LM you could interpolate (mix) the two LMs to obtain a third LM. See the description of the ngram -mix-lm option. >>> >>> Andreas >>> >>>> >>>> On 10/25/11 3:41 PM, Burkay Gur wrote: >>>>> Hi, >>>>> >>>>> I have just started using SRILM, and it is a great tool. But I ran across this issue. The situation is that I have: >>>>> >>>>> corpusA.txt >>>>> corpusB.txt >>>>> >>>>> What I want to do is create two different 3-gram language models for both corpora. But I want to make sure that if a triplet is non-existent in the other corpus, then a smoothed probability should be assigned to that. For example; >>>>> >>>>> if corpusA has triplet counts: >>>>> >>>>> this is a 1 >>>>> is a test 1 >>>>> >>>>> and corpusB has triplet counts: >>>>> >>>>> that is a 1 >>>>> is a test 1 >>>>> >>>>> then the final counts for corpusA should be: >>>>> >>>>> this is a 1 >>>>> is a test 1 >>>>> that is a 0 >>>>> >>>>> because "that is a" is in B but not A. >>>>> >>>>> similarly corpusB should be: >>>>> >>>>> that is a 1 >>>>> is a test 1 >>>>> this is a 0 >>>>> >>>>> After the counts are setup, some smoothing algorithm might be used. I have manually tried to make the triple word counts 0, however it does not seem to work. As they are omitted from 3-grams. >>>>> >>>>> Can you recommend any other ways of doing this? >>>>> >>>>> Thank you, >>>>> Burkay >>>>> >>>> >>>> _______________________________________________ >>>> SRILM-User site list >>>> SRILM-User at speech.sri.com >>>> http://www.speech.sri.com/mailman/listinfo/srilm-user >>> >> > From stolcke at icsi.berkeley.edu Tue Oct 25 20:54:41 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 25 Oct 2011 20:54:41 -0700 Subject: [SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets In-Reply-To: Your message of Tue, 25 Oct 2011 22:53:06 -0400. <50DBF2C0-634E-4391-8379-FD5017CF198E@mit.edu> Message-ID: <201110260354.p9Q3sfS9009896@fruitcake.ICSI.Berkeley.EDU> In message <50DBF2C0-634E-4391-8379-FD5017CF198E at mit.edu>you wrote: > But we have not even added "this is not" into the language model yet. If it is not a hard task, can you write a sample to show me h > ow this works? There no need to "add" this trigram to the LM. It can compute a non-zero probability for it even if it hasn't occurred in the training data. I suggest you review the basics of N-gram LM smoothing as described in the two text book chapters referenced at http://www.speech.sri.com/projects/srilm/ . Andreas > > On Oct 25, 2011, at 10:10 PM, Andreas Stolcke wrote: > > > Burkay Gur wrote: > >> thank you, i understand that. but the problem is, like you said, how do we introduce these "unobserved trigrams" into the langua > ge model. i ll give another example if it helps: > >> > >> say you have this test.count file: > >> > >> 1-gram > >> this > >> is > >> a > >> test > >> > >> 2-gram > >> this is > >> is a > >> a test > >> > >> 3-gram > >> this is a > >> is a test > >> > >> then, say you want to extend this language model with this trigram: > >> > >> "this is not" > >> > >> which basically has no previous count. and without smoothing in the 3-gram model, it will have zero probability. but how do we m > ake sure that the smooth language model has a non-zero probability for this additional trigram? > >> > >> i thought i could do this my manually by updating the test.count with "this is not" with count 0. but apparently this is not wor > king.. > > The smoothed 3gram LM will have a non-zero probability, for ALL trigrams, trust me ;-) > > > > Try > > echo "this is not" | ngram -lm LM -ppl - -debug 2 > > > > to see it in action. > > > > Andreas > > > >> > >> On 10/25/11 6:38 PM, Andreas Stolcke wrote: > >>> Burkay Gur wrote: > >>>> To follow up, basically, when I edit the .count file and add 0 counts for some trigrams, they will not be included in the fina > l .lm file, when I try to read from the .count file and create a language model. > >>> A zero count is complete equivalent to a non-existent count, so what you're seeing it expected. > >>> > >>> It is not clear what precisely you want to happen. As a result of discounting and backing off, your LM, even without the unob > served trigram, will already assign a non-zero probability to that trigram. That's exactly what the ngram smoothing algorithms are > for. > >>> > >>> If you want to inject some specific statistical information rom another dataset into your target LM you could interpolate (mix) > the two LMs to obtain a third LM. See the description of the ngram -mix-lm option. > >>> > >>> Andreas > >>> > >>>> > >>>> On 10/25/11 3:41 PM, Burkay Gur wrote: > >>>>> Hi, > >>>>> > >>>>> I have just started using SRILM, and it is a great tool. But I ran across this issue. The situation is that I have: > >>>>> > >>>>> corpusA.txt > >>>>> corpusB.txt > >>>>> > >>>>> What I want to do is create two different 3-gram language models for both corpora. But I want to make sure that if a triplet > is non-existent in the other corpus, then a smoothed probability should be assigned to that. For example; > >>>>> > >>>>> if corpusA has triplet counts: > >>>>> > >>>>> this is a 1 > >>>>> is a test 1 > >>>>> > >>>>> and corpusB has triplet counts: > >>>>> > >>>>> that is a 1 > >>>>> is a test 1 > >>>>> > >>>>> then the final counts for corpusA should be: > >>>>> > >>>>> this is a 1 > >>>>> is a test 1 > >>>>> that is a 0 > >>>>> > >>>>> because "that is a" is in B but not A. > >>>>> > >>>>> similarly corpusB should be: > >>>>> > >>>>> that is a 1 > >>>>> is a test 1 > >>>>> this is a 0 > >>>>> > >>>>> After the counts are setup, some smoothing algorithm might be used. I have manually tried to make the triple word counts 0, h > owever it does not seem to work. As they are omitted from 3-grams. > >>>>> > >>>>> Can you recommend any other ways of doing this? > >>>>> > >>>>> Thank you, > >>>>> Burkay > >>>>> > >>>> > >>>> _______________________________________________ > >>>> SRILM-User site list > >>>> SRILM-User at speech.sri.com > >>>> http://www.speech.sri.com/mailman/listinfo/srilm-user > >>> > >> > > > --Andreas From stolcke at icsi.berkeley.edu Wed Oct 26 15:52:33 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 26 Oct 2011 15:52:33 -0700 Subject: [SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets In-Reply-To: <4EA82FA5.6070103@mit.edu> References: <4EA710F7.7020802@mit.edu> <4EA73551.6070104@mit.edu> <4EA73A71.1020002@icsi.berkeley.edu> <4EA7546F.6020506@mit.edu> <4EA76C20.6010005@icsi.berkeley.edu> <4EA82FA5.6070103@mit.edu> Message-ID: <4EA88F31.10004@icsi.berkeley.edu> Burkay Gur wrote: > Try > echo "this is not" | ngram -lm LM -ppl - -debug 2 > > > ok, this returns a non-zero probability. but i want to now include > "this is not" in the language model. and still have all the > probabilities in the language model sum up to 1. > > in other words i want to expand my language model with multiple > tri-grams that are unseen events. > > maybe if i tell you the main reason why i want to do this it will be > more clear. > > i am trying to find the symmetric KL Divergence of two distributions. > and these two distributions will be two language models. > > the formula for symmetric KL divergence is: > > i being all trigrams in both models: > > sum[ p(i) * (log(p(i)) / log(q(i))) ] + sum[ q(i) * (log(q(i)) / > log(p(i))) ] > > sums are over all i's. > > p(i) is the probability in language model 1. and q(i) is the > probability in language model 2. > > since we are doing this over all i's, it means we have to include the > probabilities of trigrams that occur in one LM and no the other in > that particular LM. otherwise we will get log(0) error. so we will > need some kind of smoothing. But you don't get log(0) because the LM is smoothed and therefore the p's and q's are all > 0. BTW, you only get a problem when the term in the denominator is undefined, because 0 * log(0) = 0. So you can sum over the UNION of all ngrams in both models, and when you need to compute the p(i) or q(i) for an ngram that is not in the particular model you use the backoff estimate (i.e., just what SRILM will compute when you ask it to compute a probability that is not explicitly represented in the model). BTW, for this type of thing you wants to use ngram -counts , and then postprocess the output. Andreas > > say LM1 has these trigrams: > > a 1/3 > b 1/3 > c 1/3 > > and LM2 has these: > > a 1/2 > d 1/2 > > now when we re doing the KL divergence calculation, we need to make > sure "d" is in LM1, and also "b" and "c" are in LM2. otherwise we ll > get log(0). so we ll need to modify LM1 and LM2 by smoothing, so they > include the non-zero probabilities for b,c and d. and still each sum > up to 1 with their probabilities. > > if we use the test-training approach, and try to see the probabilities > of unseen events, we are not updating out current LM to include those > unseen events. in fact that is what i want to do. include a list of > unseen trigrams, (that might might possibly have lower orders of > n-grams in the model) in that language model. > > > On 10/25/11 10:10 PM, Andreas Stolcke wrote: >> Burkay Gur wrote: >>> thank you, i understand that. but the problem is, like you said, how >>> do we introduce these "unobserved trigrams" into the language model. >>> i ll give another example if it helps: >>> >>> say you have this test.count file: >>> >>> 1-gram >>> this >>> is >>> a >>> test >>> >>> 2-gram >>> this is >>> is a >>> a test >>> >>> 3-gram >>> this is a >>> is a test >>> >>> then, say you want to extend this language model with this trigram: >>> >>> "this is not" >>> >>> which basically has no previous count. and without smoothing in the >>> 3-gram model, it will have zero probability. but how do we make sure >>> that the smooth language model has a non-zero probability for this >>> additional trigram? >>> >>> i thought i could do this my manually by updating the test.count >>> with "this is not" with count 0. but apparently this is not working.. >> The smoothed 3gram LM will have a non-zero probability, for ALL >> trigrams, trust me ;-) >> >> Try >> echo "this is not" | ngram -lm LM -ppl - -debug 2 >> >> to see it in action. >> >> Andreas >> >>> >>> On 10/25/11 6:38 PM, Andreas Stolcke wrote: >>>> Burkay Gur wrote: >>>>> To follow up, basically, when I edit the .count file and add 0 >>>>> counts for some trigrams, they will not be included in the final >>>>> .lm file, when I try to read from the .count file and create a >>>>> language model. >>>> A zero count is complete equivalent to a non-existent count, so >>>> what you're seeing it expected. >>>> >>>> It is not clear what precisely you want to happen. As a result of >>>> discounting and backing off, your LM, even without the unobserved >>>> trigram, will already assign a non-zero probability to that >>>> trigram. That's exactly what the ngram smoothing algorithms are for. >>>> >>>> If you want to inject some specific statistical information rom >>>> another dataset into your target LM you could interpolate (mix) the >>>> two LMs to obtain a third LM. See the description of the ngram >>>> -mix-lm option. >>>> >>>> Andreas >>>> >>>>> >>>>> On 10/25/11 3:41 PM, Burkay Gur wrote: >>>>>> Hi, >>>>>> >>>>>> I have just started using SRILM, and it is a great tool. But I >>>>>> ran across this issue. The situation is that I have: >>>>>> >>>>>> corpusA.txt >>>>>> corpusB.txt >>>>>> >>>>>> What I want to do is create two different 3-gram language models >>>>>> for both corpora. But I want to make sure that if a triplet is >>>>>> non-existent in the other corpus, then a smoothed probability >>>>>> should be assigned to that. For example; >>>>>> >>>>>> if corpusA has triplet counts: >>>>>> >>>>>> this is a 1 >>>>>> is a test 1 >>>>>> >>>>>> and corpusB has triplet counts: >>>>>> >>>>>> that is a 1 >>>>>> is a test 1 >>>>>> >>>>>> then the final counts for corpusA should be: >>>>>> >>>>>> this is a 1 >>>>>> is a test 1 >>>>>> that is a 0 >>>>>> >>>>>> because "that is a" is in B but not A. >>>>>> >>>>>> similarly corpusB should be: >>>>>> >>>>>> that is a 1 >>>>>> is a test 1 >>>>>> this is a 0 >>>>>> >>>>>> After the counts are setup, some smoothing algorithm might be >>>>>> used. I have manually tried to make the triple word counts 0, >>>>>> however it does not seem to work. As they are omitted from 3-grams. >>>>>> >>>>>> Can you recommend any other ways of doing this? >>>>>> >>>>>> Thank you, >>>>>> Burkay >>>>>> >>>>> >>>>> _______________________________________________ >>>>> SRILM-User site list >>>>> SRILM-User at speech.sri.com >>>>> http://www.speech.sri.com/mailman/listinfo/srilm-user >>>> >>> >> > From wuxichuan.go at gmail.com Tue Nov 1 10:36:39 2011 From: wuxichuan.go at gmail.com (Xichuan Wu) Date: Tue, 1 Nov 2011 18:36:39 +0100 Subject: [SRILM User List] Problem on Installing SRILM Message-ID: Hi All, I have been trying to install SRILM but confronted with one problem, which googling does not help. Some infos about the platform: Win7, 32bit, Cygwin including '*csh*' and '*tcsh*'. I am working with *Joshua decoder*. After downloading and unzipping *srilm.tgz*, I tried *make* command and got the following: make: /sbin/machine-type: Command not found mkdir include lib bin mkdir: cannot create directory `include': File exists mkdir: cannot create directory `lib': File exists mkdir: cannot create directory `bin': File exists make: [dirs] Error 1 (ignored) make init make[1]: /sbin/machine-type: Command not found make[1]: Entering directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm' make[1]: Entering directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM= MACHINE_TYPE= OPTION= MAKE_PIC= init) || exit 1; \ done make[2]: Entering directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm/misc/src' Makefile:24: /common/Makefile.common.variables: No such file or directory Makefile:139: /common/Makefile.common.targets: No such file or directory make[2]: *** No rule to make target `/common/Makefile.common.targets'. Stop. make[2]: Leaving directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm/misc/src' make[1]: *** [init] Error 1 make[1]: Leaving directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm' make: *** [World] Error 2 After I changed *Makefile* in the top level, specifically *PACKAGE_DIR = F:/CL/Drei/Project/Joshua/srilm*, where the directory is the one *srilm.tgz*unzipped into. When I try make command, what I get is the following: Makefile:100: *** target pattern contains no `%'. Stop. I know there is some problem with line 100 in the *Makefile* file, which is: package: $(PACKAGE_DIR)/EXCLUDE $(TAR) cvzXf $(PACKAGE_DIR)/EXCLUDE $(PACKAGE_DIR)/srilm-$(RELEASE).tar.gz . Where should I add `%'? Or there are other problem in this? Please help. Thanks. Xichuan -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Nov 1 14:27:37 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 01 Nov 2011 14:27:37 -0700 Subject: [SRILM User List] Problem on Installing SRILM In-Reply-To: Your message of Tue, 01 Nov 2011 18:36:39 +0100. Message-ID: <201111012128.pA1LRbXX022986@fruitcake.ICSI.Berkeley.EDU> You didn't set the SRILM variable correctly. Either edit the top-level Makefile or invoke make with make SRILM=/absolute/path/to/srilm World Do not change PACKAGE_DIR, change the SRILM variable instead. Do not use DOS-style path names (F:\...). Use cygwin paths, like /home/username/srilm. Andreas In message you wrote: > > Hi All, > > I have been trying to install SRILM but confronted with one problem, which > googling does not help. Some infos about the platform: Win7, 32bit, Cygwin > including '*csh*' and '*tcsh*'. I am working with *Joshua decoder*. > > After downloading and unzipping *srilm.tgz*, I tried *make* command and got > the following: > make: /sbin/machine-type: Command not found > mkdir include lib bin > mkdir: cannot create directory `include': File exists > mkdir: cannot create directory `lib': File exists > mkdir: cannot create directory `bin': File exists > make: [dirs] Error 1 (ignored) > make init > make[1]: /sbin/machine-type: Command not found > make[1]: Entering directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm' > make[1]: Entering directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm' > for subdir in misc dstruct lm flm lattice utils; do \ > (cd $subdir/src; make SRILM= MACHINE_TYPE= OPTION= MAKE_PIC= init) || exit > 1; \ > done > make[2]: Entering directory > `/cygdrive/f/CL/Drei/Project/Joshua/srilm/misc/src' > Makefile:24: /common/Makefile.common.variables: No such file or directory > Makefile:139: /common/Makefile.common.targets: No such file or directory > make[2]: *** No rule to make target `/common/Makefile.common.targets'. > Stop. > make[2]: Leaving directory > `/cygdrive/f/CL/Drei/Project/Joshua/srilm/misc/src' > make[1]: *** [init] Error 1 > make[1]: Leaving directory `/cygdrive/f/CL/Drei/Project/Joshua/srilm' > make: *** [World] Error 2 > > > After I changed *Makefile* in the top level, specifically *PACKAGE_DIR = > F:/CL/Drei/Project/Joshua/srilm*, where the directory is the one > *srilm.tgz*unzipped into. When I try make command, what I get is the > following: > > Makefile:100: *** target pattern contains no `%'. Stop. > From d_emps at yahoo.com Sat Nov 19 07:13:02 2011 From: d_emps at yahoo.com (Simon h s) Date: Sat, 19 Nov 2011 07:13:02 -0800 (PST) Subject: [SRILM User List] problem installing ubuntu 11.10 Message-ID: <1321715582.63739.YahooMailNeo@web110608.mail.gq1.yahoo.com> Dear all,? I have a problem when compiling SRILM in Ubuntu 11.10 after installing all required package mentioned in INSTALL, I have the following error when running make World: make[2]: Entering directory `/home/ndriks/Thesis/tools/atools/srilm/misc/src' /usr/bin/gcc -march=athlon64 -m64 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 ? /usr/include/tcl8.5/tcl.h -I. -I../../include ? -c -g -O3 -o ../obj/i686/option.o option.c gcc: fatal error: cannot specify -o with -c, -S or -E with multiple files compilation terminated. make[2]: *** [../obj/i686/option.o] Error 4 make[2]: Leaving directory `/home/ndriks/Thesis/tools/atools/srilm/misc/src' make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory `/home/ndriks/Thesis/tools/atools/srilm' make: *** [World] Error 2 ? full error is attached. FYI I'm using srilm 1.5.12 Please help?? Thanks before -- Simon H S -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: makeworld.out Type: application/octet-stream Size: 13574 bytes Desc: not available URL: From wuxichuan.go at gmail.com Sat Nov 19 10:00:59 2011 From: wuxichuan.go at gmail.com (Xichuan Wu) Date: Sat, 19 Nov 2011 19:00:59 +0100 Subject: [SRILM User List] problem installing ubuntu 11.10 In-Reply-To: <1321715582.63739.YahooMailNeo@web110608.mail.gq1.yahoo.com> References: <1321715582.63739.YahooMailNeo@web110608.mail.gq1.yahoo.com> Message-ID: Hi Simon, I made it recently on Ubuntu 11.10. Here is a suggestion can follow: 1. In this file /common/Makefile.machine.$MACHINE_TYPE, change GCC_FLAGS = -march=athlon64 -m64 -Wall -Wno-unused-variable -Wno-uninitialized to GCC_FLAGS = -march=athlon64 -m64 -Wall -Wno-unused-variable -Wno-uninitialized -fPIC then, run with command: make MAKE_PIC=yes MACHINE_TYPE=$MACHINE_TYPE World Note that 1) here $MACHINE_TYPE refers to your machine type (you can try either "i686-m64" or "i686-ubuntu"); 2) before the make command, you might need to use "make clean" to clean what's left from previous compile. 2. consult "Joshua technical support" googlegroup and you will find more info. Good luck! Xichuan On Sat, Nov 19, 2011 at 4:13 PM, Simon h s wrote: > Dear all, > > I have a problem when compiling SRILM in Ubuntu 11.10 > > after installing all required package mentioned in INSTALL, I have the > following error when running make World: > > make[2]: Entering directory > `/home/ndriks/Thesis/tools/atools/srilm/misc/src' > /usr/bin/gcc -march=athlon64 -m64 -Wall -Wno-unused-variable > -Wno-uninitialized -D_FILE_OFFSET_BITS=64 /usr/include/tcl8.5/tcl.h -I. > -I../../include -c -g -O3 -o ../obj/i686/option.o option.c > gcc: fatal error: cannot specify -o with -c, -S or -E with multiple files > compilation terminated. > make[2]: *** [../obj/i686/option.o] Error 4 > make[2]: Leaving directory > `/home/ndriks/Thesis/tools/atools/srilm/misc/src' > make[1]: *** [release-libraries] Error 1 > make[1]: Leaving directory `/home/ndriks/Thesis/tools/atools/srilm' > make: *** [World] Error 2 > > full error is attached. > > FYI I'm using srilm 1.5.12 > > Please help? > Thanks before > > -- > Simon H S > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Nov 19 14:39:35 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 19 Nov 2011 14:39:35 -0800 Subject: [SRILM User List] problem installing ubuntu 11.10 In-Reply-To: <1321715582.63739.YahooMailNeo@web110608.mail.gq1.yahoo.com> References: <1321715582.63739.YahooMailNeo@web110608.mail.gq1.yahoo.com> Message-ID: <4EC83027.3000507@icsi.berkeley.edu> Simon h s wrote: > Dear all, > > I have a problem when compiling SRILM in Ubuntu 11.10 > > after installing all required package mentioned in INSTALL, I have the > following error when running make World: > > make[2]: Entering directory > `/home/ndriks/Thesis/tools/atools/srilm/misc/src' > /usr/bin/gcc -march=athlon64 -m64 -Wall -Wno-unused-variable > -Wno-uninitialized -D_FILE_OFFSET_BITS=64 /usr/include/tcl8.5/tcl.h > -I. -I../../include -c -g -O3 -o ../obj/i686/option.o option.c > gcc: fatal error: cannot specify -o with -c, -S or -E with multiple files > compilation terminated. > make[2]: *** [../obj/i686/option.o] Error 4 > make[2]: Leaving directory > `/home/ndriks/Thesis/tools/atools/srilm/misc/src' > make[1]: *** [release-libraries] Error 1 > make[1]: Leaving directory `/home/ndriks/Thesis/tools/atools/srilm' > make: *** [World] Error 2 I believe the problem is caused by having TCL_INCLUDE = /usr/include/tcl8.5/tcl.h You should use TCL_INCLUDE = -I/usr/include/tcl8.5 instead in Makefile.machine.i686 (or Makefile.machine.i686-m64). Andreas From dmytro.prylipko at ovgu.de Fri Dec 16 01:47:57 2011 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Fri, 16 Dec 2011 10:47:57 +0100 Subject: [SRILM User List] A problem with expanding class-based LMs Message-ID: <4EEB13CD.1050007@ovgu.de> Hi Andreas, I have a class-based LM, which gives a particular perplexity value on the test set: ngram -ppl test.fold3.txt -lm 2-gram.class.dd150.fold3.lm -classes class.dd150.fold3.defs -order 2 -vocab ../all.wlist file test.fold3.txt: 1397 sentences, 37403 words, 0 OOVs 427 zeroprobs, logprob= -72617.1 ppl= 78.0551 ppl1= 92.0235 I expanded it and got a word-level model: ngram -lm 2-gram.class.dd150.fold3.lm -classes class.dd150.fold3.defs -order 2 -write-lm 2-gram.class.dd150.expanded_exact.fold3.lm -expand-classes 2 -expand-exact 2 -vocab ../all.wlist But the new model provides different result: ngram -ppl test.fold3.txt -lm 2-gram.class.dd150.expanded_exact.fold3.lm -order 2 -vocab ../all.wlist file test.fold3.txt: 1397 sentences, 37403 words, 0 OOVs 0 zeroprobs, logprob= -78108.4 ppl= 103.063 ppl1= 122.544 You can see there is no more zeroprobs in the new one, which .affects the perplexity. I can show you detailed output from both models: Class-based: gruess gott frau traub p( gruess | ) = [OOV][2gram] 0.0167159 [ -1.77687 ] p( gott | gruess ...) = [OOV][1gram][OOV][2gram] 0.658525 [ -0.181428 ] p( frau | gott ...) = [OOV][1gram][OOV][2gram] 0.119973 [ -0.920917 ] p( traub | frau ...) = [OOV][OOV] 0 [ -inf ] p( | traub ...) = [1gram] 0.0377397 [ -1.4232 ] 1 sentences, 4 words, 0 OOVs 1 zeroprobs, logprob= -4.30242 ppl= 11.9016 ppl1= 27.1731 And the same sentence with expanded LM: gruess gott frau traub p( gruess | ) = [2gram] 0.0167159 [ -1.77687 ] p( gott | gruess ...) = [2gram] 0.658525 [ -0.181428 ] p( frau | gott ...) = [2gram] 0.119973 [ -0.920917 ] p( traub | frau ...) = [1gram] 3.84699e-14 [ -13.4149 ] p( | traub ...) = [1gram] 0.0377397 [ -1.4232 ] 1 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -17.7173 ppl= 3495.1 ppl1= 26873.5 From my point of view it looks like a computational error, such a small probabilities should be treated as zero. BTW, how can zero probabilities appear there? They should be smoothed, right? I divided my corpus on 10 folds and performed these actions on all of them. With 6 folds everything is fine, perplexities are almost the same for both models, but with other 4 parts I have such a problem. I would be greatly appreciated for any help. Sincerely yours, Dmytro Prylipko. From dyuret at ku.edu.tr Mon Dec 19 03:52:13 2011 From: dyuret at ku.edu.tr (Deniz Yuret) Date: Mon, 19 Dec 2011 13:52:13 +0200 Subject: [SRILM User List] lines starting with ## skipped Message-ID: Hi, I was working on the reuters rcv1 corpus and while investigating a discrepancy in the language model output I realized that the ngram command skips lines in the test file that start with '##'. Is this a documented feature or a bug? best, deniz From stolcke at icsi.berkeley.edu Mon Dec 19 11:32:19 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 19 Dec 2011 11:32:19 -0800 Subject: [SRILM User List] lines starting with ## skipped In-Reply-To: References: Message-ID: <4EEF9143.2050804@icsi.berkeley.edu> Deniz Yuret wrote: > Hi, > > I was working on the reuters rcv1 corpus and while investigating a > discrepancy in the language model output I realized that the ngram > command skips lines in the test file that start with '##'. Is this a > documented feature or a bug? > Yes, it's a feature of the File::getline() function, but not documented. In the API you can disable this by setting the skipComments variable in the File object to false. There is currently no way to do it at the command line (but would be easy to add an option). A workaround is to insert a space character at the beginning of each input line. Andreas From fmang at ieee.org Tue Dec 20 19:26:28 2011 From: fmang at ieee.org (Federico Ang) Date: Wed, 21 Dec 2011 11:26:28 +0800 Subject: [SRILM User List] lattice-ngram test seg fault on Ubuntu with x86_64 Message-ID: Hello, I successfully compiled SRILM 1.6.0 with Ubuntu 11.04 on an Intel Core i5 with -march=core2 -m64 (I edited the i686-m64 makefile) and with Tcl 8.5, Gawk 3.1.8, and gcc/g++ 4.6.1 . On make test, all test gives IDENTICAL results except for the lattice-ngram test, which gives DIFFERS for both stdout and stderr. Investigating the problem, I found that the stdout for the output is empty. On the other hand, stderr output is exactly the same as stderr reference except that there's Segmentation Fault on the last line. I don't know how to investigate further. Please advise so I can have all test passed. Best, Federico Ang DSP Laboratory, EEE Institute Univ. of the Phils., Diliman -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Dec 20 23:22:05 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 20 Dec 2011 23:22:05 -0800 Subject: [SRILM User List] lattice-ngram test seg fault on Ubuntu with x86_64 In-Reply-To: References: Message-ID: <4EF1891D.2090601@icsi.berkeley.edu> Federico Ang wrote: > Hello, > > I successfully compiled SRILM 1.6.0 with Ubuntu 11.04 on an Intel Core > i5 with -march=core2 -m64 (I edited the i686-m64 makefile) and with > Tcl 8.5, Gawk 3.1.8, and gcc/g++ 4.6.1 . On make test, all test gives > IDENTICAL results except for the lattice-ngram test, which gives > DIFFERS for both stdout and stderr. Investigating the problem, I > found that the stdout for the output is empty. On the other hand, > stderr output is exactly the same as stderr reference except that > there's Segmentation Fault on the last line. I don't know how to > investigate further. Please advise so I can have all test passed. Check if you get the same problem with default compiler options (without -march=core2) and, if possible, with older versions of gcc. I have not seen core dumps on any tests, including with Ubuntu systems I have access to, though the compiler versions might have been less recent. Andreas From stolcke at icsi.berkeley.edu Wed Dec 21 14:47:42 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 21 Dec 2011 14:47:42 -0800 Subject: [SRILM User List] lattice-ngram test seg fault on Ubuntu with x86_64 In-Reply-To: References: <4EF1891D.2090601@icsi.berkeley.edu> Message-ID: <4EF2620E.6080808@icsi.berkeley.edu> Federico Ang wrote: > You we're right. gcc-4.5 did the trick. and it's not on the > architecture/instruction set. Thank you so much! :) Glad to hear it . cc-ing srilm-user for the record. Andreas > > Best, > Fed Ang > > On Wed, Dec 21, 2011 at 3:22 PM, Andreas Stolcke > > wrote: > > Federico Ang wrote: > > Hello, > > I successfully compiled SRILM 1.6.0 with Ubuntu 11.04 on an > Intel Core i5 with -march=core2 -m64 (I edited the i686-m64 > makefile) and with Tcl 8.5, Gawk 3.1.8, and gcc/g++ 4.6.1 . > On make test, all test gives IDENTICAL results except for the > lattice-ngram test, which gives DIFFERS for both stdout and > stderr. Investigating the problem, I found that the stdout > for the output is empty. On the other hand, stderr output is > exactly the same as stderr reference except that there's > Segmentation Fault on the last line. I don't know how to > investigate further. Please advise so I can have all test passed. > > Check if you get the same problem with default compiler options > (without -march=core2) and, if possible, with older versions of > gcc. I have not seen core dumps on any tests, including with > Ubuntu systems I have access to, though the compiler versions > might have been less recent. > > Andreas > > From stolcke at icsi.berkeley.edu Wed Dec 21 17:03:51 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 21 Dec 2011 17:03:51 -0800 Subject: [SRILM User List] A problem with expanding class-based LMs In-Reply-To: <4EEB13CD.1050007@ovgu.de> References: <4EEB13CD.1050007@ovgu.de> Message-ID: <4EF281F7.5080805@icsi.berkeley.edu> My guess is that your class definitions contain multiple words per expansion, such as "GREETING" expanding to "gruess gott". In that case a bigram expansion of the LM will not have as much predictive power as the original class bigram LM. Try using -expand-classes 3 (or even higher). Andreas Dmytro Prylipko wrote: > Hi Andreas, > > I have a class-based LM, which gives a particular perplexity value on > the test set: > > ngram -ppl test.fold3.txt -lm 2-gram.class.dd150.fold3.lm -classes > class.dd150.fold3.defs -order 2 -vocab ../all.wlist > > file test.fold3.txt: 1397 sentences, 37403 words, 0 OOVs > 427 zeroprobs, logprob= -72617.1 ppl= 78.0551 ppl1= 92.0235 > > I expanded it and got a word-level model: > > ngram -lm 2-gram.class.dd150.fold3.lm -classes class.dd150.fold3.defs > -order 2 -write-lm 2-gram.class.dd150.expanded_exact.fold3.lm > -expand-classes 2 -expand-exact 2 -vocab ../all.wlist > > > But the new model provides different result: > > ngram -ppl test.fold3.txt -lm > 2-gram.class.dd150.expanded_exact.fold3.lm -order 2 -vocab ../all.wlist > > file test.fold3.txt: 1397 sentences, 37403 words, 0 OOVs > 0 zeroprobs, logprob= -78108.4 ppl= 103.063 ppl1= 122.544 > > You can see there is no more zeroprobs in the new one, which .affects > the perplexity. > > > I can show you detailed output from both models: > > Class-based: > > gruess gott frau traub > p( gruess | ) = [OOV][2gram] 0.0167159 [ -1.77687 ] > p( gott | gruess ...) = [OOV][1gram][OOV][2gram] 0.658525 [ > -0.181428 ] > p( frau | gott ...) = [OOV][1gram][OOV][2gram] 0.119973 [ > -0.920917 ] > p( traub | frau ...) = [OOV][OOV] 0 [ -inf ] > p( | traub ...) = [1gram] 0.0377397 [ -1.4232 ] > 1 sentences, 4 words, 0 OOVs > 1 zeroprobs, logprob= -4.30242 ppl= 11.9016 ppl1= 27.1731 > > > And the same sentence with expanded LM: > > gruess gott frau traub > p( gruess | ) = [2gram] 0.0167159 [ -1.77687 ] > p( gott | gruess ...) = [2gram] 0.658525 [ -0.181428 ] > p( frau | gott ...) = [2gram] 0.119973 [ -0.920917 ] > p( traub | frau ...) = [1gram] 3.84699e-14 [ -13.4149 ] > p( | traub ...) = [1gram] 0.0377397 [ -1.4232 ] > 1 sentences, 4 words, 0 OOVs > 0 zeroprobs, logprob= -17.7173 ppl= 3495.1 ppl1= 26873.5 > > > From my point of view it looks like a computational error, such a > small probabilities should be treated as zero. > BTW, how can zero probabilities appear there? They should be smoothed, > right? > > I divided my corpus on 10 folds and performed these actions on all of > them. With 6 folds everything is fine, perplexities are almost the > same for both models, but with other 4 parts I have such a problem. > > I would be greatly appreciated for any help. > > Sincerely yours, > Dmytro Prylipko. > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Thu Dec 22 15:19:49 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 22 Dec 2011 15:19:49 -0800 Subject: [SRILM User List] A problem with expanding class-based LMs In-Reply-To: Your message of Thu, 22 Dec 2011 20:50:43 +0100. <4EF38A13.7020309@ovgu.de> Message-ID: <201112222319.pBMNJnV5024999@fruitcake.ICSI.Berkeley.EDU> The problem turns out to be a sensitivity in the backoff computation to sums of probabilities that are exactly zero versus numerically equal to zero (less than Prob_Epsilon). In your case, the sum of unigram probs of the expanded LM is sometimes very slightly less than 1, causing some probabily mass to be distributed over all the unseen words, and the perplexity to be changed noticeably. The patch below will catch these cases and produce consistent results independent of these small numerical differences (which result from probabilties being summed in different order, depending on whether the iteration is over sorted arrays or hash tables). Andreas diff -c -r1.122 NgramLM.cc *** lm/src/NgramLM.cc 30 May 2011 23:46:38 -0000 1.122 --- lm/src/NgramLM.cc 22 Dec 2011 22:27:58 -0000 *************** *** 2118,2125 **** * unigrams, which we achieve by giving them zero probability. */ if (order == 0 /*&& numerator > 0.0*/) { distributeProb(numerator, context); ! } else if (numerator == 0.0 && denominator == 0.0) { node->bow = LogP_One; } else { node->bow = ProbToLogP(numerator) - ProbToLogP(denominator); --- 2118,2131 ---- * unigrams, which we achieve by giving them zero probability. */ if (order == 0 /*&& numerator > 0.0*/) { + if (numerator < Prob_Epsilon) { + /* + * Avoid spurious non-zero unigram probabilities + */ + numerator = 0.0; + } distributeProb(numerator, context); ! } else if (numerator < Prob_Epsilon && denominator < Prob_Epsilon) { node->bow = LogP_One; } else { node->bow = ProbToLogP(numerator) - ProbToLogP(denominator); In message <4EF38A13.7020309 at ovgu.de>you wrote: > > I had repeated expansion with different binaries and got different > results again. > I attached the source files and corresponding scripts to this e-mail. I > did not included the expanded models since they are too large, but they > are also available. > > I hope this will help you to investigate the problem. > > Sincerely yours, > Dmytro Prylipko. > > Am 12/22/2011 7:38 PM, schrieb Andreas Stolcke: > > Dmytro Prylipko wrote: > >> I tried expansion also on trigrams with the same problem. > >> Actually I managed to cope it. I compiled the SRILM with the "_c" > >> option and expanded my bigrams with that binary. It helped > >> (perplexity measures became the same), however in this case another > >> bigrams (expanded ok with usual binary) had the problem described > >> before. Is it a bug? > > You should never get different results (other than sorting order, > > e.g., in counts files) with the regular and the _c version. > > Can you send me the inputs involved? > > > > Andreas > > From ghenryww at roadrunner.com Fri Dec 23 12:21:56 2011 From: ghenryww at roadrunner.com (Gil Henry) Date: Fri, 23 Dec 2011 12:21:56 -0800 Subject: [SRILM User List] ngram count Message-ID: <000d01ccc1b0$84633200$8d299600$@com> I have subscribed! I am getting the message ngram count no command when I execute ngram count with all of the necessary parameters and proper syntax. Reference SRILM FAQ A1, I have tried scripts; nothing works. Srilm/bin has commands for ngram and ngram-count (console display). make World, make all, make cleanest run with no errors. make test runs with with all "identical", no "differs". Thanks, any help will be appreciated. Gilbert L. Henry ghenryww at roadrunner.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Dec 23 23:59:47 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 23 Dec 2011 23:59:47 -0800 Subject: [SRILM User List] ngram count In-Reply-To: <000d01ccc1b0$84633200$8d299600$@com> References: <000d01ccc1b0$84633200$8d299600$@com> Message-ID: <4EF58673.8060709@icsi.berkeley.edu> Gil Henry wrote: > > I have subscribed! I am getting the message ngram count no command > when I execute ngram count with all of the necessary parameters and > proper syntax. Reference SRILM FAQ A1, I have tried scripts; nothing > works. Srilm/bin has commands for ngram and ngram-count (console > display). make World, make all, make cleanest run with no errors. make > test runs with with all ?identical?, no ?differs?. > I the test succeed and the bin directory is populated then the build was successful, and your only problem is that you cannot find the binaries in your executable search path. Make sure the PATH variable includes $SRILM/bin/$MACHINE_TYPE , where $MACHINE_TYPE is the platform name you build for. If you can't manage that ask for help from a local linux or windows expert. Andreas > Thanks, any help will be appreciated. > > Gilbert L. Henry > > ghenryww at roadrunner.com > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From saman_2004 at yahoo.com Mon Dec 26 20:26:28 2011 From: saman_2004 at yahoo.com (Saman Noorzadeh) Date: Mon, 26 Dec 2011 20:26:28 -0800 (PST) Subject: [SRILM User List] big difference between ppl and ppl1 Message-ID: <1324959988.52750.YahooMailNeo@web162006.mail.bf1.yahoo.com> I? made 2 models of 2 languages, Dutch and English, to make a language recognition. I got the following perplexities: Model: Dutch??? Test: English??? ppl:55??? ppl2: 2* 10^18 Model: Dutch??? Test: Dutch?? ppl:303?? ppl2: 400 Model: English?? Test: Dutch?? ppl: 600? ppl2: 3122ses n Model: English? Test: English??? ppl: 227?? ppl2: 1897 I think it is reasonable if I have a large perplexity when my model and test are different but why ppl=55 when having a Duch model and an English test? and Why is there a BIG difference in their ppl and ppl1 ? Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: From burkay at mit.edu Tue Dec 27 00:56:32 2011 From: burkay at mit.edu (Burkay Gur) Date: Tue, 27 Dec 2011 10:56:32 +0200 Subject: [SRILM User List] big difference between ppl and ppl1 In-Reply-To: <1324959988.52750.YahooMailNeo@web162006.mail.bf1.yahoo.com> References: <1324959988.52750.YahooMailNeo@web162006.mail.bf1.yahoo.com> Message-ID: <0B3009A4-3E4E-4982-A4DF-D52FAC17A9F6@mit.edu> Is your Dutch model arranged so that there is one sentence on each line? Also which command are you using? I recommend using -gt1max 1 -gt2max 1 -gt3max 1 and -ukndiscount for kneser ney smoothing. These will give you more accurate perplexities. -Burkay Sent from my iPad On Dec 27, 2011, at 6:26 AM, Saman Noorzadeh wrote: > > I made 2 models of 2 languages, Dutch and English, to make a language recognition. > I got the following perplexities: > > Model: Dutch Test: English ppl:55 ppl2: 2* 10^18 > Model: Dutch Test: Dutch ppl:303 ppl2: 400 > Model: English Test: Dutch ppl: 600 ppl2: 3122ses n > Model: English Test: English ppl: 227 ppl2: 1897 > > I think it is reasonable if I have a large perplexity when my model and test are different but why ppl=55 when having a Duch model and an English test? > and > Why is there a BIG difference in their ppl and ppl1 ? > > Thanks in advance > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From saman_2004 at yahoo.com Tue Dec 27 03:58:34 2011 From: saman_2004 at yahoo.com (Saman Noorzadeh) Date: Tue, 27 Dec 2011 03:58:34 -0800 (PST) Subject: [SRILM User List] big difference between ppl and ppl1 In-Reply-To: <0B3009A4-3E4E-4982-A4DF-D52FAC17A9F6@mit.edu> References: <1324959988.52750.YahooMailNeo@web162006.mail.bf1.yahoo.com> <0B3009A4-3E4E-4982-A4DF-D52FAC17A9F6@mit.edu> Message-ID: <1324987114.82504.YahooMailNeo@web162004.mail.bf1.yahoo.com> Yes both of my texts are 1 sentence per line, (but some sentences are a little long!) I used gtmax options but the result were almost the same the commands I use are as following: to count: ngram-count -order 3 -write-vocab language.voc -text language_tain.txt -write language.bo to make the model: ngram-count -order 3? language.bo -lm language.BO -gt2min 1 -gt3min 2 testing Perplexity: ngram -lm language.BO -ppl language_test.txt Thank you Saman ________________________________ From: Burkay Gur To: Saman Noorzadeh Cc: Srilm group Sent: Tuesday, December 27, 2011 12:56 AM Subject: Re: [SRILM User List] big difference between ppl and ppl1 Is your Dutch model arranged so that there is one sentence on each line? Also which command are you using? I recommend using -gt1max 1 -gt2max 1 -gt3max 1 and -ukndiscount for kneser ney smoothing. These will give you more accurate perplexities. -Burkay Sent from my iPad On Dec 27, 2011, at 6:26 AM, Saman Noorzadeh wrote: > >I? made 2 models of 2 languages, Dutch and English, to make a language recognition. >I got the following perplexities: > > >Model: Dutch??? Test: English??? ppl:55??? ppl2: 2* 10^18 >Model: Dutch??? Test: Dutch?? ppl:303?? ppl2: 400 >Model: English?? Test: Dutch?? ppl: 600? ppl2: 3122ses n > >Model: English? Test: English??? ppl: 227?? ppl2: 1897 > > >I think it is reasonable if I have a large perplexity when my model and test are different but why ppl=55 when having a Duch model and an English test? >and > >Why is there a BIG difference in their ppl and ppl1 ? > > >Thanks in advance > > > > > _______________________________________________ >SRILM-User site list >SRILM-User at speech.sri.com >http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From burkay at mit.edu Tue Dec 27 05:32:16 2011 From: burkay at mit.edu (Burkay Gur) Date: Tue, 27 Dec 2011 15:32:16 +0200 Subject: [SRILM User List] big difference between ppl and ppl1 In-Reply-To: <1324987114.82504.YahooMailNeo@web162004.mail.bf1.yahoo.com> References: <1324959988.52750.YahooMailNeo@web162006.mail.bf1.yahoo.com> <0B3009A4-3E4E-4982-A4DF-D52FAC17A9F6@mit.edu> <1324987114.82504.YahooMailNeo@web162004.mail.bf1.yahoo.com> Message-ID: <5E431A43-7D70-4C07-BD2B-AE86B2B5C145@mit.edu> To get lower and more relevant perplexities I d recommend getting rid of the -order 3 and adding the kneser ney smoothing. Also make sure the corpora are not too small. Sent from my iPad On Dec 27, 2011, at 1:58 PM, Saman Noorzadeh wrote: > Yes both of my texts are 1 sentence per line, (but some sentences are a little long!) > I used gtmax options but the result were almost the same > the commands I use are as following: > > to count: > ngram-count -order 3 -write-vocab language.voc -text language_tain.txt -write language.bo > > to make the model: > ngram-count -order 3 language.bo -lm language.BO -gt2min 1 -gt3min 2 > > testing Perplexity: > ngram -lm language.BO -ppl language_test.txt > > Thank you > Saman > From: Burkay Gur > To: Saman Noorzadeh > Cc: Srilm group > Sent: Tuesday, December 27, 2011 12:56 AM > Subject: Re: [SRILM User List] big difference between ppl and ppl1 > > Is your Dutch model arranged so that there is one sentence on each line? Also which command are you using? I recommend using -gt1max 1 -gt2max 1 -gt3max 1 and -ukndiscount for kneser ney smoothing. These will give you more accurate perplexities. > > -Burkay > > Sent from my iPad > > On Dec 27, 2011, at 6:26 AM, Saman Noorzadeh wrote: > >> >> I made 2 models of 2 languages, Dutch and English, to make a language recognition. >> I got the following perplexities: >> >> Model: Dutch Test: English ppl:55 ppl2: 2* 10^18 >> Model: Dutch Test: Dutch ppl:303 ppl2: 400 >> Model: English Test: Dutch ppl: 600 ppl2: 3122ses n >> Model: English Test: English ppl: 227 ppl2: 1897 >> >> I think it is reasonable if I have a large perplexity when my model and test are different but why ppl=55 when having a Duch model and an English test? >> and >> Why is there a BIG difference in their ppl and ppl1 ? >> >> Thanks in advance >> >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From eragani at gmail.com Thu Dec 29 21:13:36 2011 From: eragani at gmail.com (anil krishna eragani) Date: Fri, 30 Dec 2011 04:13:36 -0100 Subject: [SRILM User List] Difficulty installing SRILM Message-ID: make[2]: Entering directory `/home/eragani/Documents/Nlp_Tools/srilm/misc/src' gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I/usr/include -I. -I../../include -c -g -O3 -o ../obj/i686/option.o option.c In file included from /usr/include/time.h:4:0, from /usr/include/sys/types.h:133, from /usr/include/stdlib.h:320, from option.c:23: /usr/include/v8.h:79:1: error: unknown type name ?namespace? /usr/include/v8.h:79:14: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?{? token In file included from /usr/include/sys/types.h:133:0, from /usr/include/stdlib.h:320, from option.c:23: /usr/include/time.h:6:1: error: unknown type name ?namespace? /usr/include/time.h:6:14: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?{? token option.c:34:57: error: unknown type name ?time_t? option.c: In function ?Opt_Parse?: option.c:195:5: warning: implicit declaration of function ?ParseTime? [-Wimplicit-function-declaration] option.c:196:9: error: ?time_t? undeclared (first use in this function) option.c:196:9: note: each undeclared identifier is reported only once for each function it appears in option.c:196:17: error: expected expression before ?)? token option.c: At top level: option.c:400:5: error: unknown type name ?time_t? make[2]: *** [../obj/i686/option.o] Error 1 make[2]: Leaving directory `/home/eragani/Documents/Nlp_Tools/srilm/misc/src' make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory `/home/eragani/Documents/Nlp_Tools/srilm' make: *** [World] Error 2 -------------- next part -------------- An HTML attachment was scrubbed... URL: From eragani at gmail.com Thu Dec 29 21:22:49 2011 From: eragani at gmail.com (anil krishna eragani) Date: Fri, 30 Dec 2011 04:22:49 -0100 Subject: [SRILM User List] Difficulty installing SRILM Message-ID: uname -a Linux anil-laptop 2.6.40.6-0.fc15.i686.PAE #1 SMP Tue Oct 4 00:44:38 UTC 2011 i686 i686 i386 GNU/Linux gcc version 4.6.1 20110908 (Red Hat 4.6.1-9) (GCC) make[2]: Entering directory `/home/eragani/Documents/Nlp_ Tools/srilm/misc/src' gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I/usr/include -I. -I../../include -c -g -O3 -o ../obj/i686/option.o option.c In file included from /usr/include/time.h:4:0, from /usr/include/sys/types.h:133, from /usr/include/stdlib.h:320, from option.c:23: /usr/include/v8.h:79:1: error: unknown type name ?namespace? /usr/include/v8.h:79:14: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?{? token In file included from /usr/include/sys/types.h:133:0, from /usr/include/stdlib.h:320, from option.c:23: /usr/include/time.h:6:1: error: unknown type name ?namespace? /usr/include/time.h:6:14: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?{? token option.c:34:57: error: unknown type name ?time_t? option.c: In function ?Opt_Parse?: option.c:195:5: warning: implicit declaration of function ?ParseTime? [-Wimplicit-function-declaration] option.c:196:9: error: ?time_t? undeclared (first use in this function) option.c:196:9: note: each undeclared identifier is reported only once for each function it appears in option.c:196:17: error: expected expression before ?)? token option.c: At top level: option.c:400:5: error: unknown type name ?time_t? make[2]: *** [../obj/i686/option.o] Error 1 make[2]: Leaving directory `/home/eragani/Documents/Nlp_Tools/srilm/misc/src' make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory `/home/eragani/Documents/Nlp_Tools/srilm' make: *** [World] Error 2 -------------- next part -------------- An HTML attachment was scrubbed... URL: