From stolcke at icsi.berkeley.edu Wed Jul 3 00:02:44 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 03 Jul 2013 00:02:44 -0700 Subject: [SRILM User List] About language model rescoring output In-Reply-To: References: Message-ID: <51D3CC94.2040006@icsi.berkeley.edu> On 6/25/2013 6:51 AM, yuan liang wrote: > Hi all, > > I want to rescore a bigram lattice use a trigram language model. > I tired: > > lattice-tool -in-lattice INPUTLATTICE -read-htk -lm TRIGRAM_LM > -order 3 -old-expansion -out-lattice -write-htk OUTPUTLATTICE > > The problem is: > In the output lattice, there is no acoustic model score for each arc, > each arc only has new language model score, did I miss some parameters? No. If the input lattices contain a= fields on nodes or links then so should the output lattice. When I run the above command on some lattices I get outputs containining lines like the following J=3 S=0 E=4 W=uhhuh a=-89.5173 l=-1.42187 Andreas From Joris.Pelemans at esat.kuleuven.be Wed Jul 3 11:22:01 2013 From: Joris.Pelemans at esat.kuleuven.be (Joris Pelemans) Date: Wed, 03 Jul 2013 20:22:01 +0200 Subject: [SRILM User List] OOV terminology Message-ID: <51D46BC9.3050709@esat.kuleuven.be> Hello all, My question is perhaps a little bit of topic, but I'm hoping for your cooperation, since it's LM related. Say we have a training corpus with lexicon V_train. Since some of the words have near-zero counts, we choose to exclude them from our LM. This gives us a new lexicon, let's call it V_final. However this also gives us two types of OOV words: those not in V_train and those not in V_final. I was wondering whether there are standard terms in the literature for these two types of OOVs. I have read my share of papers, but none of them seem to make this distinction. Kind regards, Joris From S.N.Maijers at student.ru.nl Wed Jul 3 13:05:44 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Wed, 03 Jul 2013 22:05:44 +0200 Subject: [SRILM User List] OOV terminology In-Reply-To: <51D46BC9.3050709@esat.kuleuven.be> References: <51D46BC9.3050709@esat.kuleuven.be> Message-ID: <51D48418.8090100@student.ru.nl> On 03-07-13 20:22, Joris Pelemans wrote: > Hello all, > > My question is perhaps a little bit of topic, but I'm hoping for your > cooperation, since it's LM related. > > Say we have a training corpus with lexicon V_train. Since some of the > words have near-zero counts, we choose to exclude them from our LM. This > gives us a new lexicon, let's call it V_final. However this also gives > us two types of OOV words: those not in V_train and those not in > V_final. I was wondering whether there are standard terms in the > literature for these two types of OOVs. I have read my share of papers, > but none of them seem to make this distinction. > > Kind regards, > > Joris > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user Hi Joris, In my view the vocabulary is a superset of the actual set of the wordforms for which all wordform sequences (the N-permutations of vocabulary words, with repetion) are modeled in the N-gram LM. What limits the hypothesized transcript produced by an ASR system, is the intersection between the sets of: a. the wordforms in the pronunciation lexicon (the mapping between acoustic feature sequences and orthographic representations) b. the target words of the wordform sequences in the LM (as opposed to history words) The vocabulary does not matter then: is just an optional means to constrain the potential richness (given the written training data) of an N-gram LM that you are creating. You can use a vocabulary as a constraint ('-limit-vocab' in' ngram-count'), and/or use it to facilitate a preprocessed form of training data by means of special tokens that aren't really words (such as "" or a 'proper name class' token). So, the vocabulary may contain superfluous words. Only after you realize that this is not an issue, you could think about it further and say that after you have created and pruned an LM, you can find out which words were actually redundant in your vocabulary given the same written training data you used to create that LM, and you could just as well drop those and those words from the vocabulary you had already before creating your LM. Maybe that reduces the size of your vocabulary as much as you hope. Will this be worthwhile? Not for the ASR task, you see. The term OOV comes in handy as shorthand to denote words that are in the written training data but not in the vocabulary. It is not precise, you could just as well use an element-out-of-set notation (short and clear) in reports. Maybe you have read the article: "Detection of OOV Words Using Generalized Word Models and a Semantic Class Language Model" by Schaaf, which was a top Google result for me. This author confuses the pronunciation lexicon with the vocabulary. While you can, confusingly, call a word that was not transcribed correctly because, for one, it was not modeled by the pronunciation lexicon 'OOV', I think it is not okay to confuse the concepts vocabulary and pronunciation lexicon as he does. I hope this clears up any confusion? From shiyang1983 at gmail.com Wed Jul 3 13:18:45 2013 From: shiyang1983 at gmail.com (yangyang shi) Date: Wed, 3 Jul 2013 22:18:45 +0200 Subject: [SRILM User List] OOV terminology In-Reply-To: <51D46BC9.3050709@esat.kuleuven.be> References: <51D46BC9.3050709@esat.kuleuven.be> Message-ID: Hi Joris, Is this a type of cut-off? If you set cut-off == 3, that means the words occurs less than 3 times will be considered as OOV. Cheers, Yangyang Shi On Wed, Jul 3, 2013 at 8:22 PM, Joris Pelemans < Joris.Pelemans at esat.kuleuven.be> wrote: > Hello all, > > My question is perhaps a little bit of topic, but I'm hoping for your > cooperation, since it's LM related. > > Say we have a training corpus with lexicon V_train. Since some of the > words have near-zero counts, we choose to exclude them from our LM. This > gives us a new lexicon, let's call it V_final. However this also gives us > two types of OOV words: those not in V_train and those not in V_final. I > was wondering whether there are standard terms in the literature for these > two types of OOVs. I have read my share of papers, but none of them seem to > make this distinction. > > Kind regards, > > Joris > ______________________________**_________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/**mailman/listinfo/srilm-user > -- Met vriendelijke groet, Yangyang Shi TU Delft / Interactive Intelligence Group HB12.290, EWI, Mekelweg 4, 2628 CD Delft, T +31 (0) 152782549 E shiyang1983 at gmail.com; yangyangshi at ieee.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From lqin at cs.cmu.edu Wed Jul 3 13:35:38 2013 From: lqin at cs.cmu.edu (Long Qin) Date: Wed, 3 Jul 2013 16:35:38 -0400 (EDT) Subject: [SRILM User List] OOV terminology In-Reply-To: References: <51D46BC9.3050709@esat.kuleuven.be> Message-ID: <27817.209.114.136.178.1372883738.squirrel@webmail.cs.cmu.edu> Hi Joris, As far as I know, there is no standard common term to distinguish OOV words that appearing in the LM training data but cutoffed and OOV words not in the data. Generally, the vocabulary of a recognizer is the mutual share of words between its lexicon and LM. From that point of view, those two types of OOVs will have the same effect on recognition - the ASR system cannot recognize them. But for OOV word detection, normally it is easier to detect OOV words which appear in the traing text but not in the vocabulary. Because we know the pronunciation of those words and we know where in a sentence they may appear. Thanks, Long On Wed, July 3, 2013 4:18 pm, yangyang shi wrote: > Hi Joris, > > > Is this a type of cut-off? If you set cut-off == 3, that means the words > occurs less than 3 times will be considered as OOV. > > Cheers, > > > Yangyang Shi > > > > On Wed, Jul 3, 2013 at 8:22 PM, Joris Pelemans < > Joris.Pelemans at esat.kuleuven.be> wrote: > > >> Hello all, >> >> >> My question is perhaps a little bit of topic, but I'm hoping for your >> cooperation, since it's LM related. >> >> Say we have a training corpus with lexicon V_train. Since some of the >> words have near-zero counts, we choose to exclude them from our LM. This >> gives us a new lexicon, let's call it V_final. However this also gives >> us two types of OOV words: those not in V_train and those not in >> V_final. I >> was wondering whether there are standard terms in the literature for >> these two types of OOVs. I have read my share of papers, but none of >> them seem to make this distinction. >> >> Kind regards, >> >> >> Joris >> ______________________________**_________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/**mailman/listinfo/srilm-user> h.sri.com/mailman/listinfo/srilm-user> >> > > > > -- > Met vriendelijke groet, > > > Yangyang Shi > > > TU Delft / Interactive Intelligence Group > HB12.290, EWI, > Mekelweg 4, > 2628 CD Delft, > T +31 (0) 152782549 > E shiyang1983 at gmail.com; yangyangshi at ieee.org > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From Joris.Pelemans at esat.kuleuven.be Wed Jul 3 14:05:30 2013 From: Joris.Pelemans at esat.kuleuven.be (Joris Pelemans) Date: Wed, 03 Jul 2013 23:05:30 +0200 Subject: [SRILM User List] OOV terminology In-Reply-To: <51D48418.8090100@student.ru.nl> References: <51D46BC9.3050709@esat.kuleuven.be> <51D48418.8090100@student.ru.nl> Message-ID: <51D4921A.8080301@esat.kuleuven.be> Sander, Thank you for your elaborate reply, but it doesn't really answer my question. I am not confused about the different sets of words. I know why they are there and what they are used for, but I'm wondering whether there is a standard term to denote each set individually. Let me rephrase my question with a very simple example: Given a single training sentence, "wrong is wrong" and a language model with cut-off 1, what are the terms to denote the following sets: 1. {wrong, is}? 2. {wrong}? 3. {is}? 4. all other English words? I am especially interested in terms that differentiate between sets 3 and 4, if such terms exist. Regards, Joris On 07/03/13 22:05, Sander Maijers wrote: > On 03-07-13 20:22, Joris Pelemans wrote: >> Hello all, >> >> My question is perhaps a little bit of topic, but I'm hoping for your >> cooperation, since it's LM related. >> >> Say we have a training corpus with lexicon V_train. Since some of the >> words have near-zero counts, we choose to exclude them from our LM. This >> gives us a new lexicon, let's call it V_final. However this also gives >> us two types of OOV words: those not in V_train and those not in >> V_final. I was wondering whether there are standard terms in the >> literature for these two types of OOVs. I have read my share of papers, >> but none of them seem to make this distinction. >> >> Kind regards, >> >> Joris >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > > Hi Joris, > > In my view the vocabulary is a superset of the actual set of the > wordforms for which all wordform sequences (the N-permutations of > vocabulary words, with repetion) are modeled in the N-gram LM. > > What limits the hypothesized transcript produced by an ASR system, is > the intersection between the sets of: > a. the wordforms in the pronunciation lexicon (the mapping between > acoustic feature sequences and orthographic representations) > b. the target words of the wordform sequences in the LM (as opposed to > history words) > > The vocabulary does not matter then: is just an optional means to > constrain the potential richness (given the written training data) of > an N-gram LM that you are creating. You can use a vocabulary as a > constraint ('-limit-vocab' in' ngram-count'), and/or use it to > facilitate a preprocessed form of training data by means of special > tokens that aren't really words (such as "" or a 'proper name > class' token). > > So, the vocabulary may contain superfluous words. Only after you > realize that this is not an issue, you could think about it further > and say that after you have created and pruned an LM, you can find out > which words were actually redundant in your vocabulary given the same > written training data you used to create that LM, and you could just > as well drop those and those words from the vocabulary you had already > before creating your LM. Maybe that reduces the size of your > vocabulary as much as you hope. Will this be worthwhile? Not for the > ASR task, you see. > > The term OOV comes in handy as shorthand to denote words that are in > the written training data but not in the vocabulary. It is not > precise, you could just as well use an element-out-of-set notation > (short and clear) in reports. Maybe you have read the article: > "Detection of OOV Words Using Generalized Word Models and a Semantic > Class Language Model" by Schaaf, which was a top Google result for me. > This author confuses the pronunciation lexicon with the vocabulary. > While you can, confusingly, call a word that was not transcribed > correctly because, for one, it was not modeled by the pronunciation > lexicon 'OOV', I think it is not okay to confuse the concepts > vocabulary and pronunciation lexicon as he does. > > I hope this clears up any confusion? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.N.Maijers at student.ru.nl Wed Jul 3 14:30:18 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Wed, 03 Jul 2013 23:30:18 +0200 Subject: [SRILM User List] OOV terminology In-Reply-To: <51D4921A.8080301@esat.kuleuven.be> References: <51D46BC9.3050709@esat.kuleuven.be> <51D48418.8090100@student.ru.nl> <51D4921A.8080301@esat.kuleuven.be> Message-ID: <51D497EA.7030804@student.ru.nl> On 03-07-13 23:05, Joris Pelemans wrote: > Sander, > > Thank you for your elaborate reply, but it doesn't really answer my > question. I am not confused about the different sets of words. I know > why they are there and what they are used for, but I'm wondering whether > there is a standard term to denote each set individually. Let me > rephrase my question with a very simple example: > > Given a single training sentence, "wrong is wrong" and a language model > with cut-off 1, what are the terms to denote the following sets: > > 1. {wrong, is}? > 2. {wrong}? > 3. {is}? > 4. all other English words? > > I am especially interested in terms that differentiate between sets 3 > and 4, if such terms exist. > > Regards, > > Joris My response was an attemp to clear up you confusion as it appeared to me from what you wrote about V_final. Such a vocabulary simply does not exist, unless you make one physically. You only mentioned excluding words from an LM. I am confident that there are no terms for those sets of hypothetical vocabularies you list. You can of course give them names and describe their meaning, like the vocabulary V_n = { w | \forall W \in V(C(w) > n)} where C is a function that counts the number times a word occurs in the written training data. But, do you have an opinion as to why such terms would be needed and why they would be better than a definition like the previous one? > On 07/03/13 22:05, Sander Maijers wrote: >> On 03-07-13 20:22, Joris Pelemans wrote: >>> Hello all, >>> >>> My question is perhaps a little bit of topic, but I'm hoping for your >>> cooperation, since it's LM related. >>> >>> Say we have a training corpus with lexicon V_train. Since some of the >>> words have near-zero counts, we choose to exclude them from our LM. This >>> gives us a new lexicon, let's call it V_final. However this also gives >>> us two types of OOV words: those not in V_train and those not in >>> V_final. I was wondering whether there are standard terms in the >>> literature for these two types of OOVs. I have read my share of papers, >>> but none of them seem to make this distinction. >>> >>> Kind regards, >>> >>> Joris >>> _______________________________________________ >>> SRILM-User site list >>> SRILM-User at speech.sri.com >>> http://www.speech.sri.com/mailman/listinfo/srilm-user >> >> Hi Joris, >> >> In my view the vocabulary is a superset of the actual set of the >> wordforms for which all wordform sequences (the N-permutations of >> vocabulary words, with repetion) are modeled in the N-gram LM. >> >> What limits the hypothesized transcript produced by an ASR system, is >> the intersection between the sets of: >> a. the wordforms in the pronunciation lexicon (the mapping between >> acoustic feature sequences and orthographic representations) >> b. the target words of the wordform sequences in the LM (as opposed to >> history words) >> >> The vocabulary does not matter then: is just an optional means to >> constrain the potential richness (given the written training data) of >> an N-gram LM that you are creating. You can use a vocabulary as a >> constraint ('-limit-vocab' in' ngram-count'), and/or use it to >> facilitate a preprocessed form of training data by means of special >> tokens that aren't really words (such as "" or a 'proper name >> class' token). >> >> So, the vocabulary may contain superfluous words. Only after you >> realize that this is not an issue, you could think about it further >> and say that after you have created and pruned an LM, you can find out >> which words were actually redundant in your vocabulary given the same >> written training data you used to create that LM, and you could just >> as well drop those and those words from the vocabulary you had already >> before creating your LM. Maybe that reduces the size of your >> vocabulary as much as you hope. Will this be worthwhile? Not for the >> ASR task, you see. >> >> The term OOV comes in handy as shorthand to denote words that are in >> the written training data but not in the vocabulary. It is not >> precise, you could just as well use an element-out-of-set notation >> (short and clear) in reports. Maybe you have read the article: >> "Detection of OOV Words Using Generalized Word Models and a Semantic >> Class Language Model" by Schaaf, which was a top Google result for me. >> This author confuses the pronunciation lexicon with the vocabulary. >> While you can, confusingly, call a word that was not transcribed >> correctly because, for one, it was not modeled by the pronunciation >> lexicon 'OOV', I think it is not okay to confuse the concepts >> vocabulary and pronunciation lexicon as he does. >> >> I hope this clears up any confusion? >> >> > From Joris.Pelemans at esat.kuleuven.be Wed Jul 3 15:07:03 2013 From: Joris.Pelemans at esat.kuleuven.be (Joris Pelemans) Date: Thu, 04 Jul 2013 00:07:03 +0200 Subject: [SRILM User List] OOV terminology In-Reply-To: <27817.209.114.136.178.1372883738.squirrel@webmail.cs.cmu.edu> References: <51D46BC9.3050709@esat.kuleuven.be> <27817.209.114.136.178.1372883738.squirrel@webmail.cs.cmu.edu> Message-ID: <51D4A087.7050107@esat.kuleuven.be> Hi Long, On 07/03/13 22:35, Long Qin wrote: > But for OOV word detection, normally it is easier to > detect OOV words which appear in the traing text but not in the > vocabulary. Because we know the pronunciation of those words and we know > where in a sentence they may appear. Exactly! I find it very surprising that no standard terms are used in this research area, since different techniques should be developed for these two sets. Anyway, I guess the terms "cut-off" OOV and perhaps "unseen" OOV should suffice. Regards, Joris From Joris.Pelemans at esat.kuleuven.be Tue Jul 30 06:19:43 2013 From: Joris.Pelemans at esat.kuleuven.be (Joris Pelemans) Date: Tue, 30 Jul 2013 15:19:43 +0200 Subject: [SRILM User List] Reobtaining time information Message-ID: <51F7BD6F.60700@esat.kuleuven.be> Hello all, I am currently extracting n-best lists from lattices using the SRILM lattice-tool for LM rescoring purposes. Unfortunately the extraction removes the detailed time information that is present in the lattice. I was wondering whether it is possible to reobtain this after finding the best hypothesis from the n-best list i.e. search the lattice for this hypothesis and reobtain the time information? By the way, I am using the -ignore-vocab option during n-best list extraction. Does this complicate matters or can you also use this option to find the time information? Thanks in advance, Joris From stolcke at icsi.berkeley.edu Sat Aug 3 16:36:02 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 03 Aug 2013 16:36:02 -0700 Subject: [SRILM User List] Reobtaining time information In-Reply-To: <51F7BD6F.60700@esat.kuleuven.be> References: <51F7BD6F.60700@esat.kuleuven.be> Message-ID: <51FD93E2.4070705@icsi.berkeley.edu> On 7/30/2013 6:19 AM, Joris Pelemans wrote: > Hello all, > > I am currently extracting n-best lists from lattices using the SRILM > lattice-tool for LM rescoring purposes. Unfortunately the extraction > removes the detailed time information that is present in the lattice. > I was wondering whether it is possible to reobtain this after finding > the best hypothesis from the n-best list i.e. search the lattice for > this hypothesis and reobtain the time information? > > By the way, I am using the -ignore-vocab option during n-best list > extraction. Does this complicate matters or can you also use this > option to find the time information? There was a discussion about this recently. It turns out there is currently no way to dump the alignment information associated with N-best lists, though that information is available internally. Part of the issue is that there is no standard format (that I'm aware of) to represent N-best lists with alignment, pronunciation, and other acoustic information. If you are interested and willing to put in some work I could advise on how to extend SRILM with this capability. FYI, there is a way to include alignments info in confusion network output, in case that helps. Andreas From tm-oleary at comcast.net Wed Aug 14 14:41:19 2013 From: tm-oleary at comcast.net (tm-oleary at comcast.net) Date: Wed, 14 Aug 2013 21:41:19 +0000 (UTC) Subject: [SRILM User List] Understanding what the values in .arpa files represent In-Reply-To: <509209796.2148181.1376515330079.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net> Message-ID: <690089649.2148876.1376516479475.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net> I would like to get a good understanding of what the values in .arpa files represent so I can do a better job on a project I am working on. I have found some documentation about .arpa files on the SRILM web site as well as in some other places that describe the values in the first column of the "\n-grams" sections of the file as conditional probabilities. I assumed from this that if I had an .arpa file containing all of the unigrams and bigrams of a corpus, that [1] for all unigrams, the sum of 10^unigram_value would equal 1.0 and [2] for all bigrams, the sum of (10^bigram_value * 10^unigram_value_of_first_term_in_bigram) would also equal 1.0, since the joint probability p(a, b) = p(b|a) * p(a). It turns out that [1] is true, but for the .arpa file I have been working with, the [2] sum is about .68. I was expecting that [2] might sum to something less than 1.0 to due to probability mass redistributed for smoothing purposes, but that wouldn't account for .32 of the total, would it? I think it's more likely that I don't understand what the values in the left column represent in the "\n-grams" sections for n >= 2. Is there a way to use the values in an .arpa file to reconstruct joint probabilities for bigrams (and other higher order n-grams) in order to verify that they actually do sum to 1.0 for each "\n-grams" section in the file? Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From londis at 163.com Thu Aug 15 00:15:31 2013 From: londis at 163.com (HU Rile) Date: Thu, 15 Aug 2013 15:15:31 +0800 (CST) Subject: [SRILM User List] question about using the Google Web N-gram corpus to build an LM Message-ID: <58519bc7.8a77.14080d44c86.Coremail.londis@163.com> Hi, I would like to build an LM using the Google Web 1T corpus. And I followed the steps on http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html. But when I used ngram-count to estimate the mixture weights, the program can not run and gave the response "google.countlm.0: line 22: reached EOF before \end\ format error in init-lm file". I tried to add \end\ to the end of google.countlm.0, but it did not work. Here is the content of my google.countlm.0: order 3 vocabsize 13588391 totalcount 1024908267229 countmodulus 40 mixweights 15 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 google-counts /home/hurile/googleweb1T/googleLM/ Could someone please tell me how can i solve the problem? Thanks a lot! Rile Hu -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Aug 15 14:15:14 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 15 Aug 2013 14:15:14 -0700 Subject: [SRILM User List] question about using the Google Web N-gram corpus to build an LM In-Reply-To: <58519bc7.8a77.14080d44c86.Coremail.londis@163.com> References: <58519bc7.8a77.14080d44c86.Coremail.londis@163.com> Message-ID: <520D44E2.4010403@icsi.berkeley.edu> On 8/15/2013 12:15 AM, HU Rile wrote: > Hi, > I would like to build an LM using the Google Web 1T corpus. And I > followed the steps on > http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html. > But when I used ngram-count to estimate the mixture weights, the > program can not run and gave the response "google.countlm.0: line 22: > reached EOF before \end\ > format error in init-lm file". > I tried to add \end\ to the end of googl! e.countlm.0, but it did not > work. > Here is the content of my google.countlm.0: > order 3 > vocabsize 13588391 > totalcount 1024908267229 > countmodulus 40 > mixweights 15 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > 0.5 0.5 0.5 > google-counts /home/hurile/googleweb1T/google! LM/ > > Could someone please tell me how can i so lve the problem? Thanks a lot! > > Rile Hu > You probably forgot the -count-lm option. Without it, ngram-count will try to interpret the -lm file as a standard ngram LM (where the \end\ line is expected). Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Aug 15 17:17:41 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 15 Aug 2013 17:17:41 -0700 Subject: [SRILM User List] Understanding what the values in .arpa files represent In-Reply-To: <690089649.2148876.1376516479475.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net> References: <690089649.2148876.1376516479475.JavaMail.root@sz0105a.emeryville.ca.mail.comcast.net> Message-ID: <520D6FA5.7090602@icsi.berkeley.edu> On 8/14/2013 2:41 PM, tm-oleary at comcast.net wrote: > I would like to get a good understanding of what the values in .arpa > files represent so I can do a better job on a project I am working on. > I have found some documentation about .arpa files on the SRILM web > site as well as in some other places that describe the values in the > first column of the "\n-grams" sections of the file as conditional > probabilities. > > I assumed from this that if I had an .arpa file containing all of the > unigrams and bigrams of a corpus, that [1] for all unigrams, the sum > of 10^unigram_value would equal 1.0 and [2] for all bigrams, the sum > of (10^bigram_value * 10^unigram_value_of_first_term_in_bigram) would > also equal 1.0, since the joint probability p(a, b) = p(b|a) * p(a). > It turns out that [1] is true, but for the .arpa file I have been > working with, the [2] sum is about .68. I was expecting that [2] might > sum to something less than 1.0 to due to probability mass > redistributed for smoothing purposes, but that wouldn't account for > .32 of the total, would it? You assume that the LM contains all possible N-grams of a given order (in your case, all bigrams). That is not true. It only lists the N-grams that occur in the training data, and that occur frequently enough (subject to the -gtNmin parameters). The probabilities of unlisted N-grams are computed by backoff. For an explanation search for "backoff computation language model". So if you summed over all possible bigrams then you should get the sum = 1 as you expect. > > I think it's more likely that I don't understand what the values in > the left column represent in the "\n-grams" sections for n >= 2. Is > there a way to use the values in an .arpa file to reconstruct joint > probabilities for bigrams (and other higher order n-grams) in order to > verify that they actually do sum to 1.0 for each "\n-grams" section in > the file? You are assuming above that the first column contains conditional ngram log probabilities, and that is correct. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmp84 at cam.ac.uk Thu Sep 19 04:13:10 2013 From: jmp84 at cam.ac.uk (Juan Pino) Date: Thu, 19 Sep 2013 12:13:10 +0100 Subject: [SRILM User List] arpa header number of 4g to big for int Message-ID: Hello, I am running this command with version 1.7.0 (the purpose is to fix the format of my input lm): srilm1.7.0/bin/i686-m64/ngram -debug 1 -order 4 -lm MY_LM_IN_ARPA_FORMAT -write-lm MY_OUTPUT_LM I get this error: line 6: ngram number -1840328771 out of range This is because I have this header in my input lm: ngram 4=2454638525 So the number of 4grams is bigger than the maximum 32-bit int. I've fixed it by replacing int nNgrams; by long nNgrams; at line 497 in lm/src/NgramLM.cc and by replacing } else if (sscanf(line, "ngram %d=%d", &thisOrder, &nNgrams) == 2) { by } else if (sscanf(line, "ngram %d=%ld", &thisOrder, &nNgrams) == 2) { at line 515 in lm/src/NgramLM.cc Are there other places in the code that I should change ? Is there a better solution for my problem ? Thanks very much, Juan -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkataraman.anand at gmail.com Thu Sep 19 08:34:16 2013 From: venkataraman.anand at gmail.com (Anand Venkataraman) Date: Thu, 19 Sep 2013 08:34:16 -0700 Subject: [SRILM User List] arpa header number of 4g to big for int In-Reply-To: References: Message-ID: Juan, One of the things I would probably first check is to see if you're including way too many 4-grams then necessary. To reduce noise and one-off occurrences for higher order ngrams, you should probably at least use the -gt4min 2 option. In most cases the quality of the resultant LM improves although the count of actual ngrams included decreases. Did you do this? & On Thu, Sep 19, 2013 at 4:13 AM, Juan Pino wrote: > Hello, > > I am running this command with version 1.7.0 (the purpose is to fix the > format of my input lm): > > srilm1.7.0/bin/i686-m64/ngram -debug 1 -order 4 -lm MY_LM_IN_ARPA_FORMAT > -write-lm MY_OUTPUT_LM > > I get this error: > > line 6: ngram number -1840328771 out of range > > This is because I have this header in my input lm: > ngram 4=2454638525 > > So the number of 4grams is bigger than the maximum 32-bit int. > > I've fixed it by replacing > int nNgrams; > by > long nNgrams; > at line 497 in lm/src/NgramLM.cc and by replacing > } else if (sscanf(line, "ngram %d=%d", &thisOrder, &nNgrams) == 2) { > by > } else if (sscanf(line, "ngram %d=%ld", &thisOrder, &nNgrams) == 2) { > at line 515 in lm/src/NgramLM.cc > > Are there other places in the code that I should change ? Is there a > better solution for my problem ? > > Thanks very much, > > Juan > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmp84 at cam.ac.uk Thu Sep 19 08:49:08 2013 From: jmp84 at cam.ac.uk (Juan Pino) Date: Thu, 19 Sep 2013 16:49:08 +0100 Subject: [SRILM User List] arpa header number of 4g to big for int In-Reply-To: References: Message-ID: Hi Anand, Thanks for the tip. The context is: -- I generate a 4gram with the kenlm toolkit. -- I don't think the kenlm toolkit has an option equivalent to -gt4min -- I can't generate a Kneser-Ney lm with srilm because of memory constraints. Probably the memory requirements with -gt4min 2 go down so I would have to check this. I still would like to know how to modify srilm to handle more ngrams than the max 32-bit int. Best, Juan On Thu, Sep 19, 2013 at 4:34 PM, Anand Venkataraman < venkataraman.anand at gmail.com> wrote: > Juan, > > One of the things I would probably first check is to see if you're > including way too many 4-grams then necessary. To reduce noise and one-off > occurrences for higher order ngrams, you should probably at least use the > -gt4min 2 option. In most cases the quality of the resultant LM improves > although the count of actual ngrams included decreases. Did you do this? > > & > > > On Thu, Sep 19, 2013 at 4:13 AM, Juan Pino wrote: > >> Hello, >> >> I am running this command with version 1.7.0 (the purpose is to fix the >> format of my input lm): >> >> srilm1.7.0/bin/i686-m64/ngram -debug 1 -order 4 -lm MY_LM_IN_ARPA_FORMAT >> -write-lm MY_OUTPUT_LM >> >> I get this error: >> >> line 6: ngram number -1840328771 out of range >> >> This is because I have this header in my input lm: >> ngram 4=2454638525 >> >> So the number of 4grams is bigger than the maximum 32-bit int. >> >> I've fixed it by replacing >> int nNgrams; >> by >> long nNgrams; >> at line 497 in lm/src/NgramLM.cc and by replacing >> } else if (sscanf(line, "ngram %d=%d", &thisOrder, &nNgrams) == 2) { >> by >> } else if (sscanf(line, "ngram %d=%ld", &thisOrder, &nNgrams) == 2) { >> at line 515 in lm/src/NgramLM.cc >> >> Are there other places in the code that I should change ? Is there a >> better solution for my problem ? >> >> Thanks very much, >> >> Juan >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Sep 19 13:27:35 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 19 Sep 2013 13:27:35 -0700 Subject: [SRILM User List] arpa header number of 4g to big for int In-Reply-To: References: Message-ID: <523B5E37.2060700@icsi.berkeley.edu> The attached patch should fix it. Note this still doesn't support vocabularies larger than 2^32, but the number of higher-order ngrams can now be 2^64. Thanks for reporting this problem! Andreas On 9/19/2013 4:13 AM, Juan Pino wrote: > Hello, > > I am running this command with version 1.7.0 (the purpose is to fix > the format of my input lm): > > srilm1.7.0/bin/i686-m64/ngram -debug 1 -order 4 -lm > MY_LM_IN_ARPA_FORMAT -write-lm MY_OUTPUT_LM > > I get this error: > > line 6: ngram number -1840328771 out of range > > This is because I have this header in my input lm: > ngram 4=2454638525 > > So the number of 4grams is bigger than the maximum 32-bit int. > > I've fixed it by replacing > int nNgrams; > by > long nNgrams; > at line 497 in lm/src/NgramLM.cc and by replacing > } else if (sscanf(line, "ngram %d=%d", &thisOrder, &nNgrams) == 2) { > by > } else if (sscanf(line, "ngram %d=%ld", &thisOrder, &nNgrams) == 2) { > at line 515 in lm/src/NgramLM.cc > > Are there other places in the code that I should change ? Is there a > better solution for my problem ? > > Thanks very much, > > Juan > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- *** lm/src/Ngram.h.dist 2013-07-02 20:23:07.385694200 -0700 --- lm/src/Ngram.h 2013-09-19 12:13:48.378147500 -0700 *************** *** 99,105 **** /* * Statistics */ ! virtual unsigned int numNgrams(unsigned int n) const; virtual void memStats(MemStats &stats); /* --- 99,105 ---- /* * Statistics */ ! virtual Count numNgrams(unsigned int n) const; virtual void memStats(MemStats &stats); /* *** lm/src/NgramLM.cc.dist 2013-09-19 12:15:13.124134400 -0700 --- lm/src/NgramLM.cc 2013-09-19 12:16:15.577406400 -0700 *************** *** 407,417 **** { char *line; unsigned maxOrder = 0; /* maximal n-gram order in this model */ ! unsigned numNgrams[maxNgramOrder + 1]; /* the number of n-grams for each order */ ! unsigned numRead[maxNgramOrder + 1]; /* Number of n-grams actually read */ ! unsigned numOOVs = 0; /* Numer of n-gram skipped due to OOVs */ int state = -1 ; /* section of file being read: * -1 - pre-header, 0 - header, * 1 - unigrams, 2 - bigrams, ... */ --- 407,417 ---- { char *line; unsigned maxOrder = 0; /* maximal n-gram order in this model */ ! Count numNgrams[maxNgramOrder + 1]; /* the number of n-grams for each order */ ! Count numRead[maxNgramOrder + 1]; /* Number of n-grams actually read */ ! Count numOOVs = 0; /* Numer of n-gram skipped due to OOVs */ int state = -1 ; /* section of file being read: * -1 - pre-header, 0 - header, * 1 - unigrams, 2 - bigrams, ... */ *************** *** 487,493 **** case 0: /* ngram header */ unsigned thisOrder; ! int nNgrams; if (backslash && sscanf(line, "\\%d-grams", &state) == 1) { /* --- 487,493 ---- case 0: /* ngram header */ unsigned thisOrder; ! long long nNgrams; if (backslash && sscanf(line, "\\%d-grams", &state) == 1) { /* *************** *** 505,511 **** } continue; ! } else if (sscanf(line, "ngram %d=%d", &thisOrder, &nNgrams) == 2) { /* * scanned a line of the form * ngram = --- 505,511 ---- } continue; ! } else if (sscanf(line, "ngram %d=%lld", &thisOrder, &nNgrams) == 2) { /* * scanned a line of the form * ngram = *************** *** 775,781 **** Ngram::writeWithOrder(File &file, unsigned order) { unsigned i; ! unsigned howmanyNgrams[maxNgramOrder + 1]; VocabIndex context[maxNgramOrder + 2]; VocabString scontext[maxNgramOrder + 1]; --- 775,781 ---- Ngram::writeWithOrder(File &file, unsigned order) { unsigned i; ! Count howmanyNgrams[maxNgramOrder + 1]; VocabIndex context[maxNgramOrder + 2]; VocabString scontext[maxNgramOrder + 1]; *************** *** 787,793 **** for (i = 1; i <= order; i++ ) { howmanyNgrams[i] = numNgrams(i); ! file.fprintf("ngram %d=%d\n", i, howmanyNgrams[i]); } for (i = 1; i <= order; i++ ) { --- 787,793 ---- for (i = 1; i <= order; i++ ) { howmanyNgrams[i] = numNgrams(i); ! file.fprintf("ngram %d=%lld\n", i, (long long)howmanyNgrams[i]); } for (i = 1; i <= order; i++ ) { *************** *** 1461,1473 **** return false; } ! unsigned int Ngram::numNgrams(unsigned int order) const { if (order < 1) { return 0; } else { ! unsigned int howmany = 0; makeArray(VocabIndex, context, order + 1); --- 1461,1473 ---- return false; } ! Count Ngram::numNgrams(unsigned int order) const { if (order < 1) { return 0; } else { ! Count howmany = 0; makeArray(VocabIndex, context, order + 1); From jmp84 at cam.ac.uk Thu Sep 19 14:42:41 2013 From: jmp84 at cam.ac.uk (Juan Pino) Date: Thu, 19 Sep 2013 22:42:41 +0100 Subject: [SRILM User List] arpa header number of 4g to big for int In-Reply-To: <523B5E37.2060700@icsi.berkeley.edu> References: <523B5E37.2060700@icsi.berkeley.edu> Message-ID: Thanks very much, this works! I have attached the patch wrt 1.7.0, it's almost the same. Best, Juan On Thu, Sep 19, 2013 at 9:27 PM, Andreas Stolcke wrote: > The attached patch should fix it. Note this still doesn't support > vocabularies larger than 2^32, but the number of higher-order ngrams can > now be 2^64. > > Thanks for reporting this problem! > > Andreas > > > > On 9/19/2013 4:13 AM, Juan Pino wrote: > > Hello, > > I am running this command with version 1.7.0 (the purpose is to fix the > format of my input lm): > > srilm1.7.0/bin/i686-m64/ngram -debug 1 -order 4 -lm MY_LM_IN_ARPA_FORMAT > -write-lm MY_OUTPUT_LM > > I get this error: > > line 6: ngram number -1840328771 out of range > > This is because I have this header in my input lm: > ngram 4=2454638525 > > So the number of 4grams is bigger than the maximum 32-bit int. > > I've fixed it by replacing > int nNgrams; > by > long nNgrams; > at line 497 in lm/src/NgramLM.cc and by replacing > } else if (sscanf(line, "ngram %d=%d", &thisOrder, &nNgrams) == 2) { > by > } else if (sscanf(line, "ngram %d=%ld", &thisOrder, &nNgrams) == 2) { > at line 515 in lm/src/NgramLM.cc > > Are there other places in the code that I should change ? Is there a > better solution for my problem ? > > Thanks very much, > > Juan > > > _______________________________________________ > SRILM-User site listSRILM-User at speech.sri.comhttp://www.speech.sri.com/mailman/listinfo/srilm-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ngramlm-64bit-1.7.0.patch Type: application/octet-stream Size: 3707 bytes Desc: not available URL: From akmalcuet00 at yahoo.com Mon Sep 23 15:10:03 2013 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Mon, 23 Sep 2013 15:10:03 -0700 (PDT) Subject: [SRILM User List] Fw: Interpolate trigram Probabilities to an n-gram LM In-Reply-To: <1379968415.89728.YahooMailNeo@web161005.mail.bf1.yahoo.com> References: <1379968415.89728.YahooMailNeo@web161005.mail.bf1.yahoo.com> Message-ID: <1379974203.26238.YahooMailNeo@web161006.mail.bf1.yahoo.com> Hi, 1. Is it possible to interpolate some trigram probabilities (say they are in file t.txt) with an n-gram LM ?? SRILM gives results with the warning (no bow for prefix of trigram of t.txt). -lm n-gram.lm -lambda .9 -mix-lm t.txt -ppl test.txt 2. When the trigram probabilities in t.txt changes (newt.txt), the results are exactly the same as above.?? -lm n-gram.lm -lambda .9 -mix-lm newt.txt -ppl test.txt Is above interpolation is OK?Is there any other methods that are required to interpolate these trigram probabilities to an n-gram LM? Format of t.txt/newt.txt \data\ ngram 3=242 \3-grams: .... \end\ Thanks Best Regards Akmal -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Mon Sep 23 13:33:35 2013 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Mon, 23 Sep 2013 13:33:35 -0700 (PDT) Subject: [SRILM User List] Interpolate trigram Probabilities to an n-gram LM Message-ID: <1379968415.89728.YahooMailNeo@web161005.mail.bf1.yahoo.com> Hi, 1. Is it possible to interpolate some trigram probabilities (say they are in file t.txt) with an n-gram LM ?? SRILM gives results with the warning (no bow for prefix of trigram of t.txt). -lm n-gram.lm -lambda .9 -mix-lm t.txt -ppl test.txt 2. When the trigram probabilities in t.txt changes (newt.txt), the results are exactly the same as above.?? -lm n-gram.lm -lambda .9 -mix-lm newt.txt -ppl test.txt Is above interpolation is OK?Is there any other methods that are required to interpolate these trigram probabilities to an n-gram LM? Format of t.txt/newt.txt \data\ ngram 3=242 \3-grams: .... \end\ Thanks Best Regards Akmal -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Sep 24 12:04:00 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 24 Sep 2013 12:04:00 -0700 Subject: [SRILM User List] Interpolate trigram Probabilities to an n-gram LM In-Reply-To: <1379968415.89728.YahooMailNeo@web161005.mail.bf1.yahoo.com> References: <1379968415.89728.YahooMailNeo@web161005.mail.bf1.yahoo.com> Message-ID: <5241E220.5080205@icsi.berkeley.edu> On 9/23/2013 1:33 PM, Md. Akmal Haidar wrote: > Hi, > > 1. Is it possible to interpolate some trigram probabilities (say they > are in file t.txt) with an n-gram LM ? > SRILM gives results with the warning (no bow for prefix of trigram of > t.txt). > -lm n-gram.lm -lambda .9 -mix-lm t.txt -ppl test.txt > 2. When the trigram probabilities in t.txt changes (newt.txt), the > results are exactly the same as above. > -lm n-gram.lm -lambda .9 -mix-lm newt.txt -ppl test.txt > > Is above interpolation is OK?Is there any other methods that are > required to interpolate these trigram probabilities to an n-gram LM? The above would be fine if newt.txt contained a well-formed LM. The format you generated is incomplete. As implied by the warning message, for each trigram "a b c" also need the history portion ("a b") to be included as a bigram. Therefore, you should include a line -99 a b 0 for every such history (plus the appropriate ngram count information in the header). You also need a unigram section containing all words of your vocabulary. -99 a 0 (the final 0's are the log backoff weights). Now, giving 0 (log = -99) probabilities to all your unigrams and bigrams is suboptimal because there will be cases where you don't have a matching trigram and then the backoff will result in probability 0. This is not the end of the world since you presumably are interpolating with another model that will yield a non-zero probability, but it should be better to estimate a non-zero probability for those unigrams and bigrams. If you do, then run the resulting model through ngram -lm newt.txt -renorm -write-lm newt-norm.txt to recompute the backoff weights. Finally, interpolate. Andreas > > Format of t.txt/newt.txt > \data\ > ngram 3=242 > \3-grams: > .... > \end\ > > Thanks > Best Regards > Akmal > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From medmediani at gmail.com Sat Sep 28 00:21:18 2013 From: medmediani at gmail.com (Mohammed Mediani) Date: Sat, 28 Sep 2013 09:21:18 +0200 Subject: [SRILM User List] 1-count Higher order ngrams not excluded by gtmin Message-ID: Dear Andreas, I noticed that when I train a 6-gram KN LM, I get some 1-count ngrams which are no prefixes of any higher order ngrams in the 4 and 3 models. Are those another exception besides the one stated in Warning4 ( http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html)? Many thanks for your help Best regards, Med -------------- next part -------------- An HTML attachment was scrubbed... URL: From okuru13 at ku.edu.tr Sat Sep 28 05:58:14 2013 From: okuru13 at ku.edu.tr (Onur Kuru) Date: Sat, 28 Sep 2013 15:58:14 +0300 Subject: [SRILM User List] missing 3grams Message-ID: I got confused with the LM file I had. In the data section it says you have 0 3grams although as you can see from the training corpus I have 6 3-grams. Therefore I don't have the conditional probabilities in \3-grams section. Here is my training corpus: ~~the dog runs~~ ~~the cat walks~~ When I run the command: ngram-count -text training.gz -lm training.lm.gz -order 3 -gt1max 0 -gt2max 0 -gt3max 0 I got: \data\ ngram 1=7 ngram 2=7 ngram 3=0 \1-grams: -0.60206 -99 ~~-99 -0.90309 cat -99 -0.90309 dog -99 -0.90309 runs -99 -0.60206 the -7.356873 -0.90309 walks -99 \2-grams: 0 ~~the 0 cat walks 0 dog runs 0 runs~~ -0.30103 the cat -0.30103 the dog 0 walks~~ \3-grams: \end\ -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Sep 28 09:31:28 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 28 Sep 2013 11:31:28 -0500 Subject: [SRILM User List] missing 3grams In-Reply-To: References: Message-ID: <52470460.7040307@icsi.berkeley.edu> On 9/28/2013 7:58 AM, Onur Kuru wrote: > I got confused with the LM file I had. > In the data section it says you have *0* 3grams although as you can > see from the training corpus I have *6* 3-grams. > Therefore I don't have the conditional probabilities in *\3-grams > *section. > > Here is my training corpus: > ~~the dog runs~~ > ~~the cat walks~~ > > When I run the command: > ngram-count -text training.gz -lm training.lm.gz -order 3 -gt1max 0 > -gt2max 0 -gt3max 0 you need -gt3min 1 Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From otheremailid at aol.com Mon Sep 30 22:46:33 2013 From: otheremailid at aol.com (E) Date: Tue, 1 Oct 2013 01:46:33 -0400 (EDT) Subject: [SRILM User List] Count-lm reference request In-Reply-To: References: Message-ID: <8D08C80B82D9757-1094-339A7@webmail-d268.sysops.aol.com> Hello, I'm trying to understand the meaning of "google.count.lm0" file as given in FAQ section on creating LM from Web1T corpus. From what I read in Sec 11.4.1 Deleted Interpolation Smoothing in Spoken Language Processing, by Huang et al. (equation 11.22) bigram case P(w_i | w_{i-1}) = \lambda * P_{MLE}(w_i | w_{i-1}) + (1 - \lambda) * P(w_i) They call \lambda's as the mixture weights. I wonder if they are conceptually the same as the ones used in google.countlm. If so why are they arranged in a 15x5 matrix? Where can I read more about the same? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: