From nlp at pobox.sk Tue Jul 20 08:26:23 2004 From: nlp at pobox.sk (Robert Wagner) Date: Tue, 20 Jul 2004 17:26:23 +0200 Subject: Interpolation of word-based and POS-nased ngrams Message-ID: <200407201526.i6KFQNr4006030@www3.pobox.sk> Hello SRILM users! Does anybody know if there is an implementation of interpolation weights in SRILM? I have an ordinary word-based ngram and part-of-speech-based ngram and want to interpolate them to create HMM model for disfluency detection (using hidden-ngram tool). Is it possible to do it directly in SRILM? Regards Robert Wagner ____________________________________ http://www.logofun.pobox.sk - urobte radost svojmu telefonu From stolcke at speech.sri.com Tue Jul 20 08:54:16 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 20 Jul 2004 08:54:16 PDT Subject: Interpolation of word-based and POS-nased ngrams In-Reply-To: Your message of Tue, 20 Jul 2004 17:26:23 +0200. <200407201526.i6KFQNr4006030@www3.pobox.sk> Message-ID: <200407201554.i6KFsGj09165@conga.speech.sri.com> In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote: > Hello SRILM users! > Does anybody know if there is an implementation of interpolation > weights in SRILM? I have an ordinary word-based ngram and > part-of-speech-based ngram and want to interpolate them to create HMM > model for disfluency detection (using hidden-ngram tool). Is it > possible to do it directly in SRILM? By using the options -lm -classes -simple-classes -lambda -mix-lm with hidden-ngram you can tell it to use an interpolated LM where (one or both of) the component models are class-based. For details see the man page. --Andreas From nlp at pobox.sk Tue Jul 20 10:01:10 2004 From: nlp at pobox.sk (Robert Wagner) Date: Tue, 20 Jul 2004 19:01:10 +0200 Subject: Interpolation of word-based and POS-nased ngrams Message-ID: <200407201701.i6KH1A9R016054@www6.pobox.sk> Hi Andreas, my problem is that I use different data for both models. The word-based model uses a text consisting of recognized words, POS-based class model uses a text consistig of recognized words' POS. I have estimated this model simply by using the ngram-count tool from the text where words were replaced by their POS tags. POS-based classes are also not typical "simple classes"... Robert P.S. It would be ideal to gain the interpolation weights by SRILM as well;-) > > In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote: > > Hello SRILM users! > > Does anybody know if there is an implementation of interpolation > > weights in SRILM? I have an ordinary word-based ngram and > > part-of-speech-based ngram and want to interpolate them to create HMM > > model for disfluency detection (using hidden-ngram tool). Is it > > possible to do it directly in SRILM? > > By using the options > > -lm > -classes > -simple-classes > -lambda > -mix-lm > > with hidden-ngram you can tell it to use an interpolated LM where > (one or both of) the component models are class-based. > > For details see the man page. > > --Andreas > ____________________________________ http://www.pobox.sk/ - spolahliva a bezpecna prevadzka From katrin at ssli-mail.ee.washington.edu Tue Jul 20 11:25:54 2004 From: katrin at ssli-mail.ee.washington.edu (Katrin Kirchhoff) Date: Tue, 20 Jul 2004 11:25:54 -0700 Subject: Interpolation of word-based and POS-nased ngrams In-Reply-To: <200407201701.i6KH1A9R016054@www6.pobox.sk>; from nlp@pobox.sk on Tue, Jul 20, 2004 at 07:01:10PM +0200 References: <200407201701.i6KH1A9R016054@www6.pobox.sk> Message-ID: <20040720112554.B16976@duck.ee.washington.edu> As far as I known you need to write your own script to compute P(word|POS) and create the classes file. There's a compute-best-mix.gawk script in SRILM for estimating interpolation weights. KK On Tue, Jul 20, 2004 at 07:01:10PM +0200, Robert Wagner wrote: > Hi Andreas, > my problem is that I use different data for both models. The > word-based model uses a text consisting of recognized words, POS-based > class model uses a text consistig of recognized words' POS. I have > estimated this model simply by using the ngram-count tool from the > text where words were replaced by their POS tags. > POS-based classes are also not typical "simple classes"... > > Robert > > P.S. > It would be ideal to gain the interpolation weights by SRILM as well;-) > > > > > In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote: > > > Hello SRILM users! > > > Does anybody know if there is an implementation of interpolation > > > weights in SRILM? I have an ordinary word-based ngram and > > > part-of-speech-based ngram and want to interpolate them to create HMM > > > model for disfluency detection (using hidden-ngram tool). Is it > > > possible to do it directly in SRILM? > > > > By using the options > > > > -lm > > -classes > > -simple-classes > > -lambda > > -mix-lm > > > > with hidden-ngram you can tell it to use an interpolated LM where > > (one or both of) the component models are class-based. > > > > For details see the man page. > > > > --Andreas > > > > ____________________________________ > http://www.pobox.sk/ - spolahliva a bezpecna prevadzka > > > -- ----------------------------------------------------------------- Katrin Kirchhoff Dept of Electrical Engineering, University of Washington M422 EE/CS Building, Box 352500, Seattle, WA, 98195 Phone: (206) 616 5494 katrin at ee.washington.edu ----------------------------------------------------------------- From stolcke at speech.sri.com Tue Jul 20 11:50:22 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 20 Jul 2004 11:50:22 PDT Subject: Interpolation of word-based and POS-nased ngrams In-Reply-To: Your message of Tue, 20 Jul 2004 19:01:10 +0200. <200407201701.i6KH1A9R016054@www6.pobox.sk> Message-ID: <200407201850.LAA27699@huge> In message <200407201701.i6KH1A9R016054 at www6.pobox.sk>you wrote: > Hi Andreas, > my problem is that I use different data for both models. The > word-based model uses a text consisting of recognized words, POS-based > class model uses a text consistig of recognized words' POS. I have > estimated this model simply by using the ngram-count tool from the > text where words were replaced by their POS tags. > POS-based classes are also not typical "simple classes"... You are right of course. While hidden-ngram can theoretically handle general class ngrams, the implementation is currently not able to handle anything but toy examples. the reason is that general class-based models are no longer Markovian: they require the complete word history. This means hidden-ngram has to keep complete distinct histories for every hypothesis, which quickly becomes infeasible. With some small changes to the code one could approximate the full class N-gram by truncating the ngram context used to a fixed length (say 4). this might not hurt you much in practice, and would enable use of class-based N-grams in hidden-ngram decoding. Let me know if you're insterest in that. The other possbility (also an approximation) for now is to expand the class ngram into a word ngram, but this might also fail due to resource limitations, depending on your vocabulary size. --Andreas > > Robert > > P.S. > It would be ideal to gain the interpolation weights by SRILM as well;-) > > > > > In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote: > > > Hello SRILM users! > > > Does anybody know if there is an implementation of interpolation > > > weights in SRILM? I have an ordinary word-based ngram and > > > part-of-speech-based ngram and want to interpolate them to create HMM > > > model for disfluency detection (using hidden-ngram tool). Is it > > > possible to do it directly in SRILM? > > > > By using the options > > > > -lm > > -classes > > -simple-classes > > -lambda > > -mix-lm > > > > with hidden-ngram you can tell it to use an interpolated LM where > > (one or both of) the component models are class-based. > > > > For details see the man page. > > > > --Andreas > > > > ____________________________________ > http://www.pobox.sk/ - spolahliva a bezpecna prevadzka > > > > From duh at ee.washington.edu Fri Aug 13 14:52:33 2004 From: duh at ee.washington.edu (Kevin Duh) Date: Fri, 13 Aug 2004 14:52:33 -0700 Subject: Memory problem with ngram-count Message-ID: <411D3821.8090308@ee.washington.edu> Hi, I'm running into some memory limitations with ngram-count and am wondering if anyone has any suggestions. I have a very large text file (more than 1GB) as input to ngram-count. I divided this text into smaller files and used the 'make-batch-counts' and 'merge-batch-counts' commands to create a large count-file. Then, I tried to use 'ngram-count -read myfile.counts -lm ...' to estimate a language model. I receive the following error: ngram-count: /SRILM/include/LHash.cc:127: void LHash::alloc(unsigned int) [with KeyT = VocabIndex, DataT = Trie]: Assertion `body != 0' failed. Does anyone have any suggestions for solving this problem? Thanks in advance, Kevin ----------------------------- Kevin Duh Graduate Research Assistant Dept. of Electrical Engineering University of Washington http://ssli.ee.washington.edu/people/duh From stolcke at speech.sri.com Fri Aug 13 20:07:15 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 13 Aug 2004 20:07:15 PDT Subject: Memory problem with ngram-count In-Reply-To: Your message of Fri, 13 Aug 2004 14:52:33 -0700. <411D3821.8090308@ee.washington.edu> Message-ID: <200408140307.UAA13114@huge> In message <411D3821.8090308 at ee.washington.edu>you wrote: > Hi, > > I'm running into some memory limitations with ngram-count and am > wondering if anyone has any suggestions. > > I have a very large text file (more than 1GB) as input to ngram-count. I > divided this text into smaller files and used the 'make-batch-counts' > and 'merge-batch-counts' commands to create a large count-file. Then, I > tried to use 'ngram-count -read myfile.counts -lm ...' to estimate a > language model. I receive the following error: > > ngram-count: /SRILM/include/LHash.cc:127: void LHash DataT>::alloc(unsigned int) [with KeyT = VocabIndex, DataT = > Trie]: Assertion `body != 0' failed. > > Does anyone have any suggestions for solving this problem? 1. Use a binary compiled for "compact" memory use. If you are lucky (the person who installed SRILM did a thorough job) you should find these installed in $SRILM/bin/${MACHINE_TYPE}_c/ ... 2. Use the make-big-lm script. See the training-scripts(1) man page for details. 3. Find a machine with more memory or swap space. 4. Some combination of the above. --Andreas From dw229 at hermes.cam.ac.uk Mon Aug 23 06:21:37 2004 From: dw229 at hermes.cam.ac.uk (Daniel Walker) Date: Mon, 23 Aug 2004 14:21:37 +0100 (BST) Subject: ngram-count and float counts Message-ID: I'm trying to create a back-off file using ngram-count and a count file that has float "counts" derived from another model but get the error message: error in discount estimator for order 1 The -float-counts option says that it only supports certain types of discounting but doesn't say which. Could this be the problem? Otherwise, what sort of data problems could cause this message, does anyone know? Thanks, Dan From stolcke at speech.sri.com Mon Aug 23 07:47:28 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 23 Aug 2004 07:47:28 PDT Subject: ngram-count and float counts In-Reply-To: Your message of Mon, 23 Aug 2004 14:21:37 +0100. Message-ID: <200408231447.HAA21348@huge> The only discounting methods that support fractional counts are -cdiscount* -wbdiscount* (You can look in include/Discount.h and see which classes do NOT have a member function virtual Boolean estimate(NgramCounts &counts, unsigned order) { return false; }; ) --Andreas In message you wrote: > I'm trying to create a back-off file using ngram-count and a count file that > has float "counts" derived from another model but get the error message: > > error in discount estimator for order 1 > > The -float-counts option says that it only supports certain types of > discounting but doesn't say which. Could this be the problem? Otherwise, what > > sort of data problems could cause this message, does anyone know? > > Thanks, > > Dan > From wangc at csail.mit.edu Fri Sep 3 10:27:44 2004 From: wangc at csail.mit.edu (Chao Wang) Date: Fri, 03 Sep 2004 13:27:44 -0400 Subject: -unk flag Message-ID: <4138A990.5060500@csail.mit.edu> Could someone please tell me what the -unk flag will do to the probability model? It seems that, with the -unk flag, the language model will give a very good probability to unknown words, even when the training sentences don't contain any unknown words. In fact, I found that the probability for a sentence in the training data is inferior to that of a sentence composed entirely of unknown words (the number of words are the same in the two sentences). This is quite expected. Thanks a lot! Chao -- Chao Wang, PhD Spoken Language Systems Group MIT CSAIL http://www.sls.csail.mit.edu/wangc From stolcke at speech.sri.com Mon Sep 6 13:54:16 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 06 Sep 2004 13:54:16 PDT Subject: -unk flag In-Reply-To: Your message of Fri, 03 Sep 2004 13:27:44 -0400. <4138A990.5060500@csail.mit.edu> Message-ID: <200409062054.NAA11957@huge> In message <4138A990.5060500 at csail.mit.edu>you wrote: > Could someone please tell me what the -unk flag will do to the probability > model? It seems that, with the -unk flag, the language model will give a very > good probability to unknown words, even when the training sentences don't > contain any unknown words. In fact, I found that the probability for a senten > ce > in the training data is inferior to that of a sentence composed entirely of > unknown words (the number of words are the same in the two sentences). This i > s > quite expected. ngram-count -unk builds an LM that has as a word type and assigns non-zero probability to it (the default is not to include in the LM). All words not in the training data not listed in the -vocab file are mapped to . This is what is commonly known as an "open-vocabulary" LM. ngram -unk should be used to evaluate an LM that contains . (A warning will be issued if -unk is no specified and the LM contains a non-zero probability for ). The behavior you describe is certainly not expected. But to figure out why it happens one would have to look at the data and exact command invocations you are using. --Andreas