From kyawkyawzinn at gmail.com Fri Apr 1 00:42:28 2011 From: kyawkyawzinn at gmail.com (Kyaw Kyaw Zin) Date: 1 Apr 2011 08:42:28 +0100 Subject: [SRILM User List] Light A Candle With SocialKonnekt And Pray For Japan Message-ID: Hi srilm-user , I just light a candle for Japan Victims. Join us to pray for those who have lost their lives and hope for the best for those who have survived. It is time to light a candle and Pray... Please Light a Candle Now at: http://www.socialkonnekt.com/Tsunami/ Warm Regards, Kyaw Kyaw Zin From stolcke at icsi.berkeley.edu Fri Apr 1 08:26:20 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 01 Apr 2011 08:26:20 -0700 Subject: [SRILM User List] Light A Candle With SocialKonnekt And Pray For Japan In-Reply-To: References: Message-ID: <4D95EE9C.1040106@icsi.berkeley.edu> Kyaw Kyaw Zin wrote: > Hi srilm-user , > > I just light a candle for Japan Victims. > Join us to pray for those who have lost their lives and hope for the best for those who have survived. > It is time to light a candle and Pray... > > Please Light a Candle Now at: http://www.socialkonnekt.com/Tsunami/ > > Warm Regards, > Kyaw Kyaw Zin > Please do not use srilm-user for spam of this nature (however well-meaning). srilm-user is only to be used for discussion relating to SRILM. Thanks Andreas From philpot at isi.edu Thu Apr 7 16:23:22 2011 From: philpot at isi.edu (Andrew Philpot) Date: Thu, 07 Apr 2011 16:23:22 -0700 Subject: [SRILM User List] unexplained failure in ngram-count Message-ID: <4D9E476A.9050007@isi.edu> ngram-count failed on a largish file (but not the largest I was considering applying it to). The file contained 325 million short sentences, and was 10 GB in total size. I am running on a 64-bit machine with 16GB physical memory (2 2.3 Ghz CPU x 4 cores) running RH EL5. The precise invocation was: ./bin/i686-m64/ngram-count -text all_lm5 -no-sos -no-eos -unk -write-vocab vocab5.txt -write counts5.txt -order 5 -lm lm5.txt The precise error message was: ngram-count: /home/eh-01/philpot/srilm/include/LHash.cc:138: void LHash::alloc(unsigned int) [with KeyT = unsigned int, DataT = Trie]: Assertion `body != 0' failed. Is this likely to be a problem with my data, with my machine's memory, with the code, or what? Andrew From stolcke at icsi.berkeley.edu Thu Apr 7 16:32:35 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 07 Apr 2011 16:32:35 -0700 Subject: [SRILM User List] unexplained failure in ngram-count In-Reply-To: <4D9E476A.9050007@isi.edu> References: <4D9E476A.9050007@isi.edu> Message-ID: <4D9E4993.4070500@icsi.berkeley.edu> Andrew Philpot wrote: > > ngram-count failed on a largish file (but not the largest I was > considering applying it to). The file contained 325 million short > sentences, and was 10 GB in total size. > > I am running on a 64-bit machine with 16GB physical memory (2 2.3 Ghz > CPU x 4 cores) running RH EL5. > > The precise invocation was: > > ./bin/i686-m64/ngram-count -text all_lm5 -no-sos -no-eos -unk > -write-vocab vocab5.txt -write counts5.txt -order 5 -lm lm5.txt > > The precise error message was: > > ngram-count: /home/eh-01/philpot/srilm/include/LHash.cc:138: void > LHash::alloc(unsigned int) [with KeyT = unsigned int, > DataT = Trie]: Assertion `body != 0' > failed. > > Is this likely to be a problem with my data, with my machine's memory, > with the code, or what? Please check the FAQ item on large data and memory issues. http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html Andreas > > Andrew > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From zeeshankhans at gmail.com Mon Apr 11 10:59:25 2011 From: zeeshankhans at gmail.com (zeeshan khan) Date: Mon, 11 Apr 2011 19:59:25 +0200 Subject: [SRILM User List] Interpreting ngram output with -debug 2 , -cache and -cache-lambda options Message-ID: Hi all, I want to understand the debug 2 output given by ngram tool using (and not using) the -cache and -cache-lambda options. here are the two commands using (and not using) the -cache and -cache-lambda options : ngram -unk "UNKNOWN" -order 4 -lm -ppl -debug 2 -cache 350 -cache-lambda 0.1 AND ngram -unk "UNKNOWN" -order 4 -lm -ppl -debug 2 I have the following questions: 1. What is the meaning of [cache=xxxx] in each line and how is it calculated. 2. I cannot understand why the 2 probabilities are different in those lines of the output where the cache-probability is zero eg; in first 5 lines of both outputs. 3. Can there be any case where the first entry in each line i.e. [ngram] will be different among the two outputs ? if yes, how can it be ? and here are the first few lines of the outputs of each command: ------------------------------------------------------------------------------------------------------------------------ WITHOUT the -cache and -cache-lambda options: ------------------------------------------------------------------------------------------------------------------------ this is a podcast of the highlights from today's woman's hour copyright issues mean that we can't always include all the items from the programme p( this | ) = [2gram] 0.0155235 [ -1.80901 ] p( is | this ...) = [3gram] 0.384267 [ -0.415367 ] p( a | is ...) = [4gram] 0.171555 [ -0.765597 ] p( podcast | a ...) = [4gram] 7.7717e-06 [ -5.10948 ] p( of | podcast ...) = [4gram] 0.108064 [ -0.966317 ] p( the | of ...) = [4gram] 0.366697 [ -0.435692 ] p( highlights | the ...) = [3gram] 4.88751e-05 [ -4.31091 ] p( from | highlights ...) = [4gram] 0.077328 [ -1.11166 ] p( today's | from ...) = [4gram] 0.00790939 [ -2.10186 ] p( woman's | today's ...) = [2gram] 9.67272e-06 [ -5.01445 ] p( hour | woman's ...) = [3gram] 0.218998 [ -0.659561 ] p( copyright | hour ...) = [1gram] 3.56089e-06 [ -5.44844 ] p( issues | copyright ...) = [2gram] 0.0196718 [ -1.70615 ] p( mean | issues ...) = [2gram] 0.00024042 [ -3.61903 ] p( that | mean ...) = [3gram] 0.211744 [ -0.674189 ] p( we | that ...) = [3gram] 0.0179052 [ -1.74702 ] p( can't | we ...) = [4gram] 0.0186763 [ -1.72871 ] p( always | can't ...) = [4gram] 0.00198593 [ -2.70204 ] p( include | always ...) = [3gram] 0.000752505 [ -3.12349 ] p( all | include ...) = [3gram] 0.00575442 [ -2.24 ] p( the | all ...) = [4gram] 0.314584 [ -0.502263 ] p( items | the ...) = [4gram] 0.00158827 [ -2.79908 ] p( from | items ...) = [4gram] 0.0124186 [ -1.90593 ] p( the | from ...) = [4gram] 0.415841 [ -0.381072 ] p( programme | the ...) = [3gram] 0.000297532 [ -3.52647 ] p( | programme ...) = [4gram] 0.288492 [ -0.539866 ] 1 sentences, 25 words, 0 OOVs 0 zeroprobs, logprob= -55.3437 ppl= 134.463 ppl1= 163.586 ----------------------------------------------------------------------------------------------------------------------- WITH the -cache and -cache-lambda options: ----------------------------------------------------------------------------------------------------------------------- this is a podcast of the highlights from today's woman's hour copyright issues mean that we can't always include all the items from the programme p( this | ) = [2gram][cache=0] 0.0139712 [ -1.85477 ] p( is | this ...) = [3gram][cache=0] 0.34584 [ -0.461124 ] p( a | is ...) = [4gram][cache=0] 0.154399 [ -0.811355 ] p( podcast | a ...) = [4gram][cache=0] 6.99453e-06 [ -5.15524 ] p( of | podcast ...) = [4gram][cache=0] 0.0972579 [ -1.01207 ] p( the | of ...) = [4gram][cache=0] 0.330028 [ -0.48145 ] p( highlights | the ...) = [3gram][cache=0] 4.39876e-05 [ -4.35667 ] p( from | highlights ...) = [4gram][cache=0] 0.0695952 [ -1.15742 ] p( today's | from ...) = [4gram][cache=0] 0.00711845 [ -2.14761 ] p( woman's | today's ...) = [2gram][cache=0] 8.70545e-06 [ -5.06021 ] p( hour | woman's ...) = [3gram][cache=0] 0.197098 [ -0.705318 ] p( copyright | hour ...) = [1gram][cache=0] 3.2048e-06 [ -5.4942 ] p( issues | copyright ...) = [2gram][cache=0] 0.0177047 [ -1.75191 ] p( mean | issues ...) = [2gram][cache=0] 0.000216378 [ -3.66479 ] p( that | mean ...) = [3gram][cache=0] 0.190569 [ -0.719947 ] p( we | that ...) = [3gram][cache=0] 0.0161147 [ -1.79278 ] p( can't | we ...) = [4gram][cache=0] 0.0168087 [ -1.77447 ] p( always | can't ...) = [4gram][cache=0] 0.00178733 [ -2.74779 ] p( include | always ...) = [3gram][cache=0] 0.000677254 [ -3.16925 ] p( all | include ...) = [3gram][cache=0] 0.00517898 [ -2.28576 ] p( the | all ...) = [4gram][cache=0.05] 0.288126 [ -0.540418 ] p( items | the ...) = [4gram][cache=0] 0.00142944 [ -2.84483 ] p( from | items ...) = [4gram][cache=0.0454545] 0.0157222 [ -1.80349 ] p( the | from ...) = [4gram][cache=0.0869565] 0.382953 [ -0.416855 ] p( programme | the ...) = [3gram][cache=0] 0.000267779 [ -3.57222 ] p( | programme ...) = [4gram][cache=0] 0.259643 [ -0.585623 ] 1 sentences, 25 words, 0 OOVs 0 zeroprobs, logprob= -56.3676 ppl= 147.226 ppl1= 179.764 ----------------------------------------------------------------------------------------------------------------------- best regards, Zeeshan Khan -------------- next part -------------- An HTML attachment was scrubbed... URL: From philpot at isi.edu Mon Apr 18 14:31:33 2011 From: philpot at isi.edu (Andrew Philpot) Date: Mon, 18 Apr 2011 14:31:33 -0700 Subject: [SRILM User List] harmonizing results with/without '-use-server' Message-ID: <4DACADB5.2060700@isi.edu> Testing on a rather small language model, I notice that I get different results for a given input lattice/pfsg depending on whether I interrogate the LM directly on the command line or resident in a server. Precisely, the server is started thus: ngram -lm data/lm5a.lm -unk -server-port 2525 and then invoked via a command such as lattice-tool -in-lattice simple.pfsg -use-server 2525 at cent64.isi.edu -nbest-decode 10 -out-nbest-dir server-out/ while the command line version is invoked thus: lattice-tool -unk -in-lattice simple.pfsg -lm data/lm5a.lm -nbest-decode 10 -out-nbest-dir cmdline-out/ It's my intention and understanding that these two would be equivalent, but they are not. As far as I can tell, the nature of the discrepancy is that the server generates only acoustic probabilities, no LM probabilities (well, they are all 0). Also the order of returned results is different, but that very well could be due to the former issue. Finally, the acoustic probabilities are are equal (within a 10-best window) in the server-based case, but vary slightly in the command line-based case. I'd like to have results equivalent to the command-line invocation, but with the potential speedup provided by the -server-port/-use-server case. Is this possible, and if so, which parameter adjustments do I need to make? Andrew From fabian_in_hongkong at hotmail.com Wed Apr 20 01:08:41 2011 From: fabian_in_hongkong at hotmail.com (Fabian -) Date: Wed, 20 Apr 2011 10:08:41 +0200 Subject: [SRILM User List] classes-format question Message-ID: Hi,I'm still experimenting with class-based (actually POS) LMs. I use my own 61 classes/PoS. I built a class LM which works fine for decoding. But I also want to compute the perplexity. If I built a mapping file like mentioned in the classes-format manual page (with probabilities=1) I get a ppl of 8.So I computed the probabilities for mapping class x to word j as followed: # word j in class x---------------------------#occurences of class x Now I get a ppl of ~1300. This seems a bit high!? I have a total of 20k mappings with a vocab of 12k! The LM is an interpolation of a pure 3g class LM and a 3g word LM. The word LM has usually a ppl of ~500. The ASR Error rate of the word based and interpolated are similar though. Can you help me?Thanks,Fabian -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Apr 20 13:59:59 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 20 Apr 2011 13:59:59 -0700 Subject: [SRILM User List] classes-format question In-Reply-To: References: Message-ID: <4DAF494F.1070006@icsi.berkeley.edu> Fabian - wrote: > Hi, > I'm still experimenting with class-based (actually POS) LMs. I use my > own 61 classes/PoS. I built a class LM which works fine for decoding. > But I also want to compute the perplexity. If I built a mapping file > like mentioned in the classes-format manual page (with > probabilities=1) I get a ppl of 8. You mean when you replace all the words with their class labels? > So I computed the probabilities for mapping class x to word j as followed: > > # word j in class x > --------------------------- > #occurences of class x > > Now I get a ppl of ~1300. This seems a bit high!? It depends. You might have to smooth these probabilities, just like ngram probabilities. Try # word j in class x + 1 --------------------------- #occurences of class x + # classes > > I have a total of 20k mappings with a vocab of 12k! The LM is an > interpolation of a pure 3g class LM and a 3g word LM. The word LM has > usually a ppl of ~500. The ASR Error rate of the word based and > interpolated are similar though. Make sure you use -bayes 0 when interpolating word and class-based LMs. You should not merge LMs of different types statically (without -bayes). Andreas > > Can you help me? > Thanks, > Fabian > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Wed Apr 20 21:03:55 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 20 Apr 2011 21:03:55 -0700 Subject: [SRILM User List] classes-format question In-Reply-To: <4DAF494F.1070006@icsi.berkeley.edu> References: <4DAF494F.1070006@icsi.berkeley.edu> Message-ID: <4DAFACAB.6070908@icsi.berkeley.edu> Andreas Stolcke wrote: > Fabian - wrote: >> Hi, >> I'm still experimenting with class-based (actually POS) LMs. I use my >> own 61 classes/PoS. I built a class LM which works fine for decoding. >> But I also want to compute the perplexity. If I built a mapping file >> like mentioned in the classes-format manual page (with >> probabilities=1) I get a ppl of 8. > You mean when you replace all the words with their class labels? > >> So I computed the probabilities for mapping class x to word j as >> followed: >> >> # word j in class x >> --------------------------- >> #occurences of class x >> >> Now I get a ppl of ~1300. This seems a bit high!? > It depends. You might have to smooth these probabilities, just like > ngram probabilities. Try > > # word j in class x + 1 > --------------------------- > #occurences of class x + # classes Correction: the add-1 smoothing formula for class membership should read: # word j in class x + 1 --------------------------- #occurences of class x + # word-types Andreas From stolcke at icsi.berkeley.edu Wed Apr 20 23:27:31 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 20 Apr 2011 23:27:31 -0700 Subject: [SRILM User List] harmonizing results with/without '-use-server' In-Reply-To: Your message of Mon, 18 Apr 2011 14:31:33 -0700. <4DACADB5.2060700@isi.edu> Message-ID: <201104210627.p3L6RVCU023428@fruitcake.ICSI.Berkeley.EDU> In message <4DACADB5.2060700 at isi.edu>you wrote: > > Testing on a rather small language model, I notice that I get > different results for a given input lattice/pfsg depending on whether > I interrogate the LM directly on the command line or resident in a > server. > > Precisely, the server is started thus: > > ngram -lm data/lm5a.lm -unk -server-port 2525 > > and then invoked via a command such as > > lattice-tool -in-lattice simple.pfsg -use-server 2525 at cent64.isi.edu > -nbest-decode 10 -out-nbest-dir server-out/ > > while the command line version is invoked thus: > > lattice-tool -unk -in-lattice simple.pfsg -lm data/lm5a.lm > -nbest-decode 10 -out-nbest-dir cmdline-out/ > > It's my intention and understanding that these two would be > equivalent, but they are not. As far as I can tell, the nature of the > discrepancy is that the server generates only acoustic probabilities, > no LM probabilities (well, they are all 0). Also the order of > returned results is different, but that very well could be due to the > former issue. Finally, the acoustic probabilities are are equal > (within a 10-best window) in the server-based case, but vary slightly > in the command line-based case. > > I'd like to have results equivalent to the command-line invocation, > but with the potential speedup provided by the > -server-port/-use-server case. Is this possible, and if so, which > parameter adjustments do I need to make? Andrew, due to a bug, the -use-server LM was completely ignored in nbest generation. The patch below will fix this and should give you the expected results. Andreas diff -c -r1.155 lattice-tool.cc *** lattice/src/lattice-tool.cc 29 Jan 2011 05:56:35 -0000 1.155 --- lattice/src/lattice-tool.cc 19 Apr 2011 00:25:54 -0000 *************** *** 544,550 **** } if (viterbiDecode) { ! LM *plm = lmFile ? &lm : 0; if (outputCTM) { NBestWordInfo *bestwords = new NBestWordInfo[maxWordsPerLine + 1]; --- 544,550 ---- } if (viterbiDecode) { ! LM *plm = (lmFile || useServer) ? &lm : 0; if (outputCTM) { NBestWordInfo *bestwords = new NBestWordInfo[maxWordsPerLine + 1]; *************** *** 610,616 **** nbestMaxHyps, nbestDuplicates); } } else { ! LM *plm = lmFile ? &lm : 0; lat.decodeNBest(nbestDecode, nbestOut, noiseWords, plm, order, maxDegree, beamwidth > 0.0 ? --- 610,616 ---- nbestMaxHyps, nbestDuplicates); } } else { ! LM *plm = (lmFile || useServer) ? &lm : 0; lat.decodeNBest(nbestDecode, nbestOut, noiseWords, plm, order, maxDegree, beamwidth > 0.0 ? From stolcke at icsi.berkeley.edu Wed Apr 20 23:46:05 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 20 Apr 2011 23:46:05 -0700 Subject: [SRILM User List] Variable N-grams In-Reply-To: Your message of Tue, 08 Feb 2011 15:33:20 +0330. Message-ID: <201104210646.p3L6k5kh024657@fruitcake.ICSI.Berkeley.EDU> In message you wrote: > > hi all, > I read a paper titled "Variable N-grams and Extensions for Conversational > Speech Language modeling". I wonder is there any option in SRILM that help > me to make Variable N-grams Language model? You can achieve a "variable N-gram" type LM by first building a high-order LM and then pruning the ngrams that don't give much perplexity gain (see ngram -prune option). However, building the unpruned LM first might run into memory limitations. Also, there are known issues pruning KN-smoothed Ngram models. A group in Helsinki developed an LM toolkit that implements selective growing of ngrams and handles the KN smoothing properly. See http://users.ics.tkk.fi/vsiivola/papers/is2007less.pdf for more information. Andreas From stolcke at icsi.berkeley.edu Fri Apr 22 13:18:35 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 22 Apr 2011 13:18:35 -0700 Subject: [SRILM User List] classes-format question + followup question In-Reply-To: References: , <4DAF494F.1070006@icsi.berkeley.edu> Message-ID: <4DB1E29B.4080400@icsi.berkeley.edu> Fabian - wrote: > Hi, > > > thank you again for the quick help! > I added the smoothing and the PPL dropped to 720 which is a bit > better, but still above the range ~500 which would "feel" correct. > Anyways,... > You might want to verify that your probabilities are normalized correctly. Try ngram -debug 3 -ppl . > > ...I have another question: > > why can't i use the static interpolation for interpolating one class > LMs and word LM? I use a class-based (from ngram-count) or one > class-based with my own tags with the word-based LM. In the > documentation it only says -mix-lm with static interpolation won't > work correct? I didn't realize you want to interpolate two class-based LMs. That should work, you just need to keep the class labels distinct, nad combine the class definition files in to one file. > I want to build interpolated LMs (with -write-lm) to use them in my > ASR, so far I simply used the static interpolation, which seems to > work more or less OK. You should be able to ngram -mix-lm -write-lm with two class-based LMs but WITHOUT using the -classes option when doing so. If you include the -classes it will be appended to the LM file. Andreas > > > -Fabian > > > Date: Wed, 20 Apr 2011 13:59:59 -0700 > > From: stolcke at ICSI.Berkeley.EDU > > To: fabian_in_hongkong at hotmail.com > > CC: srilm-user at speech.sri.com > > Subject: Re: [SRILM User List] classes-format question > > > > Fabian - wrote: > > > Hi, > > > I'm still experimenting with class-based (actually POS) LMs. I use my > > > own 61 classes/PoS. I built a class LM which works fine for decoding. > > > But I also want to compute the perplexity. If I built a mapping file > > > like mentioned in the classes-format manual page (with > > > probabilities=1) I get a ppl of 8. > > You mean when you replace all the words with their class labels? > Yes > > > > > > So I computed the probabilities for mapping class x to word j as > followed: > > > > > > # word j in class x > > > --------------------------- > > > #occurences of class x + .... > > > > > > > > Now I get a ppl of ~1300. This seems a bit high!? > > It depends. You might have to smooth these probabilities, just like > > ngram probabilities. > > Try > > > > # word j in class x + 1 > > --------------------------- > > #occurences of class x + # classes > > > > > > > > > > I have a total of 20k mappings with a vocab of 12k! The LM is an > > > interpolation of a pure 3g class LM and a 3g word LM. The word LM has > > > usually a ppl of ~500. The ASR Error rate of the word based and > > > interpolated are similar though. > > Make sure you use -bayes 0 when interpolating word and class-based LMs. > > You should not merge LMs of different types statically (without -bayes). > > > > Andreas > > > > > > > > Can you help me? > > > Thanks, > > > Fabian > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > SRILM-User site list > > > SRILM-User at speech.sri.com > > > http://www.speech.sri.com/mailman/listinfo/srilm-user > > From stolcke at icsi.berkeley.edu Wed Apr 27 11:50:53 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 27 Apr 2011 11:50:53 -0700 Subject: [SRILM User List] classes-format question + followup question In-Reply-To: References: , <4DAF494F.1070006@icsi.berkeley.edu> , <4DB1E29B.4080400@icsi.berkeley.edu> Message-ID: <4DB8658D.4080503@icsi.berkeley.edu> Fabian - wrote: > Hi Andreas, > > thank you again for the quick answer. Unfortunately didn't make myself > clear. I really want to interpolate one class LM and one word LM. > Where the classes are part-of-speech tags. So the question is, again, > why is static interpolation not correct/possible? Although the class LM mechanism in SRILM can handle ngrams over a mix of words and classes, empirically it does not work well to merge (statically interpolate) models where one is a purely word-based and the other is a class-based ngram LM. This is because ngram -mix-lm WITHOUT the -bayes 0 option does not just implement the standard interpolation of probability estimates, it also merges the ngrams used for backoff computation (this is explained in the 2002 Interspeech paper). This works fine, and usually improves the results when combining models of the same type, but merging a lower-order ngram with a lower-order class-based LM gives weird results because the class expansions is not applied at the backoff level when performing the merge. For this reason, the ngram man page says (see the last sentence): -mix-lm file Read a second N-gram model for interpolation purposes. The second and any additional interpolated models can also be class N-grams (using the same -classes definitions), but are otherwise con- strained to be standard N-grams, i.e., the options -df, -tagged, -skip, and -hidden-vocab do not apply to them. NOTE: Unless -bayes (see below) is specified, -mix-lm triggers a static interpolation of the mod- els in memory. In most cases a more efficient, dynamic interpolation is sufficient, requested by -bayes 0. Also, mixing models of different type (e.g., word-based and class-based) will only work correctly with dynamic interpolation. So you might just have to re-engineer your application to accept true interpolated LMs, or, if its feasible, convert the class-LM into a word-based LM with ngram -expand-classes BEFORE doing the merging of models. Sorry. Andreas > > > > thank you again for the quick help! > > > I added the smoothing and the PPL dropped to 720 which is a bit > > > better, but still above the range ~500 which would "feel" correct. > > > Anyways,... > > > > > You might want to verify that your probabilities are normalized > > correctly. Try ngram -debug 3 -ppl . > Well, it seems that the probabilities are not properly normalized -> > there are many warnings: > for example: > warning: word probs for this context sum to 2.48076 != 1 : ... > > > > > > > > > ...I have another question: > > > > > > why can't i use the static interpolation for interpolating one class > > > LMs and word LM? I use a class-based (from ngram-count) or one > > > class-based with my own tags with the word-based LM. In the > > > documentation it only says -mix-lm with static interpolation won't > > > work correct? > > I didn't realize you want to interpolate two class-based LMs. That > > should work, you just need to keep the class labels distinct, nad > > combine the class definition files in to one file. > > > I want to build interpolated LMs (with -write-lm) to use them in my > > > ASR, so far I simply used the static interpolation, which seems to > > > work more or less OK. > > You should be able to ngram -mix-lm -write-lm with two class-based LMs > > but WITHOUT using the -classes option when doing so. > > If you include the -classes it will be appended to the LM file. > > Fabian From stolcke at icsi.berkeley.edu Wed Apr 27 17:26:06 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 27 Apr 2011 17:26:06 -0700 Subject: [SRILM User List] Interpreting ngram output with -debug 2 , -cache and -cache-lambda options In-Reply-To: References: Message-ID: <4DB8B41E.1020909@icsi.berkeley.edu> zeeshan khan wrote: > Hi all, > I want to understand the debug 2 output given by ngram tool using (and > not using) the -cache and -cache-lambda options. > > here are the two commands using (and not using) the -cache and > -cache-lambda options : > ngram -unk "UNKNOWN" -order 4 -lm -ppl -debug 2 > -cache 350 -cache-lambda 0.1 > AND > ngram -unk "UNKNOWN" -order 4 -lm -ppl -debug 2 > > I have the following questions: > 1. What is the meaning of [cache=xxxx] in each line and how is it > calculated. The xxxx part is the conditional probability due to the cache LM alone (i.e., the number of occurrence of the word in the cache window, divided by the total number of words). > 2. I cannot understand why the 2 probabilities are different in those > lines of the output where the cache-probability is zero eg; in first 5 > lines of both outputs. Because you're interpolating the standard ngram probability with the cache LM probability. If the latter is 0 it will "drag down" the overall probability. > 3. Can there be any case where the first entry in each line i.e. > [ngram] will be different among the two outputs ? if yes, how can it be ? The [Ngram] part of the output should always be the same, because it is generated by the ngram LM alone. Andreas > > and here are the first few lines of the outputs of each command: > > > ------------------------------------------------------------------------------------------------------------------------ > WITHOUT the -cache and -cache-lambda options: > ------------------------------------------------------------------------------------------------------------------------ > this is a podcast of the highlights from today's woman's hour > copyright issues mean that we can't always include all the items from > the programme > p( this | ) = [2gram] 0.0155235 [ -1.80901 ] > p( is | this ...) = [3gram] 0.384267 [ -0.415367 ] > p( a | is ...) = [4gram] 0.171555 [ -0.765597 ] > p( podcast | a ...) = [4gram] 7.7717e-06 [ -5.10948 ] > p( of | podcast ...) = [4gram] 0.108064 [ -0.966317 ] > p( the | of ...) = [4gram] 0.366697 [ -0.435692 ] > p( highlights | the ...) = [3gram] 4.88751e-05 [ -4.31091 ] > p( from | highlights ...) = [4gram] 0.077328 [ -1.11166 ] > p( today's | from ...) = [4gram] 0.00790939 [ -2.10186 ] > p( woman's | today's ...) = [2gram] 9.67272e-06 [ -5.01445 ] > p( hour | woman's ...) = [3gram] 0.218998 [ -0.659561 ] > p( copyright | hour ...) = [1gram] 3.56089e-06 [ -5.44844 ] > p( issues | copyright ...) = [2gram] 0.0196718 [ -1.70615 ] > p( mean | issues ...) = [2gram] 0.00024042 [ -3.61903 ] > p( that | mean ...) = [3gram] 0.211744 [ -0.674189 ] > p( we | that ...) = [3gram] 0.0179052 [ -1.74702 ] > p( can't | we ...) = [4gram] 0.0186763 [ -1.72871 ] > p( always | can't ...) = [4gram] 0.00198593 [ -2.70204 ] > p( include | always ...) = [3gram] 0.000752505 [ -3.12349 ] > p( all | include ...) = [3gram] 0.00575442 [ -2.24 ] > p( the | all ...) = [4gram] 0.314584 [ -0.502263 ] > p( items | the ...) = [4gram] 0.00158827 [ -2.79908 ] > p( from | items ...) = [4gram] 0.0124186 [ -1.90593 ] > p( the | from ...) = [4gram] 0.415841 [ -0.381072 ] > p( programme | the ...) = [3gram] 0.000297532 [ -3.52647 ] > p( | programme ...) = [4gram] 0.288492 [ -0.539866 ] > 1 sentences, 25 words, 0 OOVs > 0 zeroprobs, logprob= -55.3437 ppl= 134.463 ppl1= 163.586 > > > > ----------------------------------------------------------------------------------------------------------------------- > WITH the -cache and -cache-lambda options: > ----------------------------------------------------------------------------------------------------------------------- > this is a podcast of the highlights from today's woman's hour > copyright issues mean that we can't always include all the items from > the programme > p( this | ) = [2gram][cache=0] 0.0139712 [ -1.85477 ] > p( is | this ...) = [3gram][cache=0] 0.34584 [ -0.461124 ] > p( a | is ...) = [4gram][cache=0] 0.154399 [ -0.811355 ] > p( podcast | a ...) = [4gram][cache=0] 6.99453e-06 [ > -5.15524 ] > p( of | podcast ...) = [4gram][cache=0] 0.0972579 [ -1.01207 ] > p( the | of ...) = [4gram][cache=0] 0.330028 [ -0.48145 ] > p( highlights | the ...) = [3gram][cache=0] 4.39876e-05 > [ -4.35667 ] > p( from | highlights ...) = [4gram][cache=0] 0.0695952 [ > -1.15742 ] > p( today's | from ...) = [4gram][cache=0] 0.00711845 [ -2.14761 ] > p( woman's | today's ...) = [2gram][cache=0] 8.70545e-06 > [ -5.06021 ] > p( hour | woman's ...) = [3gram][cache=0] 0.197098 [ -0.705318 ] > p( copyright | hour ...) = [1gram][cache=0] 3.2048e-06 > [ -5.4942 ] > p( issues | copyright ...) = [2gram][cache=0] 0.0177047 [ > -1.75191 ] > p( mean | issues ...) = [2gram][cache=0] 0.000216378 [ > -3.66479 ] > p( that | mean ...) = [3gram][cache=0] 0.190569 [ -0.719947 ] > p( we | that ...) = [3gram][cache=0] 0.0161147 [ -1.79278 ] > p( can't | we ...) = [4gram][cache=0] 0.0168087 [ -1.77447 ] > p( always | can't ...) = [4gram][cache=0] 0.00178733 [ -2.74779 ] > p( include | always ...) = [3gram][cache=0] 0.000677254 > [ -3.16925 ] > p( all | include ...) = [3gram][cache=0] 0.00517898 [ -2.28576 ] > p( the | all ...) = [4gram][cache=0.05] 0.288126 [ > -0.540418 ] > p( items | the ...) = [4gram][cache=0] 0.00142944 [ -2.84483 ] > p( from | items ...) = [4gram][cache=0.0454545] 0.0157222 [ > -1.80349 ] > p( the | from ...) = [4gram][cache=0.0869565] 0.382953 [ > -0.416855 ] > p( programme | the ...) = [3gram][cache=0] 0.000267779 > [ -3.57222 ] > p( | programme ...) = [4gram][cache=0] 0.259643 [ > -0.585623 ] > 1 sentences, 25 words, 0 OOVs > 0 zeroprobs, logprob= -56.3676 ppl= 147.226 ppl1= 179.764 > > ----------------------------------------------------------------------------------------------------------------------- > > > best regards, > Zeeshan Khan > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From TYCHONG at ntu.edu.sg Thu Apr 28 00:39:23 2011 From: TYCHONG at ntu.edu.sg (Chong Tze Yuang) Date: Thu, 28 Apr 2011 15:39:23 +0800 Subject: [SRILM User List] classes-format question + followup question In-Reply-To: <4DB8658D.4080503@icsi.berkeley.edu> References: , <4DAF494F.1070006@icsi.berkeley.edu> , <4DB1E29B.4080400@icsi.berkeley.edu> <4DB8658D.4080503@icsi.berkeley.edu> Message-ID: Hi, Shouldn't we worry about expanding the class-LM into word-LM will make a extremely large LM? This is due to a class-LM comprises almost all possible n-gram combination. We might have another practical issue the decoder might not have enough memory for this large LM. Best, Chong -----Original Message----- From: srilm-user-bounces at speech.sri.com [mailto:srilm-user-bounces at speech.sri.com] On Behalf Of Andreas Stolcke Sent: Thursday, April 28, 2011 2:51 AM To: Fabian - Cc: srilm-user Subject: Re: [SRILM User List] classes-format question + followup question Fabian - wrote: > Hi Andreas, > > thank you again for the quick answer. Unfortunately didn't make myself > clear. I really want to interpolate one class LM and one word LM. > Where the classes are part-of-speech tags. So the question is, again, > why is static interpolation not correct/possible? Although the class LM mechanism in SRILM can handle ngrams over a mix of words and classes, empirically it does not work well to merge (statically interpolate) models where one is a purely word-based and the other is a class-based ngram LM. This is because ngram -mix-lm WITHOUT the -bayes 0 option does not just implement the standard interpolation of probability estimates, it also merges the ngrams used for backoff computation (this is explained in the 2002 Interspeech paper). This works fine, and usually improves the results when combining models of the same type, but merging a lower-order ngram with a lower-order class-based LM gives weird results because the class expansions is not applied at the backoff level when performing the merge. For this reason, the ngram man page says (see the last sentence): -mix-lm file Read a second N-gram model for interpolation purposes. The second and any additional interpolated models can also be class N-grams (using the same -classes definitions), but are otherwise con- strained to be standard N-grams, i.e., the options -df, -tagged, -skip, and -hidden-vocab do not apply to them. NOTE: Unless -bayes (see below) is specified, -mix-lm triggers a static interpolation of the mod- els in memory. In most cases a more efficient, dynamic interpolation is sufficient, requested by -bayes 0. Also, mixing models of different type (e.g., word-based and class-based) will only work correctly with dynamic interpolation. So you might just have to re-engineer your application to accept true interpolated LMs, or, if its feasible, convert the class-LM into a word-based LM with ngram -expand-classes BEFORE doing the merging of models. Sorry. Andreas > > > > thank you again for the quick help! > > > I added the smoothing and the PPL dropped to 720 which is a bit > > > better, but still above the range ~500 which would "feel" correct. > > > Anyways,... > > > > > You might want to verify that your probabilities are normalized > > correctly. Try ngram -debug 3 -ppl . > Well, it seems that the probabilities are not properly normalized -> > there are many warnings: > for example: > warning: word probs for this context sum to 2.48076 != 1 : ... > > > > > > > > > ...I have another question: > > > > > > why can't i use the static interpolation for interpolating one class > > > LMs and word LM? I use a class-based (from ngram-count) or one > > > class-based with my own tags with the word-based LM. In the > > > documentation it only says -mix-lm with static interpolation won't > > > work correct? > > I didn't realize you want to interpolate two class-based LMs. That > > should work, you just need to keep the class labels distinct, nad > > combine the class definition files in to one file. > > > I want to build interpolated LMs (with -write-lm) to use them in my > > > ASR, so far I simply used the static interpolation, which seems to > > > work more or less OK. > > You should be able to ngram -mix-lm -write-lm with two class-based LMs > > but WITHOUT using the -classes option when doing so. > > If you include the -classes it will be appended to the LM file. > > Fabian _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user CONFIDENTIALITY: This email is intended solely for the person(s) named and may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us and do not copy, use, or disclose its content. Thank you. Towards A Sustainable Earth: Print Only When Necessary From stolcke at icsi.berkeley.edu Thu Apr 28 09:23:46 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 28 Apr 2011 09:23:46 -0700 Subject: [SRILM User List] classes-format question + followup question In-Reply-To: References: , <4DAF494F.1070006@icsi.berkeley.edu> , <4DB1E29B.4080400@icsi.berkeley.edu> <4DB8658D.4080503@icsi.berkeley.edu> Message-ID: <4DB99492.8010808@icsi.berkeley.edu> Chong Tze Yuang wrote: > Hi, > > Shouldn't we worry about expanding the class-LM into word-LM will make a extremely large LM? This is due to a class-LM comprises almost all possible n-gram combination. We might have another practical issue the decoder might not have enough memory for this large LM. > You should be concerned, that's why I wrote "if feasible" in my email. It's not a solution for large LMs. Andreas > Best, > Chong > > > > -----Original Message----- > From: srilm-user-bounces at speech.sri.com [mailto:srilm-user-bounces at speech.sri.com] On Behalf Of Andreas Stolcke > Sent: Thursday, April 28, 2011 2:51 AM > To: Fabian - > Cc: srilm-user > Subject: Re: [SRILM User List] classes-format question + followup question > > Fabian - wrote: > >> Hi Andreas, >> >> thank you again for the quick answer. Unfortunately didn't make myself >> clear. I really want to interpolate one class LM and one word LM. >> Where the classes are part-of-speech tags. So the question is, again, >> why is static interpolation not correct/possible? >> > Although the class LM mechanism in SRILM can handle ngrams over a mix of > words and classes, empirically it does not work well to merge > (statically interpolate) models where one is a purely word-based and the > other is a class-based ngram LM. This is because ngram -mix-lm WITHOUT > the -bayes 0 option does not just implement the standard interpolation > of probability estimates, it also merges the ngrams used for backoff > computation (this is explained in the 2002 Interspeech paper). This > works fine, and usually improves the results when combining models of > the same type, but merging a lower-order ngram with a lower-order > class-based LM gives weird results because the class expansions is not > applied at the backoff level when performing the merge. > > For this reason, the ngram man page says (see the last sentence): > > -mix-lm file > Read a second N-gram model for interpolation purposes. > The second and any additional interpolated > models can also be class N-grams (using the same -classes > definitions), but are otherwise con- > strained to be standard N-grams, i.e., the options -df, > -tagged, -skip, and -hidden-vocab do not > apply to them. > NOTE: Unless -bayes (see below) is specified, -mix-lm > triggers a static interpolation of the mod- > els in memory. In most cases a more efficient, dynamic > interpolation is sufficient, requested by > -bayes 0. Also, mixing models of different type (e.g., > word-based and class-based) will only work > correctly with dynamic interpolation. > > So you might just have to re-engineer your application to accept true > interpolated LMs, or, if its feasible, convert the class-LM into a > word-based LM with ngram -expand-classes BEFORE doing the merging of > models. Sorry. > > Andreas > > > >>>> thank you again for the quick help! >>>> I added the smoothing and the PPL dropped to 720 which is a bit >>>> better, but still above the range ~500 which would "feel" correct. >>>> Anyways,... >>>> >>>> >>> You might want to verify that your probabilities are normalized >>> correctly. Try ngram -debug 3 -ppl . >>> >> Well, it seems that the probabilities are not properly normalized -> >> there are many warnings: >> for example: >> warning: word probs for this context sum to 2.48076 != 1 : ... >> >> >>>> ...I have another question: >>>> >>>> why can't i use the static interpolation for interpolating one class >>>> LMs and word LM? I use a class-based (from ngram-count) or one >>>> class-based with my own tags with the word-based LM. In the >>>> documentation it only says -mix-lm with static interpolation won't >>>> work correct? >>>> >>> I didn't realize you want to interpolate two class-based LMs. That >>> should work, you just need to keep the class labels distinct, nad >>> combine the class definition files in to one file. >>> >>>> I want to build interpolated LMs (with -write-lm) to use them in my >>>> ASR, so far I simply used the static interpolation, which seems to >>>> work more or less OK. >>>> >>> You should be able to ngram -mix-lm -write-lm with two class-based LMs >>> but WITHOUT using the -classes option when doing so. >>> If you include the -classes it will be appended to the LM file. >>> >> Fabian >> > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > > CONFIDENTIALITY: This email is intended solely for the person(s) named and may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us and do not copy, use, or disclose its content. Thank you. > > Towards A Sustainable Earth: Print Only When Necessary > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > From adeoras at jhu.edu Fri Apr 29 12:47:43 2011 From: adeoras at jhu.edu (Anoop Deoras) Date: Fri, 29 Apr 2011 15:47:43 -0400 Subject: [SRILM User List] Inconsistency between mix-lm and compute-best-mix ? Message-ID: <1C7C6040-7261-4681-9B96-554D351FB039@jhu.edu> Hello, I am trying to interpolate two LMs and I see inconsistency in the outputs when 2 different methods are used for interpolation. I will explain my setup : I have two LMs: LM1 and LM2 and I have a text corpus TEXT Step 1: produce debug file using ngram tool with debug=2 option using LM1 and LM2. Lets call them DEBUG1 and DEBUG2 ngram -lm LM1 -order 4 -unk -vocab VOCAB -ppl TEXT -debug 2 > DEBUG1 ngram -lm LM2 -order 4 -unk -vocab VOCAB -ppl TEXT -debug 2 > DEBUG2 Step 2: Get the optimal weights using the command: compute-best-mix DEBUG1 DEBUG2 Let the final best perplexity obtained be denoted as PPL_Step2 Let the weights be LAMBDA, 1-LAMBDA Thus LAMBDA corresponds to LM1. Step3 : Combine LM1 and LM2 linearly with the weights found above and compute the PPL ngram -lm LM1 -order 4 -unk -vocab VOCAB -ppl TEXT -mix-lm LM2 - lambda LAMBDA Let the perplexity obtained be denoted as PPL_Step3 For my setup, PPL_Step3 turns out to be greater than PPL_Step2 and I don't understand why ? Am I missing something while combining the models ? Any pointers would be useful. Thanks and Regards Anoop From stolcke at icsi.berkeley.edu Fri Apr 29 23:28:21 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 29 Apr 2011 23:28:21 -0700 Subject: [SRILM User List] Inconsistency between mix-lm and compute-best-mix ? In-Reply-To: <1C7C6040-7261-4681-9B96-554D351FB039@jhu.edu> References: <1C7C6040-7261-4681-9B96-554D351FB039@jhu.edu> Message-ID: <4DBBAC05.6050400@icsi.berkeley.edu> Anoop Deoras wrote: > Hello, > > I am trying to interpolate two LMs and I see inconsistency in the > outputs when 2 different methods are used > for interpolation. > > I will explain my setup : > > I have two LMs: LM1 and LM2 and I have a text corpus TEXT > > Step 1: produce debug file using ngram tool with debug=2 option using > LM1 and LM2. > Lets call them DEBUG1 and DEBUG2 > > ngram -lm LM1 -order 4 -unk -vocab VOCAB -ppl TEXT -debug 2 > DEBUG1 > ngram -lm LM2 -order 4 -unk -vocab VOCAB -ppl TEXT -debug 2 > DEBUG2 > > Step 2: Get the optimal weights using the command: > compute-best-mix DEBUG1 DEBUG2 > Let the final best perplexity obtained be denoted as PPL_Step2 > Let the weights be LAMBDA, 1-LAMBDA > Thus LAMBDA corresponds to LM1. > > Step3 : Combine LM1 and LM2 linearly with the weights found above and > compute the PPL > > ngram -lm LM1 -order 4 -unk -vocab VOCAB -ppl TEXT -mix-lm LM2 > -lambda LAMBDA > Let the perplexity obtained be denoted as PPL_Step3 > > > For my setup, PPL_Step3 turns out to be greater than PPL_Step2 and I > don't understand why ? > Am I missing something while combining the models ? > Any pointers would be useful. To implement the standard type of linear interpolation that is optimized for by compute-best-mix use ngram -mix-lm .... -bayes 0 Without -bayes 0 you are actually merging the ngrams of the two models, and then compute the perplexity. The difference between two kinds of interpolation is explained in the Interspeech '02 paper. Andreas > > Thanks and Regards > Anoop > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From chenmengdx at gmail.com Tue May 3 00:31:34 2011 From: chenmengdx at gmail.com (Meng Chen) Date: Tue, 3 May 2011 15:31:34 +0800 Subject: [SRILM User List] Generate HTK lattice from Class-based language model Message-ID: Hi, I want to generete the HTK lattice in order to do speech recognition with HVite. I have trained a bigram Class-based language model and interpolated with a bigram word-based language model. However, I don't know how to generate the HTK lattice from the interplolated language model. Can anyone tell me how to generate it in details? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue May 3 10:24:15 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 03 May 2011 10:24:15 -0700 Subject: [SRILM User List] Generate HTK lattice from Class-based language model In-Reply-To: References: Message-ID: <4DC03A3F.30500@icsi.berkeley.edu> Meng Chen wrote: > Hi, I want to generete the HTK lattice in order to do speech > recognition with HVite. I have trained a bigram Class-based language > model and interpolated with a bigram word-based language model. > However, I don't know how to generate the HTK lattice from the > interplolated language model. Can anyone tell me how to generate it in > details? This is a question for HTK people. SRILM does not generate lattices from acoustic data, it is only a language modeling toolkit. Andreas From stolcke at icsi.berkeley.edu Tue May 3 19:53:08 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 03 May 2011 19:53:08 -0700 Subject: [SRILM User List] Generate HTK lattice from Class-based language model In-Reply-To: References: <4DC03A3F.30500@icsi.berkeley.edu> <4DC0B93C.9080208@icsi.berkeley.edu> Message-ID: <4DC0BF94.3020503@icsi.berkeley.edu> Meng Chen wrote: > Could I expand the class-based language model to word-based language > model first, then interpolate with the word-based language model to > work around? I mean I could generate the word lattices by HTK, because > there are only words in language model now. Yes, you can do that if the LM and the class membership is small enough to not run out of memory. Andreas > > On Wed, May 4, 2011 at 10:26 AM, Andreas Stolcke > > wrote: > > Meng Chen wrote: > > I wasn't going to generate lattices from acoustic data. I mean > how to generate word lattices from language model which > interpolated the class-based language model with word-based > language model. Should I expand the classes in the langugage > model? Or by some other method? > > You need a speech recognizer or some other similar constraint > (like a machine translation system with an input sentence) to > constrain the lattice generation. > > Andreas > > Thanks! > > > On Wed, May 4, 2011 at 1:24 AM, Andreas Stolcke > > >> wrote: > > Meng Chen wrote: > > Hi, I want to generete the HTK lattice in order to do > speech > recognition with HVite. I have trained a bigram Class-based > language model and interpolated with a bigram word-based > language model. However, I don't know how to generate > the HTK > lattice from the interplolated language model. Can > anyone tell > me how to generate it in details? > > This is a question for HTK people. SRILM does not generate > lattices from acoustic data, it is only a language modeling > toolkit. > > Andreas > > > > > From dresen at gmail.com Wed May 4 14:18:37 2011 From: dresen at gmail.com (=?ISO-8859-1?Q?Andreas_S=F8eborg_Kirkedal?=) Date: Wed, 4 May 2011 23:18:37 +0200 Subject: [SRILM User List] Fwd: trigger model In-Reply-To: <000e0cd1fbbee7ff0904a279b9a2@google.com> References: <000e0cd1fbbee7ff0904a279b9a2@google.com> Message-ID: Hi I have trained a 5-gram language model with SRILM. I would like to interpolate a trigger model to rerank translation hypotheses and I have computed the triggers. Since there is a built in way to interpolate a cache model, I wanted to ask whether this is already possible? Has this been done before? -Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed May 4 20:08:18 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 04 May 2011 20:08:18 -0700 Subject: [SRILM User List] Fwd: trigger model In-Reply-To: References: <000e0cd1fbbee7ff0904a279b9a2@google.com> Message-ID: <4DC214A2.8070704@icsi.berkeley.edu> Andreas S?eborg Kirkedal wrote: > Hi > > I have trained a 5-gram language model with SRILM. I would like to > interpolate a trigger model to rerank translation hypotheses and I > have computed the triggers. > > Since there is a built in way to interpolate a cache model, I wanted > to ask whether this is already possible? Has this been done before? > > > -Andreas There is a facility for interpolating a constant (typically large) LM with a second (smaller) LM that changes dynamically, potentially for each sentence. Typically, you would make the dynamic LM a unigram LM computed from the sentence history, or other contextual information (i.e., a cache or trigger LM). You have to train the various incarnations of the dynamic LM yourself, and then insert a special tag into the stream of input sentences to indicate where the LM should change. This mechanism is enabled by the ngram -dynamic and -dynamic-lambda options. See the man page for detail on how to invoke it. ngram -debug 1 will allow you to trace the changes in the dynamic LM. Andreas > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From aarthireddy1212 at yahoo.com Mon May 9 22:25:10 2011 From: aarthireddy1212 at yahoo.com (Aarthi) Date: Mon, 9 May 2011 22:25:10 -0700 (PDT) Subject: [SRILM User List] lattice-tool with concatenate option In-Reply-To: <1C7C6040-7261-4681-9B96-554D351FB039@jhu.edu> References: <1C7C6040-7261-4681-9B96-554D351FB039@jhu.edu> Message-ID: <344739.96922.qm@web65407.mail.ac4.yahoo.com> Hello Andreas I have a set of lattices that I would like to concatenate in a particular order: A1.lat, A2.lat, A3.lat are to be concatenated in that order to form A.lat I was able to concatenate A1.lat and A2.lat to form a lattice B.lat using this: lattice-tool -in-lattice A1.lat -in-lattice2 A2.lat -operation concatenate -out-lattice B.lat And then do: lattice-tool -in-lattice B.lat -in-lattice2 A3.lat -operation concatenate -out-lattice A.lat Is it possible to make this simpler?? Thanks, Aarthi -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue May 10 10:17:17 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 10 May 2011 10:17:17 -0700 Subject: [SRILM User List] lattice-tool with concatenate option In-Reply-To: <344739.96922.qm@web65407.mail.ac4.yahoo.com> References: <1C7C6040-7261-4681-9B96-554D351FB039@jhu.edu> <344739.96922.qm@web65407.mail.ac4.yahoo.com> Message-ID: <4DC9731D.60600@icsi.berkeley.edu> Aarthi wrote: > Hello Andreas > > I have a set of lattices that I would like to concatenate in a > particular order: A1.lat, A2.lat, A3.lat are to be concatenated in > that order to form A.lat > > I was able to concatenate A1.lat and A2.lat to form a lattice B.lat > using this: > lattice-tool -in-lattice A1.lat -in-lattice2 A2.lat -operation > concatenate -out-lattice B.lat > > And then do: > lattice-tool -in-lattice B.lat -in-lattice2 A3.lat -operation > concatenate -out-lattice A.lat > > Is it possible to make this simpler? No. You're doing the right thing. If you are comfortable with C++ you can modify the concatenate function to accept a list of lattices. Andreas > Thanks, > Aarthi > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From ariya at jhu.edu Sun May 22 18:53:39 2011 From: ariya at jhu.edu (Ariya Rastrow) Date: Sun, 22 May 2011 21:53:39 -0400 Subject: [SRILM User List] ARPA format for Ngram LMs with Jelinek-Mercer smoothing Message-ID: Hi, I have a question regarding building N-gram LMs with Jelinek-Mercer smoothing. I have optimized the weights using my own scripts on some held-out data and now I am trying to write out the ARPA backoff format of the LM. I have the N-gram probabilities and the corresponding weights for 1grams,2grams and 3grams. I was wondering if I could use SRILM toolkit to get the ARPA representation of my LM. I have tried ngram script with -count-lm option along with -write but then the script only writes out the lm as a header file which is described under -count-lm option. I know this is an easy task and one can use the weights as the backoff weights to get the ARPA format. Any help would be appreciated. Thanks, Ariya -------------- next part -------------- An HTML attachment was scrubbed... URL: From gouwsmeister at gmail.com Sun May 22 21:18:38 2011 From: gouwsmeister at gmail.com (Stephan Gouws) Date: Sun, 22 May 2011 21:18:38 -0700 Subject: [SRILM User List] Lattice decoding problems Message-ID: Hi, I am attempting to decode the most likely sentence, given a confusion network of possible words. I wrote my own script to encode this confusion net ("sausage") in the PFSG format, and it seems to work fine. However, on roughly one out of 2 cases, I get an error in the decoding process. I first compute the confusion net probabilities based on my own heuristics (standard, normalised probability scores), which are then converted via the formula 10000,5*log(p) given in the documentation into log-probs. Thereafter I rescore using a language model run in server mode, by passing the pfsg via stdin, as follows: "lattice-tool -debug 9 -in-lattice - -posterior-decode -zeroprob-word blemish -use-server 12345 at 127.0.0.1" However, one out of two times it breaks and I am not sure why. Here is sample output with debug turned on as above: server 12345 at 127.0.0.1: probserver ready Lattice::readPFSGs: reading in nested PFSGs... Lattice::readPFSG: reading in PFSG.... Lattice::expandToLM: starting expansion to general LM (maxNodes = 0) ... In Lattice::removeNode: remove node 0 In Lattice::removeNode: remove node 1 In Lattice::removeNode: remove node 2 In Lattice::removeNode: remove node 3 In Lattice::removeNode: remove node 4 In Lattice::removeNode: remove node 5 In Lattice::removeNode: remove node 6 In Lattice::removeNode: remove node 7 In Lattice::removeNode: remove node 8 In Lattice::removeNode: remove node 9 In Lattice::removeNode: remove node 10 In Lattice::removeNode: remove node 11 In Lattice::removeNode: remove node 12 In Lattice::removeNode: remove node 18 In Lattice::removeNode: remove node 17 In Lattice::removeNode: remove node 16 In Lattice::removeNode: remove node 15 In Lattice::removeNode: remove node 14 In Lattice::removeNode: remove node 13 In Lattice::removeNode: remove node 19 In Lattice::removeNode: remove node 20 Lattice::computeForwardBackward: processing (posterior scale = 8) Lattice::computeForwardBackward: unnormalized posterior = -6.83499 max-min path posterior = -7.24399 And sometimes I get something like the following, which I don't totally understand: "WordMesh::normalizeDeletes: word posteriors exceed total: 1.695" I suspect that there might be some error in how I compute the log-probs, or after rescoring with the LM. However I am not sure what. Any ideas? Thanks Stephan From stolcke at icsi.berkeley.edu Sun May 22 22:45:21 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 22 May 2011 22:45:21 -0700 Subject: [SRILM User List] ARPA format for Ngram LMs with Jelinek-Mercer smoothing In-Reply-To: References: Message-ID: <4DD9F471.5070900@icsi.berkeley.edu> Ariya Rastrow wrote: > Hi, > I have a question regarding building N-gram LMs with Jelinek-Mercer > smoothing. I have optimized the weights using my own scripts on some > held-out data and now I am trying to write out the ARPA backoff format > of the LM. I have the N-gram probabilities and the corresponding > weights for 1grams,2grams and 3grams. I was wondering if I could use > SRILM toolkit to get the ARPA representation of my LM. I have tried > ngram script with -count-lm option along with -write but then the > script only writes out the lm as a header file which is described > under -count-lm option. I know this is an easy task and one can use > the weights as the backoff weights to get the ARPA format. Any help > would be appreciated. If you know how to create the count-LM then you're halfway there. To get a backoff LM you can first train a backoff LM using one of the standard LM smoothing methods (say GT, the default), then use the count-LM (previously created) to "rescore" the probabilities in the backoff LM (ngram -rescore-ngram option). However, be aware this only approximates the interpolated LM, but the approximation is exact for all ngrams contained in the training data. Andreas > > Thanks, > Ariya > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Mon May 23 11:02:11 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 23 May 2011 11:02:11 -0700 Subject: [SRILM User List] SRILM parameters In-Reply-To: <4DDA3DD2.50409@gmail.com> References: <4C331F45.5030509@gmail.com> <4C334A3C.7060007@speech.sri.com> <4DDA3DD2.50409@gmail.com> Message-ID: <4DDAA123.3040705@icsi.berkeley.edu> Casey Kennington wrote: > Andreas, > > I've emailed you before and you were quick to response, so I am hoping > I will get another quick response! This time it's about how SRILM > calculates perplexity. I know the FAQ says: > > ppl = 10^(-logprob / (words - OOVs + sentences)) > > But what exactly is your logprob? Is that a single logprob of the > entire eval corpus? Is it an average logprob of all sentences? logprob is what ngram -ppl prints out after "logprob=" , i.e., the sum of the log probabilities over the entire test set. Andreas PS. Please direct future questions to the mailing list From mshamsuddeen2 at gmail.com Tue May 24 01:20:13 2011 From: mshamsuddeen2 at gmail.com (Muhammad Shamsuddeen Muhammad) Date: Tue, 24 May 2011 16:20:13 +0800 Subject: [SRILM User List] Remove from Mailing List Message-ID: I would like to remove myself from this mailing list, so can you assist me on how to do that. -- Muhammad Shamsuddeen Muhammad "There is No Knowledge That is Not Power". -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue May 24 07:45:24 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 24 May 2011 16:45:24 +0200 Subject: [SRILM User List] Language model Adaptation Algorithm In-Reply-To: References: Message-ID: <4DDBC484.4090001@icsi.berkeley.edu> Mehdi hoseini wrote: > hi > I want to use SRILM for my Language model adaptation. is this a good > procedure for my purpose? > > 1: Make a language model on my Adaptaion data ( Task-Specific > Data) : ForegroundLM.txt > 2: Make a language model on my General Data: BackgroundLM.txt > 3: Combine this two language models with this command: ngram -lm > ForegroundLM.txt -mix-lm BackgroundLM.txt -lambda L > -write-lm ADAPTED-LM.txt > This is the most popular approach, yes. > but how can i find an optimal Lambda coefficient? i found * > "compute-best-mix" *but i really didnt get how to use it! > Is there any better way to use SRILM for language model adaptation? Please see the ppl-scripts(1) man page (or google "compute-best-mix"). The input to compute-best-mix is the output of ngram -debug 2 -ppl for the two LMs on a held-out tuning set. Andreas ** > ** > * > > * > Best regards > > Mehdi hoseini > > > From stolcke at icsi.berkeley.edu Tue May 24 08:03:52 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 24 May 2011 17:03:52 +0200 Subject: [SRILM User List] Lattice decoding problems In-Reply-To: References: Message-ID: <4DDBC8D8.1000903@icsi.berkeley.edu> Stephan Gouws wrote: > Hi, > > I am attempting to decode the most likely sentence, given a confusion > network of possible words. I wrote my own script to encode this > confusion net ("sausage") in the PFSG format, and it seems to work > fine. However, on roughly one out of 2 cases, I get an error in the > decoding process. > > I first compute the confusion net probabilities based on my own > heuristics (standard, normalised probability scores), which are then > converted via the formula 10000,5*log(p) given in the documentation > into log-probs. Thereafter I rescore using a language model run in > server mode, by passing the pfsg via stdin, as follows: > Lattice-tool has the -read-mesh option which allows it to read CNs directly. > "lattice-tool -debug 9 -in-lattice - -posterior-decode -zeroprob-word > blemish -use-server 12345 at 127.0.0.1" > > However, one out of two times it breaks and I am not sure why. Here is > sample output with debug turned on as above: > > server 12345 at 127.0.0.1: probserver ready > Lattice::readPFSGs: reading in nested PFSGs... > Lattice::readPFSG: reading in PFSG.... > Lattice::expandToLM: starting expansion to general LM (maxNodes = 0) ... > In Lattice::removeNode: remove node 0 > In Lattice::removeNode: remove node 1 > In Lattice::removeNode: remove node 2 > In Lattice::removeNode: remove node 3 > In Lattice::removeNode: remove node 4 > In Lattice::removeNode: remove node 5 > In Lattice::removeNode: remove node 6 > In Lattice::removeNode: remove node 7 > In Lattice::removeNode: remove node 8 > In Lattice::removeNode: remove node 9 > In Lattice::removeNode: remove node 10 > In Lattice::removeNode: remove node 11 > In Lattice::removeNode: remove node 12 > In Lattice::removeNode: remove node 18 > In Lattice::removeNode: remove node 17 > In Lattice::removeNode: remove node 16 > In Lattice::removeNode: remove node 15 > In Lattice::removeNode: remove node 14 > In Lattice::removeNode: remove node 13 > In Lattice::removeNode: remove node 19 > In Lattice::removeNode: remove node 20 > Lattice::computeForwardBackward: processing (posterior scale = 8) > Lattice::computeForwardBackward: unnormalized posterior = -6.83499 > max-min path posterior = -7.24399 > Up to here there are no error messages, only debugging and informational messages. > And sometimes I get something like the following, which I don't > totally understand: > > "WordMesh::normalizeDeletes: word posteriors exceed total: 1.695" > This is from a routine that performs a sanity check on the CNs ot make sure the sum of all words plus the deletion (null word) posterior add up to 1. It means that the word posteriors exceed 1, which shouldn't happen. To find out why, you should find a (preferably small) test case, and write out the CNs that is created (write-mesh), both without and with LM rescoring. Andreas > I suspect that there might be some error in how I compute the > log-probs, or after rescoring with the LM. However I am not sure what. > Any ideas? > > Thanks > Stephan > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > From gouwsmeister at gmail.com Tue May 24 12:53:03 2011 From: gouwsmeister at gmail.com (Stephan Gouws) Date: Tue, 24 May 2011 12:53:03 -0700 Subject: [SRILM User List] Lattice decoding problems In-Reply-To: <4DDBC8D8.1000903@icsi.berkeley.edu> References: <4DDBC8D8.1000903@icsi.berkeley.edu> Message-ID: > Lattice-tool has the -read-mesh option which allows it to read CNs directly. >> Thank you for the reply, Andreas. I am going to use SRI-LM's -read-mesh function. Just to be very clear on the mesh-format: >From the documentation, the format is given as """ name s numaligns N posterior P align a w1 p1 w2 p2 ... """ Now, please correct me where I am wrong here: - name s can be any string, e.g. name "somename". Do I need quotes? - numaligns == the number of confusion sets in the CN, plus the initial and end nodes? Do I need explicit initial and end nodes? - what exactly is P?? - a gives the current confusion set position, starting with 0 for "initial", 1 for the next, etc, and N-1 for "final" ? - each individual confusion set's pi's must sum to 1? So for this CN: [ [(0.2, "a"), (0.8, "b")], [(0.3, "c"),(0.7, "d")] ], I would encode it as: name "somename" numaligns 4 posterior P align 0 "initial" 1.0 align 1 "a" 0.2 "b" 0.8 align 2 "c" 0.3 "d" 0.7 align 3 "final" 1.0 Is this correct? And how do I compute P? Thank you very much for your help! Stephan From fabian_in_hongkong at hotmail.com Tue May 24 13:26:44 2011 From: fabian_in_hongkong at hotmail.com (Fabian -) Date: Tue, 24 May 2011 22:26:44 +0200 Subject: [SRILM User List] class perplexity -debug 2 output Message-ID: Hello, I used the "-debug 2" during the perplexity computation of a word language model and a class language model, and don't understand the output. I start with:ngram -lm lm.3g -order 3 -mix-lm lm300 -lambda 0.518735 -bayes 0 -classes classes300 -ppl dev.text -debug 2lm.3g: word lmlm300: class lm and get:? ?? ? ?? ... p( ? | ) = [OOV][2gram][2gram][OOV] 0.00193608 [ -2.71308 ] p( ?? | ? ...) = [OOV][1gram][OOV][2gram][1gram][OOV][1gram][OOV] 7.96669e-05 [ -4.09872 ] p( ? | ?? ...) = [OOV][1gram][OOV][2gram][1gram][OOV][1gram][OOV] 0.00486257 [ -2.31313 ] p( ?? | ? ...) = [OOV][1gram][OOV][2gram][1gram][OOV][1gram][OOV] 0.000237892 [ -3.62362 ] p( ? | ?? ...) = [OOV][1gram][OOV][2gram][2gram][OOV][1gram][OOV] 0.00242938 [ -2.6145 ]...are the words covered by 1-grams/2-grams or OOV? Thank you!!! -Fabian -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue May 24 14:22:40 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 24 May 2011 23:22:40 +0200 Subject: [SRILM User List] Lattice decoding problems In-Reply-To: References: <4DDBC8D8.1000903@icsi.berkeley.edu> Message-ID: <4DDC21A0.50003@icsi.berkeley.edu> Stephan Gouws wrote: >> Lattice-tool has the -read-mesh option which allows it to read CNs directly. >> > > Thank you for the reply, Andreas. I am going to use SRI-LM's > -read-mesh function. Just to be very clear on the mesh-format: > > >From the documentation, the format is given as > """ > name s > numaligns N > posterior P > align a w1 p1 w2 p2 ... > """ > > Now, please correct me where I am wrong here: > - name s can be any string, e.g. name "somename". Do I need quotes? > - numaligns == the number of confusion sets in the CN, plus the > initial and end nodes? Do I need explicit initial and end nodes? > - what exactly is P?? > The total posterior probability mass represented by the CN. This is usually 1 but could be something else in certain scenarios. > - a gives the current confusion set position, starting with 0 for > "initial", 1 for the next, etc, and N-1 for "final" ? > Create a CN from a simple nbest lists, e.g. nbest-lattice -nbest $SRILM/lm/test/tests/nbest-rover/nbest-lists/sw_40008_A_0003136_0003462.score.gz -use-mesh -write - and the answers to the above questions will be obvious. > - each individual confusion set's pi's must sum to 1? > yes, though this is not enforced when the file is read in. Andreas > So for this CN: > [ > [(0.2, "a"), (0.8, "b")], > [(0.3, "c"),(0.7, "d")] > ], > > I would encode it as: > > name "somename" > numaligns 4 > posterior P > align 0 "initial" 1.0 > align 1 "a" 0.2 "b" 0.8 > align 2 "c" 0.3 "d" 0.7 > align 3 "final" 1.0 > > Is this correct? And how do I compute P? > > Thank you very much for your help! > Stephan > From stolcke at icsi.berkeley.edu Tue May 24 14:31:08 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 24 May 2011 23:31:08 +0200 Subject: [SRILM User List] class perplexity -debug 2 output In-Reply-To: References: Message-ID: <4DDC239C.9060304@icsi.berkeley.edu> Someone else asked this question recently: http://www-speech.sri.com/pipermail/srilm-user/2011q1/000979.html FYI, you could have found this page by web-searching for "class ngram ppl output" . Andreas Fabian - wrote: > Hello, > > I used the "-debug 2" during the perplexity computation of a word > language model and a class language model, and don't understand the > output. > > I start with: > ngram -lm lm.3g -order 3 -mix-lm lm300 -lambda 0.518735 -bayes 0 > -classes classes300 -ppl dev.text -debug 2 > lm.3g: word lm > lm300: class lm > > and get: > ?? ???? ?? ?! ??? ... > p( ?? | ) = [OOV][2gram][2gram][OOV] 0.00193608 [ -2.71308 ] > p( ???? | ?? ...) = [OOV][1gram][OOV][2gram][1gram][OOV][1gram][OOV] > 7.96669e-05 [ -4.09872 ] > p( ?? | ???? ...) = [OOV][1gram][OOV][2gram][1gram][OOV][1gram][OOV] > 0.00486257 [ -2.31313 ] > p( ???? | ?? ...) = [OOV][1gram][OOV][2gram][1gram][OOV][1gram][OOV] > 0.000237892 [ -3.62362 ] > p( ?? | ???? ...) = [OOV][1gram][OOV][2gra! m][2gram][OOV][1gram][OOV] > 0.00242938 [ -2.6145 ] > ... > are the words covered by 1-grams/2-grams or OOV? > > Thank you!!! > > -Fabian > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From ariya at jhu.edu Fri May 27 11:25:13 2011 From: ariya at jhu.edu (Ariya Rastrow) Date: Fri, 27 May 2011 14:25:13 -0400 Subject: [SRILM User List] ARPA format for Ngram LMs with Jelinek-Mercer smoothing In-Reply-To: <4DD9F471.5070900@icsi.berkeley.edu> References: <4DD9F471.5070900@icsi.berkeley.edu> Message-ID: On Mon, May 23, 2011 at 1:45 AM, Andreas Stolcke wrote: > Ariya Rastrow wrote: > >> Hi, >> I have a question regarding building N-gram LMs with Jelinek-Mercer >> smoothing. I have optimized the weights using my own scripts on some >> held-out data and now I am trying to write out the ARPA backoff format of >> the LM. I have the N-gram probabilities and the corresponding weights for >> 1grams,2grams and 3grams. I was wondering if I could use SRILM toolkit to >> get the ARPA representation of my LM. I have tried ngram script with >> -count-lm option along with -write but then the script only writes out the >> lm as a header file which is described under -count-lm option. I know this >> is an easy task and one can use the weights as the backoff weights to get >> the ARPA format. Any help would be appreciated. >> > If you know how to create the count-LM then you're halfway there. > > To get a backoff LM you can first train a backoff LM using one of the > standard LM smoothing methods (say GT, the default), then use the count-LM > (previously created) to "rescore" the probabilities in the backoff LM (ngram > -rescore-ngram option). However, be aware this only approximates the > interpolated LM, but the approximation is exact for all ngrams contained in > the training data. > > Andreas > > The reason I wanted to get ARPA format for Jelinek-Mercer smoothed LM was to be able to load it in a c++ code. I understand the ARPA format would be an approximation as you mentioned. Can you please let me know what the best way would be to load the N-grams and their probabilities along with the interpolation weights in a c++ code and perhaps do the interpolation on the fly? Basically my question is how to use Jelinek-Mercer LM in a c++ code given the fact that I already have the weights and N-gram probabilities (I can make the header file as in -count-lm)? Thanks, Ariya -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri May 27 14:22:29 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 27 May 2011 23:22:29 +0200 Subject: [SRILM User List] ARPA format for Ngram LMs with Jelinek-Mercer smoothing In-Reply-To: References: <4DD9F471.5070900@icsi.berkeley.edu> Message-ID: <4DE01615.1000402@icsi.berkeley.edu> Ariya Rastrow wrote: > > On Mon, May 23, 2011 at 1:45 AM, Andreas Stolcke > > wrote: > > Ariya Rastrow wrote: > > Hi, > I have a question regarding building N-gram LMs with > Jelinek-Mercer smoothing. I have optimized the weights using > my own scripts on some held-out data and now I am trying to > write out the ARPA backoff format of the LM. I have the N-gram > probabilities and the corresponding weights for 1grams,2grams > and 3grams. I was wondering if I could use SRILM toolkit to > get the ARPA representation of my LM. I have tried ngram > script with -count-lm option along with -write but then the > script only writes out the lm as a header file which is > described under -count-lm option. I know this is an easy task > and one can use the weights as the backoff weights to get the > ARPA format. Any help would be appreciated. > > If you know how to create the count-LM then you're halfway there. > > To get a backoff LM you can first train a backoff LM using one of > the standard LM smoothing methods (say GT, the default), then use > the count-LM (previously created) to "rescore" the probabilities > in the backoff LM (ngram -rescore-ngram option). However, be > aware this only approximates the interpolated LM, but the > approximation is exact for all ngrams contained in the training data. > > Andreas > > The reason I wanted to get ARPA format for Jelinek-Mercer smoothed LM > was to be able to load it in a c++ code. I understand the ARPA format > would be an approximation as you mentioned. Can you please let me know > what the best way would be to load the N-grams and their probabilities > along with the interpolation weights in a c++ code and perhaps do the > interpolation on the fly? Basically my question is how to use > Jelinek-Mercer LM in a c++ code given the fact that I already have the > weights and N-gram probabilities (I can make the header file as in > -count-lm)? The whole point of SRILM is to be able to link with C++ through the API. You just need to instantiate the Vocab class and the NgramCountLM class, invoke the read() method, and then use wordProb() function to obtain conditional probabilities. The man pages for Vocab(3) and LM(3) describe the interface. Andreas > > Thanks, > Ariya From stolcke at icsi.berkeley.edu Mon May 30 16:56:34 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 30 May 2011 16:56:34 -0700 Subject: [SRILM User List] Backoff question In-Reply-To: Your message of Wed, 20 Apr 2011 20:18:12 +0200. <4DAF2364.6060906@tmit.bme.hu> Message-ID: <201105302356.p4UNuYN9011471@fruitcake.ICSI.Berkeley.EDU> In message <4DAF2364.6060906 at tmit.bme.hu>you wrote: > > Hi Andreas, > > I'd have a question about backoff weights in SRI-LM. > I know they are weights, and not probabilities, > but sometimes they become extremely large ( e.g., log(BO)=6 ) and the > converted WFST language model is working on an unusual way. > > I've made a dummy corpus to present my problem. > The corpus is in text_ab_1000.txt , the resulted counts, and arpa LM in > text_ab_1000.out , and text_ab_1000.out.arpa , > and the problematic part of the resulted WFST in text_ab_1000.jpg . > > I have only two symbols, "a" and "b". Having both (b|aaa) and (a|aaa) > 4-grams, the (aaa) backoff weight would be unnecessary, > but if I build the WFST, there is the backoff link from (a_a_a) to (a_a). > In this way I can get from (a_a_a) to (a_a_b) on two ways, b+eps, or > BO+b+eps > The first route has the weight -2,10037 = log(p(b|aaa)) > The second route has the weight 3,66358+(-2,1038) = log( BO(aaa)*p(b|aa) > ) = 1,56 which is abnormally high. > > My question is if the backoff weight should have lower values, or the > wfst network is incorrectly built? > > Thanks for your advice, > Tibor Fegy? > > Tibor, there was a problem in the BOW computation when the probabilities from the lower-order distribution adds up to almost 1. This causes the BOW denominator to be near zero, and causes anomalous values. You can apply the appended patch. It also enables debugging information for the BOW computation with -debug 3 or higher. See if this fixes your problem. Andreas *** lm/src/NgramLM.cc 28 Sep 2010 20:17:24 -0000 1.121 --- lm/src/NgramLM.cc 30 May 2011 23:46:38 -0000 1.122 *************** *** 2039,2045 **** denominator = 0.0; } ! if (denominator == 0.0 && numerator > Prob_Epsilon) { /* * Backoff distribution has no probability left. To avoid wasting * probability mass scale the N-gram probabilities to sum to 1. --- 2039,2045 ---- denominator = 0.0; } ! if (denominator < Prob_Epsilon && numerator > Prob_Epsilon) { /* * Backoff distribution has no probability left. To avoid wasting * probability mass scale the N-gram probabilities to sum to 1. *************** *** 2055,2060 **** --- 2055,2061 ---- *prob += scale; } + denominator = 0.0; numerator = 0.0; return true; } else if (numerator < 0.0) { *************** *** 2118,2124 **** */ if (order == 0 /*&& numerator > 0.0*/) { distributeProb(numerator, context); ! } else if (numerator == 0.0 && denominator == 0) { node->bow = LogP_One; } else { node->bow = ProbToLogP(numerator) - ProbToLogP(denominator); --- 2119,2125 ---- */ if (order == 0 /*&& numerator > 0.0*/) { distributeProb(numerator, context); ! } else if (numerator == 0.0 && denominator == 0.0) { node->bow = LogP_One; } else { node->bow = ProbToLogP(numerator) - ProbToLogP(denominator); *************** *** 2130,2135 **** --- 2131,2144 ---- node->bow = LogP_Zero; result = false; } + + if (debug(DEBUG_ESTIMATES)) { + dout() << "CONTEXT " << (vocab.use(), context) + << " numerator " << numerator + << " denominator " << denominator + << " BOW " << node->bow + << endl; + } } return result; From maralthemoral at gmail.com Mon Jun 6 04:52:16 2011 From: maralthemoral at gmail.com (Maral Sh.) Date: Mon, 6 Jun 2011 07:52:16 -0400 Subject: [SRILM User List] Ngram-count and other files missing In-Reply-To: References: Message-ID: Hi, I've been trying to install SRILM for a while now. I've had a good share of errors here and there and solved some of them with the help of your mailing list and other websites. Still, I have a problem and that's that when I run make World command the bin, include, and lib folder are created and also the folder i686(without gcc in the name) is created in the bin folder and there are some files in it too. but the ngram-count file.. among some other files which I'm not sure what they are, are not being created! I can't figure out where the problem is. I would very much appreciate it if someone could help me solve this problem! Thanks in advance Maral -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Jun 6 08:14:52 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 06 Jun 2011 08:14:52 -0700 Subject: [SRILM User List] Ngram-count and other files missing In-Reply-To: References: Message-ID: <4DECEEEC.1040908@icsi.berkeley.edu> Maral Sh. wrote: > > > > Hi, > I've been trying to install SRILM for a while now. I've had a good > share of errors here and there and solved some of them with the help > of your mailing list and other websites. Still, I have a problem and > that's that when I run make World command the bin, include, and lib > folder are created and also the folder i686(without gcc in the name) > is created in the bin folder and there are some files in it too. but > the ngram-count file.. among some other files which I'm not sure what > they are, are not being created! I can't figure out where the problem > is. I would very much appreciate it if someone could help me solve > this problem! If you cannot solve the problem, please follow FAQ instructions A1 item f ! Without it there is no way anyone can help. Andreas From stolcke at icsi.berkeley.edu Wed Jun 15 11:07:12 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 15 Jun 2011 11:07:12 -0700 Subject: [SRILM User List] SRILM bigram VS HTK bigram In-Reply-To: References: Message-ID: <4DF8F4D0.7050305@icsi.berkeley.edu> Mehdi hoseini wrote: > hi > I made a bigram language model on a small document with both HTK > toolkit and SRILM toolkit. > but unfortunately their accuracy in HTK based ASR are so different. I > mean SRILM bigram has 10 percent accuracy lower than one with HTK. > Can you help me where is my mistake? > > Here is my command for build bigram in SRILM: > ngram-count -text sentences.txt -order 2 -wbdiscount 1 -wbdiscount > 2 -lm bigram.txt > sentences.txt has 405 sentences. > > I build my acoustic model based on left to right HMMs with 2 Gaussian > mixture for triphones using HTK. > Someone said sentences.txt and i build my model on that. It is likely that HTK requires some mapping of vocabulary items for begin/end of sentence. Also, are you sure that the smoothing method used by the HTK LM tools are the same as you did with SRILM? I don't really have experience building LMs for HTK, so you should inquire on the HTK user forum about this. I know there are plenty of people using SRILM in conjunction with HTK. Andreas > > Best Regards > From mehdi_hoseini at comp.iust.ac.ir Wed Jun 15 13:26:52 2011 From: mehdi_hoseini at comp.iust.ac.ir (Mehdi hoseini) Date: Wed, 15 Jun 2011 23:56:52 +0330 Subject: [SRILM User List] SRILM bigram VS HTK bigram Message-ID: hi I made a bigram language model on a small document with both HTK toolkit and SRILM toolkit. but unfortunately their accuracy in HTK based ASR are so different. I mean SRILM bigram has 10 percent accuracy lower than one with HTK. Can you help me where is my mistake? Here is my command for build bigram in SRILM: ngram-count -text sentences.txt -order 2 -wbdiscount 1 -wbdiscount 2 -lm bigram.txt sentences.txt has 405 sentences. I build my acoustic model based on left to right HMMs with 2 Gaussian mixture for triphones using HTK. Someone said sentences.txt and i build my model on that. Best Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From andersson at disi.unitn.it Mon Jun 20 04:41:52 2011 From: andersson at disi.unitn.it (Simon Andersson) Date: Mon, 20 Jun 2011 13:41:52 +0200 Subject: [SRILM User List] Cache model, ngram server Message-ID: <000001cc2f3f$0cf85c30$26e91490$@unitn.it> Hello Andreas, I want to use a cache LM (actually, an interpolated LM with a background trigram LM + unigram cache) with PocketSphinx. As far as I can see there is no way to write out the cache LM, but it should be possible to use it by running ngram with the server option. So what I'll do then is... 1) Modify PocketSphinx to connect to the ngram probability server (this feature is available in Sphinx 4 but not in PocketSphinx) 2) Run ngram with the server option Or is there an easier way? I just wanted to check with you that I've understood this correctly and that I'm not doing unnecessary work :-) Thanks, - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From s.bakhshaei at yahoo.com Mon Jun 20 06:45:58 2011 From: s.bakhshaei at yahoo.com (Somayeh Bakhshaei) Date: Mon, 20 Jun 2011 06:45:58 -0700 (PDT) Subject: [SRILM User List] Arpa LM Message-ID: <602798.91150.qm@web111723.mail.gq1.yahoo.com> Dear All, Hello, I have a question: Did LM made by SRILM are in arpa format? How I can? Make a LM which is .DMP? Thanks, ------------------ Best Regards, S.Bakhshaei -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Jun 20 10:15:46 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 20 Jun 2011 10:15:46 -0700 Subject: [SRILM User List] Arpa LM In-Reply-To: <602798.91150.qm@web111723.mail.gq1.yahoo.com> References: <602798.91150.qm@web111723.mail.gq1.yahoo.com> Message-ID: <4DFF8042.40307@icsi.berkeley.edu> Somayeh Bakhshaei wrote: > > Dear All, > > Hello, > > I have a question: > Did LM made by SRILM are in arpa format? > Yes! > How I can Make a LM which is .DMP? > What is DMP ? Some application specific binary format? It is definitely not supported by SRILM Andreas > Thanks, > ------------------ > Best Regards, > S.Bakhshaei > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Mon Jun 20 10:18:06 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 20 Jun 2011 10:18:06 -0700 Subject: [SRILM User List] Cache model, ngram server In-Reply-To: <000001cc2f3f$0cf85c30$26e91490$@unitn.it> References: <000001cc2f3f$0cf85c30$26e91490$@unitn.it> Message-ID: <4DFF80CE.7000605@icsi.berkeley.edu> Simon Andersson wrote: > > Hello Andreas, > > > > I want to use a cache LM (actually, an interpolated LM with a > background trigram LM + unigram cache) with PocketSphinx. As far as I > can see there is no way to write out the cache LM, but it should be > possible to use it by running ngram with the server option. > > > > So what I'll do then is... > > > > 1) Modify PocketSphinx to connect to the ngram probability server > (this feature is available in Sphinx 4 but not in PocketSphinx) > > 2) Run ngram with the server option > > > > Or is there an easier way? I just wanted to check with you that I've > understood this correctly and that I'm not doing unnecessary work :-) > That is exactly what you would need to do. I didn't know that Sphinx 4 supports the SRILM client/server protocol. Can you point us to some documentation ? Andreas From andersson at disi.unitn.it Mon Jun 20 11:29:42 2011 From: andersson at disi.unitn.it (Simon Andersson) Date: Mon, 20 Jun 2011 20:29:42 +0200 (CEST) Subject: [SRILM User List] Cache model, ngram server In-Reply-To: <4DFF80CE.7000605@icsi.berkeley.edu> References: <000001cc2f3f$0cf85c30$26e91490$@unitn.it> <4DFF80CE.7000605@icsi.berkeley.edu> Message-ID: <36330.127.0.0.1.1308594582.squirrel@mail.disi.unitn.it> Nickolay Shmyrev reports that he included the feature in Sphinx 4: http://nsh.nexiwave.com/2009/11/using-srilm-server-in-sphinx4.html (He also confirmed to me that it is not in PocketSphinx.) I'll use Nickolay's code as a reference when making a PocketSphinx version. - Simon > Simon Andersson wrote: >> >> Hello Andreas, >> >> >> >> I want to use a cache LM (actually, an interpolated LM with a >> background trigram LM + unigram cache) with PocketSphinx. As far as I >> can see there is no way to write out the cache LM, but it should be >> possible to use it by running ngram with the server option. >> >> >> >> So what I'll do then is... >> >> >> >> 1) Modify PocketSphinx to connect to the ngram probability server >> (this feature is available in Sphinx 4 but not in PocketSphinx) >> >> 2) Run ngram with the server option >> >> >> >> Or is there an easier way? I just wanted to check with you that I've >> understood this correctly and that I'm not doing unnecessary work :-) >> > That is exactly what you would need to do. > > I didn't know that Sphinx 4 supports the SRILM client/server protocol. > Can you point us to some documentation ? > > Andreas > From nshmyrev at yandex.ru Mon Jun 20 12:15:10 2011 From: nshmyrev at yandex.ru (Nickolay V. Shmyrev) Date: Mon, 20 Jun 2011 23:15:10 +0400 Subject: [SRILM User List] Cache model, ngram server In-Reply-To: <36330.127.0.0.1.1308594582.squirrel@mail.disi.unitn.it> References: <000001cc2f3f$0cf85c30$26e91490$@unitn.it> <4DFF80CE.7000605@icsi.berkeley.edu> <36330.127.0.0.1.1308594582.squirrel@mail.disi.unitn.it> Message-ID: <1308597310.27627.60.camel@localhost.localdomain> ? ???, 20/06/2011 ? 20:29 +0200, Simon Andersson ?????: > Nickolay Shmyrev reports that he included the feature in Sphinx 4: > > http://nsh.nexiwave.com/2009/11/using-srilm-server-in-sphinx4.html > > (He also confirmed to me that it is not in PocketSphinx.) > > I'll use Nickolay's code as a reference when making a PocketSphinx version. Hello Simon If your goal is only to implement cache-based LM, using SRILM as a server doesn't seem like an easy way and there are many important points you need to care about: 1. During initialization stage decoder requests *all* unigram probabilities to build lextree. You definitely don't want them to be in a cache and you need to disable cache for initialization. 2. During the search the decoder stores unigram probabilities internally in lextree. Most of the words are pruned before they reach leafs, so cache on server will not help you since probabilities will be the same. You need to adjust the weights inside the lextree. 3. You need to reset cache somehow Well, I suggest you to discuss this implementation thing on cmusphinx-devel mailing list instead since this is not really a SRILM issue. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From andersson at disi.unitn.it Mon Jun 20 12:49:12 2011 From: andersson at disi.unitn.it (Simon Andersson) Date: Mon, 20 Jun 2011 21:49:12 +0200 (CEST) Subject: [SRILM User List] Cache model, ngram server In-Reply-To: <1308597310.27627.60.camel@localhost.localdomain> References: <000001cc2f3f$0cf85c30$26e91490$@unitn.it> <4DFF80CE.7000605@icsi.berkeley.edu> <36330.127.0.0.1.1308594582.squirrel@mail.disi.unitn.it> <1308597310.27627.60.camel@localhost.localdomain> Message-ID: <57739.127.0.0.1.1308599352.squirrel@mail.disi.unitn.it> (I'm posting this message to both the SRILM and Sphinx lists...) What I want to do is to construct language models that can change according to application context. The context sensitive LM could be built by interpolating one or two trigram models (e.g., general background model + domain model) and a small unigram model (the 'cache' model). Would it not make sense to use the SRILM server feature for this? - Simon > ?? ??????, 20/06/2011 ?? 20:29 +0200, Simon Andersson ??????????: >> Nickolay Shmyrev reports that he included the feature in Sphinx 4: >> >> http://nsh.nexiwave.com/2009/11/using-srilm-server-in-sphinx4.html >> >> (He also confirmed to me that it is not in PocketSphinx.) >> >> I'll use Nickolay's code as a reference when making a PocketSphinx >> version. > > Hello Simon > > If your goal is only to implement cache-based LM, using SRILM as a > server doesn't seem like an easy way and there are many important points > you need to care about: > > 1. During initialization stage decoder requests *all* unigram > probabilities to build lextree. You definitely don't want them to be in > a cache and you need to disable cache for initialization. > > 2. During the search the decoder stores unigram probabilities internally > in lextree. Most of the words are pruned before they reach leafs, so > cache on server will not help you since probabilities will be the same. > You need to adjust the weights inside the lextree. > > 3. You need to reset cache somehow > > Well, I suggest you to discuss this implementation thing on > cmusphinx-devel mailing list instead since this is not really a SRILM > issue. > > From mehdi_hoseini at comp.iust.ac.ir Mon Jun 20 12:29:26 2011 From: mehdi_hoseini at comp.iust.ac.ir (Mehdi hoseini) Date: Mon, 20 Jun 2011 22:59:26 +0330 Subject: [SRILM User List] Arpa LM In-Reply-To: <4DFF8042.40307@icsi.berkeley.edu> References: <602798.91150.qm@web111723.mail.gq1.yahoo.com> <4DFF8042.40307@icsi.berkeley.edu> Message-ID: hi Somaye SRILM does not support DMP format. but you can make language model with SRILM and then convert it to DMP format with lm3g2dmp tool (more details see http://sphinx.subwiki.com/sphinx/index.php/Hello_World_Decoder_QuickStart_Guide [http://sphinx.subwiki.com/sphinx/index.php/Hello_World_Decoder_QuickStart_Guide]) -----Original Message----- From: Andreas Stolcke To: Somayeh Bakhshaei Cc: srilm-user at speech.sri.com Date: Mon, 20 Jun 2011 10:15:46 -0700 Subject: Re: [SRILM User List] Arpa LM Somayeh Bakhshaei wrote: > > Dear All, > > Hello, > > I have a question: > Did LM made by SRILM are in arpa format? > Yes! > How I can Make a LM which is .DMP? > What is DMP ? Some application specific binary format? It is definitely not supported by SRILM Andreas > Thanks, > ------------------ > Best Regards, > S.Bakhshaei > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user [http://www.speech.sri.com/mailman/listinfo/srilm-user] _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user [http://www.speech.sri.com/mailman/listinfo/srilm-user] -------------- next part -------------- An HTML attachment was scrubbed... URL: From lluis.formiga at upc.edu Tue Jun 21 01:52:15 2011 From: lluis.formiga at upc.edu (=?ISO-8859-1?Q?Llu=EDs_Formiga_i_Fanals?=) Date: Tue, 21 Jun 2011 10:52:15 +0200 Subject: [SRILM User List] Problems w/ misspellings, CN and lattice-tool Message-ID: <4E005BBF.5080600@upc.edu> Dear all, I am trying to implement the CN based misspelling correction method published by Bertoldi et al. 2010 (full citation is available at the end of this e-mail). However, I am sticked at step number 4 which involves the generation of a word-based CN by means of lattice-tool of SRILM toolkit. Once I have set the unifilar word lattices altogether in SLF format I call lattice-tool through this command: lattice-tool -in-lattice wordlattice.slf -read-htk -lm lm/en.lm -write-mesh wordlattice.cn However, the fact of including the language model may destroy completely the original CN form if the input lattice is considerably long (>15 nodes). I have tried to scale the language model impact through -htk-scale and -htk-wdpenalty options. But even though I set the htk-scale and htk-wdpenalty options to 0 the CN still gets destroyed. The only way I can save the CN structure is avoiding completely the -lm option. But then the BLEU score of the translations decrease considerably. Could anyone give me some clues in order to keep track of the problem I may have? I can provide slf lattice sample alongside dot-generated images of intact and destroyed CNs. Regards, Llu?s Formiga [Bertoldi et al. 2010] Nicola Bertoldi, Mauro Cettolo, and Marcello Federico. 2010. Statistical machine translation of texts with misspelled words. In/Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics/(HLT '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 412-419. -- -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Firma.png Type: image/png Size: 24739 bytes Desc: not available URL: From stolcke at icsi.berkeley.edu Tue Jun 21 10:35:27 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 21 Jun 2011 10:35:27 -0700 Subject: [SRILM User List] Cache model, ngram server In-Reply-To: <57739.127.0.0.1.1308599352.squirrel@mail.disi.unitn.it> References: <000001cc2f3f$0cf85c30$26e91490$@unitn.it> <4DFF80CE.7000605@icsi.berkeley.edu> <36330.127.0.0.1.1308594582.squirrel@mail.disi.unitn.it> <1308597310.27627.60.camel@localhost.localdomain> <57739.127.0.0.1.1308599352.squirrel@mail.disi.unitn.it> Message-ID: <4E00D65F.7050106@icsi.berkeley.edu> Simon Andersson wrote: > (I'm posting this message to both the SRILM and Sphinx lists...) > > What I want to do is to construct language models that can change > according to application context. The context sensitive LM could be built > by interpolating one or two trigram models (e.g., general background model > + domain model) and a small unigram model (the 'cache' model). > > Would it not make sense to use the SRILM server feature for this? > There is some experimental code (not documented in the man page yet) to perform adaptive weighting of interpolated ngram LMs. The class is AdaptiveMix (in AdaptiveMix.cc). A comment before the read() function documents the file format. The ngram(1) options enabling its use are { OPT_TRUE, "adapt-mix", &adaptMix, "use adaptive mixture of n-grams model" }, { OPT_FLOAT, "adapt-decay", &adaptDecay, "history likelihood decay factor" }, { OPT_UINT, "adapt-iters", &adaptIters, "EM iterations for adaptive mix" }, What this does is it reestimates the mixture weight between LMs based on the history. You could then also use the -cache option to add a unigram cache LM into the mix (but with a static mixture weight, given by -cache-lambda). The issues about integration with the decoder raised by Nickolay sound more serious. I'm sorry I cannot help with those. Andreas > - Simon > > > >> ?? ??????, 20/06/2011 ?? 20:29 +0200, Simon Andersson ??????????: >> >>> Nickolay Shmyrev reports that he included the feature in Sphinx 4: >>> >>> http://nsh.nexiwave.com/2009/11/using-srilm-server-in-sphinx4.html >>> >>> (He also confirmed to me that it is not in PocketSphinx.) >>> >>> I'll use Nickolay's code as a reference when making a PocketSphinx >>> version. >>> >> Hello Simon >> >> If your goal is only to implement cache-based LM, using SRILM as a >> server doesn't seem like an easy way and there are many important points >> you need to care about: >> >> 1. During initialization stage decoder requests *all* unigram >> probabilities to build lextree. You definitely don't want them to be in >> a cache and you need to disable cache for initialization. >> >> 2. During the search the decoder stores unigram probabilities internally >> in lextree. Most of the words are pruned before they reach leafs, so >> cache on server will not help you since probabilities will be the same. >> You need to adjust the weights inside the lextree. >> >> 3. You need to reset cache somehow >> >> Well, I suggest you to discuss this implementation thing on >> cmusphinx-devel mailing list instead since this is not really a SRILM >> issue. >> >> >> > > From stolcke at icsi.berkeley.edu Tue Jun 21 11:11:47 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 21 Jun 2011 11:11:47 -0700 Subject: [SRILM User List] Problems w/ misspellings, CN and lattice-tool In-Reply-To: <4E005BBF.5080600@upc.edu> References: <4E005BBF.5080600@upc.edu> Message-ID: <4E00DEE3.1070608@icsi.berkeley.edu> Llu?s Formiga i Fanals wrote: > Dear all, > > I am trying to implement the CN based misspelling correction > method published by Bertoldi et al. 2010 (full citation is available > at the end of this e-mail). However, I am sticked at step number 4 > which involves the generation of a word-based CN by means of > lattice-tool of SRILM toolkit. > > Once I have set the unifilar word lattices altogether in SLF > format I call lattice-tool through this command: > > lattice-tool -in-lattice wordlattice.slf -read-htk -lm lm/en.lm > -write-mesh wordlattice.cn > > However, the fact of including the language model may destroy > completely the original CN form if the input lattice is considerably > long (>15 nodes). I have tried to scale the language model impact > through -htk-scale and -htk-wdpenalty options. But even though I set > the htk-scale and htk-wdpenalty options to 0 the CN still gets > destroyed. The only way I can save the CN structure is avoiding > completely the -lm option. But then the BLEU score of the translations > decrease considerably. > > Could anyone give me some clues in order to keep track of the > problem I may have? I can provide slf lattice sample alongside > dot-generated images of intact and destroyed CNs. As per the lattice-tool(1) man page, the sequence of processing steps is such that the -lm option triggers expansion of the CNs into general lattices, so of course whatever special properties your original CNs had might be lost. I haven't read the original paper, so I don't know what those properties are. Can't you contact the author to find out more specifically how lattice-tool was used? Andreas > > Regards, > > Llu?s Formiga > > > > [Bertoldi et al. 2010] Nicola Bertoldi, Mauro Cettolo, and Marcello > Federico. 2010. Statistical machine translation of texts with > misspelled words. In /Human Language Technologies: The 2010 Annual > Conference of the North American Chapter of the Association for > Computational Linguistics/ (HLT '10). Association for Computational > Linguistics, Stroudsburg, PA, USA, 412-419. > -- > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From jianzhang09 at gmail.com Mon Jun 27 03:01:50 2011 From: jianzhang09 at gmail.com (jian zhang) Date: Mon, 27 Jun 2011 11:01:50 +0100 Subject: [SRILM User List] Unexpected error using make-big-lm Message-ID: Hi there, I am very new to srilm tool, i will do my best to explain my problem. I am playing around with the feature of building a large language at the moment. The batch counts have been built using make-batch-counts script. But when i am building the lm, I have got unexpected error like, + ngram-count -read - -read-with-mincounts -order 5 -kn1 biglm.kn1 -kn2 biglm.kn2 -kn3 biglm.kn3 -kn4 biglm.kn4 -kn5 biglm.kn5 -lm testlm.lm -interpolate -unk -meta-tag __meta__ -kn-counts-modified Unexpected error. /home/srilm/bin/make-big-lm: line 225: 27562 Aborted ngram-count -read - -read-with-mincounts -order $order $gtflags $options cat: write error: Broken pipe gawk: (FILENAME=- FNR=953632439) fatal: print to "standard output" failed (Broken pipe) cat: write error: Broken pipe gzip: stdout: Broken pipe I really have no idea where is wrong. Thanks, Jian -- Jian Zhang PLuTO Project Centre for Next Generation Localisation (CNGL) Dublin City University -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Jun 27 14:00:43 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 27 Jun 2011 14:00:43 -0700 Subject: [SRILM User List] Unexpected error using make-big-lm In-Reply-To: References: Message-ID: <4E08EF7B.7010706@icsi.berkeley.edu> jian zhang wrote: > > Hi there, > > I am very new to srilm tool, i will do my best to explain my problem. > I am playing around with the feature of building a large language at > the moment. The batch counts have been built using make-batch-counts > script. But when i am building the lm, I have got unexpected error like, > + ngram-count -read - -read-with-mincounts -order 5 -kn1 biglm.kn1 > -kn2 biglm.kn2 -kn3 biglm.kn3 -kn4 biglm.kn4 -kn5 biglm.kn5 -lm > testlm.lm -interpolate -unk -meta-tag __meta__ -kn-counts-modified > Unexpected error. > /home/srilm/bin/make-big-lm: line 225: 27562 Aborted > ngram-count -read - -read-with-mincounts -order $order $gtflags $options The ngram-count process exited with an "abort" signal, meaning it probably ran out of memory. As a sanity check, try building your LM on a small subset of the data (say, using no more than a million words), using the same procedure. Assuming that works, please check the FAQ page on strategies for dealing with insufficient memory. Andreas From stolcke at icsi.berkeley.edu Mon Jun 27 14:08:42 2011 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 27 Jun 2011 14:08:42 -0700 Subject: [SRILM User List] Found New Extension LSA-Ngram In-Reply-To: References: Message-ID: <4E08F15A.3010503@icsi.berkeley.edu> Mehdi hoseini wrote: > hi, > I found an free license srilm extension that build LSA+Ngram. But > unfortunately I can not make it. would please help me to compile and > patch it to srilm project? > Best regards > > here is the link: > > http://userver.ftw.at/~pucher/semanlm/semanlm0.91.zip > This extension of SRILM was developed a while ago, so there may be small incompatibilities with the current version. However, the required tweaks should be minor. I suggest you first try to look at the code sections that cause problem and try fixing it yourself. If you help serious problems, contact the author, Michael Pucher, for help. His current email address can be found in the paper at http://userver.ftw.at/~pucher/papers/phonsim_aaa07.pdf . Andreas