From ma.farajian at gmail.com Tue Oct 5 22:19:53 2010 From: ma.farajian at gmail.com (amin farajian) Date: Wed, 6 Oct 2010 08:49:53 +0330 Subject: [SRILM User List] problem in installing SRILM In-Reply-To: References: Message-ID: Hi all, I am trying to install SRILM on my machine (i486 with debian) According the instructions in INSTALL file, I changed the SRILM variable in Make file and CC and CXX variables in Makefile.machine.i686. I also added NO_TCL=X to this file. but while trying to install SRILM (by this command: make Wor), I faced these problems: ar: creating ../obj/i686/libmisc.a ar: creating ../obj/i686/libdstruct.a make[2]: [/home/amin/MT/srilm//bin/ i686/maxalloc] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram-count] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram-merge] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram-class] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/disambig] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/anti-ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-lattice] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-mix] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-optimize] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-pron-score] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/segment] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/segment-nbest] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/hidden-ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/multi-ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/fngram-count] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/fngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/lattice-tool] Error 1 (ignored) what is wrong? As it passes the dependencies check level, I think that this problem is not from missed libraries. So what is wrong? -------------- next part -------------- An HTML attachment was scrubbed... URL: From ma.farajian at gmail.com Tue Oct 5 22:18:47 2010 From: ma.farajian at gmail.com (amin farajian) Date: Wed, 6 Oct 2010 08:48:47 +0330 Subject: [SRILM User List] problem in installing SRILM Message-ID: Hi all, I am trying to install SRILM on my machine (i486 with debian) According the instructions in INSTALL file, I changed the SRILM variable in Make file and CC and CXX variables in Makefile.machine.i686. I also added NO_TCL=X to this file. but while trying to install SRILM (by this command: make Wor), I faced these problems: ar: creating ../obj/i686/libmisc.a ar: creating ../obj/i686/libdstruct.a make[2]: [/home/amin/MT/srilm//bin/ i686/maxalloc] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram-count] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram-merge] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/ngram-class] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/disambig] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/anti-ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-lattice] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-mix] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-optimize] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/nbest-pron-score] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/segment] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/segment-nbest] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/hidden-ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/multi-ngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/fngram-count] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/fngram] Error 1 (ignored) make[2]: [/home/amin/MT/srilm//bin/i686/lattice-tool] Error 1 (ignored) what is wrong? As it passes the dependencies check level, I think that this problem is not from missed libraries. So what is wrong? All messages during the installation are saved in a file which is attached. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: log.rtf Type: application/rtf Size: 175210 bytes Desc: not available URL: From stolcke at speech.sri.com Tue Oct 5 22:30:33 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 05 Oct 2010 23:30:33 -0600 Subject: [SRILM User List] problem in installing SRILM In-Reply-To: References: Message-ID: <4CAC0979.7060005@speech.sri.com> amin farajian wrote: > > Hi all, > > I am trying to install SRILM on my machine (i486 with debian) > According the instructions in INSTALL file, I changed the SRILM > variable in > Make file and CC and CXX variables in Makefile.machine.i686. I also added > NO_TCL=X to this file. but while trying to install SRILM (by this command: > make Wor), I faced these problems: Your log file (in the original post) shows that you are still trying to link with -ltcl . The INSTALL file says: > TCL_INCLUDE, to whatever is needed to find the Tcl header > files and library. If Tcl is not available, set NO_TCL=X > and leave the above variables empty. So you probably forgot to put TCL_LIBRARY = in Makefile.machine.i686. Andreas From nakul777 at gmail.com Thu Oct 7 04:37:54 2010 From: nakul777 at gmail.com (nakul sharma) Date: Thu, 7 Oct 2010 17:07:54 +0530 Subject: [SRILM User List] installing SRILM Message-ID: hi all, i am installing SRILM software on Ubuntu 10.04(updated)(i686 arch).Going by the INSTALL file i have set SRILM variable accordingly to point to Make file and CC and CXX variables in Makefile.machine.i686. i have changed NO_TCL to X. it showing me following errors after running make World command:- make: /sbin/machine-type: Command not found mkdir include lib bin mkdir: cannot create directory `include': File exists mkdir: cannot create directory `lib': File exists mkdir: cannot create directory `bin': File exists make: [dirs] Error 1 (ignored) make init make[1]: /sbin/machine-type: Command not found make[1]: Entering directory `/home/nakul/Desktop/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM= MACHINE_TYPE= OPTION= MAKE_PIC= init) || exit 1; \ done make[2]: Entering directory `/home/nakul/Desktop/srilm/misc/src' Makefile:24: /common/Makefile.common.variables: No such file or directory Makefile:139: /common/Makefile.common.targets: No such file or directory make[2]: *** No rule to make target `/common/Makefile.common.targets'. Stop. make[2]: Leaving directory `/home/nakul/Desktop/srilm/misc/src' make[1]: *** [init] Error 1 make[1]: Leaving directory `/home/nakul/Desktop/srilm' make: *** [World] Error 2 There seems to be come dependency problem.Please tell what should be done ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Oct 7 09:48:17 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 07 Oct 2010 10:48:17 -0600 Subject: [SRILM User List] installing SRILM In-Reply-To: References: Message-ID: <4CADF9D1.90207@speech.sri.com> nakul sharma wrote: > hi all, > > i am installing SRILM software on Ubuntu 10.04(updated)(i686 > arch).Going by the INSTALL file i have set SRILM variable accordingly > to point to Make file and CC and CXX variables in > Makefile.machine.i686. i have changed NO_TCL to X. > it showing me following errors after running make World command:- > > make: /sbin/machine-type: Command not found This is a FAQ: you don't have tcsh/csh installed, hence the above scripts won't run. tcsh is an optional package on Ubuntu and cygiwin systems. Andreas From david at unizar.es Thu Oct 14 04:27:39 2010 From: david at unizar.es (david at unizar.es) Date: Thu, 14 Oct 2010 13:27:39 +0200 Subject: [SRILM User List] nbest-posterior Message-ID: <20101014132739.2u6lkfow8gocc04o@webmail.unizar.es> Hi, I am using nbest-posteriors to process a list of nbest hypothesis. For each one, I get the log10 of the acoustic score, and if I convert it to a linear probability, the sum of all hypothesis is always one, for each file. So, they are the posterior distribution for the chosen number of nbest hypothesis. Is there a way to obtain just a likelihood, a score for each hypothesis without summing one among all. Many thanks, DAVID. From stolcke at speech.sri.com Thu Oct 14 09:23:21 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 14 Oct 2010 09:23:21 -0700 Subject: [SRILM User List] nbest-posterior In-Reply-To: <20101014132739.2u6lkfow8gocc04o@webmail.unizar.es> References: <20101014132739.2u6lkfow8gocc04o@webmail.unizar.es> Message-ID: <4CB72E79.404@speech.sri.com> david at unizar.es wrote: > Hi, > > I am using nbest-posteriors to process a list of nbest hypothesis. For > each one, I get the log10 of the acoustic score, and if I convert it > to a linear probability, the sum of all hypothesis is always one, for > each file. So, they are the posterior distribution for the chosen > number of nbest hypothesis. > > Is there a way to obtain just a likelihood, a score for each > hypothesis without summing one among all. Just use the original nbest scores then (and add acoustic and LM scores according to whatever weighting you want to apply). Andreas > > Many thanks, > > DAVID. > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From vagarwal at mit.edu Fri Oct 15 10:14:13 2010 From: vagarwal at mit.edu (Vikram Agarwal) Date: Fri, 15 Oct 2010 13:14:13 -0400 Subject: [SRILM User List] smoothing questions Message-ID: <4CB88BE5.9020509@mit.edu> Hello, I am new to SRILM and just have a few questions that I'd greatly appreciate some help on: 1) Trying to use purely ML estimates w/ no smoothing, I used "ngram-count -cdiscount 0 -order 8 -read counts.txt -lm train.lm" but using this model with ngram -debug 2, I do not get zeroprobs because it automatically backs off. Is there a way to prohibit backoff so that I can retrieve the zeroprob sentences? 2) Suppose I perform: ngram -order 5 -ppl test.txt -lm train.lm. Will I always get the same results if I generated train.lm with ngram-count at order 5 or greater than 5, regardless of which smoothing technique is used and whether backoff/interpolation is employed? 3) My work uses a very small vocabulary (4 letters), but requires smoothing at higher orders (5-8). I read in the FAQ that -ukndiscount -order 7 may be good to use for modeling OOV words with a letter model. I wonder why ukndiscount was recommended over kndiscount? If kndiscounting does not work due to the sparsity of low count-of-counts, could the extrapolated count-of-counts generated by "make-big-lm" outperform the ukndiscounting method? Thank you, Vikram From stolcke at speech.sri.com Thu Oct 21 10:34:48 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 21 Oct 2010 10:34:48 -0700 Subject: [SRILM User List] [Fwd: Re: Pruning of KN-smoothed models] Message-ID: <4CC079B8.6010300@speech.sri.com> This is great. Thanks, Ciprian! Andreas -------------- next part -------------- An embedded message was scrubbed... From: Ciprian Chelba Subject: Re: Pruning of KN-smoothed models Date: Wed, 20 Oct 2010 19:50:51 -0700 Size: 8107 URL: From stolcke at speech.sri.com Sun Oct 24 11:05:38 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 24 Oct 2010 11:05:38 -0700 Subject: [SRILM User List] request some question about lattice-tool In-Reply-To: References: Message-ID: <4CC47572.6080207@speech.sri.com> minglemingle_fight wrote: > Dear professor, > > Sorry to trobule you and thanks for your reading! > > I am a student from zhengzhou university of china,my name is ming > yin,my major is speech recogniton. > Recent days I read alot of article of confusion network ?I have do > some work by lattice-tool,but when i use it to generate the Confusion > network ,there are warning"fail to align 1 word(s),max > posterior=4.03825e-013",and the 1-best recognition of the CN is worse > than 1-pass decoding. I don't know why,is it the lattice have problems > or my command is error. This message describes a normal condition for confusion network building from lattices, as long as the posterior value printed is small (4e-13 is small). So you did nothing wrong, and you don't have to worry about the message. Andreas > my command is : > > lattice-tool -in-lattice-list lat_list -read-htk -no-htk-nulls > -htk-words-on-nodes -htk-logbase 2.718 -write-mesh-dir out_dir; > > Thank you for your reading in your very tight time, look forward your > lettle. > Best Wishes ! > > your S incerly:YIN From ma.farajian at gmail.com Fri Oct 29 06:39:15 2010 From: ma.farajian at gmail.com (amin farajian) Date: Fri, 29 Oct 2010 17:09:15 +0330 Subject: [SRILM User List] Problem in building a 4gram language model Message-ID: Hi, I'm using SRILM to build a 4gram language on a 570MB data (about 6,500,000 sentences) using this command: tools/srilm/bin/i686/ngram-count -order 4 -interpolate -kndiscount -unk -text work/lm/MonoLing.pe -lm work/lm/Persian4.lm but even after 24 hours of processing, nothing happenes. when I use order 3, after 2 minutes the process finished and SRILM builds the 3gram language model. but when I want SRILM to build a 4gram model on the same file, nothing happens(No output, no message). what is the problem? how can I fix it? Bests. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Oct 29 09:39:15 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 29 Oct 2010 09:39:15 -0700 Subject: [SRILM User List] Problem in building a 4gram language model In-Reply-To: Your message of Fri, 29 Oct 2010 17:09:15 +0330. Message-ID: <201010291639.o9TGdGj20857@huge> Please read the FAQ page (or "man srilm-faq") on the subject of "Large data and memory issues". --Andreas In message you wr ote: > I'm using SRILM to build a 4gram language on a 570MB data (about 6,500,000 > sentences) using this command: > > tools/srilm/bin/i686/ngram-count -order 4 -interpolate -kndiscount > -unk -text work/lm/MonoLing.pe -lm work/lm/Persian4.lm > > but even after 24 hours of processing, nothing happenes. > when I use order 3, after 2 minutes the process finished and SRILM > builds the 3gram language model. but when I want SRILM to build a > 4gram model on the same file, nothing happens(No output, no message). > > what is the problem? how can I fix it? > > Bests. > From avneesh.saluja at sv.cmu.edu Wed Nov 10 23:34:54 2010 From: avneesh.saluja at sv.cmu.edu (Avneesh Saluja) Date: Wed, 10 Nov 2010 23:34:54 -0800 Subject: [SRILM User List] Non-integer weights (-text-has-weights) and KN discounting Message-ID: Hello SRILM team, I'm working with your language model, and I'm also using the -text-has-weights feature to provide sentence-level weights to my training corpora. I read online in the documentation that modified Kneser-Ney smoothing doesn't support fractional counts (and since I have weights between 0 to 1, I will have fractional counts), so I tried forcing my weights to be integers (by scaling up and rounding), which resulted in a count-of-counts error for modified KN smoothing. I found this link: http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html that discusses a possible reason for why this would be the case. So on a whim, I tried using non-scaled and rounded weights (i.e. weights between 0 and 1) and evaluated using kn discounting, and found I didn't get any errors and got what seems to be a reasonable perplexity for my situation. My question is, can I trust this number, because the documentation says fractional counts are only available with absolute or Wb discounting. If I can't trust this number, is there any way I can get KN discounting to work in a reliable manner? Thanks, Avneesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Nov 11 14:54:08 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 11 Nov 2010 14:54:08 -0800 Subject: [SRILM User List] Non-integer weights (-text-has-weights) and KN discounting In-Reply-To: References: Message-ID: <4CDC7410.8030006@speech.sri.com> Avneesh Saluja wrote: > Hello SRILM team, > > I'm working with your language model, and I'm also using the > -text-has-weights feature to provide sentence-level weights to my > training corpora. I read online in the documentation that modified > Kneser-Ney smoothing doesn't support fractional counts (and since I > have weights between 0 to 1, I will have fractional counts), so I > tried forcing my weights to be integers (by scaling up and rounding), > which resulted in a count-of-counts error for modified KN smoothing. > > I found this link: > http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html > > that discusses a possible reason for why this would be the case. So > on a whim, I tried using non-scaled and rounded weights (i.e. weights > between 0 and 1) and evaluated using kn discounting, and found I > didn't get any errors and got what seems to be a reasonable perplexity > for my situation. > > My question is, can I trust this number, because the documentation > says fractional counts are only available with absolute or Wb > discounting. If I can't trust this number, is there any way I can get > KN discounting to work in a reliable manner? If you read floating point counts as integers you might just get truncation unless you have exponent notation (123.4 becomes 123, but 1.23e10 becomes 1!!!), and the results may be sort-of okay. Are you still collecting counts with -float-counts? You must, because if you use a weight between 0 and 1 without -float-counts you get a weight of 0, and all counts would be zero. You can check this using % ngram-count -text - -text-has-weights -write - 0.5 a 0 a 0 a 0 a 0 a 0 0 but with float-counts: % ngram-count -text - -text-has-weights -write - -float-counts 0.5 a 0.5 a 0.5 a 0.5 a 0.5 a 0.5 0.5 In the latter case if you have sufficient data the counts will sum up to something > 1 and then be truncated (see above) when building the LM without -float-counts. Andreas > > Thanks, > Avneesh > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From avneesh.saluja at sv.cmu.edu Thu Nov 11 23:28:07 2010 From: avneesh.saluja at sv.cmu.edu (Avneesh Saluja) Date: Thu, 11 Nov 2010 23:28:07 -0800 Subject: [SRILM User List] Non-integer weights (-text-has-weights) and KN discounting In-Reply-To: <4CDC7410.8030006@speech.sri.com> References: <4CDC7410.8030006@speech.sri.com> Message-ID: Hi Andreas, For some reason, float-counts doesn't seem to work well (this is on a file with non-integer sentence weights, since according to man ngram-count this works with WB, I tried WB): $ ~/tools/srilm/bin/i686-m64/ngram-count -order 3 -text training_sets/trainingwithweights.txt -text-has-weights -unk -memuse -wbdiscount -lm weights.lm -float-counts -debug 2 ngram-count: ngram-count.cc:370: int main(int, char**): Assertion `intStats != 0' failed. Aborted The code seems OK: NgramCounts *floatStats = !useFloatCounts ? 0 : new NgramCounts(*vocab, order); #define USE_STATS(what) (useFloatCounts ? floatStats->what : intStats->what) if (useFloatCounts) { assert(floatStats != 0); } else { assert(intStats != 0); } So I'm not sure why I get this error. I haven't found any previous troubleshooting emails related to this. Thanks for your help, Avneesh On Thu, Nov 11, 2010 at 2:54 PM, Andreas Stolcke wrote: > Avneesh Saluja wrote: > >> Hello SRILM team, >> >> I'm working with your language model, and I'm also using the >> -text-has-weights feature to provide sentence-level weights to my training >> corpora. I read online in the documentation that modified Kneser-Ney >> smoothing doesn't support fractional counts (and since I have weights >> between 0 to 1, I will have fractional counts), so I tried forcing my >> weights to be integers (by scaling up and rounding), which resulted in a >> count-of-counts error for modified KN smoothing. >> I found this link: >> http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html >> >> that discusses a possible reason for why this would be the case. So on a >> whim, I tried using non-scaled and rounded weights (i.e. weights between 0 >> and 1) and evaluated using kn discounting, and found I didn't get any errors >> and got what seems to be a reasonable perplexity for my situation. >> My question is, can I trust this number, because the documentation says >> fractional counts are only available with absolute or Wb discounting. If I >> can't trust this number, is there any way I can get KN discounting to work >> in a reliable manner? >> > If you read floating point counts as integers you might just get truncation > unless you have exponent notation (123.4 becomes 123, but 1.23e10 becomes > 1!!!), and the results may be sort-of okay. > > Are you still collecting counts with -float-counts? You must, because if > you use a weight between 0 and 1 without -float-counts you get a weight of > 0, and all counts would be zero. You can check this using > > % ngram-count -text - -text-has-weights -write - > 0.5 a > 0 > a 0 > a 0 > a 0 > a 0 > 0 > > but with float-counts: > > % ngram-count -text - -text-has-weights -write - -float-counts > 0.5 a > 0.5 > a 0.5 > a 0.5 > a 0.5 > a 0.5 > 0.5 > > In the latter case if you have sufficient data the counts will sum up to > something > 1 and then be truncated (see above) when building the LM without > -float-counts. > > Andreas > > > > > >> Thanks, >> Avneesh >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Nov 12 12:52:18 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 12 Nov 2010 12:52:18 -0800 Subject: [SRILM User List] Non-integer weights (-text-has-weights) and KN discounting In-Reply-To: References: <4CDC7410.8030006@speech.sri.com> <4CDD5A4E.7070205@speech.sri.com> Message-ID: <4CDDA902.8050602@speech.sri.com> Avneesh Saluja wrote: > Hi Andreas, you're right - something wrong with my 64 bit compilation > on the current machine - the 32 bit version works fine, and I tried > the 64-bit version on another machine and it's OK too. Sorry about > that, and thanks again for your help! Thanks for verifying that it is not a problem with the code. Andreas > > Avneesh > > On Fri, Nov 12, 2010 at 7:16 AM, Andreas Stolcke > > wrote: > > Avneesh Saluja wrote: > > Hi Andreas, > > For some reason, float-counts doesn't seem to work well (this > is on a file with non-integer sentence weights, since > according to man ngram-count this works with WB, I tried WB): > $ ~/tools/srilm/bin/i686-m64/ngram-count -order 3 -text > training_sets/trainingwithweights.txt -text-has-weights -unk > -memuse -wbdiscount -lm weights.lm -float-counts -debug 2 > ngram-count: ngram-count.cc:370: int main(int, char**): > Assertion `intStats != 0' failed. > Aborted > > Very strange indeed, since this happens before anything is stored > in the counts data structure. Does this happen with a 32bit > binary as well (MACHINE_TYPE=i686) ? > I cannot replicate the error using either 32 or 64bit Linux > binaries. It could be a compiler problem. What version of gcc > are you using? > > Andreas > > > The code seems OK: NgramCounts *floatStats = > !useFloatCounts ? 0 : > new NgramCounts(*vocab, order); > > #define USE_STATS(what) (useFloatCounts ? floatStats->what : > intStats->what) > > if (useFloatCounts) { > assert(floatStats != 0); > } else { > assert(intStats != 0); > } > > So I'm not sure why I get this error. I haven't found any > previous troubleshooting emails related to this. > Thanks for your help, > Avneesh > > > On Thu, Nov 11, 2010 at 2:54 PM, Andreas Stolcke > > >> wrote: > > Avneesh Saluja wrote: > > Hello SRILM team, > > I'm working with your language model, and I'm also > using the > -text-has-weights feature to provide sentence-level > weights to > my training corpora. I read online in the > documentation that > modified Kneser-Ney smoothing doesn't support fractional > counts (and since I have weights between 0 to 1, I will > have > fractional counts), so I tried forcing my weights to be > integers (by scaling up and rounding), which resulted in a > count-of-counts error for modified KN smoothing. > I found this link: > > http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html > > that discusses a possible reason for why this would be the > case. So on a whim, I tried using non-scaled and rounded > weights (i.e. weights between 0 and 1) and evaluated > using kn > discounting, and found I didn't get any errors and got what > seems to be a reasonable perplexity for my situation. > My question is, can I trust this number, because the > documentation says fractional counts are only available > with > absolute or Wb discounting. If I can't trust this > number, is > there any way I can get KN discounting to work in a > reliable > manner? > > If you read floating point counts as integers you might > just get > truncation unless you have exponent notation (123.4 becomes > 123, > but 1.23e10 becomes 1!!!), and the results may be sort-of okay. > > Are you still collecting counts with -float-counts? You must, > because if you use a weight between 0 and 1 without > -float-counts > you get a weight of 0, and all counts would be zero. You can > check this using > > % ngram-count -text - -text-has-weights -write - > 0.5 a > 0 > a 0 > a 0 > a 0 > a 0 > 0 > > but with float-counts: > > % ngram-count -text - -text-has-weights -write - -float-counts > 0.5 a > 0.5 > a 0.5 > a 0.5 > a 0.5 > a 0.5 > 0.5 > > In the latter case if you have sufficient data the counts > will sum > up to something > 1 and then be truncated (see above) when > building the LM without -float-counts. > > Andreas > > > > > > Thanks, > Avneesh > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > > > > > http://www.speech.sri.com/mailman/listinfo/srilm-user > > > > > From leona at postech.ac.kr Sun Nov 14 15:58:13 2010 From: leona at postech.ac.kr (Hwidong Na) Date: Mon, 15 Nov 2010 08:58:13 +0900 Subject: [SRILM User List] The backoff weights in the "ngram-count" result Message-ID: <1289779093.4479.11.camel@pandora> Hi, I'm going to utilize the backoff weights given by "ngram-count". In http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html > These are optionally followed by the logarithm (base 10) However, I found that these are approximately ranged from -5 to +5. I suspect it is not true that the actual backoff weights are ranged from 10^-5 to 10^+5, instead of from 0.0 to 1.0. What is the actual value of the backoff weights? -- Hwidong Na KLE lab, POSTECH, KOREA From stolcke at speech.sri.com Sun Nov 14 18:30:56 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 14 Nov 2010 18:30:56 -0800 Subject: [SRILM User List] The backoff weights in the "ngram-count" result In-Reply-To: Your message of Mon, 15 Nov 2010 08:58:13 +0900. <1289779093.4479.11.camel@pandora> Message-ID: <201011150230.oAF2Uuj08319@huge> Backoff weights are not probabilities. They can be greater than 1. --Andreas In message <1289779093.4479.11.camel at pandora>you wrote: > Hi, > > I'm going to utilize the backoff weights given by "ngram-count". > > In http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html > > These are optionally followed by the logarithm (base 10) > > However, I found that these are approximately ranged from -5 to +5. I > suspect it is not true that the actual backoff weights are ranged from > 10^-5 to 10^+5, instead of from 0.0 to 1.0. What is the actual value of > the backoff weights? > > -- > Hwidong Na > KLE lab, POSTECH, KOREA > > > > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From leona at postech.ac.kr Mon Nov 15 00:03:41 2010 From: leona at postech.ac.kr (Hwidong Na) Date: Mon, 15 Nov 2010 17:03:41 +0900 Subject: [SRILM User List] The backoff weights in the "ngram-count" result In-Reply-To: <201011150230.oAF2Uuj08319@huge> References: <201011150230.oAF2Uuj08319@huge> Message-ID: <1289808221.7858.12.camel@pandora> Hi Andreas, It is confusing for me backoff weights can be larger than 1. What would be the correct usage of backoff? I used the following model as (Chen and Goodman, 1998) summerized in the techincal paper "An Empirical Study of Smoothing Technique for Language Modeling" (p.17). p_smooth(w_n |w_(i-n+1)..w_(n-1)..w_(n-1)) = bofw * p_LM(w_n |w_(i-n+1)..w_(n-1)) + (1-bofw) * p_LM(w_n |w_(i-n+2)..w_(n-1)) where bofw is the backoff weight for the word sequence w_(i-n +1)..w_(n-1) Best regards, -- Hwidong Na KLE lab, POSTECH, KOREA 2010-11-14 (?), 18:30 -0800, Andreas Stolcke: > Backoff weights are not probabilities. They can be greater than 1. > > --Andreas > > In message <1289779093.4479.11.camel at pandora>you wrote: > > Hi, > > > > I'm going to utilize the backoff weights given by "ngram-count". > > > > In http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html > > > These are optionally followed by the logarithm (base 10) > > > > However, I found that these are approximately ranged from -5 to +5. I > > suspect it is not true that the actual backoff weights are ranged from > > 10^-5 to 10^+5, instead of from 0.0 to 1.0. What is the actual value of > > the backoff weights? > > > > -- > > Hwidong Na > > KLE lab, POSTECH, KOREA > > > > > > > > > > > > > > _______________________________________________ > > SRILM-User site list > > SRILM-User at speech.sri.com > > http://www.speech.sri.com/mailman/listinfo/srilm-user > > > > > > From zeeshankhans at gmail.com Mon Nov 15 08:56:56 2010 From: zeeshankhans at gmail.com (zeeshan khan) Date: Mon, 15 Nov 2010 17:56:56 +0100 Subject: [SRILM User List] -cache-lambda at 1.0 Message-ID: Hi all, I am using the SRI toolkit to plot a perplexity curve for a corpus. I am trying to interpolate the main LM with a unigram cache language model based on a history of** (lets say) n words. The SRI toolkit ngram command provides this option by using the -cache option to specify cache size and -cache-lambda option to specify interpolation factor. The command looks like this: ngram -order 4 -lm -cache -cache-lambda -ppl (I have also tried it with -bayes 0 - the output is the same) The PPL values for some of the interpolation factors (and fixed cache size) are stated here: Interpolation factor PPL 0.9 1848.62 0.999 93059.1 0.99999 4.32174e+06 1.0 22.2459 as you can see, the PPL values increase over -cache-lambda values from 0.9 upwards at 0.999, 0.99999, as expected. But at -cache-lambda = 1.0, the PPL suddenly falls to an extremely low value (from about 4 million at 0.99999 to about 22 at 1.0). Can you kindly comment why does it happen ? Is this behavior at -cache-lambda = 1.0 a result of some error in the way the PPL is calculated by SRI, or am I missing some options in the command ? Regards, Zeeshan Khan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Mon Nov 15 10:18:35 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 15 Nov 2010 10:18:35 -0800 Subject: [SRILM User List] The backoff weights in the "ngram-count" result In-Reply-To: <1289808221.7858.12.camel@pandora> References: <201011150230.oAF2Uuj08319@huge> <1289808221.7858.12.camel@pandora> Message-ID: <4CE1797B.2070501@speech.sri.com> Hwidong Na wrote: > Hi Andreas, > > It is confusing for me backoff weights can be larger than 1. What would > be the correct usage of backoff? I used the following model as (Chen and > Goodman, 1998) summerized in the techincal paper "An Empirical Study of > Smoothing Technique for Language Modeling" (p.17). > > p_smooth(w_n |w_(i-n+1)..w_(n-1)..w_(n-1)) = > bofw * p_LM(w_n |w_(i-n+1)..w_(n-1)) > + (1-bofw) * p_LM(w_n |w_(i-n+2)..w_(n-1)) > > where bofw is the backoff weight for the word sequence w_(i-n > +1)..w_(n-1) > In the formula you quote, the "bofw" is not the backoff weight, it is the interpolation weight controlling the mixture of higher and lower order estimates. I only applies to interpolated smoothing methods (ngram-count -interpolate). The backoff weight is the \gamma variable in equation (24) on that page. It is computed as described in the 3rd column of the table at the top of p. 17. In fact, the formula given for Katz smoothing will work for all methods (replacing p_katz with the appropriate p-estimate of course), and it is what is implemented by SRILM. Andreas > Best regards, > From stolcke at speech.sri.com Mon Nov 15 12:10:37 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 15 Nov 2010 12:10:37 -0800 Subject: [SRILM User List] -cache-lambda at 1.0 In-Reply-To: References: Message-ID: <4CE193BD.9010804@speech.sri.com> zeeshan khan wrote: > Hi all, > > I am using the SRI toolkit to plot a perplexity curve for a corpus. > I am trying to interpolate the main LM with a unigram cache language > model based on a history of (lets say) n words. > The SRI toolkit ngram command provides this option by using the -cache > option to specify cache size and -cache-lambda option to specify > interpolation factor. > > The command looks like this: > > ngram -order 4 -lm -cache -cache-lambda factor> -ppl > (I have also tried it with -bayes 0 - the output is the same) > > The PPL values for some of the interpolation factors (and fixed cache > size) are stated here: > > Interpolation factor PPL > 0.9 1848.62 > 0.999 93059.1 > 0.99999 4.32174e+06 > 1.0 22.2459 Check the count of "zeroprob" words. My guess is that your cache LM gives probability 0 to a large number of words (in fact, it does that by design, since it only knows about words that appeared before). When lambda=1 only few words will have nonzero probability. So the PPL value at 1.0 is really infinity, but ngram-count -ppl tries to give more useful output by giving you the PPL over the nonzero words and separately reporting the number of zeroprob words. Andreas > > as you can see, the PPL values increase over -cache-lambda values from > 0.9 upwards at 0.999, 0.99999, as expected. > But at -cache-lambda = 1.0, the PPL suddenly falls to an extremely low > value (from about 4 million at 0.99999 to about 22 at 1.0). > > Can you kindly comment why does it happen ? Is this behavior at > -cache-lambda = 1.0 a result of some error in the way the PPL is > calculated by SRI, or am I missing some options in the command ? > > Regards, > Zeeshan Khan. > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From avneesh.saluja at sv.cmu.edu Thu Nov 18 00:11:17 2010 From: avneesh.saluja at sv.cmu.edu (Avneesh Saluja) Date: Thu, 18 Nov 2010 00:11:17 -0800 Subject: [SRILM User List] Discrepancies in -text-has-weights Message-ID: Hi Andreas and SRILM team, There seem to be some discrepancies when I use the -text-has-weights (and -float-counts) feature, or perhaps I'm not understanding these features right. I ran some experiments to test it out. I have a small training set of around 41k sentences and a test set of around 1000 sentences. I ran the following commands: ~/tools/srilm/bin/i686/ngram-count -order 3 -text test_baseline.txt -unk -memuse -wbdiscount -gt1min 0 -gt2min 0 gt3min 0 -lm test_baseline.lm -debug 2 ~/tools/srilm/bin/i686/ngram -unk -map-unk '' -lm test_baseline.lm -order 3 > baseline.ppl On several training sets with weights prepended to the sentences: 0.1x, 1x, 10x, and 100x (I ran ngram-count on 0.1x with -text-has-weights and -float-counts, and for the others I ran -text-has-weights only as the sentences are weighted by integers). I would expect the perplexities in all of these cases would be the same but instead I got the following results: No weights 0.1x 1x 10x 100x ppl 61.6418 125.688 61.6418 102.966 102.966 Is there any reason why they are different? Thanks, Avneesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From dianachih at gmail.com Thu Nov 18 09:06:29 2010 From: dianachih at gmail.com (Jie Qi) Date: Thu, 18 Nov 2010 12:06:29 -0500 Subject: [SRILM User List] make-batch-counts and merge-batch-counts Message-ID: Hi all and Andreas, I have a question about parameters of *make-batch-counts* and * merge-batch-counts*. If I have many text files stored in subfolders like 2007/01/01 to 2007/06/31. What should be the correct form of file-list, count-dir and start-iter? Can someone give me an example? Thanks! *make-batch-counts* *file-list* \ [ *batch-size* [ *filter* [ *count-dir* [ *options* ... ] ] ] ]*merge-batch-counts* [ *-float-counts* ] [ *-l* *N* ] *count-dir* [ *file-list*|*start-iter* ] Here is my writing code: >>make-batch-counts file-list 10 2007/01/01/*.txt >>merge-batch-counts 2007/01/01/*.txt -Diana -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Nov 18 14:43:05 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 18 Nov 2010 14:43:05 -0800 Subject: [SRILM User List] Discrepancies in -text-has-weights In-Reply-To: References: Message-ID: <4CE5ABF9.8050209@speech.sri.com> Avneesh Saluja wrote: > Hi Andreas and SRILM team, > > There seem to be some discrepancies when I use the -text-has-weights > (and -float-counts) feature, or perhaps I'm not understanding these > features right. I ran some experiments to test it out. I have a > small training set of around 41k sentences and a test set of around > 1000 sentences. I ran the following commands: > > ~/tools/srilm/bin/i686/ngram-count -order 3 -text test_baseline.txt > -unk -memuse -wbdiscount -gt1min 0 -gt2min 0 gt3min 0 -lm > test_baseline.lm -debug 2 > > ~/tools/srilm/bin/i686/ngram -unk -map-unk '' -lm > test_baseline.lm -order 3 > baseline.ppl > > > On several training sets with weights prepended to the sentences: > 0.1x, 1x, 10x, and 100x (I ran ngram-count on 0.1x with > -text-has-weights and -float-counts, and for the others I ran > -text-has-weights only as the sentences are weighted by integers). I > would expect the perplexities in all of these cases would be the same > but instead I got the following results: > > > No weights 0.1x 1x 10x 100x > ppl 61.6418 125.688 61.6418 102.966 102.966 > > > Is there any reason why they are different? > Yes, as you change the scaling factor you are effectively changing the number of samples. The smoothing with Witten Bell depends on the ratio of type and token frequency. If you duplicate the data x times you have increased the token frequencies by a factor x, but the number of distinct words (types) stays the same. It is quite intuitive: if you've seen 10 different words among a total sample of 10 your probability estimate that the next word will be a new word type will be much higher than if you had seen 10 word types among a sample of 1000! So scaling up makes your LM overconfident in having seen all the words in training, and scaling down gives too much probability mass to unseen words. What your results show nicely is that the "natural" counts work best for WB smoothing, which is reassuring since it validates the underlying model (new word frequency is used to estimate the frequency of unseen words, for each context). Andreas > > Thanks, > > Avneesh > From stolcke at speech.sri.com Thu Nov 18 14:55:15 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 18 Nov 2010 14:55:15 -0800 Subject: [SRILM User List] make-batch-counts and merge-batch-counts In-Reply-To: References: Message-ID: <4CE5AED3.20205@speech.sri.com> Jie Qi wrote: > Hi all and Andreas, > > I have a question about parameters of *make-batch-counts* and > *merge-batch-counts*. If I have many text files stored in subfolders > like 2007/01/01 to 2007/06/31. What should be the correct form of > file-list, count-dir and start-iter? Can someone give me an example? > Thanks! > > *make-batch-counts* /file-list/ \ > [ /batch-size/ [ /filter/ [ /count-dir/ [ /options/ ... ] ] ] ] > *merge-batch-counts* [ *-float-counts* ] [ *-l* /N/ ] /count-dir/ [ /file-list/|/start-iter/ ] > Here is my writing code: > >>make-batch-counts file-list 10 2007/01/01/*.txt > >>merge-batch-counts 2007/01/01/*.txt > You don't list the input files on the command line. You put a list of the input files into a separate file, and then give that to make-batch-counts. The "count-dir" is a new directory that you chose to store the aggregated count information. Make sure it has plenty of disk space. So for example: ls 2007/01/01/*.txt > file-list make-batch-counts file-list 10 mycounts merge-batch-counts mycounts should do what you intended to do. Andreas > > > -Diana > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From sidhurukku at yahoo.com Fri Nov 19 04:53:54 2010 From: sidhurukku at yahoo.com (Jasleen Sidhu) Date: Fri, 19 Nov 2010 04:53:54 -0800 (PST) Subject: [SRILM User List] please help me ... Message-ID: <163931.2543.qm@web111505.mail.gq1.yahoo.com> hello i m trying to build language model using srilm and moses.following error mesage is displayed.i tried installing srilm 3 times but same problem arises againand all the components are installed properly like gcc,gnu make ,tcl,gawk /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/anti-ngram ../../bin/i686 ERROR:? File to be installed (../bin/i686/anti-ngram) does not exist. ERROR:? File to be installed (../bin/i686/anti-ngram) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/anti-ngram directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/anti-ngram] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-lattice ../obj/i686/nbest-lattice.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-lattice ../../bin/i686 ERROR:? File to be installed (../bin/i686/nbest-lattice) does not exist. ERROR:? File to be installed (../bin/i686/nbest-lattice) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/nbest-lattice directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/nbest-lattice] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-mix ../obj/i686/nbest-mix.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-mix ../../bin/i686 ERROR:? File to be installed (../bin/i686/nbest-mix) does not exist. ERROR:? File to be installed (../bin/i686/nbest-mix) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/nbest-mix directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/nbest-mix] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-optimize ../obj/i686/nbest-optimize.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-optimize ../../bin/i686 ERROR:? File to be installed (../bin/i686/nbest-optimize) does not exist. ERROR:? File to be installed (../bin/i686/nbest-optimize) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/nbest-optimize directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/nbest-optimize] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-pron-score ../obj/i686/nbest-pron-score.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-pron-score ../../bin/i686 ERROR:? File to be installed (../bin/i686/nbest-pron-score) does not exist. ERROR:? File to be installed (../bin/i686/nbest-pron-score) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/nbest-pron-score directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/nbest-pron-score] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/segment ../obj/i686/segment.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/segment ../../bin/i686 ERROR:? File to be installed (../bin/i686/segment) does not exist. ERROR:? File to be installed (../bin/i686/segment) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/segment directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/segment] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/segment-nbest ../obj/i686/segment-nbest.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/segment-nbest ../../bin/i686 ERROR:? File to be installed (../bin/i686/segment-nbest) does not exist. ERROR:? File to be installed (../bin/i686/segment-nbest) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/segment-nbest directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/segment-nbest] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/hidden-ngram ../obj/i686/hidden-ngram.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/hidden-ngram ../../bin/i686 ERROR:? File to be installed (../bin/i686/hidden-ngram) does not exist. ERROR:? File to be installed (../bin/i686/hidden-ngram) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/hidden-ngram directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/hidden-ngram] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/multi-ngram ../obj/i686/multi-ngram.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/multi-ngram ../../bin/i686 ERROR:? File to be installed (../bin/i686/multi-ngram) does not exist. ERROR:? File to be installed (../bin/i686/multi-ngram) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/multi-ngram directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/multi-ngram] Error 1 (ignored) make[2]: Leaving directory `/home/srilm/lm/src' make[2]: Entering directory `/home/srilm/flm/src' g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/fngram-count ../obj/i686/fngram-count.o ../obj/i686/libflm.a -lm -ldl ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/fngram-count ../../bin/i686 ERROR:? File to be installed (../bin/i686/fngram-count) does not exist. ERROR:? File to be installed (../bin/i686/fngram-count) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/fngram-count directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/fngram-count] Error 1 (ignored) g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/fngram ../obj/i686/fngram.o ../obj/i686/libflm.a -lm -ldl ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/fngram ../../bin/i686 ERROR:? File to be installed (../bin/i686/fngram) does not exist. ERROR:? File to be installed (../bin/i686/fngram) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/fngram directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/fngram] Error 1 (ignored) make[2]: Leaving directory `/home/srilm/flm/src' make[2]: Entering directory `/home/srilm/lattice/src' g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/lattice-tool ../obj/i686/lattice-tool.o ../obj/i686/liblattice.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/srilm/sbin/decipher-install 0555 ../bin/i686/lattice-tool ../../bin/i686 ERROR:? File to be installed (../bin/i686/lattice-tool) does not exist. ERROR:? File to be installed (../bin/i686/lattice-tool) is not a plain file. Usage:? decipher-install ... ??????? mode:???????????????? file permission mode, in octal ??????? file1 ... fileN:????? files to be installed ??????? directory:??????????? where the files should be installed files =? ../bin/i686/lattice-tool directory =? ../../bin/i686 mode =? 0555 make[2]: [../../bin/i686/lattice-tool] Error 1 (ignored) make[2]: Leaving directory `/home/srilm/lattice/src' make[2]: Entering directory `/home/srilm/utils/src' make[2]: Nothing to be done for `release-programs'. make[2]: Leaving directory `/home/srilm/utils/src' make[1]: Leaving directory `/home/srilm' make release-scripts make[1]: Entering directory `/home/srilm' for subdir in misc dstruct lm flm lattice utils; do \ ??????????????? (cd $subdir/src; make SRILM=/home/srilm MACHINE_TYPE=i686 OPTION= MAKE_PIC= release-scripts) || exit 1; \ ??????? done make[2]: Entering directory `/home/srilm/misc/src' make[2]: Nothing to be done for `release-scripts'. make[2]: Leaving directory `/home/srilm/misc/src' make[2]: Entering directory `/home/srilm/dstruct/src' make[2]: Nothing to be done for `release-scripts'. make[2]: Leaving directory `/home/srilm/dstruct/src' make[2]: Entering directory `/home/srilm/lm/src' make[2]: Nothing to be done for `release-scripts'. make[2]: Leaving directory `/home/srilm/lm/src' make[2]: Entering directory `/home/srilm/flm/src' make[2]: Nothing to be done for `release-scripts'. make[2]: Leaving directory `/home/srilm/flm/src' make[2]: Entering directory `/home/srilm/lattice/src' make[2]: Nothing to be done for `release-scripts'. make[2]: Leaving directory `/home/srilm/lattice/src' make[2]: Entering directory `/home/srilm/utils/src' make[2]: Nothing to be done for `release-scripts'. make[2]: Leaving directory `/home/srilm/utils/src' make[1]: Leaving directory `/home/srilm' please help me? sir thank you . jasleen -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Nov 19 08:21:26 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 19 Nov 2010 08:21:26 -0800 Subject: [SRILM User List] please help me ... In-Reply-To: Your message of Fri, 19 Nov 2010 04:53:54 -0800. <163931.2543.qm@web111505.mail.gq1.yahoo.com> Message-ID: <201011191621.oAJGLQj06387@huge> please read the FAQ section on building and installation. --Andreas In message <163931.2543.qm at web111505.mail.gq1.yahoo.com>you wrote: > > hello > i m trying to build language model using srilm and moses.following error me= > sage is displayed.i tried installing srilm 3 times but same problem arises = > againand all the components are installed properly like gcc,gnu make ,tcl,g= > awk > > /usr/bin/ld: cannot find -ltcl > collect2: ld returned 1 exit status > /home/srilm/sbin/decipher-install 0555 ../bin/i686/anti-ngram ../../bin/i68= From stolcke at speech.sri.com Fri Nov 19 18:32:14 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 19 Nov 2010 18:32:14 -0800 Subject: [SRILM User List] how to create features like probability In-Reply-To: References: Message-ID: <4CE7332E.4050905@speech.sri.com> Jie Qi wrote: > Hi all and Andreas, > > I have used make-batch-counts(I choose to merge all files) and > merge-batch-counts to construct large counts files for the file-list0102. > > qijie at minus:~/Project/data/nytimes/2007> ls ./01/01/*.txt > file-list0101 > qijie at minus:~/Project/data/nytimes/2007> make-batch-counts > file-list0102 all > qijie at minus:~/Project/data/nytimes/2007> merge-batch-counts counts > > Based on the result file of file-list0102-1.ngrams, I made 3 language > models(1-gram, 2-gram and 3-gram): > qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order > 1 -read file-list0102-1.ngrams -lm file-list0102-1.unigramlm > qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order > 2 -read file-list0102-1.ngrams -lm file-list0102-1.bigramlm > qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order > 3 -read file-list0102-1.ngrams -lm file-list0102-1.trigramlm > > but in the .bigram and .trigram files, there is usually result of > 1-gram, how can I remove then and create features based on these > language models? Here is my code but -ppl file-list0102-1 seems wrong: > ngram -lm file-list0102-1.trigramlm -ppl file-list0102-1 -debug 3 > > file-list0102-1.triprob3 > I'm not sure what you want to achieve. The ARPA format for backoff LMs contains all ngram probabilities up to the maximum order, by definition and by necessity (lower-order estimates are needed for backing off). If you want to extract individual ngram probability parameters from the LM you can do that with gawk or perl text processing directly from the LM file. The LM contains explicit probabilities only for ngrams observed in training. If you want to generate the conditional ngram probabilities for a list of arbitrary ngrams use the option ngram -lm LM -counts C -debug 2 where C contains a list of ngrams each followed by a count value (e.g., 1). This will dump out the probabilities in a format similar to -ppl. You probably have to reformat the output using gawk/perl to suit your needs. Hope this helps. Andreas From oatgnaw at gmail.com Sat Nov 20 23:00:28 2010 From: oatgnaw at gmail.com (=?GB2312?B?zM7N9Q==?=) Date: Sun, 21 Nov 2010 15:00:28 +0800 Subject: [SRILM User List] Discrepancies in -text-has-weights In-Reply-To: <4CE5ABF9.8050209@speech.sri.com> References: <4CE5ABF9.8050209@speech.sri.com> Message-ID: would anyone help to remove me from the mail list? Thanks. 2010/11/19 Andreas Stolcke > Avneesh Saluja wrote: > >> Hi Andreas and SRILM team, >> >> There seem to be some discrepancies when I use the -text-has-weights (and >> -float-counts) feature, or perhaps I'm not understanding these features >> right. I ran some experiments to test it out. I have a small training set >> of around 41k sentences and a test set of around 1000 sentences. I ran the >> following commands: >> >> ~/tools/srilm/bin/i686/ngram-count -order 3 -text test_baseline.txt -unk >> -memuse -wbdiscount -gt1min 0 -gt2min 0 gt3min 0 -lm test_baseline.lm -debug >> 2 >> >> ~/tools/srilm/bin/i686/ngram -unk -map-unk '' -lm test_baseline.lm >> -order 3 > baseline.ppl >> >> >> On several training sets with weights prepended to the sentences: 0.1x, >> 1x, 10x, and 100x (I ran ngram-count on 0.1x with -text-has-weights and >> -float-counts, and for the others I ran -text-has-weights only as the >> sentences are weighted by integers). I would expect the perplexities in all >> of these cases would be the same but instead I got the following results: >> >> >> No weights 0.1x 1x 10x 100x >> ppl 61.6418 125.688 61.6418 102.966 >> 102.966 >> >> >> Is there any reason why they are different? >> >> Yes, as you change the scaling factor you are effectively changing the > number of samples. The smoothing with Witten Bell depends on the ratio of > type and token frequency. If you duplicate the data x times you have > increased the token frequencies by a factor x, but the number of distinct > words (types) stays the same. > > It is quite intuitive: if you've seen 10 different words among a total > sample of 10 your probability estimate that the next word will be a new word > type will be much higher than if you had seen 10 word types among a sample > of 1000! So scaling up makes your LM overconfident in having seen all the > words in training, and scaling down gives too much probability mass to > unseen words. > > What your results show nicely is that the "natural" counts work best for WB > smoothing, which is reassuring since it validates the underlying model (new > word frequency is used to estimate the frequency of unseen words, for each > context). > > Andreas > > >> Thanks, >> >> Avneesh >> >> > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fabian_in_hongkong at hotmail.com Tue Nov 23 00:38:15 2010 From: fabian_in_hongkong at hotmail.com (Fabian -) Date: Tue, 23 Nov 2010 09:38:15 +0100 Subject: [SRILM User List] Interpolation of multiple class-based+word-based LMs Message-ID: Hi, I am interested in interpolating multiple class-based and word-based language models. To be precise: 2 word-based (trivial) and 2 class-based. The classes of the 2 class-based LMs are different, as I computed the classes/LMs on different texts. As I understand the manual of the ngram tool it is not possible to do this, quote: the second and any additional interpolated models can also be class N-grams (using the same -classes definitions). And I do not satisfy "only one classes file". If I just combine the class-files (and rename classes in the classes file and LM, to avoid conflicts) I may have words in 2 classes, which may have an adverse effect.But it should be possible to do this, correct? So, is there a way to interpolate two class-based LMs with two different class definitions to get one class-based LM and one classes-file (where one word is only in one class)? Thank you,Fabian -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Nov 25 09:41:19 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 25 Nov 2010 09:41:19 -0800 Subject: [SRILM User List] Interpolation of multiple class-based+word-based LMs In-Reply-To: References: Message-ID: <4CEE9FBF.6080208@speech.sri.com> Fabian - wrote: > Hi, > > I am interested in interpolating multiple class-based and word-based > language models. To be precise: 2 word-based (trivial) and 2 > class-based. The classes of the 2 class-based LMs are different, as I > computed the classes/LMs on different texts. > > As I understand the manual of the ngram tool it is not possible to do > this, quote: the second and any additional interpolated models can > also be class N-grams (using the same *-classes *definitions). And I > do not satisfy "only one classes file". If I just combine the > class-files (and rename classes in the classes file and LM, to avoid > conflicts) I may have words in 2 classes, which may have an adverse > effect. Having the same words in more than once class has no adverse effect. So renaming the classes to be unique to each model, and then merging the class definitions is exactly the way to go. (You also want to avoid any name clashes between class and word labels.) Andreas > But it should be possible to do this, correct? > > So, is there a way to interpolate two class-based LMs with two > different class definitions to get one class-based LM and one > classes-file (where one word is only in one class)? > > Thank you, > Fabian > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at speech.sri.com Fri Nov 26 17:36:37 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 26 Nov 2010 17:36:37 -0800 Subject: [SRILM User List] FW: Re: help on buildind srilm In-Reply-To: Your message of Fri, 26 Nov 2010 16:31:09 +0100. Message-ID: <201011270136.oAR1abH21454@huge> In message you wrote: > > Hi, > > I just posted this message on the moses mailinglist and I was refered to yo= > u. > > Could you help me? > > ------------------------------------ > > Hi, > > I have compiled SRILM on a machine type of: ppc64 > > The make world seems to have finished ok. These files are in place: > > libdstruct.a > libflm.a > liblattice.a > libmisc.a > liboolm.a > > The make test seems to perform great. However it hangs (more than an hour) = > on this line: > > *** Running test ngram-server *** > > I have no idea what might cause this. Can anyone help me solve the problem.= > I have tried to ignore this and compile moses anyway, but that generates a= > n error during make moses. > I have no idea why this test wouldn't work on your machine, but the ngram-server funcionality is more dependent on OS specifics than most because is involves networking. If you don't need it specifically just disable the test and then rerun the other tests: cd $SRILM mkdir lm/test/tests.disabled mv lm/test/tests/ngram-server lm/test/tests.disabled make test Andreas From dianachih at gmail.com Sat Nov 27 00:30:34 2010 From: dianachih at gmail.com (Jie Qi) Date: Sat, 27 Nov 2010 03:30:34 -0500 Subject: [SRILM User List] how to create features like probability In-Reply-To: <4CE7332E.4050905@speech.sri.com> References: <4CE7332E.4050905@speech.sri.com> Message-ID: Hi all and Andreas, I wish to build n-gram language models on a set of text files, and then use the model to train on other corpora to compute their max sentence probability, min sentence probability, average sentence probability and overall log probability of the article. But I am not quite sure about how to do that with SRILM. I am confused with how to train the model and get overall and sentence probability most. Here is my code: find $1 -type f -iname 'clean*.txt' > file-list200701 make-batch-counts file-list200701 all merge-batch-counts counts #build language models ngram-count -order 1 -read merge-iter0-1.ngrams -lm file-list200701.unigramlm #get sentence probability ngram -lm file-list200701.unigramlm -counts merge-iter0-1.ngrams -debug 3 > nytfinance.unigramlm.prob3 Thanks! Best, Jie Qi On Fri, Nov 19, 2010 at 9:32 PM, Andreas Stolcke wrote: > Jie Qi wrote: > >> Hi all and Andreas, >> >> I have used make-batch-counts(I choose to merge all files) and >> merge-batch-counts to construct large counts files for the file-list0102. >> qijie at minus:~/Project/data/nytimes/2007> ls ./01/01/*.txt > file-list0101 >> qijie at minus:~/Project/data/nytimes/2007> make-batch-counts file-list0102 >> all >> qijie at minus:~/Project/data/nytimes/2007> merge-batch-counts counts >> >> Based on the result file of file-list0102-1.ngrams, I made 3 language >> models(1-gram, 2-gram and 3-gram): >> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 1 >> -read file-list0102-1.ngrams -lm file-list0102-1.unigramlm >> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 2 >> -read file-list0102-1.ngrams -lm file-list0102-1.bigramlm >> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 3 >> -read file-list0102-1.ngrams -lm file-list0102-1.trigramlm >> >> but in the .bigram and .trigram files, there is usually result of 1-gram, >> how can I remove then and create features based on these language models? >> Here is my code but -ppl file-list0102-1 seems wrong: >> ngram -lm file-list0102-1.trigramlm -ppl file-list0102-1 -debug 3 > >> file-list0102-1.triprob3 >> >> I'm not sure what you want to achieve. The ARPA format for backoff LMs > contains all ngram probabilities up to the maximum order, by definition and > by necessity (lower-order estimates are needed for backing off). > > If you want to extract individual ngram probability parameters from the LM > you can do that with gawk or perl text processing directly from the LM file. > The LM contains explicit probabilities only for ngrams observed in > training. > > If you want to generate the conditional ngram probabilities for a list of > arbitrary ngrams use the option > > ngram -lm LM -counts C -debug 2 > > where C contains a list of ngrams each followed by a count value (e.g., 1). > This will dump out the probabilities in a format similar to -ppl. You > probably have to reformat the output using gawk/perl to suit your needs. > > Hope this helps. > > Andreas > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sun Nov 28 20:44:38 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 28 Nov 2010 20:44:38 -0800 Subject: [SRILM User List] how to create features like probability In-Reply-To: References: <4CE7332E.4050905@speech.sri.com> Message-ID: <4CF32FB6.5050803@speech.sri.com> Jie Qi wrote: > Hi all and Andreas, > > I wish to build n-gram language models on a set of text files, and > then use the model to train on other corpora to compute their max > sentence probability, min sentence probability, average sentence > probability and overall log probability of the article. But I am not > quite sure about how to do that with SRILM. I am confused with how to > train the model and get overall and sentence probability most. Here is > my code: > > find $1 -type f -iname 'clean*.txt' > file-list200701 > make-batch-counts file-list200701 all > merge-batch-counts counts > #build language models > ngram-count -order 1 -read merge-iter0-1.ngrams -lm > file-list200701.unigramlm Everything looks good up to this point. > #get sentence probability > ngram -lm file-list200701.unigramlm -counts merge-iter0-1.ngrams > -debug 3 > nytfinance.unigramlm.prob3 I'm still confused about what you want to compute. Since merge-iter0-1.ngrams contains counts from the training data you would be computing the total training set likelihood, as well as that of all the individual ngrams occurring in it. Also, -debug 3 is very slow because it computes the sum of all the conditional probabilities for all histories. To compute just sentence probabilities (as indicated by your comment) for a list of test sentences contained in TEST (one sentence per line), use ngram -lm -debug 1 -ppl TEST > TEST.ppl Andreas From stolcke at speech.sri.com Sun Nov 28 20:59:06 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 28 Nov 2010 20:59:06 -0800 Subject: [SRILM User List] smoothing questions In-Reply-To: <4CB88BE5.9020509@mit.edu> References: <4CB88BE5.9020509@mit.edu> Message-ID: <4CF3331A.7080103@speech.sri.com> Vikram Agarwal wrote: > Hello, > > I am new to SRILM and just have a few questions that I'd greatly > appreciate some help on: > > 1) Trying to use purely ML estimates w/ no smoothing, I used > "ngram-count -cdiscount 0 -order 8 -read counts.txt -lm train.lm" > > but using this model with ngram -debug 2, I do not get zeroprobs > because it automatically backs off. Is there a way to prohibit > backoff so that I can retrieve the zeroprob sentences? I believe you are seeing backing-off only because the default minimum count for ngrams longer than 2 is 2 (so a singleton 4-gram, for example is not recorded in the model, and triggers backoff). Try using these additional options: -gt3min 1 -gt4min 1 -gt5min 1 -gt6min 1 -gt7min 1 -gt8min 1 (yes, it would be nice to have a single option to do this). > > 2) Suppose I perform: ngram -order 5 -ppl test.txt -lm train.lm. Will > I always get the same results if I generated train.lm with ngram-count > at order 5 or greater than 5, regardless of which smoothing technique > is used and whether backoff/interpolation is employed? The answer is yes for smoothing methods that apply to different ngram lengths uniformly. That is all the methods except those based on Kneser-Ney. For KN, the lower-order distributions are treated differently from the highest order, hence the 5grams are smoothed differently depending on whether 5 is the highest order or not. > > 3) My work uses a very small vocabulary (4 letters), but requires > smoothing at higher orders (5-8). I read in the FAQ that -ukndiscount > -order 7 may be good to use for modeling OOV words with a letter > model. I wonder why ukndiscount was recommended over kndiscount? If > kndiscounting does not work due to the sparsity of low > count-of-counts, could the extrapolated count-of-counts generated by > "make-big-lm" outperform the ukndiscounting method? I don't think there was a particular reason to recommend ukndiscount over kndiscount, but you conjecture makes sense. The count-of-counts extrapolation is designed for cases were the lowest order counts-counts (starting with the count of singletons) are missing, so it wouldn't really be relevant in this case. Andreas > > Thank you, > Vikram > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at speech.sri.com Sun Nov 28 22:32:36 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 28 Nov 2010 22:32:36 -0800 Subject: [SRILM User List] how to create features like probability In-Reply-To: References: <4CE7332E.4050905@speech.sri.com> <4CF32FB6.5050803@speech.sri.com> Message-ID: <4CF34904.1050300@speech.sri.com> Jie Qi wrote: > Hi Andreas, > > Thanks for your help! My goal is to compute the max sentence > probability, min sentence probability, average sentence probability > and overall log probability of the article for every text file in the > folder of New York Times Finance. Would you please teach me how to do > that with SRILM? Also, is it better to use make-batch-counts or to > write every text flie into one file and use ngram-counts? 1. Format the data so that each line of the text file contains one sentence. Use one file per article. 2. run ngram -debug 2 -ppl FILE on each text file. this gives you the log probability for each sentence, as well as the total log prob. 3. post process the output to get min/max/avg. etc. For the first and last step you need to learn gawk or perl. I cannot help you with that, but there are many books and online articles to teach you. Or ask someone local to help you. Good luck Andreas > > > Best, > Jie > > On Sun, Nov 28, 2010 at 11:44 PM, Andreas Stolcke > > wrote: > > Jie Qi wrote: > > Hi all and Andreas, > > I wish to build n-gram language models on a set of text files, > and then use the model to train on other corpora to compute > their max sentence probability, min sentence probability, > average sentence probability and overall log probability of > the article. But I am not quite sure about how to do that with > SRILM. I am confused with how to train the model and get > overall and sentence probability most. Here is my code: > > find $1 -type f -iname 'clean*.txt' > file-list200701 > make-batch-counts file-list200701 all > merge-batch-counts counts > #build language models > ngram-count -order 1 -read merge-iter0-1.ngrams -lm > file-list200701.unigramlm > > Everything looks good up to this point. > > #get sentence probability > ngram -lm file-list200701.unigramlm -counts > merge-iter0-1.ngrams -debug 3 > nytfinance.unigramlm.prob3 > > I'm still confused about what you want to compute. > Since merge-iter0-1.ngrams contains counts from the training data > you would be computing the total training set likelihood, as well > as that of all the individual ngrams occurring in it. > Also, -debug 3 is very slow because it computes the sum of all the > conditional probabilities for all histories. > To compute just sentence probabilities (as indicated by your > comment) for a list of test sentences contained in TEST (one > sentence per line), use > > ngram -lm -debug 1 -ppl TEST > TEST.ppl > > Andreas > > From zeeshankhans at gmail.com Mon Nov 29 16:19:27 2010 From: zeeshankhans at gmail.com (zeeshan khan) Date: Tue, 30 Nov 2010 01:19:27 +0100 Subject: [SRILM User List] lattice rescoring with LM + cache Message-ID: Dear all, I am trying to investigate the effect of cache on the word error rates improvements of a corpus. For this, I want to rescore the HTK-lattices with the LM + cache and then extract the CTM from the lattice. Ideally, it should work in a similar way to calculating perplexity with the ngram tool: while rescoring the lattices, the SRILM tool should take a unigram LM based on a history of n words and interpolate the main LM with it (just like done with the -cache and -cache-lambda options ngram tool). I am using the SRILM lattice-tool to rescore the lattice and producing the CTM. And the command currently looks like this : latticeTool -order -lm -bayes 0 -in-lattice -read-htk -posterior-decode -output-ctm But I cant find the proper set of configuration options to achieve what I want to do - Ideally there should be -cache and -cache-lambda options like the ones in SRILM ngram tool. But there isnt any such option in lattice-tool. Can anyone guide me how can I achieve it ? Thanks & Best Regards, Zeeshan Khan -------------- next part -------------- An HTML attachment was scrubbed... URL: From zeeshankhans at gmail.com Mon Nov 29 18:04:06 2010 From: zeeshankhans at gmail.com (zeeshan khan) Date: Tue, 30 Nov 2010 03:04:06 +0100 Subject: [SRILM User List] nbest list format Message-ID: Dear all, There are three formats specified by SRILM for the nbest lists generated by using it: http://www-speech.sri.com/projects/srilm/manpages/nbest-format.5.html I want to generate nbest-lists in the 2nd format i.e. decipher format specified on this page (to preserve the timing information) from HTK lattices. However, by default, I get the nbest lists in the 3rd format specified on the page. Ideally, I should be able to specify the output format using some option of the lattice-tool or nbest-lattice, however, I couldn't find configuration options to specify the output nbest list's format. Any suggestions on how to do it ? Thanks and Best Regards, Zeeshan Khan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Mon Nov 29 23:11:55 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 29 Nov 2010 23:11:55 -0800 Subject: [SRILM User List] nbest list format In-Reply-To: References: Message-ID: <4CF4A3BB.3000900@speech.sri.com> zeeshan khan wrote: > Dear all, > There are three formats specified by SRILM for the nbest lists > generated by using it: > http://www-speech.sri.com/projects/srilm/manpages/nbest-format.5.html > I want to generate nbest-lists in the 2nd format i.e. decipher format > specified on this page (to preserve the timing information) from HTK > lattices. > However, by default, I get the nbest lists in the 3rd format specified > on the page. > Ideally, I should be able to specify the output format using some > option of the lattice-tool or nbest-lattice, however, I couldn't find > configuration options to specify the output nbest list's format. > Any suggestions on how to do it ? Unfortunately the N-best generation for lattices does not support the output format with time alignment information. It it possible in principle, but the code just wasn't written to do that. If you have a fair amount of time you could dig into lattice/src/LatticeNBest.cc to change it. Andreas > Thanks and Best Regards, > Zeeshan Khan. > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at speech.sri.com Tue Nov 30 00:01:16 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 30 Nov 2010 00:01:16 -0800 Subject: [SRILM User List] lattice rescoring with LM + cache In-Reply-To: References: Message-ID: <4CF4AF4C.5020309@speech.sri.com> zeeshan khan wrote: > Dear all, > > I am trying to investigate the effect of cache on the word error rates > improvements of a corpus. > > For this, I want to rescore the HTK-lattices with the LM + cache and > then extract the CTM from the lattice. > > Ideally, it should work in a similar way to calculating perplexity > with the ngram tool: while rescoring the lattices, the SRILM tool > should take a unigram LM based on a history of n words and interpolate > the main LM with it (just like done with the -cache and > -cache-lambda options ngram tool). > > I am using the SRILM lattice-tool to rescore the lattice and producing > the CTM. And the command currently looks like this : > > latticeTool -order -lm -bayes 0 -in-lattice > -read-htk -posterior-decode -output-ctm > > But I cant find the proper set of configuration options to achieve > what I want to do - Ideally there should be -cache and > -cache-lambda options like the ones in SRILM ngram tool. But > there isnt any such option in lattice-tool. Can anyone guide me how > can I achieve it ? Applying a cache LM to lattice rescoring is not straightforward, because you're not processing a linear sequence of words. Strictly speaking you'd have to maintain a different cache LM for each partial path through the lattice, which would be very expensive. Also, you don't know what the correct word string is, so you have to think about which words should go into the cache. What people usually do when dealing with lattice or nbest lists is that you compute a cache based on all utterances preceding the utterance to be rescored. So you would compute a unigram LM specific to each utterance, then interpolate that with the main LM (-bayes 0 -mix-lm CACHELM). As to the question of what words to cache: a sensible approach would be to weight words unigrams according to their posterior probabilities. So use lattice-tool -order 1 -write-ngrams to dump the weighted counts, and ngram-count -float-counts to build the cache LM (you can also turn off smoothing). Andreas > > Thanks & Best Regards, > > Zeeshan Khan > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From cwsunshine at gmail.com Tue Nov 30 01:18:11 2010 From: cwsunshine at gmail.com (wei chen) Date: Tue, 30 Nov 2010 17:18:11 +0800 Subject: [SRILM User List] lattice-tool a-star Message-ID: Hi,all can lattice-tool realize A-star algorithm which is always used the 2nd pass of speech recognition? I am so confused about that,thanks a lot! Best wishes, Wei Chen -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Nov 30 10:11:32 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 30 Nov 2010 10:11:32 -0800 Subject: [SRILM User List] lattice-tool a-star In-Reply-To: References: Message-ID: <4CF53E54.6080006@speech.sri.com> wei chen wrote: > Hi,all > can lattice-tool realize A-star algorithm which is always used the > 2nd pass of speech recognition? I am so confused about that,thanks a lot! lattice-tool uses a Viterbi algorithm as the default for 1-best decoding. For nbest decoding you have the option to use A-star or Viterbi. See the man page for detail. Andreas > > Best wishes, > Wei Chen > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From mehdi_hoseini at comp.iust.ac.ir Wed Dec 1 09:00:35 2010 From: mehdi_hoseini at comp.iust.ac.ir (Mehdi hoseini) Date: Wed, 01 Dec 2010 20:30:35 +0330 Subject: [SRILM User List] Basic questions Message-ID: -----Original Message----- From: "Mehdi hoseini" To: srilm-user at speech.sri.com Date: Wed, 01 Dec 2010 18:58:31 +0330 Subject: Basic questions hi first thanks for attention. I am new in SRILM and HTK and i am sorry to all for my very basic questions. I couldn't run SRILM on Linux and Cygwin so i search for solution in visual studio and i found it in here " http://www.keithv.com/software/srilm/ [http://www.keithv.com/software/srilm/] " I compile the files and Use them for building my language models. how can i use my results(LMs) in HTK? does SRILM support Topic language models like "PLSA" or "LDA LM"? if not is there any toolkit that cover these models?thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From zeeshankhans at gmail.com Mon Dec 6 07:20:15 2010 From: zeeshankhans at gmail.com (zeeshan khan) Date: Mon, 6 Dec 2010 16:20:15 +0100 Subject: [SRILM User List] Cache-lambda with lowest perplexity Message-ID: Hi all, I am using the following options to calculate the perplexity with a cache size of lets say 500. All I can do is to run it for various values of CACHE-LAMBDA and find out manually that for which value of CACHE-LAMBDA does the least perplexity occur. ngram -unk -map-unk '[UNKNOWN]' -lm -cache 500 -cache-lambda -ppl Is it possible with SRI toolkit to somehow automatically get the CACHE-LAMBDA weight which gives the least perplexity for the given corpus. Best Regards, Zeeshan Khan -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Mon Dec 6 10:58:08 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 06 Dec 2010 10:58:08 -0800 Subject: [SRILM User List] Cache-lambda with lowest perplexity In-Reply-To: References: Message-ID: <4CFD3240.10409@speech.sri.com> zeeshan khan wrote: > Hi all, > > I am using the following options to calculate the perplexity with a > cache size of lets say 500. All I can do is to run it for various > values of CACHE-LAMBDA and find out manually that for which value of > CACHE-LAMBDA does the least perplexity occur. > > ngram -unk -map-unk '[UNKNOWN]' -lm -cache 500 -cache-lambda > -ppl > > Is it possible with SRI toolkit to somehow automatically get the > CACHE-LAMBDA weight which gives the least perplexity for the given > corpus. Yes. You use the same method as used to optimize the linear interpolation of two arbitrary LMs. 1. Generate the probabilities from the basic LM ngram -unk -map-unk '[UNKNOWN]' -lm -ppl -debug 2 > lm1.ppl ngram -unk -map-unk '[UNKNOWN]' -null -cache 500 -cache-lambda 1.0 -ppl -debug 2 > cachelm.ppl 2. Use an EM algorithm to estimate the best lambda: compute-best-mix cachelm.ppl lm1.ppl (see ppl-scripts(1) man page for details on compute-best-mix). Of course for meaningful results you should use a development set separate from the evaluation data to optimize the mixture weight. Andreas From stolcke at speech.sri.com Tue Dec 21 06:09:26 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 21 Dec 2010 06:09:26 -0800 Subject: [SRILM User List] SRILM-user list maintenance Message-ID: <201012211409.oBLE9QH20710@huge> The list (and its web interface) were nonfunctional for a while due to a server crash. If you read this, it means that things are back to working. Sorry for any inconvenience. --Andreas From amr_desoky at yahoo.com Wed Dec 29 10:40:25 2010 From: amr_desoky at yahoo.com (Amr Desoky) Date: Wed, 29 Dec 2010 10:40:25 -0800 (PST) Subject: [SRILM User List] ARPA LM with only higher order grams? Message-ID: <606025.9895.qm@web51001.mail.re2.yahoo.com> Hi, I am asking is it possible to have an ARPA LM storing only 3-gram log probabilities? Assuming that in my application (in which I will use the LM), I will only require the probability of these specific 3-grams. example of the LM: \data\ ngram 1=0 ngram 2=0 ngram 3=3 \1-grams: \2-grams: \3-grams: \end\ To say in other words: if I got some method to estimate the probability of some 3-grams needed for 3-gram lattice rescoring for ASR, is it possible to insert the probabilities of these 3-grams in a normal ARPA backoff LM? I did so, but when I tried to normalize the new LM (after adding the new 3-grams), I got the following warinings and the new grams are filtered out! warning: no bow for prefix of ngram "w1 w2 w3" .........(lots of the above warinig) BOW numerator for context "w4 w5" is -0.535204 < 0 .........(lots of the above warinig) could you tell me why this is happening? since if some 3-gram probability is there I will not need to backoff and I will not need to use the lower order grams to get the probability of this specific 3-gram...yes? What if I did not normalize the new LM will it be a correct LM or you see some bug, is there some other way to validate the correctness of this LM? I will appreciate your help very much. Best regards, Amr Amr Ibrahim El-Desoky, Mousa PhD Student, Computer Science (i6), RWTH-Aachen University, Aachen, Germany Cel. : +49 0176 56418470 Office : +49 241 8021620 Fax : +49 241 8022219 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Wed Dec 29 14:29:10 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 29 Dec 2010 16:29:10 -0600 Subject: [SRILM User List] ARPA LM with only higher order grams? In-Reply-To: <606025.9895.qm@web51001.mail.re2.yahoo.com> References: <606025.9895.qm@web51001.mail.re2.yahoo.com> Message-ID: <4D1BB636.7030008@speech.sri.com> Amr Desoky wrote: > Hi, > I am asking is it possible to have an ARPA LM storing only 3-gram > log probabilities? > Assuming that in my application (in which I will use the LM), I will > only require the probability of these specific 3-grams. > example of the LM: > > \data\ > ngram 1=0 > ngram 2=0 > ngram 3=3 > > \1-grams: > > \2-grams: > > \3-grams: > > > > > \end\ > > > To say in other words: if I got some method to estimate the > probability of some 3-grams needed for 3-gram lattice rescoring for > ASR, is it possible to insert the probabilities of these 3-grams in a > normal ARPA backoff LM? I did so, but when I tried to normalize the > new LM (after adding the new 3-grams), I got the following warinings > and the new grams are filtered out! > > warning: no bow for prefix of ngram "w1 w2 w3" > .........(lots of the above warinig) This is a sanity check of the backoff format. For each ngram w1 w2 w3 it is checked the the history "w1 w2" has a corresponding backoff weight. > BOW numerator for context "w4 w5" is -0.535204 < 0 > .........(lots of the above warinig) > > could you tell me why this is happening? since if some 3-gram > probability is there I will not need to backoff and I will not need to > use the lower order grams to get the probability of this specific > 3-gram...yes? > > What if I did not normalize the new LM will it be a correct LM or you > see some bug, is there some other way to validate the correctness of > this LM? As long as you don't renormalize the LM, AND you only use the trigram probabilities, AND you insert dummy unigrams and bigrams (to satisfy the above sanity check) with arbitrary log probabilities and backoff weights (make them 0) you can use the model in the standard way. Andreas