From deliverable at gmail.com Sat Oct 4 08:46:56 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sat, 4 Oct 2008 11:46:56 -0400 Subject: make-big-lm out of memory Message-ID: I've got a couple corpora where make-big-lm is killed after a couple days of swapping. I've originally started it after merge-batch-counts generated the total counts fine, and the kncounts file is produced OK as a part of the make-big-lm run. I've used a -max-per-file option, but once the kncounts is done, I understand it no longer applies? Basically trying to rerun the same command line again reuses the kncounts and the thing gets killed again. Which strategies do we have to overcome this? Cheers, Alexy From stolcke at speech.sri.com Sat Oct 4 09:40:33 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 04 Oct 2008 09:40:33 -0700 Subject: make-big-lm out of memory In-Reply-To: References: Message-ID: <48E79C81.4070204@speech.sri.com> Alexy Khrabrov wrote: > I've got a couple corpora where make-big-lm is killed after a couple > days of swapping. I've originally started it after merge-batch-counts > generated the total counts fine, and the kncounts file is produced OK > as a part of the make-big-lm run. I've used a -max-per-file option, > but once the kncounts is done, I understand it no longer applies? > Basically trying to rerun the same command line again reuses the > kncounts and the thing gets killed again. Which strategies do we have > to overcome this? Increase the count cutoffs or get a machine with more memory. This is a FAQ. check http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html . Andreas > > Cheers, > Alexy From gwenole.lecorve at irisa.fr Tue Oct 7 04:55:08 2008 From: gwenole.lecorve at irisa.fr (=?ISO-8859-1?Q?Gw=E9nol=E9_Lecorv=E9?=) Date: Tue, 07 Oct 2008 13:55:08 +0200 Subject: Beginning and end of sentences tags Message-ID: <48EB4E1C.90802@irisa.fr> Hi, I'm currently trying to rescore language scores of lattices generated using the HTK toolkit and personal tools. Here is an example of lattice to be rescored : > VERSION=1.0 > UTTERANCE=/path/to/one.spf > acscale=1.00 > vocab=/path/to/dic > N=290 L=942 > I=0 t=0.00 W= > I=1 t=0.14 W=le v=1 > I=2 t=0.33 W=chien v=1 > I=3 t=0.83 W=miaule v=1 > I=4 t=1.08 W= > J=0 S=0 E=1 a=-55.36 l=-2973.43 > J=1 S=1 E=2 a=-72.28 l=-48.43 > J=2 S=2 E=3 a=-72.28 l=-87.30 > J=3 S=3 E=4 a=-91.57 l=-145.72 You can notice that the tags for beginning/end of sentence are present. My problem is that once I launch lattice-tool (with -htk-words-on-nodes and -no-htk-nulls) on such a lattice the results (HTK format) looks like this : > # Header (generated by SRILM) > VERSION=1.1 > UTTERANCE=/path/to/one.spf > base=2.71828 > dir=f > vocab=/path/to/di > start=0 > end=1 > NODES=6 LINKS=5 > # Nodes > I=0 W=!NULL t=0 > I=1 W=!NULL t=1.08 > I=2 W=le t=0.14 v=1 > I=3 W=chien t=0.33 v=1 > I=4 W=miaule t=0.83 v=1 > I=5 W=!NULL t=1.08 > # Links > J=0 S=0 E=2 a=-55.36 l=-2.74741 > J=1 S=2 E=3 a=-72.28 l=-9.61595 > J=2 S=3 E=4 a=-72.28 l=-inf > J=3 S=4 E=5 a=-91.5701 l=-2.87136 > J=4 S=5 E=1 l=-2.87136 Something strange happens : the "bos" and "eos" tags disappear and !NULL tags are introduced instead. Why aren't the "bos" and "eos" printed anymore and why are these !NULL tagged considered insteand ? Can't I just keep the same lattice structure as the one given in input ? I'm facing this problem since several months and still did not find any solution. I would be really grateful if you help me. Regards, Gw?nol? Lecorv?. -------------- next part -------------- A non-text attachment was scrubbed... Name: gwenole_lecorve.vcf Type: text/x-vcard Size: 255 bytes Desc: not available URL: From stolcke at speech.sri.com Tue Oct 7 21:41:43 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 07 Oct 2008 21:41:43 -0700 Subject: Beginning and end of sentences tags In-Reply-To: <48EB4E1C.90802@irisa.fr> References: <48EB4E1C.90802@irisa.fr> Message-ID: <48EC3A07.5060300@speech.sri.com> Gw?nol? Lecorv? wrote: > Hi, > > I'm currently trying to rescore language scores of lattices generated > using the HTK toolkit and personal tools. > Here is an example of lattice to be rescored : >> VERSION=1.0 >> UTTERANCE=/path/to/one.spf >> acscale=1.00 >> vocab=/path/to/dic >> N=290 L=942 >> I=0 t=0.00 W= >> I=1 t=0.14 W=le v=1 >> I=2 t=0.33 W=chien v=1 >> I=3 t=0.83 W=miaule v=1 >> I=4 t=1.08 W= >> J=0 S=0 E=1 a=-55.36 l=-2973.43 >> J=1 S=1 E=2 a=-72.28 l=-48.43 >> J=2 S=2 E=3 a=-72.28 l=-87.30 >> J=3 S=3 E=4 a=-91.57 l=-145.72 > You can notice that the tags for beginning/end of sentence are present. > My problem is that once I launch lattice-tool (with > -htk-words-on-nodes and -no-htk-nulls) on such a lattice the results > (HTK format) looks like this : >> # Header (generated by SRILM) >> VERSION=1.1 >> UTTERANCE=/path/to/one.spf >> base=2.71828 >> dir=f >> vocab=/path/to/di >> start=0 >> end=1 >> NODES=6 LINKS=5 >> # Nodes >> I=0 W=!NULL t=0 >> I=1 W=!NULL t=1.08 >> I=2 W=le t=0.14 v=1 >> I=3 W=chien t=0.33 v=1 >> I=4 W=miaule t=0.83 v=1 >> I=5 W=!NULL t=1.08 >> # Links >> J=0 S=0 E=2 a=-55.36 l=-2.74741 >> J=1 S=2 E=3 a=-72.28 l=-9.61595 >> J=2 S=3 E=4 a=-72.28 l=-inf >> J=3 S=4 E=5 a=-91.5701 l=-2.87136 >> J=4 S=5 E=1 l=-2.87136 > Something strange happens : the "bos" and "eos" tags disappear and > !NULL tags are introduced instead. > Why aren't the "bos" and "eos" printed anymore and why are these !NULL > tagged considered insteand ? > Can't I just keep the same lattice structure as the one given in input ? > > I'm facing this problem since several months and still did not find > any solution. I would be really grateful if you help me. and are replaced by !NULL because they are not necessary, since the start/end of sentence are implicit in the lattice structure. For example, when rescoring the lattice with an LM the initial node is implicitly treat as the context. However, I can see how you would want to preserve these tags for some applications. If you download the beta version of srilm you will find a new option: lattice-tool -print-sent-tags will output and in the lattice format (both HTK and PFSG). Andreas > > Regards, > Gw?nol? Lecorv?. From gwenole.lecorve at irisa.fr Wed Oct 8 06:29:47 2008 From: gwenole.lecorve at irisa.fr (=?ISO-8859-1?Q?Gw=E9nol=E9_Lecorv=E9?=) Date: Wed, 08 Oct 2008 15:29:47 +0200 Subject: Beginning and end of sentences tags In-Reply-To: <48EC3A07.5060300@speech.sri.com> References: <48EB4E1C.90802@irisa.fr> <48EC3A07.5060300@speech.sri.com> Message-ID: <48ECB5CB.1000301@irisa.fr> Thank you for this quick and precise answer. However, when I launch my command (see below), I still do not get back the same lattice structure. command : > lattice-tool -in-lattice /path/to/input.lat -out-lattice > /path/to/output.lat > -lm $LM > -htk-logbase 2.71828 > -write-htk > -read-htk > -print-sent-tags > -htk-logzero '-99' > -no-htk-nulls > -htk-words-on-nodes Then the result is as follows : > # Header (generated by SRILM) > VERSION=1.1 > UTTERANCE=/path/to/one.spf > base=2.71828 > dir=f > vocab=/path/to/dic > start=0 > end=1 > NODES=6 LINKS=5 > # Nodes > I=0 W= t=0 > I=1 W= t=1.08 > I=2 W=le t=0.14 v=1 > I=3 W=chien t=0.33 v=1 > I=4 W=miaule t=0.83 v=1 > I=5 W= t=1.08 > # Links > J=0 S=0 E=2 a=-55.36 l=-2.74741 > J=1 S=2 E=3 a=-72.28 l=-8.60446 > J=2 S=3 E=4 a=-72.28 l=-inf > J=3 S=4 E=5 a=-91.5701 l=-2.87136 > J=4 S=5 E=1 l=-2.87136 I notice 2 things : 1/ Evenif if !NULL are replaced by the sentence start/end tags, one more "eos" tag is added at the end of the lattice. Isn't it a problem since a P(|) would then be considered while computing the posteriors ? When writing words on edges the problem is the same (whereas the "bos" tag dissapears). 2/ Despite the "-htk-logzero -99" option, "-inf" is still returned. After a few additional experiments, it appears that the "-htk-logzero" option works when, for example, no LM rescoring is applied or when the "-no-expansion" option is enabled. I may misuse the lattice-tool command but I do not see how to preserve the original lattice structure (eventhough I know that SRILM converts HTK lattices into its own format and that my goal is maybe unreachable :-) ). Best regards, Gw?nol? Lecorv?. Andreas Stolcke a ?crit : > Gw?nol? Lecorv? wrote: >> Hi, >> >> I'm currently trying to rescore language scores of lattices generated >> using the HTK toolkit and personal tools. >> Here is an example of lattice to be rescored : >>> VERSION=1.0 >>> UTTERANCE=/path/to/one.spf >>> acscale=1.00 >>> vocab=/path/to/dic >>> N=290 L=942 >>> I=0 t=0.00 W= >>> I=1 t=0.14 W=le v=1 >>> I=2 t=0.33 W=chien v=1 >>> I=3 t=0.83 W=miaule v=1 >>> I=4 t=1.08 W= >>> J=0 S=0 E=1 a=-55.36 l=-2973.43 >>> J=1 S=1 E=2 a=-72.28 l=-48.43 >>> J=2 S=2 E=3 a=-72.28 l=-87.30 >>> J=3 S=3 E=4 a=-91.57 l=-145.72 >> You can notice that the tags for beginning/end of sentence are present. >> My problem is that once I launch lattice-tool (with >> -htk-words-on-nodes and -no-htk-nulls) on such a lattice the results >> (HTK format) looks like this : >>> # Header (generated by SRILM) >>> VERSION=1.1 >>> UTTERANCE=/path/to/one.spf >>> base=2.71828 >>> dir=f >>> vocab=/path/to/di >>> start=0 >>> end=1 >>> NODES=6 LINKS=5 >>> # Nodes >>> I=0 W=!NULL t=0 >>> I=1 W=!NULL t=1.08 >>> I=2 W=le t=0.14 v=1 >>> I=3 W=chien t=0.33 v=1 >>> I=4 W=miaule t=0.83 v=1 >>> I=5 W=!NULL t=1.08 >>> # Links >>> J=0 S=0 E=2 a=-55.36 l=-2.74741 >>> J=1 S=2 E=3 a=-72.28 l=-9.61595 >>> J=2 S=3 E=4 a=-72.28 l=-inf >>> J=3 S=4 E=5 a=-91.5701 l=-2.87136 >>> J=4 S=5 E=1 l=-2.87136 >> Something strange happens : the "bos" and "eos" tags disappear and >> !NULL tags are introduced instead. >> Why aren't the "bos" and "eos" printed anymore and why are these >> !NULL tagged considered insteand ? >> Can't I just keep the same lattice structure as the one given in input ? >> >> I'm facing this problem since several months and still did not find >> any solution. I would be really grateful if you help me. > and are replaced by !NULL because they are not necessary, > since the start/end of sentence are implicit in the lattice structure. > For example, when rescoring the lattice with an LM the initial node is > implicitly treat as the context. > > However, I can see how you would want to preserve these tags for some > applications. > If you download the beta version of srilm you will find a new option: > lattice-tool -print-sent-tags will output and in the lattice > format (both HTK and PFSG). > > Andreas >> >> Regards, >> Gw?nol? Lecorv?. > > -------------- next part -------------- A non-text attachment was scrubbed... Name: gwenole_lecorve.vcf Type: text/x-vcard Size: 255 bytes Desc: not available URL: From sai_tang_huang at hotmail.com Wed Oct 8 09:40:11 2008 From: sai_tang_huang at hotmail.com (SAI TANG HUANG) Date: Wed, 8 Oct 2008 18:40:11 +0200 Subject: Beginning and end of sentences tags In-Reply-To: <48ECB5CB.1000301@irisa.fr> References: <48EB4E1C.90802@irisa.fr> <48EC3A07.5060300@speech.sri.com> <48ECB5CB.1000301@irisa.fr> Message-ID: Hi Andreas, could you unsubscribe me from this mailing list please? Thanks a lot for all your help in the past. Regards, Sai> Date: Wed, 8 Oct 2008 15:29:47 +0200> From: gwenole.lecorve at irisa.fr> To: stolcke at speech.sri.com> CC: srilm-user at speech.sri.com> Subject: Re: Beginning and end of sentences tags> > Thank you for this quick and precise answer.> However, when I launch my command (see below), I still do not get back > the same lattice structure.> command :> > lattice-tool -in-lattice /path/to/input.lat -out-lattice > > /path/to/output.lat> > -lm $LM> > -htk-logbase 2.71828> > -write-htk> > -read-htk> > -print-sent-tags> > -htk-logzero '-99'> > -no-htk-nulls> > -htk-words-on-nodes> > Then the result is as follows :> > # Header (generated by SRILM)> > VERSION=1.1> > UTTERANCE=/path/to/one.spf> > base=2.71828> > dir=f> > vocab=/path/to/dic> > start=0> > end=1> > NODES=6 LINKS=5> > # Nodes> > I=0 W= t=0> > I=1 W= t=1.08> > I=2 W=le t=0.14 v=1> > I=3 W=chien t=0.33 v=1> > I=4 W=miaule t=0.83 v=1> > I=5 W= t=1.08> > # Links> > J=0 S=0 E=2 a=-55.36 l=-2.74741> > J=1 S=2 E=3 a=-72.28 l=-8.60446> > J=2 S=3 E=4 a=-72.28 l=-inf> > J=3 S=4 E=5 a=-91.5701 l=-2.87136> > J=4 S=5 E=1 l=-2.87136> > > I notice 2 things :> 1/ Evenif if !NULL are replaced by the sentence start/end tags, one more > "eos" tag is added at the end of the lattice. Isn't it a problem since a > P(|) would then be considered while computing the posteriors ? > When writing words on edges the problem is the same (whereas the "bos" > tag dissapears).> 2/ Despite the "-htk-logzero -99" option, "-inf" is still returned. > After a few additional experiments, it appears that the "-htk-logzero" > option works when, for example, no LM rescoring is applied or when the > "-no-expansion" option is enabled.> > I may misuse the lattice-tool command but I do not see how to preserve > the original lattice structure (eventhough I know that SRILM converts > HTK lattices into its own format and that my goal is maybe unreachable > :-) ).> > Best regards,> Gw?nol? Lecorv?.> > Andreas Stolcke a ?crit :> > Gw?nol? Lecorv? wrote:> >> Hi,> >>> >> I'm currently trying to rescore language scores of lattices generated > >> using the HTK toolkit and personal tools.> >> Here is an example of lattice to be rescored :> >>> VERSION=1.0> >>> UTTERANCE=/path/to/one.spf> >>> acscale=1.00> >>> vocab=/path/to/dic> >>> N=290 L=942> >>> I=0 t=0.00 W=> >>> I=1 t=0.14 W=le v=1> >>> I=2 t=0.33 W=chien v=1> >>> I=3 t=0.83 W=miaule v=1> >>> I=4 t=1.08 W=> >>> J=0 S=0 E=1 a=-55.36 l=-2973.43> >>> J=1 S=1 E=2 a=-72.28 l=-48.43> >>> J=2 S=2 E=3 a=-72.28 l=-87.30> >>> J=3 S=3 E=4 a=-91.57 l=-145.72> >> You can notice that the tags for beginning/end of sentence are present.> >> My problem is that once I launch lattice-tool (with > >> -htk-words-on-nodes and -no-htk-nulls) on such a lattice the results > >> (HTK format) looks like this :> >>> # Header (generated by SRILM)> >>> VERSION=1.1> >>> UTTERANCE=/path/to/one.spf> >>> base=2.71828> >>> dir=f> >>> vocab=/path/to/di> >>> start=0> >>> end=1> >>> NODES=6 LINKS=5> >>> # Nodes> >>> I=0 W=!NULL t=0> >>> I=1 W=!NULL t=1.08> >>> I=2 W=le t=0.14 v=1> >>> I=3 W=chien t=0.33 v=1> >>> I=4 W=miaule t=0.83 v=1> >>> I=5 W=!NULL t=1.08> >>> # Links> >>> J=0 S=0 E=2 a=-55.36 l=-2.74741> >>> J=1 S=2 E=3 a=-72.28 l=-9.61595> >>> J=2 S=3 E=4 a=-72.28 l=-inf> >>> J=3 S=4 E=5 a=-91.5701 l=-2.87136> >>> J=4 S=5 E=1 l=-2.87136> >> Something strange happens : the "bos" and "eos" tags disappear and > >> !NULL tags are introduced instead.> >> Why aren't the "bos" and "eos" printed anymore and why are these > >> !NULL tagged considered insteand ?> >> Can't I just keep the same lattice structure as the one given in input ?> >>> >> I'm facing this problem since several months and still did not find > >> any solution. I would be really grateful if you help me.> > and are replaced by !NULL because they are not necessary, > > since the start/end of sentence are implicit in the lattice structure.> > For example, when rescoring the lattice with an LM the initial node is > > implicitly treat as the context.> >> > However, I can see how you would want to preserve these tags for some > > applications.> > If you download the beta version of srilm you will find a new option: > > lattice-tool -print-sent-tags will output and in the lattice > > format (both HTK and PFSG).> >> > Andreas> >>> >> Regards,> >> Gw?nol? Lecorv?.> >> >> _________________________________________________________________ ?Entra en el Club oficial de Messenger y te enterar?s de todas las novedades! http://www.vivelive.com/ilovemessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: From acp06tk at sheffield.ac.uk Wed Oct 15 04:02:12 2008 From: acp06tk at sheffield.ac.uk (Tim Kempton) Date: Wed, 15 Oct 2008 12:02:12 +0100 Subject: addsmooth on unigrams In-Reply-To: <200810150534.m9F5YAl04582@ns2> References: <200810150534.m9F5YAl04582@ns2> Message-ID: <000601c92eb5$7d57bcb0$78073610$@ac.uk> Great toolkit and thanks for the recent update. SRILM has been really useful for some computational phonology problems I've been working on. I know it's not advised to use the add one smoothing method, but at the moment I'm trying to replicate someone else's results. I have a question on the unigram case because the results aren't what I was expecting. These results are from version 1.5.6; as far as I am aware addsmooth has not changed since then. Following the equation in the ngram-discount manual page p(a_z) = (c(a_z) + D) / (c(a_) + D n(*)) I assume that, in the unigram case, c(a_) simplifies to the total number of word tokens (Jurafsky and Martin, 2000). When D=0, this appears to be the case. e.g. for the test data below p() = 1/18 log(1/18) = -1.255273 this matches with the results given below When D=1 I was expecting: p() = (1+1) / (18+2) log(2/20) = -1 However the result of -0.9242793 below, corresponds to a raw probability very close to 5/42. And I'm not sure where this comes from. If anyone could explain how this is calculated, I'd be very grateful. Thanks, Tim Kempton PhD student University of Sheffield, UK Here is the test run: [tim at trill i686]$ echo "a a a a a a a a a a a a a a a a a" | ./ngram-count -order 1 -text - 1 a 17 1 [tim at trill i686]$ echo "a a a a a a a a a a a a a a a a a" | ./ngram-count -order 1 -text - -addsmooth 0 -lm - \data\ ngram 1=3 \1-grams: -1.255273 -99 -0.02482358 a \end\ [tim at trill i686]$ echo "a a a a a a a a a a a a a a a a a" | ./ngram-count -order 1 -text - -addsmooth 1 -lm - \data\ ngram 1=3 \1-grams: -0.9242793 -99 -0.05504756 a \end\ From stolcke at speech.sri.com Tue Oct 28 11:24:45 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 28 Oct 2008 11:24:45 -0700 Subject: ngram-class is too time consuming In-Reply-To: References: Message-ID: <490758ED.7010107@speech.sri.com> ??? wrote: > I want to use the class based Bigram , like this: > P (w2 | w1) = lambda * Pw (w2 | w1)+ (1-lambda) * P (w2 | G2) * Pc > (G2| G1) > where wi belongs to class Gi, i=1, 2, respectively. > So I used the "ngram-class" program to generate a set of classes using > some corpus (282,360 unique words), > And the output classnum is 2,000. > but I found the time of this program is too long,maybe for 10 days. my > computer is Core2, 1.8G. > Here is my command: > ngram-class -text -numclasses 2000-classes -incremental > > does it has some problem? or it is normal? It's probably normal. 282k is quite a large vocabulary. You might want to play with difference vocab sizes, especially excluding words with very low counts (such as singletons), because their statistics are not reliable and won't be clustered properly. It might be best to group all those words in a special class ahead of time. For comparison, running the small test in $SRILM/test make TEST=class-ngram should take about 0.15 seconds of cpu time on a 2.6GHz Opteron machine. Andreas From stolcke at speech.sri.com Tue Oct 28 22:51:04 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 28 Oct 2008 22:51:04 PDT Subject: ngram-class is too time consuming In-Reply-To: Your message of Tue, 28 Oct 2008 22:19:11 -0700. <00163646b9f02245f0045a5d8106@google.com> Message-ID: <200810290551.m9T5p4618287@ns2> In message <00163646b9f02245f0045a5d8106 at google.com>you wrote: > --00163646b9f02245d8045a5d81db > Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes > Content-Transfer-Encoding: 7bit > > I run the "make TEST" ,it output as: > *** Running test class-ngram *** > 0.18user 0.05system 0:00.24elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+616outputs (0major+2336minor)pagefaults 0swaps > class-ngram: stdout output IDENTICAL. > class-ngram: stderr output IDENTICAL. > > so is it right? Looks ok, yes. > > and yesterday, I used a vocabulary of 100K, the highest counts of 282K > vocabulary , > and used the count file by the "ngram-count ",not the text file, also to 2K > word classes, > now it has run about 20 hours, it iterated 7685 times now ,and has another > 90K times of iteration, > so the time is too long. Well, you have to be patient when dealing with large data problems. Note that each iteration takes less time, so the remaining iterations will go ever faster. > is it normal? or normally how can I do it to be a little quilk? You can design (and implement and publish) a new and improved algorithm that runs fast enough for your purposes! I highly recommend this solution. > > very thanks! You are welcome! Andreas From kbasye1 at jhu.edu Wed Oct 29 08:09:22 2008 From: kbasye1 at jhu.edu (Ken Basye) Date: Wed, 29 Oct 2008 11:09:22 -0400 Subject: 1.5.7 test failure in nbest-optimize-bleu Message-ID: <49087CA2.3030504@jhu.edu> Hi, I didn't see anything about this in the mailing-list archive. I just installed 1.5.7 and ran the tests. I saw one failure, in nbest-optimize-bleu. I looked at the code briefly; the output is coming from nbest-optimize.cc:1179, but that's about as far as I got. I built on Mac OS 10.4.11 using this compiler version: ~/work/srilm--$ c++ --version i686-apple-darwin8-g++-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5370) I can try another system if that will help, or let me know if anyone wants more information. Thanks, Ken Basye Here are the differences in the output (note that diff got a bit confused in the first block; all three blocks are really showing the same thing, with the reference lambdas going from 0 to 6 but the output lambdas going from 1 to 7 - in the first block both sides had the same line for lambda_4) Only the indices are off; the values are all identical. ~/work/srilm/test/output--$ diff nbest-optimize-bleu.unknown.stderr ../reference/nbest-optimize-bleu.stderr 87,89c87,90 < lambda_1 0 < lambda_2 1 < lambda_3 0 --- > lambda_0 0 > lambda_1 1 > lambda_2 0 > lambda_3 0.2 91,93c92,93 < lambda_5 0.2 < lambda_6 0.7 < lambda_7 0.2 --- > lambda_5 0.7 > lambda_6 0.2 113,119c113,119 < lambda_1 -0.462891 < lambda_2 1.16211 < lambda_3 0.162109 < lambda_4 0.362109 < lambda_5 0.0703125 < lambda_6 0.362109 < lambda_7 0.319922 --- > lambda_0 -0.462891 > lambda_1 1.16211 > lambda_2 0.162109 > lambda_3 0.362109 > lambda_4 0.0703125 > lambda_5 0.362109 > lambda_6 0.319922 157,163c157,163 < lambda_1 -0.098284 < lambda_2 1.1704 < lambda_3 0.149144 < lambda_4 0.41325 < lambda_5 0.0652597 < lambda_6 0.41325 < lambda_7 0.354837 --- > lambda_0 -0.098284 > lambda_1 1.1704 > lambda_2 0.149144 > lambda_3 0.41325 > lambda_4 0.0652597 > lambda_5 0.41325 > lambda_6 0.354837 From wqfengnlpr at gmail.com Wed Oct 29 18:18:30 2008 From: wqfengnlpr at gmail.com (=?GB2312?B?zfXH77fm?=) Date: Thu, 30 Oct 2008 09:18:30 +0800 Subject: Chinese words in "replace-words-with-classes" Message-ID: Hi, When I used the Chinese word class file in the "replace-words-with-classes", the word file failed to replaced by the class name. this is my class format: CLASS-0512 0.000286 ?? CLASS-0512 0.004003 ???? CLASS-0512 0.002574 ????? CLASS-0512 0.000095 ???? .... it has about 285K lines , every line has one word, and about 1K classes. I'm sure all the word in the word file are in this class file. The command is : replace-words-with-classes classes= > Thank you. Wang -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Nov 7 10:59:25 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 07 Nov 2008 10:59:25 PST Subject: 1.5.7 test failure in nbest-optimize-bleu In-Reply-To: Your message of Wed, 29 Oct 2008 11:09:22 -0400. <49087CA2.3030504@jhu.edu> Message-ID: <200811071859.mA7IxPH04099@ns2> It cannot replicate this bug since I don't have the kind of system you use. I find it hard to believe that the code would produce output that is so different from what it is on all other systems I've seen. I suggest trying a more recent version of the compiler (gcc 4.3.2 is the latest!). --Andreas In message <49087CA2.3030504 at jhu.edu>you wrote: > Hi, > I didn't see anything about this in the mailing-list archive. I just > installed 1.5.7 and ran the > tests. I saw one failure, in nbest-optimize-bleu. I looked at the > code briefly; the output is coming from nbest-optimize.cc:1179, but > that's about as far as I got. I built on Mac OS 10.4.11 using > this compiler version: > > ~/work/srilm--$ c++ --version > i686-apple-darwin8-g++-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5370) > > I can try another system if that will help, or let me know if anyone > wants more information. > Thanks, > Ken Basye > > > > Here are the differences in the output (note that diff got a bit > confused in the first block; all three blocks are really showing the > same thing, with the reference lambdas going from 0 to 6 but the output > lambdas going from 1 to 7 - in the first block both sides had the same > line for lambda_4) > Only the indices are off; the values are all identical. > > ~/work/srilm/test/output--$ diff nbest-optimize-bleu.unknown.stderr > ../reference/nbest-optimize-bleu.stderr > 87,89c87,90 > < lambda_1 0 > < lambda_2 1 > < lambda_3 0 > --- > > lambda_0 0 > > lambda_1 1 > > lambda_2 0 > > lambda_3 0.2 > 91,93c92,93 > < lambda_5 0.2 > < lambda_6 0.7 > < lambda_7 0.2 > --- > > lambda_5 0.7 > > lambda_6 0.2 > 113,119c113,119 > < lambda_1 -0.462891 > < lambda_2 1.16211 > < lambda_3 0.162109 > < lambda_4 0.362109 > < lambda_5 0.0703125 > < lambda_6 0.362109 > < lambda_7 0.319922 > --- > > lambda_0 -0.462891 > > lambda_1 1.16211 > > lambda_2 0.162109 > > lambda_3 0.362109 > > lambda_4 0.0703125 > > lambda_5 0.362109 > > lambda_6 0.319922 > 157,163c157,163 > < lambda_1 -0.098284 > < lambda_2 1.1704 > < lambda_3 0.149144 > < lambda_4 0.41325 > < lambda_5 0.0652597 > < lambda_6 0.41325 > < lambda_7 0.354837 > --- > > lambda_0 -0.098284 > > lambda_1 1.1704 > > lambda_2 0.149144 > > lambda_3 0.41325 > > lambda_4 0.0652597 > > lambda_5 0.41325 > > lambda_6 0.354837 From kbasye1 at jhu.edu Sat Nov 8 07:09:27 2008 From: kbasye1 at jhu.edu (Ken Basye) Date: Sat, 08 Nov 2008 10:09:27 -0500 Subject: 1.5.7 test failure in nbest-optimize-bleu In-Reply-To: <200811081505.mA8F5iq26837@ns2> References: <200811081505.mA8F5iq26837@ns2> Message-ID: <4915ABA7.2070103@jhu.edu> Terrific, thanks very much. Ken Andreas Stolcke wrote: > In message <49087CA2.3030504 at jhu.edu>you wrote: > >> Hi, >> I didn't see anything about this in the mailing-list archive. I just >> installed 1.5.7 and ran the >> tests. I saw one failure, in nbest-optimize-bleu. I looked at the >> code briefly; the output is coming from nbest-optimize.cc:1179, but >> that's about as far as I got. I built on Mac OS 10.4.11 using >> this compiler version: >> >> ~/work/srilm--$ c++ --version >> i686-apple-darwin8-g++-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5370) >> >> I can try another system if that will help, or let me know if anyone >> wants more information. >> Thanks, >> Ken Basye >> >> > > It turns out this problems seems to affect compilation with gcc 4.2 on all > the platforms I tried (Linux, Solaris), so I assume it's the same bug as > you describe. The problem has to do with expression evaluation order > when the expression has side effects (++ operator). > > You can apply the following patch, or download the beta release > (which also has some other fixes for BLEU optimization). > > Andreas > > *** lm/src/nbest-optimize.cc 2008/11/07 15:14:15 1.58 > --- lm/src/nbest-optimize.cc 2008/11/08 13:49:49 1.59 > *************** > *** 1176,1182 **** > { > if (!fixLambdas[k]) > cerr << "lambda_" << j - 1 > ! << " " << p[ilo][j++] << endl; > } > } > } > --- 1176,1183 ---- > { > if (!fixLambdas[k]) > cerr << "lambda_" << j - 1 > ! << " " << p[ilo][j] << endl; > ! j ++; > } > } > } > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Nov 8 07:05:44 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 08 Nov 2008 07:05:44 PST Subject: 1.5.7 test failure in nbest-optimize-bleu In-Reply-To: Your message of Wed, 29 Oct 2008 11:09:22 -0400. <49087CA2.3030504@jhu.edu> Message-ID: <200811081505.mA8F5iq26837@ns2> In message <49087CA2.3030504 at jhu.edu>you wrote: > Hi, > I didn't see anything about this in the mailing-list archive. I just > installed 1.5.7 and ran the > tests. I saw one failure, in nbest-optimize-bleu. I looked at the > code briefly; the output is coming from nbest-optimize.cc:1179, but > that's about as far as I got. I built on Mac OS 10.4.11 using > this compiler version: > > ~/work/srilm--$ c++ --version > i686-apple-darwin8-g++-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5370) > > I can try another system if that will help, or let me know if anyone > wants more information. > Thanks, > Ken Basye > It turns out this problems seems to affect compilation with gcc 4.2 on all the platforms I tried (Linux, Solaris), so I assume it's the same bug as you describe. The problem has to do with expression evaluation order when the expression has side effects (++ operator). You can apply the following patch, or download the beta release (which also has some other fixes for BLEU optimization). Andreas *** lm/src/nbest-optimize.cc 2008/11/07 15:14:15 1.58 --- lm/src/nbest-optimize.cc 2008/11/08 13:49:49 1.59 *************** *** 1176,1182 **** { if (!fixLambdas[k]) cerr << "lambda_" << j - 1 ! << " " << p[ilo][j++] << endl; } } } --- 1176,1183 ---- { if (!fixLambdas[k]) cerr << "lambda_" << j - 1 ! << " " << p[ilo][j] << endl; ! j ++; } } } From stolcke at speech.sri.com Tue Nov 25 11:29:06 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 25 Nov 2008 11:29:06 PST Subject: line length limit In-Reply-To: Your message of Mon, 24 Nov 2008 19:31:47 -0600. <16429778.81227576703241.JavaMail.alexyk@De-Divinatione.local> Message-ID: <200811251929.mAPJT6924375@ns2> In message <16429778.81227576703241.JavaMail.alexyk at De-Divinatione.local>you wr ote: > > Andreas -- a couple questions... I now use sensor data which has no real "sen > tence" meaning, thus 2.5 million observations are all on the same line. Ans n > gram-count complains that the line is too long, not surprisingly. Is there a > way to break it into several lines but teach ngram-count to ignore sentence b > oundaries? In the worst case I can envision manipulating the margins by appen > ding/prepending (n-1) stitching chunks, but managing it is a nightmare... That's what the continuous-ngram-count filter is for. Please see the training-scripts(1) man page. For example, you could pass continuous-ngram-count as a filter to make-batch-counts (option 3rd argument). > > Also, I'm now building a full KN model for about 2 billion Russian words. I s > ee that in a week of running it the RAM usage gradually grew to about 32 GB, > with my 16 GB real RAM. Is there any way to estimnate how much longer can I j > ustify using the box? :) Yes, you run is on smaller amounts of data, measure time and memory, and extrapolate. However, it sounds you don't have enough memory and should consult the FAQ itemes on how to reduce memory requirements for ngram-count. Andreas From deliverable at gmail.com Sun Nov 30 19:41:17 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 30 Nov 2008 22:41:17 -0500 Subject: interpreting -order and -debug results Message-ID: <3C299FDB-5463-45AC-B73F-24F3BED8EB55@gmail.com> Greetings -- I've trained a Kneser-Ney model of a Russian corpus with - order 5 -kndiscount, and started it as a server with -order 5. Then, to see that indeed 5-grams are working, I feed it a sentence with (a) an existing first word present in the corpus, (b) a made-up first word not present in the Russian language. Then I run both 5-word sentences in two ways: (1) -order 5 -debug 2 (2) -order 0 debug 3, both for - ppl. The results, which puzzle me, are below, followed by a description of the puzzlement. ~ echo c ???? ?????????? ?? ???????? | ngram - use-server -order 5 -debug 2 -ppl - server : probserver ready c ???? ?????????? ?? ???????? p( c | ) = 3.67342e-06 [ -5.43493 ] p( ???? | c ...) = 0.00102315 [ -2.99006 ] p( ?????????? | ???? ...) = 0.00151464 [ -2.81969 ] p( ?? | ?????????? ...) = 0.0218172 [ -1.6612 ] p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] p( | ???????? ...) = 0.00693155 [ -2.15917 ] 1 sentences, 5 words, 0 OOVs 0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16 file -: 1 sentences, 5 words, 0 OOVs 0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16 ~ echo ??????????? ???? ?????????? ?? ???????? | ngram -use-server -order 5 -debug 2 -ppl - server : probserver ready ??????????? ???? ?????????? ?? ???????? p( ??????????? | ) = 0 [ -inf ] p( ???? | ??????????? ...) = 0.00014788 [ -3.83009 ] p( ?????????? | ???? ...) = 0.00151464 [ -2.81969 ] p( ?? | ?????????? ...) = 0.0218172 [ -1.6612 ] p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] p( | ???????? ...) = 0.00693155 [ -2.15917 ] 1 sentences, 5 words, 0 OOVs 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 file -: 1 sentences, 5 words, 0 OOVs 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 == notice that from the 3rd line p(word | context ...), the conditional probs are the same, although we're using a 5-gram model and in the second batch the first word is non-existing! We also have 0 OOVs reported there (?). == Now, let's explore what "unlimited ngrams" mean with -order 0, and set -debug 3 too: ~ echo ? ???? ?????????? ?? ???????? | ngram - use-server -order 0 -debug 3 -ppl - server : probserver ready ? ???? ?????????? ?? ???????? warning: word probs for this context sum to 0.00119158 != 1 : p( ? | ) = 0.000113967 [ -3.94322 ] / 0.00119158 warning: word probs for this context sum to 0.0248594 != 1 : ? p( ???? | ? ...) = 0.00614229 [ -2.21167 ] / 0.0248594 warning: word probs for this context sum to 0.0135057 != 1 : ???? ? p( ?????????? | ???? ...) = 0.0026996 [ -2.5687 ] / 0.0135057 warning: word probs for this context sum to 0.136629 != 1 : ?????????? ???? ? p( ?? | ?????????? ...) = 0.0191721 [ -1.71733 ] / 0.136629 warning: word probs for this context sum to 0.00931138 != 1 : ?? ?????????? ???? ? p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] / 0.00931138 warning: word probs for this context sum to 0.243228 != 1 : ???????? ?? ?????????? ???? ? p( | ???????? ...) = 0.00693155 [ -2.15917 ] / 0.243228 1 sentences, 5 words, 0 OOVs 0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89 file -: 1 sentences, 5 words, 0 OOVs 0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89 ----- ~ echo ??????????? ???? ?????????? ?? ???????? | ngram -use-server -order 0 -debug 3 -ppl - server : probserver ready ??????????? ???? ?????????? ?? ???????? warning: word probs for this context sum to 0.00107762 != 1 : p( ??????????? | ) = 0 [ -inf ] / 0.00107762 warning: word probs for this context sum to 0.0136768 != 1 : ??????????? p( ???? | ??????????? ...) = 0.00014788 [ -3.83009 ] / 0.0136768 warning: word probs for this context sum to 0.0105593 != 1 : ???? ??????????? p( ?????????? | ???? ...) = 0.00151464 [ -2.81969 ] / 0.0105593 warning: word probs for this context sum to 0.0891667 != 1 : ?????????? ???? ??????????? p( ?? | ?????????? ...) = 0.0218172 [ -1.6612 ] / 0.0891667 warning: word probs for this context sum to 0.00501918 != 1 : ?? ?????????? ???? ??????????? p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] / 0.00501918 warning: word probs for this context sum to 0.00712921 != 1 : ???????? ?? ?????????? ???? ??????????? p( | ???????? ...) = 0.00693155 [ -2.15917 ] / 0.00712921 1 sentences, 5 words, 0 OOVs 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 file -: 1 sentences, 5 words, 0 OOVs 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 == Now we get more differences, the "real" example, the first one, differs from the "fake" second one in the first 4 lines, the p(|)'s are the same only for the last two lines, 5 and 6. However, the 4th line of the first "real" case has a *lower* p( ?? | ?????????? ...) = 0.0191721 < p( ?? | ?????????? ...) = 0.0218172 in 4th line of the second *fake* case! Again, we see 0 OOVs reported in both cases, despite "???????????" being a fake word with 0 [-Inf] prob. Although the final perplexities are higher for the fake case, I can't be certain, from these results, that the -order 5 option is being honored, and am not sure what -order 0 does here, as well as why some conditional probability can be higher for a fake word. Also, what exactly is the -debug 3 "word probs for this context", and why would they cause a warning for a rather large real corpus, and how should I interpret it? For the reference, here's the model building command I used: time make-batch-counts list/list-stok 100000 cat counts/5g -order 5 > / dev/null 2>&1; time merge-batch-counts counts/5g; time make-big-lm - name lm-ko-kn5 -lm lm-ko-kn5 -max-per-file 100000000 -kndiscount - order 5 -read counts/5g/*.ngrams.gz -- and here's how I launch the resulting LM server: ngram -server-port -lm /data/rupress/lm-ko-kn5 -order 5 Cheers, Alexy From stolcke at speech.sri.com Mon Dec 1 21:41:08 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 01 Dec 2008 21:41:08 -0800 Subject: interpreting -order and -debug results In-Reply-To: <3C299FDB-5463-45AC-B73F-24F3BED8EB55@gmail.com> References: <3C299FDB-5463-45AC-B73F-24F3BED8EB55@gmail.com> Message-ID: <4934CA74.2040006@speech.sri.com> Alexy Khrabrov wrote: > Greetings -- I've trained a Kneser-Ney model of a Russian corpus with > -order 5 -kndiscount, and started it as a server with -order 5. Then, > to see that indeed 5-grams are working, I feed it a sentence with (a) > an existing first word present in the corpus, (b) a made-up first word > not present in the Russian language. Then I run both 5-word sentences > in two ways: (1) -order 5 -debug 2 (2) -order 0 debug 3, both for > -ppl. The results, which puzzle me, are below, followed by a > description of the puzzlement. > > ~ echo c ???? ?????????? ?? ???????? | ngram -use-server > -order 5 -debug 2 -ppl - > server : probserver ready > c ???? ?????????? ?? ???????? > p( c | ) = 3.67342e-06 [ -5.43493 ] > p( ???? | c ...) = 0.00102315 [ -2.99006 ] > p( ?????????? | ???? ...) = 0.00151464 [ -2.81969 ] > p( ?? | ?????????? ...) = 0.0218172 [ -1.6612 ] > p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] > p( | ???????? ...) = 0.00693155 [ -2.15917 ] > 1 sentences, 5 words, 0 OOVs > 0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16 > > file -: 1 sentences, 5 words, 0 OOVs > 0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16 > ~ echo ??????????? ???? ?????????? ?? ???????? | ngram -use-server > -order 5 -debug 2 -ppl - > server : probserver ready > ??????????? ???? ?????????? ?? ???????? > p( ??????????? | ) = 0 [ -inf ] > p( ???? | ??????????? ...) = 0.00014788 [ -3.83009 ] > p( ?????????? | ???? ...) = 0.00151464 [ -2.81969 ] > p( ?? | ?????????? ...) = 0.0218172 [ -1.6612 ] > p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] > p( | ???????? ...) = 0.00693155 [ -2.15917 ] > 1 sentences, 5 words, 0 OOVs > 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 > > file -: 1 sentences, 5 words, 0 OOVs > 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 > > == notice that from the 3rd line p(word | context ...), the > conditional probs are the same, although we're using a 5-gram model > and in the second batch the first word is non-existing! We also have > 0 OOVs reported there (?). The conditional probs can be the same because the N-gram probability might not use the full context. In this case, it might just back off to using one context word. You can verify this by running ngram -ppl with the LM in a file. -debug 2 will display the length of the ngram used in each position. You can also start the SERVER side with ngram -debug 2 to see this information. About 0 OOVs: The LM client/server implementation has a few limitation relative to evaluating the LM from a file. One such limitation is that the client cannot tell the difference between an OOV and word with zero probability. Functionally they are the same (both are excluded from the perplexity computation). You see the OOVs being reported as "zeroprob" tokens, rather than OOVs. > > == Now, let's explore what "unlimited ngrams" mean with -order 0, and > set -debug 3 too: Note that -order 0 on the client side just means the no context truncation happens in the CLIENT. So the full history of each ngram is passed to the server, but then of course there the effective history is limited by the order of the LM. So, if your SERVER was started with -order 5 then the -order 0 on the client side should have no effect. > > ~ echo ? ???? ?????????? ?? ???????? | ngram -use-server > -order 0 -debug 3 -ppl - > server : probserver ready > ? ???? ?????????? ?? ???????? > > warning: word probs for this context sum to 0.00119158 != 1 : > p( ? | ) = 0.000113967 [ -3.94322 ] / 0.00119158 > > warning: word probs for this context sum to 0.0248594 != 1 : ? > p( ???? | ? ...) = 0.00614229 [ -2.21167 ] / 0.0248594 > > warning: word probs for this context sum to 0.0135057 != 1 : ???? ? > p( ?????????? | ???? ...) = 0.0026996 [ -2.5687 ] / > 0.0135057 > > warning: word probs for this context sum to 0.136629 != 1 : ?????????? > ???? ? > p( ?? | ?????????? ...) = 0.0191721 [ -1.71733 ] / > 0.136629 > > warning: word probs for this context sum to 0.00931138 != 1 : ?? > ?????????? ???? ? > p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] / 0.00931138 > > warning: word probs for this context sum to 0.243228 != 1 : ???????? > ?? ?????????? ???? ? > p( | ???????? ...) = 0.00693155 [ -2.15917 ] / > 0.243228 > 1 sentences, 5 words, 0 OOVs > 0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89 > > file -: 1 sentences, 5 words, 0 OOVs > 0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89 > > ----- > > ~ echo ??????????? ???? ?????????? ?? ???????? | ngram -use-server > -order 0 -debug 3 -ppl - > server : probserver ready > ??????????? ???? ?????????? ?? ???????? > > warning: word probs for this context sum to 0.00107762 != 1 : > p( ??????????? | ) = 0 [ -inf ] / 0.00107762 > > warning: word probs for this context sum to 0.0136768 != 1 : > ??????????? > p( ???? | ??????????? ...) = 0.00014788 [ -3.83009 ] / > 0.0136768 > > warning: word probs for this context sum to 0.0105593 != 1 : ???? > ??????????? > p( ?????????? | ???? ...) = 0.00151464 [ -2.81969 ] / > 0.0105593 > > warning: word probs for this context sum to 0.0891667 != 1 : > ?????????? ???? ??????????? > p( ?? | ?????????? ...) = 0.0218172 [ -1.6612 ] / > 0.0891667 > > warning: word probs for this context sum to 0.00501918 != 1 : ?? > ?????????? ???? ??????????? > p( ???????? | ?? ...) = 0.000925487 [ -3.03363 ] / 0.00501918 > > warning: word probs for this context sum to 0.00712921 != 1 : ???????? > ?? ?????????? ???? ??????????? > p( | ???????? ...) = 0.00693155 [ -2.15917 ] / > 0.00712921 > 1 sentences, 5 words, 0 OOVs > 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 > > file -: 1 sentences, 5 words, 0 OOVs > 1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54 > > == Now we get more differences, the "real" example, the first one, > differs from the "fake" second one in the first 4 lines, the p(|)'s > are the same only for the last two lines, 5 and 6. However, the 4th > line of the first "real" case has a *lower* p( ?? | ?????????? ...) > = 0.0191721 < p( ?? | ?????????? ...) = 0.0218172 in 4th > line of the second *fake* case! > > Again, we see 0 OOVs reported in both cases, despite "???????????" > being a fake word with 0 [-Inf] prob. See explanation above. > > Although the final perplexities are higher for the fake case, I can't > be certain, from these results, that the -order 5 option is being > honored, and am not sure what -order 0 does here, as well as why some > conditional probability can be higher for a fake word. Also, what > exactly is the -debug 3 "word probs for this context", and why would > they cause a warning for a rather large real corpus, and how should I > interpret it? > > For the reference, here's the model building command I used: > > time make-batch-counts list/list-stok 100000 cat counts/5g -order 5 > > /dev/null 2>&1; time merge-batch-counts counts/5g; time make-big-lm > -name lm-ko-kn5 -lm lm-ko-kn5 -max-per-file 100000000 -kndiscount > -order 5 -read counts/5g/*.ngrams.gz > > -- and here's how I launch the resulting LM server: > > ngram -server-port -lm /data/rupress/lm-ko-kn5 -order 5 I don't understand why -order 0 gives you any different from -order 5, as explained above. I also cannot reproduce this discrepancy with a model I have.. So, I would suggest that you start your server ngram with the -debug 2 option and then pay attention to - what ngrams get passed to the server - what the ngram length found in the lm is - what the returned probability is The last two pieces of information should be identical with -order 0 or 5 on the client side. If not please email me the output of a short example and we can investigate further. Andreas From adeoras1 at jhu.edu Wed Dec 17 11:59:48 2008 From: adeoras1 at jhu.edu (Anoop Deoras) Date: Wed, 17 Dec 2008 14:59:48 -0500 Subject: CN and Oracle Path from Lattice In-Reply-To: <006F0F42-9114-4D6A-A447-B2B60A8CB777@jhu.edu> References: <006F0F42-9114-4D6A-A447-B2B60A8CB777@jhu.edu> Message-ID: <01C790E3-004B-4C4F-BB89-8584837C7628@jhu.edu> On Dec 16, 2008, at 12:15 PM, Anoop Deoras wrote: > Hello, > > Could someone please help me with my following questions: > > Q1. Is there a way to get Confusion Networks from Lattices such > that the confusion bins are restricted > between node $i$ and $i+1$ for any $i$. I am using IBM's > recognizer and it has Lidia Mangu's CN generation > tool incorporated. The recognizer generates CN from lattices > in the fashion I described above. > However when I use write-mesh command, I dont see a similar > representation ? Am I missing something > here ? > > Q2. If I have access to true utterance and the lattice generated by > the recognizer, is there a way to find out > the Oracle path from this lattice ? > > > Thank you in advance, > Regards > Anoop From stolcke at speech.sri.com Wed Dec 17 12:40:04 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 17 Dec 2008 12:40:04 -0800 Subject: CN and Oracle Path from Lattice In-Reply-To: <01C790E3-004B-4C4F-BB89-8584837C7628@jhu.edu> References: <006F0F42-9114-4D6A-A447-B2B60A8CB777@jhu.edu> <01C790E3-004B-4C4F-BB89-8584837C7628@jhu.edu> Message-ID: <494963A4.2010801@speech.sri.com> Anoop Deoras wrote: > On Dec 16, 2008, at 12:15 PM, Anoop Deoras wrote: > >> Hello, >> >> Could someone please help me with my following questions: >> >> Q1. Is there a way to get Confusion Networks from Lattices such that >> the confusion bins are restricted >> between node $i$ and $i+1$ for any $i$. I am using IBM's >> recognizer and it has Lidia Mangu's CN generation >> tool incorporated. The recognizer generates CN from lattices in >> the fashion I described above. >> However when I use write-mesh command, I dont see a similar >> representation ? Am I missing something >> here ? The "mesh" output format has lines of the form align 1 *DELETE* 0.398704 i'm 0.178572 uh 0.135939 oh 0.0958802 i 0.0772781 we're 0.0662928 aw 0.0473337 That is a list of the word s and their posterior probabilities between node 1 and node 2 (i and i+1 in general). So is this not exactly what you want in a confusion network ? >> >> Q2. If I have access to true utterance and the lattice generated by >> the recognizer, is there a way to find out >> the Oracle path from this lattice ? Not easily. However, if you use the -reference option is used you will find the alignment of the reference words in the mesh output. Andreas >> >> >> Thank you in advance, >> Regards >> Anoop From hashashin at gmail.com Sat Dec 20 14:25:03 2008 From: hashashin at gmail.com (=?UTF-8?Q?Alberto_Sim=C3=B5es?=) Date: Sat, 20 Dec 2008 22:25:03 +0000 Subject: Compiling SRILM under Xeon Message-ID: <2b670b7e0812201425v1e99255fmb88b220cbd6c79e9@mail.gmail.com> Hello I am trying to compile SRILM but not having luck. uname -a: Linux search1.di.uminho.pt 2.6.9-67.0.15.EL #1 Thu May 810:38:13 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux What it happening: Lots of complains like this one: ./testLattice.cc:0: error: CPU you selected does not support x86-64instruction set Thank youAlberto-- Alberto Sim?es From ebicici at ku.edu.tr Sat Dec 20 16:00:18 2008 From: ebicici at ku.edu.tr (Ergun Bicici) Date: Sun, 21 Dec 2008 02:00:18 +0200 Subject: ngram-count -read performance difference for different tokens In-Reply-To: <4ded78d60804080418wc49503et77dcb2f85b08dab5@mail.gmail.com> References: <4ded78d60804080418wc49503et77dcb2f85b08dab5@mail.gmail.com> Message-ID: <4ded78d60812201600k45933f98l2fc40bd7a7221cd2@mail.gmail.com> Dear SRILM List Members, I was experimenting with the "-use-server" option of ngram and it appears to work for "-ppl" calculations from text but I was receiving different numbers when working with count files. With some debugging, I realized that this was due to the server receiving tokens from the client. I made the following modification: line 352, LM.cc, version 1.5.7: //vocab.getIndices(words, wids, order + 1, vocab.unkIndex()); vocab.addWords(words, wids, order + 1); and I am able to get the same results with or without using a server. I have not checked whether this will effect "-cache-served-ngrams" policy or whether this may have other impacts on the results. Regards, Ergun Ergun Bicici Koc University -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Dec 20 20:11:25 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 20 Dec 2008 20:11:25 PST Subject: ngram-count -read performance difference for different tokens In-Reply-To: Your message of Sun, 21 Dec 2008 02:00:18 +0200. <4ded78d60812201600k45933f98l2fc40bd7a7221cd2@mail.gmail.com> Message-ID: <200812210411.mBL4BPS04025@ns2> In message <4ded78d60812201600k45933f98l2fc40bd7a7221cd2 at mail.gmail.com>you wro te: > > Dear SRILM List Members, > > I was experimenting with the "-use-server" option of ngram and it appears to > work for "-ppl" calculations from text but I was receiving different numbers > when working with count files. With some debugging, I realized that this was > due to the server receiving tokens from the client. > > I made the following modification: > > line 352, LM.cc, version 1.5.7: > //vocab.getIndices(words, wids, order + 1, vocab.unkIndex()); > vocab.addWords(words, wids, order + 1); > > and I am able to get the same results with or without using a server. > > I have not checked whether this will effect "-cache-served-ngrams" policy or > whether this may have other impacts on the results. Good catch. Actually, the correct fix is *** LM.cc 2008/12/17 00:17:26 1.66 --- LM.cc 2008/12/21 04:09:50 *************** *** 631,637 **** /* * Map words to indices */ ! vocab.getIndices(words, wids, order + 1, vocab.unkIndex()); /* * Update the counts --- 631,641 ---- /* * Map words to indices */ ! if (addUnkWords()) { ! vocab.addWords(words, wids, order + 1); ! } else { ! vocab.getIndices(words, wids, order + 1, vocab.unkIndex()); ! } /* * Update the counts (compare the code in LM::sentenceProb()). Andreas From hashashin at gmail.com Sun Dec 21 13:28:29 2008 From: hashashin at gmail.com (=?UTF-8?Q?Alberto_Sim=C3=B5es?=) Date: Sun, 21 Dec 2008 21:28:29 +0000 Subject: Where is ngram-count? Message-ID: <2b670b7e0812211328u38c90a1ekbc5be10d18c79aa4@mail.gmail.com> Hello. While I find ngram-count and other manual pages, I can't find the binaries.After issuing 'make World', where should they be? (a find can't find it) Thank you in advance.Alberto -- Alberto Sim?es From hashashin at gmail.com Sun Dec 21 13:34:55 2008 From: hashashin at gmail.com (=?UTF-8?Q?Alberto_Sim=C3=B5es?=) Date: Sun, 21 Dec 2008 21:34:55 +0000 Subject: Where is ngram-count? In-Reply-To: <2b670b7e0812211328u38c90a1ekbc5be10d18c79aa4@mail.gmail.com> References: <2b670b7e0812211328u38c90a1ekbc5be10d18c79aa4@mail.gmail.com> Message-ID: <2b670b7e0812211334r17435316ocb3d97f1d1c699b0@mail.gmail.com> Hi On Sun, Dec 21, 2008 at 9:28 PM, Alberto Sim?es wrote:> Hello.>> While I find ngram-count and other manual pages, I can't find the binaries.> After issuing 'make World', where should they be?> (a find can't find it) OK, makefile was silently ignoring errors.libtcl library was being linked with -ltcl, but on my system it isnamed libtcl8.5.so Solved. Cheers ;)Alberto >> Thank you in advance.> Alberto>> --> Alberto Sim?es> -- Alberto Sim?es From deliverable at gmail.com Sun Dec 21 14:01:29 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 21 Dec 2008 17:01:29 -0500 Subject: Where is ngram-count? In-Reply-To: <2b670b7e0812211334r17435316ocb3d97f1d1c699b0@mail.gmail.com> References: <2b670b7e0812211328u38c90a1ekbc5be10d18c79aa4@mail.gmail.com> <2b670b7e0812211334r17435316ocb3d97f1d1c699b0@mail.gmail.com> Message-ID: <123661B6-5277-40BD-99C8-D0B807447EEC@gmail.com> If you don't need tcl, you can define NO_TCL=yes in common/Makefile.machine. -- and you won't have it at all. Cheers, Alexy On Dec 21, 2008, at 4:34 PM, Alberto Sim?es wrote: > Hi > On Sun, Dec 21, 2008 at 9:28 PM, Alberto Sim?es > wrote:> Hello.>> While I find ngram-count and > other manual pages, I can't find the binaries.> After issuing 'make > World', where should they be?> (a find can't find it) > OK, makefile was silently ignoring errors.libtcl library was being > linked with -ltcl, but on my system it isnamed libtcl8.5.so > Solved. > Cheers ;)Alberto >>> Thank you in advance.> Alberto>> --> Alberto Sim?es> > > > -- Alberto Sim?es