From marco.turchi at gmail.com Sat Jan 6 12:24:15 2007 From: marco.turchi at gmail.com (marco turchi) Date: Sat, 6 Jan 2007 20:24:15 +0000 Subject: compilation problems Message-ID: <79a042480701061224k1309d5c0ie43a49e376ece31f@mail.gmail.com> Dear all, I'm a new user, and I'm trying to compile and install it, but I have some problems. I set the srlim home variable inside the Makefile, and I run make World, but I obtain this set of errors: /enm/local/bin/gcc -mtune=i686 -Wreturn-type -Wimplicit -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c -g -O3 -o ../obj/i686/option.o option.c cc1: invalid option `tune=i686' make[2]: *** [../obj/i686/option.o] Error 1 I have the same error for other files: qsort.c matherr.c FDiscount.cc Lattice.cc ngram.cc fngram-count.cc lattice-tool.cc I try to change the mtune variable without good result, so I remove this flag. Using this brute solution, I'm able to compile quite all the files, but I have other errors /enm/local/bin/g++ -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c -g -O3 -o ../obj/i686/DFNgram.o DFNgram.cc Trellis.h:203: sorry, not implemented: use of `enumeral_type' in template type unification make[2]: *** [../obj/i686/DFNgram.o] Error 1 make[2]: Leaving directory `/usr/local/Moses/srilm/lm/src' make[2]: Entering directory `/usr/local/Moses/srilm/flm/src' make[2]: *** No rule to make target `/usr/local/Moses/srilm//lib/i686/liboolm.a', needed by `../bin/i686/lattice-tool'. Stop. can you help me?? Please, can you tell me where I can find all the other messages of the mailing list? Thanks a lot Marco Turchi -------------- next part -------------- An HTML attachment was scrubbed... URL: From sanyaade at hotmail.com Sat Jan 6 19:55:04 2007 From: sanyaade at hotmail.com (sanyaade) Date: Sun, 7 Jan 2007 03:55:04 -0000 Subject: compilation problems References: <79a042480701061224k1309d5c0ie43a49e376ece31f@mail.gmail.com> Message-ID: What platform are you on -> Windows, Linux, Unix, etc.. If you are on windows then its got to be in your root directory -> c:\srilm (Cygwin installed) On linux put it in your home /home/srilm or on root -> /srilm then do: set 1.) srlim home variable inside the Makefile and 2.) make world Hope this help! God blesses!!! Best regards, Sanyaade ----- Original Message ----- From: marco turchi To: srilm-user at speech.sri.com Sent: Saturday, January 06, 2007 8:24 PM Subject: compilation problems Dear all, I'm a new user, and I'm trying to compile and install it, but I have some problems. I set the srlim home variable inside the Makefile, and I run make World, but I obtain this set of errors: /enm/local/bin/gcc -mtune=i686 -Wreturn-type -Wimplicit -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c -g -O3 -o ../obj/i686/option.o option.c cc1: invalid option `tune=i686' make[2]: *** [../obj/i686/option.o] Error 1 I have the same error for other files: qsort.c matherr.c FDiscount.cc Lattice.cc ngram.cc fngram-count.cc lattice-tool.cc I try to change the mtune variable without good result, so I remove this flag. Using this brute solution, I'm able to compile quite all the files, but I have other errors /enm/local/bin/g++ -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c -g -O3 -o ../obj/i686/DFNgram.o DFNgram.cc Trellis.h:203: sorry, not implemented: use of `enumeral_type' in template type unification make[2]: *** [../obj/i686/DFNgram.o] Error 1 make[2]: Leaving directory `/usr/local/Moses/srilm/lm/src' make[2]: Entering directory `/usr/local/Moses/srilm/flm/src' make[2]: *** No rule to make target `/usr/local/Moses/srilm//lib/i686/liboolm.a', needed by `../bin/i686/lattice-tool'. Stop. can you help me?? Please, can you tell me where I can find all the other messages of the mailing list? Thanks a lot Marco Turchi -------------- next part -------------- An HTML attachment was scrubbed... URL: From marco.turchi at gmail.com Sun Jan 7 06:06:22 2007 From: marco.turchi at gmail.com (marco turchi) Date: Sun, 7 Jan 2007 14:06:22 +0000 Subject: compilation problems In-Reply-To: References: <79a042480701061224k1309d5c0ie43a49e376ece31f@mail.gmail.com> Message-ID: <79a042480701070606p6f65f00ar4de95b163b3d985d@mail.gmail.com> Dear Sanyaade, I'm working under Linux. I move srilm in my home directory, but I obtain the same errors. :-( Thanks Marco On 1/7/07, sanyaade wrote: > > What platform are you on -> Windows, Linux, Unix, etc.. > > If you are on windows then its got to be in your root directory -> > c:\srilm (Cygwin installed) > > On linux put it in your home /home/srilm or on root -> /srilm > then do: set 1.) srlim home variable inside the Makefile and 2.) make > world > > Hope this help! > > God blesses!!! > > Best regards, > Sanyaade > > > > ----- Original Message ----- > *From:* marco turchi > *To:* srilm-user at speech.sri.com > *Sent:* Saturday, January 06, 2007 8:24 PM > *Subject:* compilation problems > > > > Dear all, > I'm a new user, and I'm trying to compile and install it, but I have some > problems. > I set the srlim home variable inside the Makefile, and I run make World, > but I obtain this set of errors: > > /enm/local/bin/gcc -mtune=i686 -Wreturn-type -Wimplicit > -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c -g > -O3 -o ../obj/i686/option.o option.c > cc1: invalid option `tune=i686' > make[2]: *** [../obj/i686/option.o] Error 1 > I have the same error for other files: qsort.c matherr.c FDiscount.cc > Lattice.cc ngram.cc fngram-count.cc lattice-tool.cc > > I try to change the mtune variable without good result, so I remove this > flag. Using this brute solution, I'm able to compile quite all the files, > but I have other errors > > > /enm/local/bin/g++ -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c -g > -O3 -o ../obj/i686/DFNgram.o DFNgram.cc > Trellis.h:203: sorry, not implemented: use of `enumeral_type' in template > type > unification > > make[2]: *** [../obj/i686/DFNgram.o] Error 1 > make[2]: Leaving directory `/usr/local/Moses/srilm/lm/src' > make[2]: Entering directory `/usr/local/Moses/srilm/flm/src' > > make[2]: *** No rule to make target > `/usr/local/Moses/srilm//lib/i686/liboolm.a', needed by > `../bin/i686/lattice-tool'. Stop. > > can you help me?? > Please, can you tell me where I can find all the other messages of the > mailing list? > Thanks a lot > Marco Turchi > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marco.turchi at gmail.com Sun Jan 7 11:44:54 2007 From: marco.turchi at gmail.com (marco turchi) Date: Sun, 7 Jan 2007 19:44:54 +0000 Subject: compilation problems In-Reply-To: References: <79a042480701061224k1309d5c0ie43a49e376ece31f@mail.gmail.com> <79a042480701070606p6f65f00ar4de95b163b3d985d@mail.gmail.com> Message-ID: <79a042480701071144s367f7ac6q17716624ba3ce78b@mail.gmail.com> Hi Russel, you are right I've gcc 3.2.3. It is not a good new. :-) Where can I find all the other messages of this mailing list? Thanks Marco On 1/7/07, Russell Sheptak wrote: > > Check which version of GCC you're using. I suspect it is version 3.x > (a simple gcc --version should give you the info you need). I think > you need gcc 4 to successfully compile it on linux. > > rus > > > > On Jan 7, 2007, at 6:06 AM, marco turchi wrote: > > > Dear Sanyaade, > > I'm working under Linux. > > I move srilm in my home directory, but I obtain the same errors. :-( > > > > Thanks > > Marco > > > > > > On 1/7/07, sanyaade wrote: What platform are > > you on -> Windows, Linux, Unix, etc.. > >> > >> If you are on windows then its got to be in your root directory -> > >> c:\srilm (Cygwin installed) > >> > >> On linux put it in your home /home/srilm or on root -> /srilm > >> then do: set 1.) srlim home variable inside the Makefile and 2.) make > >> world > >> > >> Hope this help! > >> > >> God blesses!!! > >> > >> Best regards, > >> > >> Sanyaade > >> > >> > >> > >>> ----- Original Message ----- > >>> From: marco turchi > >>> To: srilm-user at speech.sri.com > >>> Sent: Saturday, January 06, 2007 8:24 PM > >>> Subject: compilation problems > >>> > >>> > >>> Dear all, > >>> I'm a new user, and I'm trying to compile and install it, but I have > >>> some problems. > >>> I set the srlim home variable inside the Makefile, and I run make > >>> World, but I obtain this set of errors: > >>> > >>> /enm/local/bin/gcc -mtune=i686 -Wreturn-type -Wimplicit > >>> -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c > >>> -g -O3 -o ../obj/i686/option.o option.c > >>> cc1: invalid option `tune=i686' > >>> make[2]: *** [../obj/i686/option.o] Error 1 > >>> I have the same error for other files: qsort.c matherr.c > >>> FDiscount.cc Lattice.cc ngram.cc fngram-count.cc lattice-tool.cc > >>> > >>> I try to change the mtune variable without good result, so I remove > >>> this flag. Using this brute solution, I'm able to compile quite all > >>> the files, but I have other errors > >>> > >>> > >>> /enm/local/bin/g++ -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES > >>> -D_FILE_OFFSET_BITS=64 -I. -I/usr/local/Moses/srilm//include -c > >>> -g -O3 -o ../obj/i686/DFNgram.o DFNgram.cc > >>> Trellis.h:203: sorry, not implemented: use of `enumeral_type' in > >>> template type > >>> unification > >>> > >>> make[2]: *** [../obj/i686/DFNgram.o] Error 1 > >>> make[2]: Leaving directory `/usr/local/Moses/srilm/lm/src' > >>> make[2]: Entering directory `/usr/local/Moses/srilm/flm/src' > >>> > >>> make[2]: *** No rule to make target > >>> `/usr/local/Moses/srilm//lib/i686/liboolm.a', needed by > >>> `../bin/i686/lattice-tool'. Stop. > >>> > >>> can you help me?? > >>> Please, can you tell me where I can find all the other messages of > >>> the mailing list? > >>> Thanks a lot > >>> Marco Turchi > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sun Jan 7 11:51:40 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 07 Jan 2007 11:51:40 PST Subject: compilation problems In-Reply-To: Your message of Sun, 07 Jan 2007 19:44:54 +0000. <79a042480701071144s367f7ac6q17716624ba3ce78b@mail.gmail.com> Message-ID: <200701071951.LAA01733@tonga> In message <79a042480701071144s367f7ac6q17716624ba3ce78b at mail.gmail.com>you wro te: > > Hi Russel, > you are right I've gcc 3.2.3. It is not a good new. :-) What you saw is definitely a problem I have seen with old versions of gcc. Try 3.4.3 or newer. > Where can I find all the other messages of this mailing list? Send a message with the body help to majordomo at speech.sri.com to get instructions on how to retrieve archives of old messages (as well as other documentation). Andreas From john at johnfry.org Sun Jan 7 19:58:06 2007 From: john at johnfry.org (John Fry) Date: Sun, 07 Jan 2007 19:58:06 -0800 Subject: compilation problems In-Reply-To: <200701071951.LAA01733@tonga> (Andreas Stolcke's message of "Sun, 07 Jan 2007 11:51:40 PST") References: <200701071951.LAA01733@tonga> Message-ID: <87wt3y6ugh.fsf@lld.sjsu.edu> Andreas Stolcke writes: > Send a message with the body > > help > > to majordomo at speech.sri.com to get instructions on how to retrieve > archives of old messages (as well as other documentation). Hi Andreas, Before I start complaining, let me say that SRILM is a fantastic, world-class system, and we're all *extremely* grateful to you for opening it up to us and continuing to support it. That said, I must point out that using majordomo, a perl script from 1992, to retrieve old messages is completely unworkable. If you don't believe me, try it yourself. Maybe one of these days you can persuade a summer intern to archive the srilm-user mailing list on the web, where it will be searchable? Best, John From stolcke at speech.sri.com Mon Jan 8 11:06:12 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 08 Jan 2007 11:06:12 -0800 Subject: compilation problems In-Reply-To: <87wt3y6ugh.fsf@lld.sjsu.edu> References: <200701071951.LAA01733@tonga> <87wt3y6ugh.fsf@lld.sjsu.edu> Message-ID: <45A29624.8000209@speech.sri.com> John Fry wrote: > Andreas Stolcke writes: > > >> Send a message with the body >> >> help >> >> to majordomo at speech.sri.com to get instructions on how to retrieve >> archives of old messages (as well as other documentation). >> > > Hi Andreas, > > Before I start complaining, let me say that SRILM is a fantastic, > world-class system, and we're all *extremely* grateful to you for > opening it up to us and continuing to support it. > Thanks, that's nice to hear. > That said, I must point out that using majordomo, a perl script from > 1992, to retrieve old messages is completely unworkable. If you don't > believe me, try it yourself. > > Maybe one of these days you can persuade a summer intern to archive > the srilm-user mailing list on the web, where it will be searchable? > Believe me, converting from majordomo to mailman has been on our to-do list for a while now. Any day now ... Andreas From yozhik at computer.org Tue Jan 16 15:38:55 2007 From: yozhik at computer.org (Tom Murray) Date: Tue, 16 Jan 2007 15:38:55 -0800 Subject: Bug in lattice-tool? In-Reply-To: <39abe3570701161526s290a0374w97e7d6326516cb62@mail.gmail.com> References: <39abe3570701161526s290a0374w97e7d6326516cb62@mail.gmail.com> Message-ID: <39abe3570701161538j470e9312j7a9e9a5965fa8fa1@mail.gmail.com> Hi, I was seeing weird behavior in lattice-tool, mixing in an external LM to a lattice for nbest decoding. Tracking things down, I found that if I zeroed out the external LM scores as they were added into the lattice during expansion, the resulting hyp scores were always zero, that is the scores from the lattice were discarded. I observed this for both HTK and PFSG lattices. Attached is a patch (to version 1.5.1) which I believe fixes the problem. What I found is that, as old transitions were replaced during expansion (Lattice::expandAddTransition() in LatticeExpand.cc), the old weights were discarded. This caused the problem because theinitial transitions loaded from the lattice files were replaced during expansion. Cheers, tm P.S. I also made some changes to functionality, let me know if anyone is interested in them: (1) allowing scaling of the external LM as it's used to reweight the lattice and (2) outputing (weighted) acoustic and LM scores to the nbest list as they were actually evaluated during decoding; currently only the original scores from the lattice are output for HTK lattices and zeros are output for PFSG lattices, because they don't fill the internal HTK structures used for score output. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: LatticeExpand.patch Type: application/octet-stream Size: 630 bytes Desc: not available URL: From yozhik at computer.org Thu Jan 18 09:55:01 2007 From: yozhik at computer.org (Tom Murray) Date: Thu, 18 Jan 2007 09:55:01 -0800 Subject: Fwd: Bug in lattice-tool? In-Reply-To: <200701180657.WAA24896@tonga> References: <39abe3570701171423p4bb5d962qf6dbed50cca8aeda@mail.gmail.com> <200701180657.WAA24896@tonga> Message-ID: <39abe3570701180955g5b08279aj4b2c2eb6132259b1@mail.gmail.com> Thanks, Andreas. I'm forwarding this to the list because I think it may be quite useful to a number of people. ---------- Forwarded message ---------- From: Andreas Stolcke Date: Jan 17, 2007 10:57 PM Subject: Re: Bug in lattice-tool? To: Tom Murray Tom, what you are trying to do can be done with lattice-tool as it is, but it requires two passes. That's how we rescore lattices ourselves. step 1: expand lattice with new LM, write new lattices step 2: read rescored lattices, choosing scaling factors and decoding 1-best or n-best. You are trying to combine these steps into one, and it fails because the LM rescoring function overrides the combined scores. This behavior is by design and some other functions depend on it, but it needs to be better documented. BTW, I don't think your patch will necessarily do the right thing. It simply adds the new LM score to the old combined score, instead of replacing the old LM score in the combination of scores. There are ways to fix this, but it would require more extensive code changes. I would recommend the 2-step approach. It also has the advantage hat you can rerun step2 (n-best decoding) multiple times to try different scaling factors. One more thing: since your LM does not contain multiwords you need to split the multiwords prior to LM expansion. Simply add the -split-multiwords option in step 1. Andreas In message <39abe3570701171423p4bb5d962qf6dbed50cca8aeda at mail.gmail.com>you wro te: > ------=_Part_119177_28709660.1169072629160 > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > > Hi, Andreas-- > > What we want to do with lattice-tool is this: generate an n-best list > from a lattice using an external LM, where the path scores are a > weighted sum of the AM and LM scores in the lattice and the scores of > the external LM. > > Attached is a tarred directory with an HTK lattice, an LM, and a test > script test-lattice.sh. Also included is the output of v1.5.1 > lattice-tool, compared with my patched version which adds the > transition log weights as I described. > > The script runs lattice-tool three times, first with default > -htk-lmscale and -htk-acscale, and then with the lmscale and the > acscale zeroed out. You can see that the n-best list is the same for > all three for the v1.5.1 output. For mine it differs. > > To give a little more detail of where I think the bug is, according to > my understanding of what's going on: > > When you load the HTK file, you create a node for each HTK edge, and > then connect this new node from the start node and to the end node. > The weight of the connection from the start to the new node is the > weighted sum (according to lmscale, acscale, etc.) of the various > scores from the HTK edge. > > Now, during expansion, old nodes and transitions are replaced by new > ones, with the old nodes deleted. I printed out all the node indices, > and the initial nodes corresponding to the HTK edges are deleted > during this stage. I became convince of this when I added a line to > zero out the probs from the external LM, and all the hyp scores during > n-best output had score = 0. > > Please let me know if I'm misunderstanding something. Thanks for your help, > > tm From stolcke at speech.sri.com Tue Feb 6 17:43:50 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 06 Feb 2007 17:43:50 PST Subject: Google language model In-Reply-To: Your message of Tue, 06 Feb 2007 15:01:54 -0500. <200702062003.l16K33Jk028807@linus.mitre.org> Message-ID: <200702070143.l171hp108262@huge> In message <200702062003.l16K33Jk028807 at linus.mitre.org>you wrote: > Hi Andreas, > > I have been using SRILM for some time now and am interested in using it > in conjunction with the Google language model. > > >From looking at the documentation and code, I can see that it reads the > format, but do not see strategies to keep portions of the model in > memory and others on disk, for example. Obviously one would need to do > something like this to hold the entire model. However, I've also used > and tweaked enough of the code to know you're a serious hacker, and that > I might have missed something. > > One thought I had was to point ngram-count to the Google LM, then use a > word list to filter only the n-grams that I need SRILM to estimate > probabilities for. Beyond that, I'm stumped. > > So, can you offer any feedback? What are some strategies you recommend > for using the Google LM? The Google LM (with nontrivial data size) is really meant to be used in conjunction with the -limit-vocab option, which restricts loading of parameters to a subset of the vocabulary (i.e., the subset used in your test or tuning data). An example of this appears in $SRILM/test/tests/ngram-count-lm-limit-vocab/run-test. BTW, there is no "Google LM" per se in SRILM. You use the "CountLM" class, and designate the counts to be read in Google format. See the -count-lm option as described in ngram(1) man page. Hope this clarifies things. Andreas From marco.turchi at gmail.com Wed Feb 7 04:21:56 2007 From: marco.turchi at gmail.com (marco turchi) Date: Wed, 7 Feb 2007 12:21:56 +0000 Subject: compilation problems In-Reply-To: <45A29624.8000209@speech.sri.com> References: <200701071951.LAA01733@tonga> <87wt3y6ugh.fsf@lld.sjsu.edu> <45A29624.8000209@speech.sri.com> Message-ID: <79a042480702070421t2c9ec21cu9617b1c34029eb32@mail.gmail.com> Dear Andreas, I wrote on this mailing list a month ago. I had some compilation problems. You suggested me to install a new gcc version. I did it, and I was able to compile srilm. In this day I have realized that some executable files are empty: ngram ngram-count ngram-merge ... I have the object file of them. I have compiled srilm again and I have found these errors: /usr/bin/g++4 -mtune=pentium3 -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/tclmain.o tclmain.cc tclmain.cc:8:17: error: tcl.h: No such file or directory make[2]: *** [../obj/i686/tclmain.o] Error 1 ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt g++4: ../../lib/i686/libmisc.a: No such file or directory /usr/local/Moses/srilm/sbin/decipher-install 0555 ../bin/i686/ngram ../../bin/i686 ERROR: File to be installed (../bin/i686/ngram) does not exist. ERROR: File to be installed (../bin/i686/ngram) is not a plain file. Usage: decipher-install ... mode: file permission mode, in octal file1 ... fileN: files to be installed directory: where the files should be installed files = ../bin/i686/ngram directory = ../../bin/i686 mode = 0555 make[2]: [../../bin/i686/ngram] Error 1 (ignored) touch ../../bin/i686/ngram and so on for the other files... please can you help me? thanks Marco On 1/8/07, Andreas Stolcke wrote: > John Fry wrote: > > Andreas Stolcke writes: > > > > > >> Send a message with the body > >> > >> help > >> > >> to majordomo at speech.sri.com to get instructions on how to retrieve > >> archives of old messages (as well as other documentation). > >> > > > > Hi Andreas, > > > > Before I start complaining, let me say that SRILM is a fantastic, > > world-class system, and we're all *extremely* grateful to you for > > opening it up to us and continuing to support it. > > > Thanks, that's nice to hear. > > That said, I must point out that using majordomo, a perl script from > > 1992, to retrieve old messages is completely unworkable. If you don't > > believe me, try it yourself. > > > > Maybe one of these days you can persuade a summer intern to archive > > the srilm-user mailing list on the web, where it will be searchable? > > > Believe me, converting from majordomo to mailman has been on our to-do > list for a while now. > Any day now ... > > Andreas > > > > From patryale at iro.umontreal.ca Wed Feb 7 05:32:18 2007 From: patryale at iro.umontreal.ca (Alexandre Patry) Date: Wed, 07 Feb 2007 08:32:18 -0500 Subject: compilation problems In-Reply-To: <79a042480702070421t2c9ec21cu9617b1c34029eb32@mail.gmail.com> References: <200701071951.LAA01733@tonga> <87wt3y6ugh.fsf@lld.sjsu.edu> <45A29624.8000209@speech.sri.com> <79a042480702070421t2c9ec21cu9617b1c34029eb32@mail.gmail.com> Message-ID: <1170855139.6266.4.camel@localhost.localdomain> Hi, the compiler does not seem to find a TCL header files (tclmain.cc:8:17: error: tcl.h: No such file or directory). Did you set the TCL_INCLUDE and TCL_LIBRARY variables in the common/Makefile.machine.ARCH file? Mine look like it (in common/Makefile.machine.i686): 8<---------------------------------- # Tcl support (standard in Linux) TCL_INCLUDE = -I/usr/include/tcl8.4 TCL_LIBRARY = -L/usr/lib/tcl8.4 -ltcl 8<---------------------------------- Hope this help, Alexandre Le mercredi 07 f?vrier 2007 ? 12:21 +0000, marco turchi a ?crit : > Dear Andreas, > I wrote on this mailing list a month ago. I had some compilation > problems. You suggested me to install a new gcc version. I did it, and > I was able to compile srilm. > In this day I have realized that some executable files are empty: > ngram > ngram-count > ngram-merge ... > > I have the object file of them. > > I have compiled srilm again and I have found these errors: > > /usr/bin/g++4 -mtune=pentium3 -Wreturn-type -Wimplicit > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include > -c -g -O3 -o ../obj/i686/tclmain.o tclmain.cc > tclmain.cc:8:17: error: tcl.h: No such file or directory > make[2]: *** [../obj/i686/tclmain.o] Error 1 > > > > ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt > g++4: ../../lib/i686/libmisc.a: No such file or directory > /usr/local/Moses/srilm/sbin/decipher-install 0555 ../bin/i686/ngram > ../../bin/i686 > ERROR: File to be installed (../bin/i686/ngram) does not exist. > ERROR: File to be installed (../bin/i686/ngram) is not a plain file. > Usage: decipher-install ... > mode: file permission mode, in octal > file1 ... fileN: files to be installed > directory: where the files should be installed > > files = ../bin/i686/ngram > directory = ../../bin/i686 > mode = 0555 > > make[2]: [../../bin/i686/ngram] Error 1 (ignored) > touch ../../bin/i686/ngram > and so on for the other files... > > please can you help me? > > thanks > Marco > > On 1/8/07, Andreas Stolcke wrote: > > John Fry wrote: > > > Andreas Stolcke writes: > > > > > > > > >> Send a message with the body > > >> > > >> help > > >> > > >> to majordomo at speech.sri.com to get instructions on how to retrieve > > >> archives of old messages (as well as other documentation). > > >> > > > > > > Hi Andreas, > > > > > > Before I start complaining, let me say that SRILM is a fantastic, > > > world-class system, and we're all *extremely* grateful to you for > > > opening it up to us and continuing to support it. > > > > > Thanks, that's nice to hear. > > > That said, I must point out that using majordomo, a perl script from > > > 1992, to retrieve old messages is completely unworkable. If you don't > > > believe me, try it yourself. > > > > > > Maybe one of these days you can persuade a summer intern to archive > > > the srilm-user mailing list on the web, where it will be searchable? > > > > > Believe me, converting from majordomo to mailman has been on our to-do > > list for a while now. > > Any day now ... > > > > Andreas > > > > > > > > From stolcke at speech.sri.com Mon Feb 12 09:41:23 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 12 Feb 2007 09:41:23 -0800 Subject: Perplexity In-Reply-To: <20070212100115.81215.qmail@web36804.mail.mud.yahoo.com> References: <20070212100115.81215.qmail@web36804.mail.mud.yahoo.com> Message-ID: <45D0A6C3.3020800@speech.sri.com> Martha Yifiru wrote: > Hi, > > I want to compare morph-based language model with > word-based one. To do this I have to do some > manipulation on the calculation of perplexity for > morph-based language model so as to have fair > comparison. I was thinking that the source code for > perplexity calculation is in ngram.cc but it does not > seem that the actual perplexity calculation is in > ngram.cc. > > Can anyone help me? > > The source code for perplexity computation is in lm/src/TextStats.cc . However, there is no need to modify the code. When you have different token counts (words versus morphs) the perplexities are no longer comparable, but the log probabilities are. You can get the log probability from the perplexity output, e.g.: file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 681 OOVs 0 zeroprobs, logprob= -86334.6 ppl= 103.502 ppl1= 198.958 ^^^^^^^^ Assume the "words" in this example are actually morphs, and the actual number of words (including sentence boundaries) is less, say, 25000. then the word-perplexity is 10^ -(-86334.6 / 25000 ) = 2840.43 --Andreas From Antoine.Ghaoui at jinny.ie Thu Feb 15 00:09:39 2007 From: Antoine.Ghaoui at jinny.ie (Antoine Ghaoui) Date: Thu, 15 Feb 2007 10:09:39 +0200 Subject: Language Model output problem using FLM Message-ID: Hello, I'm trying to use fngram-count to generate a Language Model based on Morphology. I'm trying to generate a trigram model in order to be familiar with the tool. The factor file is: ## word trigram 1 W : 2 W(-1) W(-2) ntextfile_99.flm.cnt ntextfile_99.flm.lm 3 W1W2 W2 kndiscount gtmin 1 interpolate W1 W1 kndiscount gtmin 1 interpolate 0 0 kndiscount gtmin 1 The command line used is: fngram-count -factor-file flm_spc.1 -text ntextfile_99.flm -lm ntextfile_99.flm.lm -vocab ntextfile.vocab.flm The lm file generated is a little bit strange. A part of it is shown below: \data\ ngram 0x0=18119 ngram 0x1=2855740 ngram 0x2=0 ngram 0x3=6490198 \0x0-grams: -2.313375 -99 . . \0x1-grams: -0.9892201 W-LTN -1.629908 . . \\0x2-grams: \0x3-grams: -0.9725394 W-LTN -1.654503 . . \end\ Can you please help on this? Is it normal to have ngram 0x2=0? How can I get the old format? Thanks for your help Antoine From amittai at mit.edu Thu Feb 15 07:49:46 2007 From: amittai at mit.edu (amittai e axelrod) Date: Thu, 15 Feb 2007 15:49:46 +0000 Subject: Language Model output problem using FLM In-Reply-To: References: Message-ID: <5734eadd0702150749r25ed8de5s6353bffc06845e2a@mail.gmail.com> On 2/15/07, Antoine Ghaoui wrote: > ## word trigram > 1 > W : 2 W(-1) W(-2) ntextfile_99.flm.cnt ntextfile_99.flm.lm 3 > W1W2 W2 kndiscount gtmin 1 interpolate > W1 W1 kndiscount gtmin 1 interpolate > 0 0 kndiscount gtmin 1 > Can you please help on this? Is it normal to have ngram 0x2=0? Yes (for a regular trigram LM in FLM format). The short answer is that this indicates that you have no histories that consist simply of W2. > How can I get the old format? You can't. This is the standard FLM file format-- but it's really equivalent to the LM format, it's just labelled a bit differently. Because a FLM allows you to select arbitrary combinations of factors to use as the ngram history, the header of the FLM file will contain a list of how many of each possible combination of factors you're using for your history. However, as your FLM specification narrows down which factor combinations are valid histories, some (or many) of the entries in the FLM header will have a count of zero. For example, a FLM header corresponding to an FLM over a trigram with 3 factors per word, might look something like this: <<< \data\ ngram 0x0=61628 ngram 0x1=1267167 ngram 0x2=278079 ngram 0x4=1136820 ngram 0x8=2021099 ngram 0x10=0 ngram 0x3=1352676 ngram 0x5=1267167 ngram 0x6=1339994 ngram 0x9=0 ngram 0xA=2824147 ngram 0xC=4578754 ngram 0x11=0 ngram 0x12=0 ngram 0x14=0 ngram 0x18=0 ngram 0x7=1352676 ngram 0xB=0 ngram 0xD=0 ngram 0xE=4702913 ngram 0x13=0 ngram 0x15=4497090 ngram 0x16=4534847 ngram 0x19=0 ngram 0x1A=2824147 ngram 0x1C=4578754 ngram 0xF=0 ngram 0x17=4542579 ngram 0x1B=0 ngram 0x1D=0 ngram 0x1E=425916 ngram 0x1F=325041 >>> ...and this is also normal. While in a normal trigram LM you'd see "1-gram", "2-gram", etc, a FLM will just number all the nodes in the possible backoff graph and use each node's label in the header rather than write out which particular factor combination it represents. If you want to figure out which particular factor combination each hex label means, I think the counting mechanism is commented in the FLM code. In the case of a trigram model, though, there's only one combination of factors that's not used as a history and thus has zero entries (namely that of W2 alone), and therefore that's the one labelled 0x2 :) ~amittai From liangy at mail.rockefeller.edu Tue Feb 27 14:35:34 2007 From: liangy at mail.rockefeller.edu (Yupu Liang) Date: Tue, 27 Feb 2007 17:35:34 -0500 Subject: help on installing srilm on redhat Message-ID: Hi, I am new to the toolkit and want to install it on redhat. the command I ran is "make MACHINE_TYPE=i686 World" And I got the following error g++: ../../lib/i686/libmisc.a: No such file or directory I tried to read the make file to find out where the libmisc.a got generated but didn't get any luck. Could somebody help out? Thanks a lot, Yupu From hanisaf at gmail.com Wed Mar 7 16:17:47 2007 From: hanisaf at gmail.com (Hani Safadi) Date: Wed, 7 Mar 2007 19:17:47 -0500 Subject: cahce based models Message-ID: <990817d50703071617i46a7ed3t92813a9287edc0ea@mail.gmail.com> Hi, I would like to get more information on the cache-based models implemented in SRILM. and how to use them. The paper briefly mentions them, and there is no information in the man pages. Thanks -- Looking forward to hearing from you. Best wishes, Hani Safadi From j.ganitkevitch at googlemail.com Wed Mar 7 17:14:21 2007 From: j.ganitkevitch at googlemail.com (Juri Ganitkevitch) Date: Thu, 8 Mar 2007 02:14:21 +0100 Subject: cahce based models In-Reply-To: <990817d50703071617i46a7ed3t92813a9287edc0ea@mail.gmail.com> References: <990817d50703071617i46a7ed3t92813a9287edc0ea@mail.gmail.com> Message-ID: <3BE78265-2376-4D96-8AB4-547D82E15E92@gmail.com> Hi Hani, if I'm correctly interpreting your question, the LM subclass CacheLM provides a simple cache component implementation. Word probability is boosted if the very same word occured in a window of the last N words (more occurencies yield higher probability). You get ngram to interpolate whatever model you're using with a cache component using -cache. The source code of this one is very straightforward if you're interested in the details. If you're looking for the original papers, Kuhn and De Mori published on this in 1990 (as to my knowledge at least). Hope this helps. Cheers from Aachen, Juri On 8. Mar, 2007, at 01:17, Hani Safadi wrote: > Hi, > I would like to get more information on the cache-based models > implemented in SRILM. and how to use them. > The paper briefly mentions them, and there is no information in the > man pages. > Thanks > -- > Looking forward to hearing from you. > Best wishes, > Hani Safadi From stolcke at speech.sri.com Wed Mar 7 17:27:24 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 07 Mar 2007 17:27:24 PST Subject: cahce based models In-Reply-To: Your message of Thu, 08 Mar 2007 02:14:21 +0100. <3BE78265-2376-4D96-8AB4-547D82E15E92@gmail.com> Message-ID: <200703080127.l281ROA13913@huge> In message <3BE78265-2376-4D96-8AB4-547D82E15E92 at gmail.com>you wrote: > Hi Hani, > > if I'm correctly interpreting your question, the LM subclass CacheLM > provides a simple cache component implementation. > > Word probability is boosted if the very same word occured in a window > of the last N words (more occurencies yield higher probability). You > get ngram to interpolate whatever model you're using with a cache > component using -cache. The source code of this one is very > straightforward if you're interested in the details. > > If you're looking for the original papers, Kuhn and De Mori published > on this in 1990 (as to my knowledge at least). > > Hope this helps. > > Cheers from Aachen, > > Juri Thanks for this dead-on response! At risk of stating the obvious, the code for CacheLM is in $SRILM/lm/src/CacheLM.cc, and is quite short and easy to follow. Best, Andreas > > On 8. Mar, 2007, at 01:17, Hani Safadi wrote: > > > Hi, > > I would like to get more information on the cache-based models > > implemented in SRILM. and how to use them. > > The paper briefly mentions them, and there is no information in the > > man pages. > > Thanks > > -- > > Looking forward to hearing from you. > > Best wishes, > > Hani Safadi > From hanisaf at gmail.com Wed Mar 7 20:28:32 2007 From: hanisaf at gmail.com (Hani Safadi) Date: Wed, 7 Mar 2007 23:28:32 -0500 Subject: cahce based models In-Reply-To: <200703080127.l281ROA13913@huge> References: <3BE78265-2376-4D96-8AB4-547D82E15E92@gmail.com> <200703080127.l281ROA13913@huge> Message-ID: <990817d50703072028o5a68049j5352af582ce007e9@mail.gmail.com> Hi there, Thank you for your answers, I would like to compare several language models, including the cache model defined in Kuhn and De Mori paper. I was using the CMU SLM toolkit, and moved recently to SRILM because of the richness of the implemented algorithm. The only obstacle I found is the sparse documentation of the project. I can infer from your answers that to use cache model, I can either: 1- Use the subclass CacheLM using a programming language. 2- use the option -cache with the ngram command. I still prefer to master the existing commands before using any API, so now, suppose I want to use ngram -cache 10 and I would like to define to word classes, The pdf paper says that "Word classes may be defined manually". I would like to know how to do that, and how to pass the classes file to ngram. Finally, I have a comment to the maintainers of this wonderful project. Why don't you provide a tutorial to use SRILM. This can help many new comers, given that the documentation is not complete. Thanks Looking forward to hearing from you regards Hani On 3/7/07, Andreas Stolcke wrote: > > In message <3BE78265-2376-4D96-8AB4-547D82E15E92 at gmail.com>you wrote: > > Hi Hani, > > > > if I'm correctly interpreting your question, the LM subclass CacheLM > > provides a simple cache component implementation. > > > > Word probability is boosted if the very same word occured in a window > > of the last N words (more occurencies yield higher probability). You > > get ngram to interpolate whatever model you're using with a cache > > component using -cache. The source code of this one is very > > straightforward if you're interested in the details. > > > > If you're looking for the original papers, Kuhn and De Mori published > > on this in 1990 (as to my knowledge at least). > > > > Hope this helps. > > > > Cheers from Aachen, > > > > Juri > > Thanks for this dead-on response! > > At risk of stating the obvious, the code for CacheLM is in > $SRILM/lm/src/CacheLM.cc, and is quite short and easy to follow. > > Best, > > Andreas > > > > > On 8. Mar, 2007, at 01:17, Hani Safadi wrote: > > > > > Hi, > > > I would like to get more information on the cache-based models > > > implemented in SRILM. and how to use them. > > > The paper briefly mentions them, and there is no information in the > > > man pages. > > > Thanks > > > -- > > > Looking forward to hearing from you. > > > Best wishes, > > > Hani Safadi > > > > -- Looking forward to hearing from you. Best wishes, Hani Safadi From j.ganitkevitch at googlemail.com Thu Mar 8 01:11:32 2007 From: j.ganitkevitch at googlemail.com (Juri Ganitkevitch) Date: Thu, 8 Mar 2007 10:11:32 +0100 Subject: cahce based models In-Reply-To: <990817d50703072028o5a68049j5352af582ce007e9@mail.gmail.com> References: <3BE78265-2376-4D96-8AB4-547D82E15E92@gmail.com> <200703080127.l281ROA13913@huge> <990817d50703072028o5a68049j5352af582ce007e9@mail.gmail.com> Message-ID: <4F7EF298-AAC4-44E7-AFB3-D84C7E81549F@gmail.com> Hi Hani, > I can infer from your answers that to use cache model, I can either: > 1- Use the subclass CacheLM using a programming language. > 2- use the option -cache with the ngram command. Actually, -cache uses the implementation given in the CacheLM class. If you want to extend fuctionality I figure your best bet would be to either extend the CacheLM or LM class (don't think any other language than C/C++ would be good here, as you'll get horrible performance for invoking wrappers for every word). You would then need to plug your class in ngram (possibly ngram-count as well if you have stuff to count/train). This is actually quite simple, you can best observe the steps necessary by searching for cache in ngram.cc. You'll find essentially two parts, one where command line parameters are defined and mapped to variables and a second where the model is initiated and mixed into the current model in use. > I still prefer to master the existing commands before using any API, > so now, suppose I want to use ngram -cache 10 To my knowledge (this would vary with texts and languages of course) a value of 100 is a good starting point > and I would like to define to word classes, > The pdf paper says that "Word classes may be defined manually". I > would like to know how to do that, and how to pass the classes file to > ngram. Given the current code, I figure you'll need to implement your own cache model, as this one does not incorporate any kind of word class support. Either you map words to classes (and operate on those) in your model, or you have a LM wrapper (a bit like the classes that provide for combining LMs) that feeds the cache model with classes rather than words. Sadly I don't know if there is such an approach implemented in SRILM. Documentation is a bit sparse, true. As long as you don't want to code around in SRILM the manpages and -help options provide you with a bit of an overview. For coding I have found it to be helpful to follow the course main() in either ngram or ngram-count to figure out how it works. Code's clean and the naming gives you a good insight about what's going on. Take care, Juri From joel.pinto at idiap.ch Thu Mar 8 06:33:52 2007 From: joel.pinto at idiap.ch (Joel Pinto) Date: Thu, 08 Mar 2007 15:33:52 +0100 Subject: ngram manipulation Message-ID: <45F01ED0.2030305@idiap.ch> Hello SRILM users, I have a question on the use of srilm toolkit for LM manipulation. The language model in the arpa format gives conditional probabilities e.g p(wd3|wd1, wd2) Can I compute the joint probability p(wd1, wd2, wd3) using any utility. I have a heavy LM with (ngram 1=50002, ngram 2=29077135, ngram 3=40083381). Any help would be greatly appreciated. Thanks, joel. arpa format: p(wd3|wd1,wd2) = if(trigram exists) p_3(wd1,wd2,wd3) else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2) else p(wd3|w2) p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2) else bo_wt_1(wd1)*p_1(wd2) From stolcke at speech.sri.com Thu Mar 8 08:21:42 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 08 Mar 2007 08:21:42 PST Subject: ngram manipulation In-Reply-To: Your message of Thu, 08 Mar 2007 15:33:52 +0100. <45F01ED0.2030305@idiap.ch> Message-ID: <200703081621.l28GLh101953@huge> There is a hack to do it. Remove from your LM any ngrams involving the or token (without changing the other probabilities nad backoff weights). Then feed your ngrams to "ngram -debug 1 -ppl"). The "sentence" log probabilities will now correspond to joint ngram probabilities, since the initial word will back off to a unigram probability, and the final will count as an OOV and not contrinute to the total log probability. It would be easy to add an option somewhere to make this more convenient, without the need to hack the LM itself. --Andreas In message <45F01ED0.2030305 at idiap.ch>you wrote: > Hello SRILM users, > > I have a question on the use of srilm toolkit for LM manipulation. > > The language model in the arpa format gives conditional probabilities > e.g p(wd3|wd1, wd2) > Can I compute the joint probability p(wd1, wd2, wd3) using any utility. > > I have a heavy LM with (ngram 1=50002, ngram 2=29077135, ngram 3=40083381). > > > Any help would be greatly appreciated. > Thanks, > joel. > > > arpa format: > p(wd3|wd1,wd2) = if(trigram exists) p_3(wd1,wd2,wd3) > else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2) > else p(wd3|w2) > > p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2) > else bo_wt_1(wd1)*p_1(wd2) > From jkurlandski at hotmail.com Sat Mar 10 11:36:44 2007 From: jkurlandski at hotmail.com (Kurlandski Jerry) Date: Sat, 10 Mar 2007 14:36:44 -0500 Subject: problems running tests Message-ID: Hello, I'm a newcomer to SRI LM and am having problems running the tests. Between a third and half the tests do not match the reference output. One example is the first test, adapt-marginals. Here is the stderr output: ../ngram-count-gt/swbd.3bo.gz: line 8: ngram line has 1 fields (3 expected) format error in lm file ../ngram-count-gt/eval97.text: line 5293: 5290 sentences, 38238 words, 0 OOVs 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 using WittenBell for 1-grams warning: distributing 0.0720362 left-over probability mass over all 3379 words writing 3380 1-grams ../ngram-count-gt/swbd.3bo.gz: line 8: ngram line has 1 fields (3 expected) format error in lm file The vocab-aliases test has very similar error output: reading 33110 1-grams ../ngram-count-gt/swbd.3bo.gz: line 8: ngram line has 1 fields (3 expected) format error in lm file And ngram-prune's output is: swbd.3bo.gz: line 7: ngram line has 1 fields (3 expected) format error in lm file pruned.gz: No such file or directory I am running SRI LM version 1.5.1 with the latest version of Cygwin on a Windows 2000 platform. Any help would be appreciated. Thanks. Further details: I wondered if the issue might have to do with gunzip. So I typed the following at the command line, and got the following output: $ gunzip -f swbd.3bo.gz gunzip: swbd.3bo.gz: invalid compressed data--format violated I tried unzipping with WinZip and got the following message: Invalid compressed data--unable to inflate. Still, Winzip did give me an apparently unzipped version of the file, so I ran just the adapt-marginals test against the unzipped file. However, I got the same output as described above. _________________________________________________________________ The average US Credit Score is 675. The cost to see yours: $0 by Experian. http://www.freecreditreport.com/pm/default.aspx?sc=660600&bcd=EMAILFOOTERAVERAGE From stolcke at speech.sri.com Sat Mar 10 12:27:00 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 10 Mar 2007 12:27:00 PST Subject: problems running tests In-Reply-To: Your message of Sat, 10 Mar 2007 14:36:44 -0500. Message-ID: <200703102027.l2AKR0g16858@huge> In message you wrote: > Hello, > > I'm a newcomer to SRI LM and am having problems running the tests. Between a > third and half the tests do not match the reference output. One example is > the first test, adapt-marginals. Here is the stderr output: > > ../ngram-count-gt/swbd.3bo.gz: line 8: ngram line has 1 fields (3 expected) > format error in lm file > ../ngram-count-gt/eval97.text: line 5293: 5290 sentences, 38238 words, 0 > OOVs > 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 > using WittenBell for 1-grams > warning: distributing 0.0720362 left-over probability mass over all 3379 > words > writing 3380 1-grams > ../ngram-count-gt/swbd.3bo.gz: line 8: ngram line has 1 fields (3 expected) > format error in lm file > > > The vocab-aliases test has very similar error output: > > reading 33110 1-grams > ../ngram-count-gt/swbd.3bo.gz: line 8: ngram line has 1 fields (3 expected) > format error in lm file This indicates either 1) there is some problem with your cygwin installation 2) the files were somehow corrupted in unpacking. If you have access to a unix or linux system you could unpack the tar.gz file there and make sure the swbd.3bo.gz file can uncompressed. I suspect it's something having to do with the way Windows distingishes "text" from "binary" files. Andreas PS. If you built SRILM for the "win32" platform compressed files won't be supported, and you should run the go.unzip script in the test directory before attempting to run the tests. However, this assumes you have aworking gunzip in your cygwin installation. From bplank at science.uva.nl Mon Mar 12 10:05:49 2007 From: bplank at science.uva.nl (B. Plank) Date: Mon, 12 Mar 2007 18:05:49 +0100 (CET) Subject: tolower option Message-ID: <4224.146.50.144.82.1173719149.squirrel@webmail.science.uva.nl> Dear SRILM mailing list, I am wondering.. when I try to train a language model with ngram-count and the ?tolower option, I?m getting the following error: assertion "i < maxWordLength" failed: file "Vocab.cc", line 97 The input corpus (-text) is an utf8 file. Might this cause the problem? I am grateful for any suggestion. Barbara From Antoine.Ghaoui at jinny.ie Tue Mar 13 08:52:49 2007 From: Antoine.Ghaoui at jinny.ie (Antoine Ghaoui) Date: Tue, 13 Mar 2007 17:52:49 +0200 Subject: Error in discount estimator Message-ID: Hello, When using fngram-count to generate a Language Model, i'm getting the following error: warning: one of required modified KneserNey count-of-count is zero error in discount estimator Can someone help? knowing that the factor file is ## root trigram 1 R : 2 R(-1) R(-2) ntextfile_99.flm.cnt ntextfile_99.flm.lm 3 R1R2 R2 kndiscount gtmin 1 interpolate R1 R1 kndiscount gtmin 1 interpolate 0 0 kndiscount gtmin 1 Thanks Antoine From stolcke at speech.sri.com Wed Mar 14 10:32:08 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 14 Mar 2007 10:32:08 -0700 Subject: tolower option In-Reply-To: <4224.146.50.144.82.1173719149.squirrel@webmail.science.uva.nl> References: <4224.146.50.144.82.1173719149.squirrel@webmail.science.uva.nl> Message-ID: <45F83198.9090704@speech.sri.com> B. Plank wrote: > Dear SRILM mailing list, > > I am wondering.. when I try to train a language model with ngram-count and > the ?tolower option, > I?m getting the following error: > > assertion "i < maxWordLength" failed: file "Vocab.cc", line 97 > > The input corpus (-text) is an utf8 file. Might this cause the problem? > > I am grateful for any suggestion. > > -tolower is simply implemented by the C library tolower() function, which is controlled by the OS's locale settings. I am not sure if tolower() works correctly for UTF8, and if it does you probably have to set LC_CTYPE to something appropriate. In other words, this is all beyond the scope of what the SRILM code itself handles. I would write a little test program that calls tolower() on some test data to make sure it does what you want. Andreas From stolcke at speech.sri.com Thu Mar 15 11:26:10 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 15 Mar 2007 11:26:10 -0700 Subject: Error in discount estimator In-Reply-To: References: Message-ID: <45F98FC2.8090207@speech.sri.com> Antoine Ghaoui wrote: > Hello, > > When using fngram-count to generate a Language Model, i'm getting the > following error: > > warning: one of required modified KneserNey count-of-count is zero > error in discount estimator > > > Can someone help? > > knowing that the factor file is This is a problem with the frequency distribution of factors in your data. You probably have no singleton ngrams for some factor-ngram. Try using a discounting method like wbdiscount instead of kndiscount. Andreas > > > ## root trigram > 1 > R : 2 R(-1) R(-2) ntextfile_99.flm.cnt ntextfile_99.flm.lm 3 > R1R2 R2 kndiscount gtmin 1 interpolate > R1 R1 kndiscount gtmin 1 interpolate > 0 0 kndiscount gtmin 1 > > Thanks > > Antoine > From stolcke at speech.sri.com Tue Mar 20 21:27:00 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 20 Mar 2007 20:27:00 -0800 Subject: SRILM beginning and end tokens? In-Reply-To: Your message of Tue, 20 Mar 2007 19:33:26 -0400. <20070320233327.E8AD478B51@epoch.cs> Message-ID: <200703210427.l2L4R0r24813@speech.sri.com> In message <20070320233327.E8AD478B51 at epoch.cs>you wrote: > Dear Andreas, > > I am very grateful to benefit from your work by using this toolkit. It's > great! > > I noticed it adds and tokens if they aren't there. However, I'm > modelling with trigrams, and it seems to add only one begin/end pair per > sentence. Is there an option I missed, or do I need to insert them myself? For , there is never a reason to add more than one such token, the last ngram probability that goes into the sentence probability is p( | ... ) For , you also need no more than one token, since the backoff will establish that p( w1 | ... ) = p(w1 | ) I know that some other implementations add additional higher-order ngrams by filling in multiple copies of , but I believe that is not well motivated. It could also lead to unnatural count-of-count statistics for KN and GT smoothing. Andreas > > Thank you! > -Amber > > > \ L. Amber Wilcox-O'Hearn * http://www.cs.toronto.edu/~amber/ / > -\ Graduate student * Computational Linguistics Research Group /- > --\ Department of Computer Science * University of Toronto /-- From stolcke at speech.sri.com Mon Mar 26 10:38:12 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 26 Mar 2007 10:38:12 -0700 Subject: Question about using SRI with Large Data In-Reply-To: <845615.46861.qm@web38109.mail.mud.yahoo.com> References: <845615.46861.qm@web38109.mail.mud.yahoo.com> Message-ID: <46080504.50500@speech.sri.com> Ibrahim Zaghloul wrote: > Dear Eng. Andreas > > I am trying to use SRI LM with a counts file that is 5 GB, but I failed > with all the ways. I got this counts by using the vocab option to limit > the counts. I generated 8 sorted files as my data was divided to 8 > parts and then used ngram-merge to merge them. The result was the above > file 5 GB. > I tried to use the ordinal command: > ngram-count -read ngram-file -lm output-lm-file > but the result was a long error ending with Assertion 'body !=0' failed > I tried to use this command > make-big-lm -read ngrams-file -lm lm-file > but also the above error was the result. > Also I tried to use the -gtNmin option, but also recieved the above > error. Please check $SRILM/doc/FAQ for a list of measures to try. If none of them work then you just have too much data and too little memory, and need to get a larger machine. Note that you should ALWAYS succeed by raising the minimum counts sufficiently. The exact values will depend on your data and the amount of memory you have. > > When I tried to use make-google-ngrams, the result was the error: > "/sri/bin/make-google-ngrams gzip=0 cna.ngrams > sort: invalid option -- 2" > make-google-ngrams not the right tool for this problem. Andreas