From amontalvo at cenatav.co.cu Mon Apr 8 08:28:52 2013 From: amontalvo at cenatav.co.cu (Ana Montalvo Bereau) Date: Mon, 08 Apr 2013 11:28:52 -0400 Subject: [SRILM User List] error compiling SRILM Message-ID: <5162E234.5030607@cenatav.co.cu> Hi all, I'm a beginner with SRILM,I'm trying to compile it but I got an error. I use Ubuntu 32bit, so I only had to set the SRILM variable in the Makefile and run make World, then why do I get this error? make[2]: Entering directory `/home/ana/pincha/srilm/misc/src' gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/option.o option.c gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/zio.o zio.c gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/fcheck.o fcheck.c gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/fake-rand48.o fake-rand48.c g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/Debug.o Debug.cc g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/File.o File.cc g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/MStringTokUtil.o MStringTokUtil.cc g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/tls.o tls.cc g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/tserror.o tserror.cc gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/version.o version.c g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/tclmain.o tclmain.cc *tclmain.cc:8:17: fatal error: tcl.h: No such file or directory* compilation terminated. make[2]: *** [../obj/i686/tclmain.o] Error 1 make[2]: Leaving directory `/home/ana/pincha/srilm/misc/src' make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory `/home/ana/pincha/srilm' make: *** [World] Error 2 best regards ana -------------- next part -------------- An HTML attachment was scrubbed... URL: From anuragrs at gmail.com Mon Apr 8 09:22:50 2013 From: anuragrs at gmail.com (Anurag Singh) Date: Mon, 8 Apr 2013 09:22:50 -0700 Subject: [SRILM User List] error compiling SRILM In-Reply-To: <5162E234.5030607@cenatav.co.cu> References: <5162E234.5030607@cenatav.co.cu> Message-ID: This might be similar to http://www.speech.sri.com/pipermail/srilm-user/2007q1/000415.html On Mon, Apr 8, 2013 at 8:28 AM, Ana Montalvo Bereau wrote: > Hi all, I'm a beginner with SRILM,I'm trying to compile it but I got an > error. > I use Ubuntu 32bit, so I only had to set the SRILM variable in the > Makefile and run make World, then why do I get this error? > > make[2]: Entering directory `/home/ana/pincha/srilm/misc/src' > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/option.o option.c > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/zio.o zio.c > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/fcheck.o fcheck.c > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/fake-rand48.o fake-rand48.c > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c > -g -O3 -o ../obj/i686/Debug.o Debug.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c > -g -O3 -o ../obj/i686/File.o File.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c > -g -O3 -o ../obj/i686/MStringTokUtil.o MStringTokUtil.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c > -g -O3 -o ../obj/i686/tls.o tls.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c > -g -O3 -o ../obj/i686/tserror.o tserror.cc > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/version.o version.c > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. -I../../include -c > -g -O3 -o ../obj/i686/tclmain.o tclmain.cc > *tclmain.cc:8:17: fatal error: tcl.h: No such file or directory* > compilation terminated. > make[2]: *** [../obj/i686/tclmain.o] Error 1 > make[2]: Leaving directory `/home/ana/pincha/srilm/misc/src' > make[1]: *** [release-libraries] Error 1 > make[1]: Leaving directory `/home/ana/pincha/srilm' > make: *** [World] Error 2 > > > best regards > ana > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amontalvo at cenatav.co.cu Mon Apr 8 11:30:24 2013 From: amontalvo at cenatav.co.cu (Ana Montalvo Bereau) Date: Mon, 08 Apr 2013 14:30:24 -0400 Subject: [SRILM User List] error compiling SRILM In-Reply-To: References: <5162E234.5030607@cenatav.co.cu> Message-ID: <51630CC0.6060906@cenatav.co.cu> I've added NO_TCL = X variable to the Makefile.machine.i686 file, and seted empty values to TCL_INCLUDE and TCL_LIBRARY variables. it worked! Thanks, ana On 04/08/2013 12:22 PM, Anurag Singh wrote: > This might be similar to > http://www.speech.sri.com/pipermail/srilm-user/2007q1/000415.html > > > On Mon, Apr 8, 2013 at 8:28 AM, Ana Montalvo Bereau > > wrote: > > Hi all, I'm a beginner with SRILM,I'm trying to compile it but I > got an error. > I use Ubuntu 32bit, so I only had to set the SRILM variable in the > Makefile and run make World, then why do I get this error? > > make[2]: Entering directory `/home/ana/pincha/srilm/misc/src' > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include > -c -g -O3 -o ../obj/i686/option.o option.c > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include > -c -g -O3 -o ../obj/i686/zio.o zio.c > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include > -c -g -O3 -o ../obj/i686/fcheck.o fcheck.c > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include > -c -g -O3 -o ../obj/i686/fake-rand48.o fake-rand48.c > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/Debug.o Debug.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/File.o File.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/MStringTokUtil.o MStringTokUtil.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/tls.o tls.cc > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/tserror.o tserror.cc > gcc -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -D_FILE_OFFSET_BITS=64 -I. -I../../include > -c -g -O3 -o ../obj/i686/version.o version.c > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/tclmain.o tclmain.cc > *tclmain.cc:8:17: fatal error: tcl.h: No such file or directory* > compilation terminated. > make[2]: *** [../obj/i686/tclmain.o] Error 1 > make[2]: Leaving directory `/home/ana/pincha/srilm/misc/src' > make[1]: *** [release-libraries] Error 1 > make[1]: Leaving directory `/home/ana/pincha/srilm' > make: *** [World] Error 2 > > > best regards > ana > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arefeh_kazemi_65 at yahoo.com Tue Apr 23 07:58:33 2013 From: arefeh_kazemi_65 at yahoo.com (arefeh kazemi) Date: Tue, 23 Apr 2013 07:58:33 -0700 (PDT) Subject: [SRILM User List] problem with building Srilm on obuntu 10.04 (32 bit) Message-ID: <1366729113.57585.YahooMailNeo@web162406.mail.bf1.yahoo.com> Hi everyone I'm trying to install srilm on the ubuntu 10.04 but it gets the following error. Do you have any suggestion for fixing this problem? BTW: I install tcl library and set it's path in the makefile. ? >>sudo make all for subdir in misc dstruct lm flm lattice utils; do \ ??(cd $subdir/src; make SRILM=/opt/tools/srilm MACHINE_TYPE=i686 OPTION= MAKE_PIC= all) || exit 1; \ ?done make[1]: Entering directory `/opt/tools/srilm/misc/src' make[1]: Nothing to be done for `all'. make[1]: Leaving directory `/opt/tools/srilm/misc/src' make[1]: Entering directory `/opt/tools/srilm/dstruct/src' g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64?? -I/usr/include/tcl8.4 -I. -I../../include?? -c -g -O3 -o ../obj/i686/testMap2.o testMap2.cc testMap2.cc: In function ?int Delete(void*, Tcl_Interp*, int, char**)?: testMap2.cc:114: error: cannot convert ?Boolean? to ?char**? in assignment make[1]: *** [../obj/i686/testMap2.o] Error 1 make[1]: Leaving directory `/opt/tools/srilm/dstruct/src' make: *** [all] Error 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Apr 23 09:14:31 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 23 Apr 2013 09:14:31 -0700 Subject: [SRILM User List] problem with building Srilm on obuntu 10.04 (32 bit) In-Reply-To: <1366729113.57585.YahooMailNeo@web162406.mail.bf1.yahoo.com> References: <1366729113.57585.YahooMailNeo@web162406.mail.bf1.yahoo.com> Message-ID: <5176B367.7010404@icsi.berkeley.edu> On 4/23/2013 7:58 AM, arefeh kazemi wrote: > Hi everyone > I'm trying to install srilm on the ubuntu 10.04 but it gets the > following error. Do you have any suggestion for fixing this problem? > BTW: I install tcl library and set it's path in the makefile. 1) what happens is your do "make release-libraries release-programs" ? 2) try the latest "beta" release. There might be an problem building the test programs that got fixed in the current code. Andreas > >>sudo make all > for subdir in misc dstruct lm flm lattice utils; do \ > (cd $subdir/src; make SRILM=/opt/tools/srilm MACHINE_TYPE=i686 > OPTION= MAKE_PIC= all) || exit 1; \ > done > make[1]: Entering directory `/opt/tools/srilm/misc/src' > make[1]: Nothing to be done for `all'. > make[1]: Leaving directory `/opt/tools/srilm/misc/src' > make[1]: Entering directory `/opt/tools/srilm/dstruct/src' > g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I/usr/include/tcl8.4 > -I. -I../../include -c -g -O3 -o ../obj/i686/testMap2.o testMap2.cc > testMap2.cc: In function 'int Delete(void*, Tcl_Interp*, int, char**)': > testMap2.cc:114: error: cannot convert 'Boolean' to 'char**' in assignment > make[1]: *** [../obj/i686/testMap2.o] Error 1 > make[1]: Leaving directory `/opt/tools/srilm/dstruct/src' > make: *** [all] Error 1 > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From sgetachew92 at yahoo.com Tue Apr 23 15:25:33 2013 From: sgetachew92 at yahoo.com (Solomon Getachew) Date: Tue, 23 Apr 2013 15:25:33 -0700 (PDT) Subject: [SRILM User List] problem with building Srilm on obuntu 10.04 (32 bit) In-Reply-To: <5176B367.7010404@icsi.berkeley.edu> References: <1366729113.57585.YahooMailNeo@web162406.mail.bf1.yahoo.com> <5176B367.7010404@icsi.berkeley.edu> Message-ID: <1366755933.17602.YahooMailNeo@web126206.mail.ne1.yahoo.com> hi everyone? i will like to get Multiple pronunciation?variation in knowledge based working phyton??code? ________________________________ From: Andreas Stolcke To: arefeh kazemi Cc: "srilm-user at speech.sri.com" Sent: Tuesday, April 23, 2013 9:14 AMo? Subject: Re: [SRILM User List] problem with building Srilm on obuntu 10.04 (32 bit) On 4/23/2013 7:58 AM, arefeh kazemi wrote: Hi everyone >I'm trying to install srilm on the ubuntu 10.04 but it gets the following error. Do you have any suggestion for fixing this problem? >BTW: I install tcl library and set it's path in the makefile. >? 1) what happens is your do "make release-libraries release-programs"?? ? 2) try the latest "beta" release.?? There might be an problem building the test programs that got fixed in the current code. Andreas >>sudo make all >for subdir in misc dstruct lm flm lattice utils; do \ >??(cd $subdir/src; make SRILM=/opt/tools/srilm MACHINE_TYPE=i686 OPTION= MAKE_PIC= all) || exit 1; \ >?done >make[1]: Entering directory `/opt/tools/srilm/misc/src' >make[1]: Nothing to be done for `all'. >make[1]: Leaving directory `/opt/tools/srilm/misc/src' >make[1]: Entering directory `/opt/tools/srilm/dstruct/src' >g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64?? -I/usr/include/tcl8.4 -I. -I../../include?? -c -g -O3 -o ../obj/i686/testMap2.o testMap2.cc >testMap2.cc: In function ?int Delete(void*, Tcl_Interp*, int, char**)?: >testMap2.cc:114: error: cannot convert ?Boolean? to ?char**? in assignment >make[1]: *** [../obj/i686/testMap2.o] Error 1 >make[1]: Leaving directory `/opt/tools/srilm/dstruct/src' >make: *** [all] Error 1 > >? > > >_______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From otheremailid at aol.com Tue Apr 23 19:20:11 2013 From: otheremailid at aol.com (E) Date: Tue, 23 Apr 2013 22:20:11 -0400 (EDT) Subject: [SRILM User List] Computing nbest-error rate from HTK MLF files Message-ID: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> Hello all, I have n-best results from HTK for lots of wav files in a single MLF file. I want to compute nbest-error rate using SRILM toolkit. I would like to know - 1. Is there any direct way to convert HTK n-best MLF format into SRILM n-best format (http://www.speech.sri.com/projects/srilm/manpages/nbest-format.5.html). Or should I write a script for that? 2. What does the n-best error actually mean in SRILM? Suppose I have 10 hypotheses for each wav file. I am familiar with usual way of reporting errors using WER. Does nbest correctness mean the number of words in reference that occur ANYWHERE in the nbest-list? Does nbest deletions mean the number of words in reference that occur NOWHERE in the nbest-list? What does nbest substitutions mean? Thanks, Ethan -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Apr 23 22:05:22 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 23 Apr 2013 22:05:22 -0700 Subject: [SRILM User List] Computing nbest-error rate from HTK MLF files In-Reply-To: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> References: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> Message-ID: <51776812.1060401@icsi.berkeley.edu> On 4/23/2013 7:20 PM, E wrote: > Hello all, > > I have n-best results from HTK for lots of wav files in a single MLF > file. I want to compute nbest-error rate using SRILM toolkit. I would > like to know - > > 1. Is there any direct way to convert HTK n-best MLF format into SRILM > n-best format > (http://www.speech.sri.com/projects/srilm/manpages/nbest-format.5.html). > Or should I write a script for that? Sorry, there is no standard conversion tool I am aware of. If you write one you should share it with the list. > > 2. What does the n-best error actually mean in SRILM? Suppose I have > 10 hypotheses for each wav file. I am familiar with usual way of > reporting errors using WER. > > Does nbest correctness mean the number of words in reference that > occur ANYWHERE in the nbest-list? Does nbest deletions mean the number > of words in reference that occur NOWHERE in the nbest-list? What does > nbest substitutions mean? nbesterror is the lowest WER achievable by picking the best hypothesis (the one giving the lowest number of errors) from each nbest list. That's why it's also called the"oracle" error rate, asan oracle magically told you which hypothesis to pick to give the best result. The number of deletions, substitutions, etc. in this context is that the number of deleted, substituted, etc. , words relative to the reference found in that oracle hypothesis. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From otheremailid at aol.com Tue Apr 23 22:22:14 2013 From: otheremailid at aol.com (E) Date: Wed, 24 Apr 2013 01:22:14 -0400 (EDT) Subject: [SRILM User List] Computing nbest-error rate from HTK MLF files In-Reply-To: <51776812.1060401@icsi.berkeley.edu> References: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> <51776812.1060401@icsi.berkeley.edu> Message-ID: <8D00EC2D02BB615-71C-6A9E@webmail-d166.sysops.aol.com> Thanks for the response Andreas. I will share my script once its ready. This "oracle" WER seems like a very crude way of computing nbest-error to me. Suppose a reference word is located in [0, 1] seconds, one can look at all the alternatives in the nbest list (all words that significantly overlap with reference word) and choose the word that best matches. So basically one will extract "most accurate" segments from each nbest hypothesis in order to get a new "oracle" hypothesis. Do you know if people have done that kind of thing while computing nbest error? Thanks, Ethan -------------- next part -------------- An HTML attachment was scrubbed... URL: From otheremailid at aol.com Tue Apr 23 23:57:56 2013 From: otheremailid at aol.com (E) Date: Wed, 24 Apr 2013 02:57:56 -0400 (EDT) Subject: [SRILM User List] Correct format for nbest files In-Reply-To: <51776812.1060401@icsi.berkeley.edu> References: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> <51776812.1060401@icsi.berkeley.edu> Message-ID: <8D00ED02F01112D-71C-6D89@webmail-d166.sysops.aol.com> Hello, I am confused about the correct format of nbest files (http://www.speech.sri.com/projects/srilm/manpages/nbest-format.5.html). I created a dummy nbest file called "nbesthyp". The reference file is called "ref" $ cat nbesthyp NBestList1.0 nbesthyp w1 w2 w3 nbesthyp w3 w2 w3 nbesthyp w1 w4 w3 $ cat ref nbesthyp w2 w3 w4 With above files, I get below error when computing nbest-error $ ./nbest-error nbesthyp ref no reference for NBestList1.0 bad Decipher score: nbesthyp nbesthyp: line 2: bad n-best hyp format error in nbest list If I change "nbesthyp" like below - $ cat nbesthyp NBestList1.0 (100) w1 w2 w3 (10) w3 w2 w3 (2) w1 w4 w3 I get below error $ ./nbest-error nbesthyp ref no reference for NBestList1.0 no reference for 100 no reference for 10 no reference for 2 Please let me know what is the correct format of nbest files. Thank you very much. -------------- next part -------------- An HTML attachment was scrubbed... URL: From otheremailid at aol.com Wed Apr 24 00:35:21 2013 From: otheremailid at aol.com (E) Date: Wed, 24 Apr 2013 03:35:21 -0400 (EDT) Subject: [SRILM User List] Correct format for nbest files In-Reply-To: <8D00ED02F01112D-71C-6D89@webmail-d166.sysops.aol.com> References: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> <51776812.1060401@icsi.berkeley.edu> <8D00ED02F01112D-71C-6D89@webmail-d166.sysops.aol.com> Message-ID: <8D00ED568BCD9D0-71C-6F1B@webmail-d166.sysops.aol.com> Please don't bother. I realized that I was sending wrong files to ./nbest-error. I should send a list of nbest files and not the nbest file itself. $ cat list nbesthyp $ cat ref nbesthyp w1 w2 w3 $ cat nbesthyp NBestList1.0 (100) w1 w5 w3 (10) w3 w2 w3 (2) w1 w4 w3 $ ./nbest-error list ref -wer nbesthyp 1 sub 1 ins 0 del 0 words 3 1 sentences, 3 words, 1 errors (33.33%) -------------- next part -------------- An HTML attachment was scrubbed... URL: From nshmyrev at yandex.ru Wed Apr 24 05:24:48 2013 From: nshmyrev at yandex.ru (Nickolay V. Shmyrev) Date: Wed, 24 Apr 2013 15:24:48 +0300 Subject: [SRILM User List] Computing nbest-error rate from HTK MLF files In-Reply-To: <8D00EC2D02BB615-71C-6A9E@webmail-d166.sysops.aol.com> References: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> <51776812.1060401@icsi.berkeley.edu> <8D00EC2D02BB615-71C-6A9E@webmail-d166.sysops.aol.com> Message-ID: <1366806288.13282.5.camel@localhost.localdomain> On 24/04/2013 at 01:22 -0400, E wrote: > Thanks for the response Andreas. > > I will share my script once its ready. > > This "oracle" WER seems like a very crude way of computing nbest-error > to me. Suppose a reference word is located in [0, 1] seconds, one can > look at all the alternatives in the nbest list (all words that > significantly overlap with reference word) and choose the word that > best matches. You probably want to learn and use lattice oracle WER which finds the best path in the lattice and can switch to different word variants on the way at different times. Overall n-best lists are not very good structure unless you are dealing with long window rescoring with some advanced models like RNNLMs which can't work on lattices. It's better to use lattices instead of n-best lists wherever you can. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From stolcke at icsi.berkeley.edu Wed Apr 24 15:43:36 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 24 Apr 2013 15:43:36 -0700 Subject: [SRILM User List] Computing nbest-error rate from HTK MLF files In-Reply-To: <8D00EC2D02BB615-71C-6A9E@webmail-d166.sysops.aol.com> References: <8D00EA9618D72DD-71C-611D@webmail-d166.sysops.aol.com> <51776812.1060401@icsi.berkeley.edu> <8D00EC2D02BB615-71C-6A9E@webmail-d166.sysops.aol.com> Message-ID: <51786018.6020702@icsi.berkeley.edu> On 4/23/2013 10:22 PM, E wrote: > Thanks for the response Andreas. > > I will share my script once its ready. > > This "oracle" WER seems like a very crude way of computing nbest-error > to me. Suppose a reference word is located in [0, 1] seconds, one can > look at all the alternatives in the nbest list (all words that > significantly overlap with reference word) and choose the word that > best matches. > > So basically one will extract "most accurate" segments from each nbest > hypothesis in order to get a new "oracle" hypothesis. > > Do you know if people have done that kind of thing while computing > nbest error? > > Thanks, > Ethan What you suggest is not how nbest WER is commonly defined. However, taking different pieces from different hypotheses and glueing them together for an overall better result is the idea behind "confusion networks" (aka word sausages, or word meshes in SRILM terminology). You can read more about confusion networks at http://arxiv.org/pdf/cs/0010012 . The nbest-lattice tool in SRILM builds confusion networks from nbest lists. It also has functionality to compute the lowest WER and best path through the network. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From arefeh_kazemi_65 at yahoo.com Mon Apr 29 04:08:27 2013 From: arefeh_kazemi_65 at yahoo.com (arefeh kazemi) Date: Mon, 29 Apr 2013 04:08:27 -0700 (PDT) Subject: [SRILM User List] Cannot find SRILM's library Message-ID: <1367233707.3184.YahooMailNeo@web162404.mail.bf1.yahoo.com> Hi everyone. I want to install Moses on ubuntu 10.04 , i686. but it gets the following error: checking for trigram_init in -loolm... no configure: error: Cannot find SRILM's library in /opt/tools/srilm//lib/i686 ? these files are in /opt/tools/srilm//lib/i686: libdstruct.a? libflm.a? liblattice.a? libmisc.a? libool ?do you have any sugesstion to fix this problem? ? Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Apr 29 11:28:30 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 29 Apr 2013 11:28:30 -0700 Subject: [SRILM User List] Cannot find SRILM's library In-Reply-To: <1367233707.3184.YahooMailNeo@web162404.mail.bf1.yahoo.com> References: <1367233707.3184.YahooMailNeo@web162404.mail.bf1.yahoo.com> Message-ID: <517EBBCE.7020409@icsi.berkeley.edu> On 4/29/2013 4:08 AM, arefeh kazemi wrote: > Hi everyone. > I want to install Moses on ubuntu 10.04 , i686. > but it gets the following error: > checking for trigram_init in -loolm... no > configure: error: Cannot find SRILM's library in > /opt/tools/srilm//lib/i686 > these files are in /opt/tools/srilm//lib/i686: > libdstruct.a libflm.a liblattice.a libmisc.a libool > do you have any sugesstion to fix this problem? The libraries might be empty. Did you build SRILM yourself? If so run the "make test" step to ensure everything was built correctly. % nm $SRILM/lib/i686/liboolm.a | grep trigram_init should output a line with "trigram_init" . If that is the case the configure test might still fail because of some compiler or linker problem on your system. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From meskevich at computing.dcu.ie Tue Apr 30 05:02:10 2013 From: meskevich at computing.dcu.ie (Maria Eskevich) Date: Tue, 30 Apr 2013 13:02:10 +0100 Subject: [SRILM User List] problems with installation on macosx Message-ID: <797ADC62-F98A-4813-A2B2-54E4B2889DD4@computing.dcu.ie> Dear Andreas, I downloaded the 1.7 version of SRILM and followed the instruction for installation (checked with INSTALL file details and http://www1.icsi.berkeley.edu/~wooters/SRILM/3%20Install(07F18266).html). My system is macosx 10.8.3, processor 2.9 Ghz Intel Core i7. I don't get error messages, only few warnings at the installation step: In file included from ./File.cc:27: ./srilm_iconv.h:15:3: warning: #include_next with absolute path # include_next ^ 1 warning generated. In file included from File.cc:27: ./srilm_iconv.h:15:3: warning: #include_next with absolute path # include_next ^ 1 warning generated. /usr/bin/ranlib: file: ../obj/macosx/libmisc.a(fake-rand48.o) has no symbols WARNING: creating directory ../../lib/macosx /usr/bin/ranlib: file: ../obj/macosx/libdstruct.a(LHashTrie.o) has no symbols LatticeIndex.cc:126:4: warning: data argument not used by format string [-Wformat-extra-args] (float)ngram[0].start); ^ LatticeIndex.cc:128:4: warning: data argument not used by format string [-Wformat-extra-args] (float)(ngram[len-1].start + ngram[len-1].duration)); ^ 2 warnings generated. WARNING: creating directory ../../bin/macosx fngram.cc:253:16: warning: use of logical '&&' with constant operand [-Wconstant-logical-operand] if (memuse && 0) { ^ ~ fngram.cc:253:16: note: use '&' for a bitwise operation if (memuse && 0) { ^~ & fngram.cc:253:16: note: remove constant to silence this warning if (memuse && 0) { ~^~~~ 1 warning generated. lattice-tool.cc:988:2: warning: delete called on 'VocabDistance' that is abstract but has non-virtual destructor [-Wdelete-non-virtual-dtor] delete wordDistance; ^ 1 warning generated. Afterwards the test fails for all conditions. I have DIFFERS for everything. Could you please suggest, if possible, what might cause those problems at installation? Best, Maria -- Maria Eskevich PhD-student L2.08 School of Computing Dublin City University Dublin 9, Ireland http://nclt.computing.dcu.ie/~meskevich/ http://ie.linkedin.com/pub/maria-eskevich/17/520/741 tel (Ireland): +353 87 14 23 101 tel (Russia): + 7 921 915 52 54 e-mail: maria.eskevich at gmail.com, meskevich at computing.dcu.ie, maria.eskevich2 at mail.dcu.ie, maria.eskevich at yandex.ru Please consider your environmental responsibility before printing this email! ;) DCU Disclaimer: https://iss.servicedesk.dcu.ie/index.php?/News/NewsItem/View/37/dcu-email-disclaimer-information -------------- next part -------------- An HTML attachment was scrubbed... URL: From arefeh_kazemi_65 at yahoo.com Tue Apr 30 08:30:58 2013 From: arefeh_kazemi_65 at yahoo.com (arefeh kazemi) Date: Tue, 30 Apr 2013 08:30:58 -0700 (PDT) Subject: [SRILM User List] Cannot find SRILM's library In-Reply-To: <517EBBCE.7020409@icsi.berkeley.edu> References: <1367233707.3184.YahooMailNeo@web162404.mail.bf1.yahoo.com> <517EBBCE.7020409@icsi.berkeley.edu> Message-ID: <1367335858.24881.YahooMailNeo@web162406.mail.bf1.yahoo.com> Thank you Andreas, I downloaded the new version of Moses and? the problem is fixed now. ? Regards Arefeh ________________________________ From: Andreas Stolcke To: arefeh kazemi Cc: "srilm-user at speech.sri.com" Sent: Monday, 29 April 2013, 21:58 Subject: Re: [SRILM User List] Cannot find SRILM's library On 4/29/2013 4:08 AM, arefeh kazemi wrote: Hi everyone. >I want to install Moses on ubuntu 10.04 , i686. >but it gets the following error: >checking for trigram_init in -loolm... no >configure: error: Cannot find SRILM's library in /opt/tools/srilm//lib/i686 > >these files are in /opt/tools/srilm//lib/i686: >libdstruct.a? libflm.a? liblattice.a? libmisc.a? libool >?do you have any sugesstion to fix this problem?The libraries might be empty.? Did you build SRILM yourself??? If so run the "make test"? step to ensure everything was built correctly. % nm $SRILM/lib/i686/liboolm.a | grep trigram_init should output a line with "trigram_init" .? If that is the case the configure test might still fail because of some compiler or linker problem on your system. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Apr 30 15:25:33 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 30 Apr 2013 15:25:33 -0700 Subject: [SRILM User List] problems with installation on macosx In-Reply-To: <797ADC62-F98A-4813-A2B2-54E4B2889DD4@computing.dcu.ie> References: <797ADC62-F98A-4813-A2B2-54E4B2889DD4@computing.dcu.ie> Message-ID: <518044DD.4050502@icsi.berkeley.edu> On 4/30/2013 5:02 AM, Maria Eskevich wrote: > > Dear Andreas, > > I downloaded the 1.7 version of SRILM and followed the instruction for > installation (checked with INSTALL file details and > http://www1.icsi.berkeley.edu/~wooters/SRILM/3%20Install(07F18266).html ). > > My system is macosx 10.8.3, processor 2.9 Ghz Intel Core i7. The compiler warnings are not a problem. Verify that the binaries in $SRILM/bin/macosx are runnable, e.g., ngram -version. If that's not the case then there is some problem with your compiler or linker and you should shared your complete log output -- hopefully some macosx expert can help. The tests could be failing because you don't have gawk installed. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From meskevich at computing.dcu.ie Wed May 1 17:04:01 2013 From: meskevich at computing.dcu.ie (Maria Eskevich) Date: Thu, 2 May 2013 01:04:01 +0100 Subject: [SRILM User List] problems with installation on macosx In-Reply-To: <518044DD.4050502@icsi.berkeley.edu> References: <797ADC62-F98A-4813-A2B2-54E4B2889DD4@computing.dcu.ie> <518044DD.4050502@icsi.berkeley.edu> Message-ID: <2E4F3A95-77B9-4CA9-8DA7-E617E0DCC8A5@computing.dcu.ie> Dear Andreas, You were right, I haven't had the gawk installed. Thanks for the help. Could I please ask another question. If I have a file with lattice in HTK format, is it possible to get the 1-best list with corresponding timing and probability information? As I understood the option -acoustic-mesh should keep this information, but I don't see any writing option that would combine the 2 things together. Basically I need the 1 best list with time/confidence scores information for each word. Maybe some additional changes with the following command line can help? ./lattice-tool -in-lattice file.lat -read-htk -viterbi-decode -acoustic-mesh Best, Maria On 30 Apr 2013, at 23:25, Andreas Stolcke wrote: > On 4/30/2013 5:02 AM, Maria Eskevich wrote: >> >> Dear Andreas, >> >> I downloaded the 1.7 version of SRILM and followed the instruction for installation (checked with INSTALL file details and http://www1.icsi.berkeley.edu/~wooters/SRILM/3%20Install(07F18266).html). >> >> My system is macosx 10.8.3, processor 2.9 Ghz Intel Core i7. > The compiler warnings are not a problem. Verify that the binaries in $SRILM/bin/macosx are runnable, e.g., ngram -version. > If that's not the case then there is some problem with your compiler or linker and you should shared your complete log output -- hopefully some macosx expert can help. > > The tests could be failing because you don't have gawk installed. > > Andreas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu May 2 13:08:32 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 02 May 2013 13:08:32 -0700 Subject: [SRILM User List] problems with installation on macosx In-Reply-To: <2E4F3A95-77B9-4CA9-8DA7-E617E0DCC8A5@computing.dcu.ie> References: <797ADC62-F98A-4813-A2B2-54E4B2889DD4@computing.dcu.ie> <518044DD.4050502@icsi.berkeley.edu> <2E4F3A95-77B9-4CA9-8DA7-E617E0DCC8A5@computing.dcu.ie> Message-ID: <5182C7C0.3060106@icsi.berkeley.edu> On 5/1/2013 5:04 PM, Maria Eskevich wrote: > Dear Andreas, > > You were right, I haven't had the gawk installed. Thanks for the help. > > Could I please ask another question. If I have a file with lattice in > HTK format, is it possible to get the 1-best list with corresponding > timing and probability information? > As I understood the option -acoustic-mesh should keep this > information, but I don't see any writing option that would combine the > 2 things together. Basically I need the 1 best list with > time/confidence scores information for each word. > > Maybe some additional changes with the following command line can help? > ./lattice-tool -in-lattice file.lat -read-htk -viterbi-decode > -acoustic-mesh There is currently no option to dump all the acoustic information (scores, alignments) out in nbest format, although the information is available internally. But for getting the 1-best version of that information there is a workaround. You can generate confusion networks with a low posterior scaling factor. That will force the 1-best in the CN to be the same as the 1-best in a Viterbi decoding. But using the the -acoustic-mesh option you can then read off the time alignments and score information. Try lattice-tool -read-htk -in-lattice LATTICEFILE -acoustic-mesh -write-mesh CNFILE -posterior-scale 0.01 and postprocess CNFILE. It will contain stuff like align 3 he 1 we 0 ate 0 me 0 h 0 u 0 t 0 you're 0 if_you_have 0 say 0 pete 0 you 0 deep 0 aid 0 you'd 0 is 0 i 0 they 0 c 0 a 0 keep 0 q 0 t. 0 a. 0 these 0 p. 0 oh 0 uh 0 lee 0 hee 0 she 0 really 0 indeed 0 hehe 0 he'd 0 are_you 0 heat 0 eight 0 or 0 but_you 0 to_be 0 uhhuh 0 a._i. 0 [laugh] 0 p 0 it's 0 see 0 but 0 e 0 hate 0 but_he 0 re 0 i_mean 0 neat 0 i_see 0 and_he 0 ee 0 uh_you 0 need 0 yeah_you 0 maybe 0 and_it 0 v 0 okay 0 v. 0 eee 0 do_you 0 e. 0 hes 0 g 0 mm 0 he's 0 easy 0 may 0 any 0 pay 0 if 0 b. 0 they'd 0 you_you 0 hey 0 beep 0 it 0 c. 0 gee 0 if_you 0 be 0 three 0 if_he 0 b 0 is_it 0 eat 0 d 0 d. 0 eighty 0 info 3 he 0.04 0.26 -165.916 -2.62371 :#[hh]iy,0.08:hh[iy]#,0.18: : info 3 we 0.07 0.23 -160.802 -2.66818 :#[w]iy,0.03:w[iy]#,0.20: : info 3 ate 0.03 0.27 -173.698 -2.75711 :#[ey]t,0.15:ey[t]#,0.12: : .... Because of the low posterior scaling, all the posterior probability is on the 1-best word ("he" in this case). Then you find the "info" record associated with that word an it will give you the start time, duration, acoustic and LM scores, and pronunciation and phone durations (the format is defined in the wlat-format(5) man page). You can safely add a pruning option if the CN construction takes too long, since you are only interested in the 1-best output. Andreas > > Best, > Maria > > On 30 Apr 2013, at 23:25, Andreas Stolcke > wrote: > >> On 4/30/2013 5:02 AM, Maria Eskevich wrote: >>> >>> Dear Andreas, >>> >>> I downloaded the 1.7 version of SRILM and followed the instruction >>> for installation (checked with INSTALL file details and >>> http://www1.icsi.berkeley.edu/~wooters/SRILM/3%20Install(07F18266).html >>> ). >>> >>> My system is macosx 10.8.3, processor 2.9 Ghz Intel Core i7. >> The compiler warnings are not a problem. Verify that the binaries >> in $SRILM/bin/macosx are runnable, e.g., ngram -version. >> If that's not the case then there is some problem with your compiler >> or linker and you should shared your complete log output -- hopefully >> some macosx expert can help. >> >> The tests could be failing because you don't have gawk installed. >> >> Andreas >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From member at linkedin.com Mon May 6 14:37:54 2013 From: member at linkedin.com (solomon YALEW) Date: Mon, 6 May 2013 21:37:54 +0000 (UTC) Subject: [SRILM User List] Invitation to connect on LinkedIn Message-ID: <295189189.18299272.1367876274573.JavaMail.app@ela4-app0128.prod> LinkedIn ------------ solomon YALEW requested to add you as a connection on LinkedIn: ------------------------------------------ G?khan Can, I'd like to add you to my professional network on LinkedIn. - solomon Accept invitation from solomon YALEW http://www.linkedin.com/e/t1zgkk-hge67sdx-4i/GYEITBnGRvUQDaFc0a-73k8G9tC-sl29aHrv22N/blk/I496842762_55/3wOtCVFbmdxnSVFbm8JrnpKqlZJrmZzbmNJpjRQnOpBtn9QfmhBt71BoSd1p65Lr6lOfPkRnP8SdP8Qe3oVd4ALrz1NpBljkAwLdPAQdjkVdzcMcP4LrCBxbOYWrSlI/eml-comm_invm-b-in_ac-inv28/?hs=false&tok=3Zz7AwK45NTBI1 View profile of solomon YALEW http://www.linkedin.com/e/t1zgkk-hge67sdx-4i/rso/104218742/d2fI/name/57580855_I496842762_55/?hs=false&tok=0wIsut8z1NTBI1 ------------------------------------------ You are receiving Invitation emails. This email was intended for G?khan Can Avcu. Learn why this is included: http://www.linkedin.com/e/t1zgkk-hge67sdx-4i/plh/http%3A%2F%2Fhelp%2Elinkedin%2Ecom%2Fapp%2Fanswers%2Fdetail%2Fa_id%2F4788/-GXI/?hs=false&tok=0zvJxLvElNTBI1 (c) 2012, LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgb15 at aub.edu.lb Tue May 7 16:32:17 2013 From: rgb15 at aub.edu.lb (Ramy Baly) Date: Wed, 08 May 2013 02:32:17 +0300 Subject: [SRILM User List] Installing SRILM 1.7.0 on ubuntu 12.04 Message-ID: <20130508023217.124178ue6jjbzif5@imail.aub.edu.lb> Hello, My name is Ramy, I'm a PhD student at the American University of Beirut, working on Arabic Sentiment analysis. I tried to install the SRILM (both versions 1.6.0 and 1.7.0) on Ubuntu 12.04. I installed all prerequisite packages mentioned in the website (in addition to tcsh), and it doesn't work... I also tried to follow the following common guides I found in different forums: 1) Install tcsh if not already installed 2) Install all the TCL developer libraries: tcl8.4-dev, tcl-dev, tcl-lib, tclx8.4, tclx8.4-dev. This step may not be necessary, let me know what works for you. 3) Uncomment the ?SRILM =? line in the top level Makefile and replace the existing path with the absolute path of the SRILM top-level directory on your system (where the Makefile resides) 4) Start the tcsh shell 5) Type ?make NO_TCL=X MACHINE_TYPE=i686-gcc4 World > & make.log.txt? to begin the build and capture stderr and stdout in a file 6) If you can run ?./bin/i686-gcc4/ngram-count -help?, the build was probably a success It doesn't work for both versions... and I'm getting fatal errors (cannot find .h files, exiting directory, .... make: *** [World] Error 2) Can you please help me with this problem? I appreciate your help --Ramy From rgb15 at aub.edu.lb Tue May 7 16:50:54 2013 From: rgb15 at aub.edu.lb (Ramy Baly) Date: Wed, 08 May 2013 02:50:54 +0300 Subject: [SRILM User List] install only Disambig for MADA Message-ID: <20130508025054.32104u23nlo8o47y@imail.aub.edu.lb> Hi again I need only to install the disambig executable, which is required by the MADA morphological analyzer. Can anyone help in that please? I'm new to linux. Thanks in advance --Ramy From stolcke at icsi.berkeley.edu Tue May 7 21:52:50 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 07 May 2013 21:52:50 -0700 Subject: [SRILM User List] Installing SRILM 1.7.0 on ubuntu 12.04 In-Reply-To: <20130508023217.124178ue6jjbzif5@imail.aub.edu.lb> References: <20130508023217.124178ue6jjbzif5@imail.aub.edu.lb> Message-ID: <5189DA22.8060500@icsi.berkeley.edu> On 5/7/2013 4:32 PM, Ramy Baly wrote: > Hello, > > My name is Ramy, I'm a PhD student at the American University of > Beirut, working on Arabic Sentiment analysis. > > I tried to install the SRILM (both versions 1.6.0 and 1.7.0) on Ubuntu > 12.04. > > I installed all prerequisite packages mentioned in the website (in > addition to tcsh), and it doesn't work... > > I also tried to follow the following common guides I found in > different forums: > > > 1) Install tcsh if not already installed tcsh is no longer required in recent versions of SRILM. > > 2) Install all the TCL developer libraries: tcl8.4-dev, tcl-dev, > tcl-lib, tclx8.4, tclx8.4-dev. This step may not be necessary, let me > know what works for you. > > 3) Uncomment the ?SRILM =? line in the top level Makefile and replace > the existing path with the absolute path of the SRILM top-level > directory on your system (where the Makefile resides) > > 4) Start the tcsh shell > > 5) Type ?make NO_TCL=X MACHINE_TYPE=i686-gcc4 World > & make.log.txt? > to begin the build and capture stderr and stdout in a file If you did the above then you can share the make.log.txt file. Otherwise we have no clue what might be wrong. Andreas From tarek.ahmed at rdi-eg.com Sun May 12 05:06:39 2013 From: tarek.ahmed at rdi-eg.com (tarek abuamer) Date: Sun, 12 May 2013 14:06:39 +0200 Subject: [SRILM User List] lattice-tool with HTK lattice with probabilities Message-ID: <000001ce4f09$29e33860$7da9a920$@rdi-eg.com> I want to use lattice-tool to rescore HTK format lattices. I need HTK lattices to contain probabilities in the form : J=1 S=4 E=2 l=-1.1 J=2 S=4 E=3 l=-0.4 However, this seems not to have any effect when using lattice-tool. Any clue? Description: sig -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 7666 bytes Desc: not available URL: From stolcke at icsi.berkeley.edu Tue May 14 12:12:12 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 14 May 2013 12:12:12 -0700 Subject: [SRILM User List] lattice-tool with HTK lattice with probabilities In-Reply-To: <000001ce4f09$29e33860$7da9a920$@rdi-eg.com> References: <000001ce4f09$29e33860$7da9a920$@rdi-eg.com> Message-ID: <51928C8C.5040608@icsi.berkeley.edu> On 5/12/2013 5:06 AM, tarek abuamer wrote: > > I want to use lattice-tool to rescore HTK format lattices. I need HTK > lattices to contain probabilities in the form : J=1 S=4 E=2 l=-1.1 > > J=2 S=4 E=3 l=-0.4 > > However, this seems not to have any effect when using lattice-tool. > > Any clue? > You don't say what you tried that didn't work. If you invoke lattice-tool -in-lattice INPUTLATTICE -read-htk -lm LMFILE -order N -out-lattice OUTPUTLATTICE -write-htk the OUTPUTLATTICE will contain recomputed LM probabilities in the form you describe, assuming the input lattice is also in HTK format. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed May 15 10:03:41 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 15 May 2013 10:03:41 -0700 Subject: [SRILM User List] lattice-tool with HTK lattice with probabilities In-Reply-To: <000801ce5179$205224b0$60f66e10$@rdi-eg.com> References: <000001ce4f09$29e33860$7da9a920$@rdi-eg.com> <51928C8C.5040608@icsi.berkeley.edu> <000001ce515c$5d67d510$18377f30$@rdi-eg.com> <51939768.4090008@icsi.berkeley.edu> <000801ce5179$205224b0$60f66e10$@rdi-eg.com> Message-ID: <5193BFED.8060506@icsi.berkeley.edu> On 5/15/2013 7:33 AM, tarek abuamer wrote: > > What I really need is to combine the original scores and the LM scores > in the final score, I don't know whether this is rescoring or decoding? > I'm cc-ing the list since this might be of general interest. HTK lattices support an "ngram" score that is separate from the "LM" score. This ngram score is passed through unchanged in the rescoring process (just like the acoustic scores). So what you can do is 1) rewrite your input lattice to replace all "l=score" to use "n=score" instead ("n" is the key for "ngram" scores). 2) Use lattice-tool -lm to compute new LM scores. 3) Decode the lattices giving nonzero weight to both the ngram and the LM score. So you will get a log linear combination of the two LMs. lattice-tool -viterbi-decode -read-htk -htk-lmscale L -htk-ngscale N ... where L is the LM score weight and N is the ngram score weight. If you want to combine the two scores in a different way you can postprocess the lattice and insert new scores. Since all the scores appear on one line it should be easy to do this with a gawk / perl / python etc. script. I hope this answers your question. Andreas > *From:*Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] > *Sent:* Wednesday, May 15, 2013 4:11 PM > *To:* tarek abuamer > *Subject:* Re: [SRILM User List] lattice-tool with HTK lattice with > probabilities > > On 5/15/2013 4:07 AM, tarek abuamer wrote: > > This is exactly what I am talking about. The weights I put in the > input lattice have no effect on the weights in the output lattice > (i.e. if I remove the weights from the input lattice it still > gives me the same numbers in the output lattice) > > Yes, because the use of the -lm option implies "rescoring" the > lattice, i.e., recomputing the LM scores. > > If you simply want to extract the hypotheses with the highest score, > that's called "decoding" and there are several other options for that. > Read the man page. You probably want -viterbi-decode. > > Andreas > > > *From:*Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] > *Sent:* Tuesday, May 14, 2013 9:12 PM > *To:* tarek abuamer > *Cc:* srilm-user at speech.sri.com > *Subject:* Re: [SRILM User List] lattice-tool with HTK lattice with > probabilities > > On 5/12/2013 5:06 AM, tarek abuamer wrote: > > I want to use lattice-tool to rescore HTK format lattices. I need > HTK lattices to contain probabilities in the form : J=1 S=4 E=2 l=-1.1 > > ;& nbsp;&n bsp; J=2 S=4 E=3 l=-0.4 > > However, this seems not to have any effect when using lattice-tool. > > Any clue? > > > You don't say what you tried that didn't work. If you invoke > > lattice-tool -in-lattice INPUTLATTICE -read-htk -lm LMFILE -order > N -out-lattice OUTPUTLATTICE -write-htk > > the OUTPUTLATTICE will contain recomputed LM probabilities in the form > you describe, assuming the input lattice is also in HTK format. > > Andreas > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.N.Maijers at student.ru.nl Sat May 25 09:27:45 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Sat, 25 May 2013 18:27:45 +0200 Subject: [SRILM User List] ngram: sentence boundary markers in text file used with -ppl? Message-ID: <51A0E681.5080007@student.ru.nl> Hi, Should one the sentences in the sentences file for ngram's '-ppl' with ~~sos and~~ eos tokens? They are in the LM. I have tested it just now, and it seems that the sentence boundary markers are inferred by ngram when left out, and adopted when put in. Where is this documented? Best, Sander From S.N.Maijers at student.ru.nl Sat May 25 13:37:12 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Sat, 25 May 2013 22:37:12 +0200 Subject: [SRILM User List] ngram: sentence boundary markers in text file used with -ppl? [edit] Message-ID: <51A120F8.8050602@student.ru.nl> Hi, Should one surround the sentences in the sentences file for ngram's '-ppl' with ~~sos and~~ eos tokens? They are in the LM. I have tested it just now, and it seems that the sentence boundary markers are inferred by ngram when left out, and adopted when put in. Where is this documented? Best, Sander _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Sat May 25 15:11:40 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 25 May 2013 15:11:40 -0700 Subject: [SRILM User List] ngram: sentence boundary markers in text file used with -ppl? [edit] In-Reply-To: <51A120F8.8050602@student.ru.nl> References: <51A120F8.8050602@student.ru.nl> Message-ID: <51A1371C.2000902@icsi.berkeley.edu> On 5/25/2013 1:37 PM, Sander Maijers wrote: > Hi, > > Should one surround the sentences in the sentences file for ngram's > '-ppl' with ~~sos and~~ eos tokens? They are in the LM. > > I have tested it just now, and it seems that the sentence boundary > markers are inferred by ngram when left out, and adopted when put in. > Where is this documented? In the man page . The relevant options are -no-sos Disable the automatic insertion of start-of-sentence tokens for sentence probability computation. The probability of the initial word is thus computed with an empty context. -no-eos Disable the automatic insertion of end-of-sentence tokens for sentence probability computation. End- of-sentence is thus excluded from the total probability. Andreas > > Best, > Sander > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From cloudygooseg at gmail.com Mon May 27 00:19:11 2013 From: cloudygooseg at gmail.com (=?GB2312?B?utjM7NDQ?=) Date: Mon, 27 May 2013 15:19:11 +0800 Subject: [SRILM User List] Why does -addsmooth still has discounting effects? Message-ID: The manual wrote: *-addsmooth** D***********Smooth by adding *D *to each N-gram count. This is usually a poor smoothing method, included mainly for instructional purposes. *p*(*a*_*z*) = (*c*(*a*_*z*) + *D*) / (*c*(*a*_) + *D* *n*(*)) My script is: ngram-count -write allcnt -order 3 -debug 2 -text test_htx.dat -addsmooth 0 -lm lmtest The the debug wrote: test_htx.dat: line 3: 2 sentences, 6 words, 0 OOVs 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 using AddSmooth for 1-grams using AddSmooth for 2-grams using AddSmooth for 3-grams discarded 1 2-gram contexts containing pseudo-events discarded 2 3-gram contexts containing pseudo-events discarded 6 3-gram probs discounted to zero writing 6 1-grams writing 8 2-grams writing 0 3-grams So there's still discounting, I'm confused that why addsmooth still has discounting? Thanks a lot! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon May 27 10:48:03 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 27 May 2013 10:48:03 -0700 Subject: [SRILM User List] Why does -addsmooth still has discounting effects? In-Reply-To: References: Message-ID: <51A39C53.4020502@icsi.berkeley.edu> On 5/27/2013 12:19 AM, ??? wrote: > The manual wrote: > *-addsmooth*/ D/ > Smooth by adding /D /to each N-gram count. This is usually a poor > smoothing method, included mainly for instructional purposes. > /p/(/a/_/z/) = (/c/(/a/_/z/) +/D/) / (/c/(/a/_) +/D/ /n/(*)) > My script is: > ngram-count -write allcnt -order 3 -debug 2 -text test_htx.dat > -addsmooth 0 -lm lmtest > The the debug wrote: > test_htx.dat: line 3: 2 sentences, 6 words, 0 OOVs > 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 > using AddSmooth for 1-grams > using AddSmooth for 2-grams > using AddSmooth for 3-grams > discarded 1 2-gram contexts containing pseudo-events > discarded 2 3-gram contexts containing pseudo-events > discarded 6 3-gram probs discounted to zero > writing 6 1-grams > writing 8 2-grams > writing 0 3-grams > So there's still discounting, I'm confused that why addsmooth still > has discounting? You also have to change that mincount parameter to include all trigrams, even those that occur only once. ngram-count -write allcnt -order 3 -debug 2 -text test_htx.dat -addsmooth 0 *-gt3min 1* -lm lmtest The default is -gt3min 2 . Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From cloudygooseg at gmail.com Tue May 28 00:42:49 2013 From: cloudygooseg at gmail.com (=?GB2312?B?utjM7NDQ?=) Date: Tue, 28 May 2013 15:42:49 +0800 Subject: [SRILM User List] If I use -kndiscount for order 2, does I get an uncorrect unigram model? Message-ID: Hello When I use order 2 kndiscount, I get a unigram model and a bigram model Then I use order 1 kndiscount, I also get a unigram But these two unigrams are different, I read the http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html It seems that this has to do with some implementation issue, what i want to ask is, is the unigram I get in the order 2 kndiscount uncorrect? Because if I use order1 Katz discount and order 2 Katz discount, the two unigrams are the same, so I think I need to treat kndiscount result with caution. Many thanks Goose -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue May 28 09:41:10 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 28 May 2013 09:41:10 -0700 Subject: [SRILM User List] If I use -kndiscount for order 2, does I get an uncorrect unigram model? In-Reply-To: References: Message-ID: <51A4DE26.7030801@icsi.berkeley.edu> On 5/28/2013 12:42 AM, ??? wrote: > Hello > When I use order 2 kndiscount, I get a unigram model and a bigram model > Then I use order 1 kndiscount, I also get a unigram > But these two unigrams are different, I read the > http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html > It seems that this has to do with some implementation issue, what i > want to ask is, is the unigram I get in the order 2 kndiscount uncorrect? > > Because if I use order1 Katz discount and order 2 Katz discount, the > two unigrams are the same, so I think I need to treat kndiscount > result with caution. It is one of the distinguishing features of KN discounting that the lower-order (backoff) distributions are estimated differently from the highest-order distribution. You are not supposed to use the unigram distribution in a KN-smoothed bigram by itself. So what you're seeing is completely expected and correct. For a detailed explanation see the Chen and Goodman paper. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From cloudygooseg at gmail.com Wed May 29 06:07:57 2013 From: cloudygooseg at gmail.com (=?GB2312?B?utjM7NDQ?=) Date: Wed, 29 May 2013 21:07:57 +0800 Subject: [SRILM User List] Please help me understand the debug info of the -interpolate -kndiscount Message-ID: Hello, I'm trying to understand how does SRILM gives us the output in the lm file, but I can not figure out how these numbers come from. ngram-count -order 2 -gt1min 1 -gt2min 1 -gt3min 1 -text test_htx.dat -write1 cnt1 -write2 cnt2 -write3 cnt3 -kndiscount1 -kndiscount2 -kndiscount3 -debug 5 -lm lmtest2 test_htx.dat: line 22: 22 sentences, 67 words, 0 OOVs 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 using ModKneserNey for 1-grams modifying 1-gram counts for Kneser-Ney smoothing Kneser-Ney smoothing 1-grams n1 = 2 n2 = 4 n3 = 4 n4 = 4 D1 = 0.2 D2 = 1.4 D3+ = 2.2 using ModKneserNey for 2-grams Kneser-Ney smoothing 2-grams n1 = 34 n2 = 10 n3 = 3 n4 = 3 D1 = 0.62963 D2 = 1.43333 D3+ = 0.481481 CONTEXT WORD NUMER 9 DENOM 52 DISCOUNT 0.755556 LPROB -0.883494 CONTEXT WORD Alice NUMER 3 DENOM 52 DISCOUNT 0.266667 LPROB -1.81291 ........ In the lm file: -99 0.1888525 -1.309463 Alice -0.02817659 ......... I'm trying to understand the line CONTEXT WORD Alice NUMER 3 DENOM 52 DISCOUNT 0.266667 LPROB -1.81291 I know the NUMBER 3 means c(* Alice)=3 I can't figure out the other parameters, and how are they calculated, and how are the result -1.309463 Alice -0.02817659 calculated I have referred to Chen's paper and SRILM ngram-discount manual, but I still don't know what's going on This is my cnt1 file 22 9 Alice 3 loves 4 Bob 2 also 3 Kai 2 KaiKai 3 KK 3 hates 2 YY 5 Miss 4 MM 1 b3 4 a3 4 c3 1 d3 2 Thank you very much. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cloudygooseg at gmail.com Wed May 29 07:23:09 2013 From: cloudygooseg at gmail.com (=?GB2312?B?utjM7NDQ?=) Date: Wed, 29 May 2013 22:23:09 +0800 Subject: [SRILM User List] Please help me understand the debug info of the -interpolate -kndiscount In-Reply-To: References: Message-ID: I'm terribly sorry that it seems when I do the calculation following the manual, I messed up with the Ds so I can't get the output right. Now I can get the g() for the unigram following the manual Now my question becomes simple, when computing the bow() for the unigram, there are two ways in the manual: Let *Z1 *be the set {*z*: *c*(*a*_*z*) > 0}. For highest order N-grams we have: *g*(*a*_*z*) = max(0, *c*(*a*_*z*) - *D*) / *c*(*a*_) bow(*a*_) = 1 - Sum_*Z1* *g*(*a*_*z*) = 1 - Sum_*Z1* *c*(*a*_*z*) / *c*(*a*_) + Sum_*Z1* *D* / *c*(*a*_) = *D* *n*(*a*_*) / *c*(*a*_) Let *Z2 *be the set {*z*: *n*(*_*z*) > 0}. For lower order N-grams we have: *g*(_*z*) = max(0, *n*(*_*z*) - *D*) / *n*(*_*) bow(_) = 1 - Sum_*Z2* *g*(_*z*) = 1 - Sum_*Z2* *n*(*_*z*) / *n*(*_*) + Sum_*Z2* *D* / *n*(*_*) = *D* *n*(_*) / *n*(*_*) I don't know which equation to take computing the bow() for the unigram , and for unigram, what does 'a' and '_' means respectively? Also, I still don't get hold of the -debug 5 output in my last mail. Terribly sorry again for my mistake, hope didn't waste your time and many thanks Goose 2013/5/29 ??? > Hello, I'm trying to understand how does SRILM gives us the output in the > lm file, but I can not figure out how these numbers come from. > > ngram-count -order 2 -gt1min 1 -gt2min 1 -gt3min 1 -text test_htx.dat > -write1 cnt1 -write2 cnt2 -write3 cnt3 -kndiscount1 -kndiscount2 > -kndiscount3 -debug 5 -lm lmtest2 > test_htx.dat: line 22: 22 sentences, 67 words, 0 OOVs > 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 > using ModKneserNey for 1-grams > modifying 1-gram counts for Kneser-Ney smoothing > Kneser-Ney smoothing 1-grams > n1 = 2 > n2 = 4 > n3 = 4 > n4 = 4 > D1 = 0.2 > D2 = 1.4 > D3+ = 2.2 > using ModKneserNey for 2-grams > Kneser-Ney smoothing 2-grams > n1 = 34 > n2 = 10 > n3 = 3 > n4 = 3 > D1 = 0.62963 > D2 = 1.43333 > D3+ = 0.481481 > CONTEXT WORD NUMER 9 DENOM 52 DISCOUNT 0.755556 LPROB -0.883494 > CONTEXT WORD Alice NUMER 3 DENOM 52 DISCOUNT 0.266667 LPROB -1.81291 > ........ > In the lm file: > -99 0.1888525 > -1.309463 Alice -0.02817659 > ......... > I'm trying to understand the line > CONTEXT WORD Alice NUMER 3 DENOM 52 DISCOUNT 0.266667 LPROB -1.81291 > I know the NUMBER 3 means > c(* Alice)=3 > I can't figure out the other parameters, and how are they calculated, and > how are the result > -1.309463 Alice -0.02817659 > calculated > > I have referred to Chen's paper and SRILM ngram-discount manual, but I > still don't know what's going on > > This is my cnt1 file > ~~22 >~~ 9 > Alice 3 > loves 4 > Bob 2 > also 3 > Kai 2 > KaiKai 3 > KK 3 > hates 2 > YY 5 > Miss 4 > MM 1 > b3 4 > a3 4 > c3 1 > d3 2 > > Thank you very much. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.N.Maijers at student.ru.nl Thu May 30 08:38:24 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Thu, 30 May 2013 17:38:24 +0200 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff Message-ID: <51A77270.2030109@student.ru.nl> Hi, I have trained a baseline N-gram LM like so: vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s -sort -lm %s Suppose I have the following line to ngram -ppl -debug 3 -map-unk [unk] ... : ( Heijn | Albert ...) = 0.210084 [ -0.677607 ] This bigram is not in my LM. My pronunciation lexicon contains both words, but only in lower case. I believe that the bigram that would be looked up in this case by ngram is the one for "[unk] [unk]": -0.5549474 [unk] [unk] -0.2222121 I do not understand precisely how to confirm this with the logprob between brackets reported by ngram. When the applicable N-gram *is* in the LM, the logprobs do not match between the ARPA line and the ngram output either, but this must be due to discounting applied by default. The man page for ngram with arguments -debug 2 -ppl says: "Probabilities for each word, plus LM-dependent details about backoff used etc., are printed.". Where should I look for the backoff details in my ngram output to asses the role of backoff, including the backing off as happening in LMs generated with the -skip option? Best, Sander From S.N.Maijers at student.ru.nl Thu May 30 08:42:38 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Thu, 30 May 2013 17:42:38 +0200 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff Message-ID: <51A7736E.7080802@student.ru.nl> My comment about discounting being responsible for the discrepancy between the logprob in the LM and the one reported by ngram is incorrect in hindsight, because all discounting had already been applied during generation of the LM, of course. From stolcke at icsi.berkeley.edu Thu May 30 11:14:54 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 30 May 2013 11:14:54 -0700 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff In-Reply-To: <51A77270.2030109@student.ru.nl> References: <51A77270.2030109@student.ru.nl> Message-ID: <51A7971E.3080607@icsi.berkeley.edu> On 5/30/2013 8:38 AM, Sander Maijers wrote: > Hi, > > I have trained a baseline N-gram LM like so: > vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s > -sort -lm %s > > Suppose I have the following line to ngram -ppl -debug 3 -map-unk > [unk] ... : > ( Heijn | Albert ...) = 0.210084 [ -0.677607 ] > > This bigram is not in my LM. My pronunciation lexicon contains both > words, but only in lower case. I believe that the bigram that would be > looked up in this case by ngram is the one for "[unk] [unk]": > > -0.5549474 [unk] [unk] -0.2222121 > > I do not understand precisely how to confirm this with the logprob > between brackets reported by ngram. When the applicable N-gram *is* in > the LM, the logprobs do not match between the ARPA line and the ngram > output either, but this must be due to discounting applied by default. > The man page for ngram with arguments -debug 2 -ppl says: > "Probabilities for each word, plus LM-dependent details about backoff > used etc., are printed.". > > Where should I look for the backoff details in my ngram output to > asses the role of backoff, including the backing off as happening in > LMs generated with the -skip option? You won't see all the details of the backoff computation in the ppl output. If the word is 'a' and the last two words 'b' and 'c' (in that order), and you have a bigram hit (output says '[2gram]' ), you'd have to look up the bigram log probability for 'a c' and add to that the backoff weight for 'b c'. Unfortunately only one word of history is printed (to keep things brief), so for trigrams and higher models you need to extract the history from the complete sentence string. Andreas From S.N.Maijers at student.ru.nl Sat Jun 1 04:37:05 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Sat, 01 Jun 2013 13:37:05 +0200 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff In-Reply-To: <51A7971E.3080607@icsi.berkeley.edu> References: <51A77270.2030109@student.ru.nl> <51A7971E.3080607@icsi.berkeley.edu> Message-ID: <51A9DCE1.4080401@student.ru.nl> On 30-05-13 20:14, Andreas Stolcke wrote: > On 5/30/2013 8:38 AM, Sander Maijers wrote: >> Hi, >> >> I have trained a baseline N-gram LM like so: >> vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s >> -sort -lm %s >> >> Suppose I have the following line to ngram -ppl -debug 3 -map-unk >> [unk] ... : >> ( Heijn | Albert ...) = 0.210084 [ -0.677607 ] >> >> This bigram is not in my LM. My pronunciation lexicon contains both >> words, but only in lower case. I believe that the bigram that would be >> looked up in this case by ngram is the one for "[unk] [unk]": >> >> -0.5549474 [unk] [unk] -0.2222121 >> >> I do not understand precisely how to confirm this with the logprob >> between brackets reported by ngram. When the applicable N-gram *is* in >> the LM, the logprobs do not match between the ARPA line and the ngram >> output either, but this must be due to discounting applied by default. >> The man page for ngram with arguments -debug 2 -ppl says: >> "Probabilities for each word, plus LM-dependent details about backoff >> used etc., are printed.". >> >> Where should I look for the backoff details in my ngram output to >> asses the role of backoff, including the backing off as happening in >> LMs generated with the -skip option? > > You won't see all the details of the backoff computation in the ppl output. > If the word is 'a' and the last two words 'b' and 'c' (in that order), > and you have a bigram hit (output says '[2gram]' ), you'd have to look > up the bigram log probability for 'a c' and add to that the backoff > weight for 'b c'. Unfortunately only one word of history is printed > (to keep things brief), so for trigrams and higher models you need to > extract the history from the complete sentence string. > > Andreas > On 30-05-13 20:14, Andreas Stolcke wrote:> On 5/30/2013 8:38 AM, Sander Maijers wrote: >> Hi, >> >> I have trained a baseline N-gram LM like so: >> vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s >> -sort -lm %s >> >> Suppose I have the following line to ngram -ppl -debug 3 -map-unk >> [unk] ... : >> ( Heijn | Albert ...) = 0.210084 [ -0.677607 ] >> >> This bigram is not in my LM. My pronunciation lexicon contains both >> words, but only in lower case. I believe that the bigram that would be >> looked up in this case by ngram is the one for "[unk] [unk]": >> >> -0.5549474 [unk] [unk] -0.2222121 >> >> I do not understand precisely how to confirm this with the logprob >> between brackets reported by ngram. When the applicable N-gram *is* in >> the LM, the logprobs do not match between the ARPA line and the ngram >> output either, but this must be due to discounting applied by default. >> The man page for ngram with arguments -debug 2 -ppl says: >> "Probabilities for each word, plus LM-dependent details about backoff >> used etc., are printed.". >> >> Where should I look for the backoff details in my ngram output to >> asses the role of backoff, including the backing off as happening in >> LMs generated with the -skip option? > > You won't see all the details of the backoff computation in the ppl output. > If the word is 'a' and the last two words 'b' and 'c' (in that order), > and you have a bigram hit (output says '[2gram]' ), you'd have to look > up the bigram log probability for 'a c' and add to that the backoff > weight for 'b c'. Unfortunately only one word of history is printed > (to keep things brief), so for trigrams and higher models you need to > extract the history from the complete sentence string. > > Andreas > Thank you. Could you explain the following example? How do I interpret this snippet of ngram output: dat Albert Heijn het doet zou niet de aanleiding zijn p( dat | ) = 0.0438046 [ -1.35848 ] p( Albert | dat ...) = 0.0100695 [ -1.99699 ] What is the order of these first two N-grams retrieved? There is no line with [2gram] in the output. However, the first N-gram has no ellipsis in the history and the second line has. From S.N.Maijers at student.ru.nl Sat Jun 1 06:00:28 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Sat, 01 Jun 2013 15:00:28 +0200 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff In-Reply-To: <51A7971E.3080607@icsi.berkeley.edu> References: <51A77270.2030109@student.ru.nl> <51A7971E.3080607@icsi.berkeley.edu> Message-ID: <51A9F06C.5030706@student.ru.nl> Andreas, you wrote: > If the word is 'a' and the last two words 'b' and 'c' (in that order), and you have a bigram hit (output says '[2gram]' ), you'd have to look up the bigram log probability for 'a c' and add to that the backoff weight for 'b c'. I interpret this as P(a | b c). If that is correct, then shouldn't I actually look up the line in the ARPA LM with "c a" (reverse)? Can you further comment to clear up my remaining confusion please ... 1. The word "Albert" is not in the word list/vocabulary. Nor are there any N-grams with "Albert". This confuses me. I cannot trace the appropriate N-gram that lead to the logprob that was reported by 'ngram'. It seems that I cannot directly see in the 'ngram' output if/when there had been any backing off during the LM lookups for this sentence. I assume that the word Albert was actually replaced with [unk], but in the ngram output, such is not displayed. There also 0 OOVs reported, which strikes me as odd. All in all I believe that the following N-grams were looked up: -1.358477 dat -0.6622628 -1.334724 dat [unk] -0.0686222 -0.6776069 dat [unk] [unk] for the 'n gram' output dat Albert Heijn het doet zou niet de aanleiding zijn p( dat | ) = 0.0438046 [ -1.35848 ] p( Albert | dat ...) = 0.0100695 [ -1.99699 ] p( Heijn | Albert ...) = 0.210084 [ -0.677607 ] How did 'ngram' come to the -1.99699 logprob? Extra suggestions: A. Just now I saw the 'ngram' option '-limit-vocab': 'The default is that words used in the LM are automatically added to the vocabulary.' I would say that not doing this and restricting the known words to the ones in the vocabulary is a more sensible default, because this behaviour defeats an important point for specifying a vocabulary (controlling the lookup in the LM). But anyway, could you remark this behavior under the description of '-vocab' as well? B. I think it would be useful if something like the line number in the ARPA LM of the N-gram that was retrieved is listed on each p( ... ) line in the 'ngram' output, if need be only at '-debug 4' level. Be it a line number, or some other definite key/index to the N-grams that is automatically parseable from the output. From stolcke at icsi.berkeley.edu Tue Jun 4 22:58:14 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 04 Jun 2013 22:58:14 -0700 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff In-Reply-To: <51A9F06C.5030706@student.ru.nl> References: <51A77270.2030109@student.ru.nl> <51A7971E.3080607@icsi.berkeley.edu> <51A9F06C.5030706@student.ru.nl> Message-ID: <51AED376.8090902@icsi.berkeley.edu> On 6/1/2013 6:00 AM, Sander Maijers wrote: > Andreas, you wrote: > > If the word is 'a' and the last two words 'b' and 'c' (in that > order), and you have a bigram hit (output says '[2gram]' ), you'd > have to look up the bigram log probability for 'a c' and add to that > the backoff weight for 'b c'. > I interpret this as P(a | b c). If that is correct, then shouldn't I > actually look up the line in the ARPA LM with "c a" (reverse)? You are correct, the backoff bigram would be 'c a' . > > Can you further comment to clear up my remaining confusion please ... > > 1. The word "Albert" is not in the word list/vocabulary. Nor are there > any N-grams with "Albert". This confuses me. I cannot trace the > appropriate N-gram that lead to the logprob that was reported by > 'ngram'. It seems that I cannot directly see in the 'ngram' output > if/when there had been any backing off during the LM lookups for this > sentence. I assume that the word Albert was actually replaced with > [unk], but in the ngram output, such is not displayed. There also 0 > OOVs reported, which strikes me as odd. All in all I believe that the > following N-grams were looked up: > > -1.358477 dat -0.6622628 > -1.334724 dat [unk] -0.0686222 > -0.6776069 dat [unk] [unk] > > for the 'n gram' output > > dat Albert Heijn het doet zou niet de aanleiding zijn > p( dat | ) = 0.0438046 [ -1.35848 ] > p( Albert | dat ...) = 0.0100695 [ -1.99699 ] > p( Heijn | Albert ...) = 0.210084 [ -0.677607 ] > > How did 'ngram' come to the -1.99699 logprob? I'm not sure how exactly you are invoking ngram. But if your ppl output shows the original words and not the [unk] word that means your OOV words were NOT mapped to [unk]. Also, I'm confused by that had that your -debug 2 output doesn't indicate what order of ngram was hit. For reference, if you invoke ngram on the 3gram model that is packaged with SRILM: % ngram -unk -map-unk '@reject@' -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -debug 2 -ppl - and feed two sentences hi there hix there the output should look like this: hi there p( hi | ~~) = [2gram] 0.000625719 [ -3.20362 ] p( there | hi ...) = [3gram] 0.0131176 [ -1.88215 ] p(~~ | there ...) = [2gram] 0.137458 [ -0.86183 ] 1 sentences, 2 words, 0 OOVs 0 zeroprobs, logprob= -5.9476 ppl= 96.0578 ppl1= 941.453 hix there p( @reject@ | ~~) = [2gram] 0.00733392 [ -2.13466 ] p( there | @reject@ ...) = [3gram] 0.00498615 [ -2.30223 ] p(~~ | there ...) = [2gram] 0.0603144 [ -1.21958 ] 1 sentences, 2 words, 0 OOVs 0 zeroprobs, logprob= -5.65648 ppl= 76.8232 ppl1= 673.347 Notice the [2gram] [3gram] indicators, and notice how the OOV "hix" is mapped to the unknown word token "@reject@" (that's the label that this particular model uses). So you must be invoking ngram in a difference way and that may have something to do with the problems you have interpreting the output. Andreas From S.N.Maijers at student.ru.nl Wed Jun 5 04:23:11 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Wed, 05 Jun 2013 13:23:11 +0200 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff In-Reply-To: <51AED376.8090902@icsi.berkeley.edu> References: <51A77270.2030109@student.ru.nl> <51A7971E.3080607@icsi.berkeley.edu> <51A9F06C.5030706@student.ru.nl> <51AED376.8090902@icsi.berkeley.edu> Message-ID: <51AF1F9F.4020505@student.ru.nl> On 05-06-13 07:58, Andreas Stolcke wrote: > On 6/1/2013 6:00 AM, Sander Maijers wrote: >> Andreas, you wrote: >> > If the word is 'a' and the last two words 'b' and 'c' (in that >> order), and you have a bigram hit (output says '[2gram]' ), you'd >> have to look up the bigram log probability for 'a c' and add to that >> the backoff weight for 'b c'. >> I interpret this as P(a | b c). If that is correct, then shouldn't I >> actually look up the line in the ARPA LM with "c a" (reverse)? > You are correct, the backoff bigram would be 'c a' . >> Can you further comment to clear up my remaining confusion please ... >> >> 1. The word "Albert" is not in the word list/vocabulary. Nor are >> there any N-grams with "Albert". This confuses me. I cannot trace the >> appropriate N-gram that lead to the logprob that was reported by >> 'ngram'. It seems that I cannot directly see in the 'ngram' output >> if/when there had been any backing off during the LM lookups for this >> sentence. I assume that the word Albert was actually replaced with >> [unk], but in the ngram output, such is not displayed. There also 0 >> OOVs reported, which strikes me as odd. All in all I believe that the >> following N-grams were looked up: >> >> -1.358477 dat -0.6622628 >> -1.334724 dat [unk] -0.0686222 >> -0.6776069 dat [unk] [unk] >> >> for the 'n gram' output >> >> dat Albert Heijn het doet zou niet de aanleiding zijn >> p( dat | ) = 0.0438046 [ -1.35848 ] >> p( Albert | dat ...) = 0.0100695 [ -1.99699 ] >> p( Heijn | Albert ...) = 0.210084 [ -0.677607 ] >> >> How did 'ngram' come to the -1.99699 logprob? > > I'm not sure how exactly you are invoking ngram. But if your ppl > output shows the original words and not the [unk] word that means > your OOV words were NOT mapped to [unk]. > Also, I'm confused by that had that your -debug 2 output doesn't > indicate what order of ngram was hit. > > For reference, if you invoke ngram on the 3gram model that is packaged > with SRILM: > > % ngram -unk -map-unk '@reject@' -lm > $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -debug 2 -ppl - > > and feed two sentences > > hi there > hix there > > the output should look like this: > > hi there > p( hi | ~~) = [2gram] 0.000625719 [ -3.20362 ] > p( there | hi ...) = [3gram] 0.0131176 [ -1.88215 ] > p(~~ | there ...) = [2gram] 0.137458 [ -0.86183 ] > 1 sentences, 2 words, 0 OOVs > 0 zeroprobs, logprob= -5.9476 ppl= 96.0578 ppl1= 941.453 > > hix there > p( @reject@ | ~~) = [2gram] 0.00733392 [ -2.13466 ] > p( there | @reject@ ...) = [3gram] 0.00498615 [ -2.30223 ] > p(~~ | there ...) = [2gram] 0.0603144 [ -1.21958 ] > 1 sentences, 2 words, 0 OOVs > 0 zeroprobs, logprob= -5.65648 ppl= 76.8232 ppl1= 673.347 > > Notice the [2gram] [3gram] indicators, and notice how the OOV "hix" is > mapped to the unknown word token "@reject@" (that's the label that > this particular model uses). > > So you must be invoking ngram in a difference way and that may have > something to do with the problems you have interpreting the output. > > Andreas Yes, there is confusion here. I did not specify the complete command line that I used to invoke 'ngram', only the options that I believed were relevant. Actually, I used the "-use-server" parameter (for the ngram client invocation) ... I haven't read that the debug output is more limited in this sense if you use the 'ngram' server. And I had specified the same debug level with both the "'ngram' -server-port ..." and "ngram -use-server ..." invocations) . Still, there are no logprobs of "-1.99699" in my LM, so I do not know where that comes from. (I tried to look for the values in the ARPA LM with fewer decimals: the 'ngram' output logprobs and the ARPA logprobs are not exactly the same in my output). There simply must be OOVs in that "Albert Heijn" sentence but they weren't detected, it seems? Could this be problem that has to do with the 'ngram' server mode? If you are interested in it, I can give you the relevant ARPA LM and the command lines used for a test case. Sander From stolcke at icsi.berkeley.edu Wed Jun 5 14:03:50 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 05 Jun 2013 14:03:50 -0700 Subject: [SRILM User List] Interpreting ngram -ppl output in case of backoff In-Reply-To: <51AF1F9F.4020505@student.ru.nl> References: <51A77270.2030109@student.ru.nl> <51A7971E.3080607@icsi.berkeley.edu> <51A9F06C.5030706@student.ru.nl> <51AED376.8090902@icsi.berkeley.edu> <51AF1F9F.4020505@student.ru.nl> Message-ID: <51AFA7B6.5030502@icsi.berkeley.edu> On 6/5/2013 4:23 AM, Sander Maijers wrote: > Yes, there is confusion here. I did not specify the complete command > line that I used to invoke 'ngram', only the options that I believed > were relevant. Actually, I used the "-use-server" parameter (for the > ngram client invocation) ... I haven't read that the debug output is > more limited in this sense if you use the 'ngram' server. And I had > specified the same debug level with both the "'ngram' -server-port > ..." and "ngram -use-server ..." invocations) . I suggest you first debug the probability computation without client/server. Indeed, the client side has no way of knowing some of the debugging output from the server end (like what ngram order was used). That's in the nature of the client/server interface. However, you can elicit some additional information by giving -debug 2 to the server side (which you said you did). For example, % ngram -unk -map-unk '@reject@' -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -debug 2 -server-port 8888 reading 33110 1-grams reading 425750 2-grams reading 268962 3-grams starting prob server on port 8888 client 54671 at 192.150.186.222: connection accepted client 54671 at 192.150.186.222: _R_E_M_O_T_E_L_M_V=2 client 54671 at 192.150.186.222: OK client 54671 at 192.150.186.222: W hi client 54671 at 192.150.186.222: [2gram]OK -3.20362 client 54671 at 192.150.186.222: W ~~hi there client 54671 at 192.150.186.222: [3gram]OK -1.88215 client 54671 at 192.150.186.222: W hi there~~ client 54671 at 192.150.186.222: [2gram]OK -0.86183 Notice the [2gram], [3gram] etc. Those are from the server looking up the ngrams. The client only sees the probabilities being sent back. Andreas From S.N.Maijers at student.ru.nl Mon Jun 10 11:55:12 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Mon, 10 Jun 2013 20:55:12 +0200 Subject: [SRILM User List] ngram-count: -skip in combination with -unk In-Reply-To: <51A77270.2030109@student.ru.nl> References: <51A77270.2030109@student.ru.nl> Message-ID: <51B62110.800@student.ru.nl> What is the interaction between the "-unk" and "-skip" parameters to 'ngram-count' when creating an LM given a word list that fully covers the training words? According to srilm-faq.7, the precise interaction in terms of backoff strategy when a test word sequence is looked up that has no corresponding N-gram in the LM depends on the particular backoff scheme. From stolcke at icsi.berkeley.edu Mon Jun 10 15:37:30 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 10 Jun 2013 18:37:30 -0400 Subject: [SRILM User List] ngram-count: -skip in combination with -unk In-Reply-To: <51B62110.800@student.ru.nl> References: <51A77270.2030109@student.ru.nl> <51B62110.800@student.ru.nl> Message-ID: <51B6552A.80803@icsi.berkeley.edu> On 6/10/2013 2:55 PM, Sander Maijers wrote: > What is the interaction between the "-unk" and "-skip" parameters to > 'ngram-count' when creating an LM given a word list that fully covers > the training words? > > According to srilm-faq.7, the precise interaction in terms of backoff > strategy when a test word sequence is looked up that has no > corresponding N-gram in the LM depends on the particular backoff scheme. The effect of -unk is very specific: it allows including ngrams involving the word in the LM. Without it, words not contained in the vocabulary are still mapped to but then discarded from the model. Using ngram-count -unk is usually used when -vocab is also specified. Otherwise all words are implicitly added to the vocabulary and you wouldn't see any occurrences. The same is true if your word list contains all the words in your training data: you won't see any ngrams containing the word (unless the input data already contains them, which is another way to structure your data processing). The way regular backoff and the "skip" ngram operate is really orthogonal to the above. Words are either mapped to themselves or to , but once that is done the model (in backing off, mixing regular and skip ngram estimates) doesn't nothing special with . If not clear, maybe you could give a specific example and we can walk through it. Andreas From stolcke at icsi.berkeley.edu Tue Jun 11 15:05:17 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 11 Jun 2013 18:05:17 -0400 Subject: [SRILM User List] ngram-count: -skip in combination with -unk In-Reply-To: <51B7838B.7070007@student.ru.nl> References: <51A77270.2030109@student.ru.nl> <51B62110.800@student.ru.nl> <51B6552A.80803@icsi.berkeley.edu> <51B7838B.7070007@student.ru.nl> Message-ID: <51B79F1D.2060002@icsi.berkeley.edu> On 6/11/2013 4:07 PM, Sander Maijers wrote: > On 11-06-13 00:37, Andreas Stolcke wrote: >> On 6/10/2013 2:55 PM, Sander Maijers wrote: >>> What is the interaction between the "-unk" and "-skip" parameters to >>> 'ngram-count' when creating an LM given a word list that fully covers >>> the training words? >>> >>> According to srilm-faq.7, the precise interaction in terms of backoff >>> strategy when a test word sequence is looked up that has no >>> corresponding N-gram in the LM depends on the particular backoff >>> scheme. >> >> The effect of -unk is very specific: it allows including ngrams >> involving the word in the LM. >> Without it, words not contained in the vocabulary are still mapped to >> but then discarded from the model. >> >> Using ngram-count -unk is usually used when -vocab is also specified. >> Otherwise all words are implicitly added to the vocabulary and you >> wouldn't see any occurrences. The same is true if your word list >> contains all the words in your training data: you won't see any ngrams >> containing the word >> (unless the input data already contains them, which is another way to >> structure your data processing). >> >> The way regular backoff and the "skip" ngram operate is really >> orthogonal to the above. Words are either mapped to themselves or to >> , but once that is done the model (in backing off, mixing regular >> and skip ngram estimates) doesn't nothing special with . >> >> If not clear, maybe you could give a specific example and we can walk >> through it. >> >> Andreas > > Suppose ocurrences of "a b c" have added to the count for "a b " > in certain LM. Suppose that the 3-gram for "a b c" is looked up. It > would match "a b " and not back off (no need, because a matching > N-gram was found). Conversely, suppose "a b " is not in the LM. > Then backing off would be attempted. > > I think that my word list fully covers the training data in itself. > Then it wouldn't matter whether I created the LM with or without the > '-unk' parameter. But if there instead are a few OOV words in the > training data, then specifying '-unk' and '-skip' means that in cases > like in my previous "a b c" example no back off would be performed. In > fact, all test word sequences "a b (OOV)" will get the same > probability estimate, the probability estimate for "a b ". > > Is the above reasoning entirely correct, or not? This reasoning is > what made me ask this question. If this is the case, then it would be > better that I not use '-unk' as my goal is to compare two language > models, one with backoff and one without. Your reasoning is correct. Andreas From akmalcuet00 at yahoo.com Wed Jun 12 16:02:59 2013 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Wed, 12 Jun 2013 16:02:59 -0700 (PDT) Subject: [SRILM User List] SRILM command in Matlab Message-ID: <1371078179.6978.YahooMailNeo@web161002.mail.bf1.yahoo.com> Hi, Does anyone know how to execute srilm command in Matlab? For example, I need to compute the perplexityof an LM. I have to run the following command in Matlab c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2 is there any way to run this in Matlab? Thanks Best Regards Akmal -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Thu Jun 13 08:23:44 2013 From: chenmengdx at gmail.com (Meng CHEN) Date: Thu, 13 Jun 2013 23:23:44 +0800 Subject: [SRILM User List] About wbdiscount and meta-tag options Message-ID: Hi, in make-big-lm command, it specifies -read-with-mincounts and -meta-tag by default. In the help page, it says "if -meta-tag is defined, these low-count N-grams will be converted to count-of-count N-grams, so that smoothing methods that need this information still work correctly". However, for wbdiscount, we don't need the count-of-count infomation to compute the discounting parameters. So, why does make-big-lm specify -meta-tag option for wbdiscount by default? Is that necessary? Can I remove it?(I tried that, and find the ngrams are the same in model, but the probability is different.) Thanks! Meng CHEN ?????MX -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Jun 13 10:43:20 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 13 Jun 2013 10:43:20 -0700 Subject: [SRILM User List] About wbdiscount and meta-tag options In-Reply-To: References: Message-ID: <51BA04B8.9070703@icsi.berkeley.edu> On 6/13/2013 8:23 AM, Meng CHEN wrote: > Hi, in make-big-lm command, it specifies -read-with-mincounts and > -meta-tag by default. In the help page, it says "if -meta-tag is > defined, these low-count N-grams will be converted to count-of-count > N-grams, so that smoothing methods that need this information still > work correctly". However, for wbdiscount, we don't need the > count-of-count infomation to compute the discounting parameters. So, > why does make-big-lm specify -meta-tag option for wbdiscount by > default? Is that necessary? Can I remove it?(I tried that, and find > the ngrams are the same in model, but the probability is different.) > Thanks! WB discounting requires the count of the distinct word types for each context. That information can also be gotten from the meta-counts, and that's why you're getting different results without -meta-tag. BTW, I should update the man page to say that WB discounting is also supported in make-big-lm. Andreas > > > Meng CHEN > > > > ?????MX > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Jun 13 10:54:05 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 13 Jun 2013 10:54:05 -0700 Subject: [SRILM User List] SRILM command in Matlab In-Reply-To: <1371078179.6978.YahooMailNeo@web161002.mail.bf1.yahoo.com> References: <1371078179.6978.YahooMailNeo@web161002.mail.bf1.yahoo.com> Message-ID: <51BA073D.4020502@icsi.berkeley.edu> On 6/12/2013 4:02 PM, Md. Akmal Haidar wrote: > c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2 system('c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2 > pploutput'); should work. You can then read the pploutput file. Since the output is not in a format that's very amenable to matlab you might want to write a filter script that reformats the output in a way that can be loaded more easily, e.g., as a matrix. Let's say that script is called 'format-ppl' then you would run something like system('c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2 | reformat-ppl > pploutput'); Andreas From qzhou at ispeech.org Thu Jun 13 11:13:21 2013 From: qzhou at ispeech.org (Qiru Zhou) Date: Thu, 13 Jun 2013 14:13:21 -0400 Subject: [SRILM User List] SRILM command in Matlab In-Reply-To: <1371140507.97166.YahooMailNeo@web161003.mail.bf1.yahoo.com> References: <1371078179.6978.YahooMailNeo@web161002.mail.bf1.yahoo.com> <51B9DA51.6000502@ispeech.org> <1371140507.97166.YahooMailNeo@web161003.mail.bf1.yahoo.com> Message-ID: <51BA0BC1.2020306@ispeech.org> Works for me using a MSVC 2010 compiled ngram: [status,result] = system('D:\SLTools\SRI\srilm1.7\bin\x64\Release\ngram.exe -lm D:\...\eg.mad.arpa -ppl D:\...\en-us.trans.txt -debug 2' >> result reading 100001 1-grams reading 864169 2-grams reading 1739458 3-grams ... p( am | ...) = [1gram] 0.0033243 [ -2.4783 ] p( interested | am ...) = [2gram] 0.000292752 [ -3.5335 ] p( in | interested ...) = [3gram] 0.454779 [ -0.3422 ] p( learning | in ...) = [3gram] 0.00223203 [ -2.6513 ] p( more | learning ...) = [2gram] 0.00943844 [ -2.0251 ] p( | more ...) = [OOV] 0 [ -1.#INF ] p( restaurant | ...) = [1gram] 0.000581166 [ -3.2357 ] p( is | restaurant ...) = [2gram] 0.00385035 [ -2.4145 ] p( on | is ...) = [3gram] 0.0063841 [ -2.1949 ] p( the | on ...) = [3gram] 0.192486 [ -0.7156 ] p( fifth | the ...) = [3gram] 0.000952138 [ -3.0213 ] p( floor | fifth ...) = [2gram] 0.000440758 [ -3.3558 ] p( | floor ...) = [3gram] 0.304299 [ -0.5167 ] ... On 6/13/2013 12:21 PM, Md. Akmal Haidar wrote: > Hi Qiru, > > I tried it. It didn't work. The result shows ' '. > > Akmal > > ---------------------------------------------------------------------------------------------------- > *From:* Qiru Zhou > *To:* Md. Akmal Haidar > *Sent:* Thursday, June 13, 2013 10:42:26 AM > *Subject:* Re: [SRILM User List] SRILM command in Matlab > > Try: > [status,result] = system('c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2') > > -- Qiru > > On 6/12/2013 7:02 PM, Md. Akmal Haidar wrote: >> Hi, >> >> Does anyone know how to execute srilm command in Matlab? >> >> For example, I need to compute the perplexity of an LM. I have to run the following command in Matlab >> c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2 >> >> is there any way to run this in Matlab? >> >> Thanks >> Best Regards >> Akmal >> >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > > -- > QIRU ZHOU | CHIEF SCIENTIST | iSPEECH, INC. | T 917.338.7723 |qzhou at iSpeech.org |www.iSpeech.org > > -- QIRU ZHOU | CHIEF SCIENTIST | iSPEECH, INC. | T 917.338.7723 | qzhou at iSpeech.org | www.iSpeech.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Thu Jun 13 12:10:06 2013 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Thu, 13 Jun 2013 12:10:06 -0700 (PDT) Subject: [SRILM User List] SRILM command in Matlab In-Reply-To: <51BA0BC1.2020306@ispeech.org> References: <1371078179.6978.YahooMailNeo@web161002.mail.bf1.yahoo.com> <51B9DA51.6000502@ispeech.org> <1371140507.97166.YahooMailNeo@web161003.mail.bf1.yahoo.com> <51BA0BC1.2020306@ispeech.org> Message-ID: <1371150606.79061.YahooMailNeo@web161004.mail.bf1.yahoo.com> Thanks Qiru.. But it does not work for me. I compiled the srilm using cygwin. I reinstalled the srilm again in cygwin...still it does not work. ________________________________ From: Qiru Zhou To: Md. Akmal Haidar ; srilm-user at speech.sri.com Sent: Thursday, June 13, 2013 2:13:21 PM Subject: Re: [SRILM User List] SRILM command in Matlab Works for me using a MSVC 2010 compiled ngram: [status,result] = system('D:\SLTools\SRI\srilm1.7\bin\x64\Release\ngram.exe -lm D:\...\eg.mad.arpa -ppl D:\...\en-us.trans.txt -debug 2' >> result reading 100001 1-grams reading 864169 2-grams reading 1739458 3-grams ... ??? p( am | ...) ??? = [1gram] 0.0033243 [ -2.4783 ] ??? p( interested | am ...) ??? = [2gram] 0.000292752 [ -3.5335 ] ??? p( in | interested ...) ??? = [3gram] 0.454779 [ -0.3422 ] ??? p( learning | in ...) ??? = [3gram] 0.00223203 [ -2.6513 ] ??? p( more | learning ...) ??? = [2gram] 0.00943844 [ -2.0251 ] ??? p( | more ...) ??? = [OOV] 0 [ -1.#INF ] ??? p( restaurant | ...) ??? = [1gram] 0.000581166 [ -3.2357 ] ??? p( is | restaurant ...) ??? = [2gram] 0.00385035 [ -2.4145 ] ??? p( on | is ...) ??? = [3gram] 0.0063841 [ -2.1949 ] ??? p( the | on ...) ??? = [3gram] 0.192486 [ -0.7156 ] ??? p( fifth | the ...) ??? = [3gram] 0.000952138 [ -3.0213 ] ??? p( floor | fifth ...) ??? = [2gram] 0.000440758 [ -3.3558 ] ??? p( | floor ...) ??? = [3gram] 0.304299 [ -0.5167 ] ... On 6/13/2013 12:21 PM, Md. Akmal Haidar wrote: Hi Qiru, > >I tried it. It didn't work. The result shows ' '. > > >Akmal > > > > >________________________________ > From: Qiru Zhou >To: Md. Akmal Haidar >Sent: Thursday, June 13, 2013 10:42:26 AM >Subject: Re: [SRILM User List] SRILM command in Matlab > > > >Try: >?[status,result] = system('c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2') > >-- Qiru > > >On 6/12/2013 7:02 PM, Md. Akmal Haidar wrote: > >Hi, >> >> >>Does anyone know how to execute srilm command in Matlab? >> >> >>For example, I need to compute the perplexity of an LM. I have to run the following command in Matlab >> >>c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2 >> >> >>is there any way to run this in Matlab? >> >> >>Thanks >>Best Regards >>Akmal >> >> >>_______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user > >-- QIRU ZHOU | CHIEF SCIENTIST | iSPEECH, INC. | T 917.338.7723 | qzhou at iSpeech.org | www.iSpeech.org > > -- QIRU ZHOU | CHIEF SCIENTIST | iSPEECH, INC. | T 917.338.7723 | qzhou at iSpeech.org | www.iSpeech.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Jun 13 13:03:17 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 13 Jun 2013 13:03:17 -0700 Subject: [SRILM User List] SRILM command in Matlab In-Reply-To: <1371150606.79061.YahooMailNeo@web161004.mail.bf1.yahoo.com> References: <1371078179.6978.YahooMailNeo@web161002.mail.bf1.yahoo.com> <51B9DA51.6000502@ispeech.org> <1371140507.97166.YahooMailNeo@web161003.mail.bf1.yahoo.com> <51BA0BC1.2020306@ispeech.org> <1371150606.79061.YahooMailNeo@web161004.mail.bf1.yahoo.com> Message-ID: <51BA2585.3080500@icsi.berkeley.edu> On 6/13/2013 12:10 PM, Md. Akmal Haidar wrote: > Thanks Qiru.. > > But it does not work for me. I compiled the srilm using cygwin. > I reinstalled the srilm again in cygwin...still it does not work. I tried it with the cygwin version and it works. You need to locate the windows path of your cygwin binaries. In your cygwin shell you could run cygpath -w $SRILM/bin/cygwin/ngram.exe to get the path to the executable (assuming the SRILM variable is set to point to the SRILM base directory). If you still have a problem you should copy the entire command line and all output (error messages etc.) to get more help. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Thu Jun 13 13:24:42 2013 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Thu, 13 Jun 2013 13:24:42 -0700 (PDT) Subject: [SRILM User List] SRILM command in Matlab In-Reply-To: <1371150606.79061.YahooMailNeo@web161004.mail.bf1.yahoo.com> References: <1371078179.6978.YahooMailNeo@web161002.mail.bf1.yahoo.com> <51B9DA51.6000502@ispeech.org> <1371140507.97166.YahooMailNeo@web161003.mail.bf1.yahoo.com> <51BA0BC1.2020306@ispeech.org> <1371150606.79061.YahooMailNeo@web161004.mail.bf1.yahoo.com> Message-ID: <1371155082.56202.YahooMailNeo@web161003.mail.bf1.yahoo.com> Works now by using the MS visual studio compiled srilm.. Thanks ________________________________ From: Md. Akmal Haidar To: "qzhou at ispeech.org" ; "srilm-user at speech.sri.com" Sent: Thursday, June 13, 2013 3:10:06 PM Subject: Re: [SRILM User List] SRILM command in Matlab Thanks Qiru.. But it does not work for me. I compiled the srilm using cygwin. I reinstalled the srilm again in cygwin...still it does not work. ________________________________ From: Qiru Zhou To: Md. Akmal Haidar ; srilm-user at speech.sri.com Sent: Thursday, June 13, 2013 2:13:21 PM Subject: Re: [SRILM User List] SRILM command in Matlab Works for me using a MSVC 2010 compiled ngram: [status,result] = system('D:\SLTools\SRI\srilm1.7\bin\x64\Release\ngram.exe -lm D:\...\eg.mad.arpa -ppl D:\...\en-us.trans.txt -debug 2' >> result reading 100001 1-grams reading 864169 2-grams reading 1739458 3-grams ... ??? p( am | ...) ??? = [1gram] 0.0033243 [ -2.4783 ] ??? p( interested | am ...) ??? = [2gram] 0.000292752 [ -3.5335 ] ??? p( in | interested ...) ??? = [3gram] 0.454779 [ -0.3422 ] ??? p( learning | in ...) ??? = [3gram] 0.00223203 [ -2.6513 ] ??? p( more | learning ...) ??? = [2gram] 0.00943844 [ -2.0251 ] ??? p( | more ...) ??? = [OOV] 0 [ -1.#INF ] ??? p( restaurant | ...) ??? = [1gram] 0.000581166 [ -3.2357 ] ??? p( is | restaurant ...) ??? = [2gram] 0.00385035 [ -2.4145 ] ??? p( on | is ...) ??? = [3gram] 0.0063841 [ -2.1949 ] ??? p( the | on ...) ??? = [3gram] 0.192486 [ -0.7156 ] ??? p( fifth | the ...) ??? = [3gram] 0.000952138 [ -3.0213 ] ??? p( floor | fifth ...) ??? = [2gram] 0.000440758 [ -3.3558 ] ??? p( | floor ...) ??? = [3gram] 0.304299 [ -0.5167 ] ... On 6/13/2013 12:21 PM, Md. Akmal Haidar wrote: Hi Qiru, > >I tried it. It didn't work. The result shows ' '. > > >Akmal > > > > >________________________________ > From: Qiru Zhou >To: Md. Akmal Haidar >Sent: Thursday, June 13, 2013 10:42:26 AM >Subject: Re: [SRILM User List] SRILM command in Matlab > > > >Try: >?[status,result] = system('c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2') > >-- Qiru > > >On 6/12/2013 7:02 PM, Md. Akmal Haidar wrote: > >Hi, >> >> >>Does anyone know how to execute srilm command in Matlab? >> >> >>For example, I need to compute the perplexity of an LM. I have to run the following command in Matlab >> >>c:\cygwin\srilm\bin\cygwin\ngram -lm LM -ppl test.txt -debug 2 >> >> >>is there any way to run this in Matlab? >> >> >>Thanks >>Best Regards >>Akmal >> >> >>_______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user > >-- QIRU ZHOU | CHIEF SCIENTIST | iSPEECH, INC. | T 917.338.7723 | qzhou at iSpeech.org | www.iSpeech.org > > -- QIRU ZHOU | CHIEF SCIENTIST | iSPEECH, INC. | T 917.338.7723 | qzhou at iSpeech.org | www.iSpeech.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.N.Maijers at student.ru.nl Sat Jun 15 12:39:50 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Sat, 15 Jun 2013 21:39:50 +0200 Subject: [SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker Message-ID: <51BCC306.3080906@student.ru.nl> In the case of an LM created with '-skip', what is the meaning of the values past "\end\"? They are of the form: a-team 0.5 a-teens 0.5 a-test 0.5 I do not understand their relation to these 'ngram-count' parameters: -init-lm lmfile Load an LM to initialize the parameters of the skip-N-gram. -skip-init value The initial skip probability for all words. -em-iters n The maximum number of EM iterations. -em-delta d The convergence criterion for EM: if the relative change in log likelihood falls below the given value, iteration stops. From S.N.Maijers at student.ru.nl Sun Jun 16 09:29:43 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Sun, 16 Jun 2013 18:29:43 +0200 Subject: [SRILM User List] ngram-count's extensions to the ARPA LM Message-ID: <51BDE7F7.7010204@student.ru.nl> Does 'ngram' take into account the values that are in a quasi-ARPA LM with SRILM extensions? So, if I have created an LM with 'ngram-count' with '-skip' parameter and I use it with 'ngram', will the skipping be reflected in the perplexity results reported? Could you please document explicitly for each respective 'ngram-count' parameter, for instance '-skip', whether its use will result in non-ARPA LM output (an ARPA LM with SRILM extensions)? From stolcke at icsi.berkeley.edu Sun Jun 16 12:13:00 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 16 Jun 2013 12:13:00 -0700 Subject: [SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker In-Reply-To: <51BCC306.3080906@student.ru.nl> References: <51BCC306.3080906@student.ru.nl> Message-ID: <51BE0E3C.8030201@icsi.berkeley.edu> On 6/15/2013 12:39 PM, Sander Maijers wrote: > In the case of an LM created with '-skip', what is the meaning of the > values past "\end\"? > > They are of the form: > > a-team 0.5 > a-teens 0.5 > a-test 0.5 These are the skip probabilities estimated by the model. 0.5 is the default initial value, but after doing the EM estimation each word would have its individual probability of being skipped in the computation of condition probabilities. With the above values you would get P(w | a b "a-team" ) = 0.5 P'(w | a b) + 0.5 P'(w | a b "a-team" ) and so on for all words. Here P' is the probability as determined by a standard n-gram LM. Note: "a-team" is the word right before the word being predicted (w). > > > I do not understand their relation to these 'ngram-count' parameters: > > -init-lm lmfile > Load an LM to initialize the parameters of the skip-N-gram. As it says, you can start the estimation process with a preexisting set of parameters, read from a model file "lmfile". > -skip-init value > The initial skip probability for all words. Alternatively, you can initialize all skip probabilities to the same fixed value. > -em-iters n > The maximum number of EM iterations. > -em-delta d > The convergence criterion for EM: if the relative change in log > likelihood falls below the given value, iteration stops. These are just standard parameters for an EM-type algorithm. Andreas From stolcke at icsi.berkeley.edu Sun Jun 16 12:15:33 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 16 Jun 2013 12:15:33 -0700 Subject: [SRILM User List] ngram-count's extensions to the ARPA LM In-Reply-To: <51BDE7F7.7010204@student.ru.nl> References: <51BDE7F7.7010204@student.ru.nl> Message-ID: <51BE0ED5.5060509@icsi.berkeley.edu> On 6/16/2013 9:29 AM, Sander Maijers wrote: > Does 'ngram' take into account the values that are in a quasi-ARPA LM > with SRILM extensions? So, if I have created an LM with 'ngram-count' > with '-skip' parameter and I use it with 'ngram', will the skipping be > reflected in the perplexity results reported? Yes, if you run "ngram -skip" it will expect the skip probabilities following the \end\ line in the model file. > > Could you please document explicitly for each respective 'ngram-count' > parameter, for instance '-skip', whether its use will result in > non-ARPA LM output (an ARPA LM with SRILM extensions)? The ngram(1) man page says -skip Interpret the LM as a ''skip'' N-gram model. Andreas From Joris.Pelemans at esat.kuleuven.be Mon Jun 17 01:03:44 2013 From: Joris.Pelemans at esat.kuleuven.be (Joris Pelemans) Date: Mon, 17 Jun 2013 10:03:44 +0200 Subject: [SRILM User List] How to model unseen words without N_1 Message-ID: <51BEC2E0.8080907@esat.kuleuven.be> Hello, I am trying to build a unigram model with only the 400k most frequent words (this is essential) out of a training set of 4M tokens. The language model has to be open i.e. include the tag, because I want to assign probabilities to unseen words. However, I don't want it to base the probability for on that part of 4M minus 400k words, because then would get way too much probability mass (since there is a lot of data that I do not include in my LM). I simply want to ignore the other words and build a model based on the Good-Turing intuition of count-of-counts. However, since I limit the training data to 400k words, my training data does not contain any words with a frequency of 1 (i.e. N_1 = 0). How should I go about building this language model? Thanks in advance, Joris Pelemans From stolcke at icsi.berkeley.edu Mon Jun 17 10:16:42 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 17 Jun 2013 10:16:42 -0700 Subject: [SRILM User List] How to model unseen words without N_1 In-Reply-To: <51BEC2E0.8080907@esat.kuleuven.be> References: <51BEC2E0.8080907@esat.kuleuven.be> Message-ID: <51BF447A.4000005@icsi.berkeley.edu> On 6/17/2013 1:03 AM, Joris Pelemans wrote: > Hello, > > I am trying to build a unigram model with only the 400k most frequent > words (this is essential) out of a training set of 4M tokens. The > language model has to be open i.e. include the tag, because I > want to assign probabilities to unseen words. However, I don't want it > to base the probability for on that part of 4M minus 400k words, > because then would get way too much probability mass (since > there is a lot of data that I do not include in my LM). I simply want > to ignore the other words and build a model based on the > Good-Turing intuition of count-of-counts. However, since I limit the > training data to 400k words, my training data does not contain any > words with a frequency of 1 (i.e. N_1 = 0). > > How should I go about building this language model? To work around the problem of missing N_1 for estimating GT parameters, you should run ngram-count twice. First, without vocabulary restriction, and saving the GT parameters to a file (with -gt1 FILE and no -lm option). Second, you run ngram-count again, with -vocab option, -lm and -gt1 FILE. This will read the smoothing parameters from FILE. (The make-big-lm wrapper script automates this two-step process.) I don't have a good solution for setting the unigram probablity directly based on GT smoothing. I would recommend one of two practical solutions. 1) Replace rare words in your training data with ahead of running ngram-count (this also gives you ngrams that predict unseen words). 2) Interpolate your LM with an LM containing only and optimize the interpolation weight on a held-out set. Of course you can always edit the LM file to insert with whatever probability you want (and possibly use ngram -renorm to renormalize the model). Andreas From venkataraman.anand at gmail.com Mon Jun 17 11:12:50 2013 From: venkataraman.anand at gmail.com (Anand Venkataraman) Date: Mon, 17 Jun 2013 11:12:50 -0700 Subject: [SRILM User List] How to model unseen words without N_1 In-Reply-To: <51BF447A.4000005@icsi.berkeley.edu> References: <51BEC2E0.8080907@esat.kuleuven.be> <51BF447A.4000005@icsi.berkeley.edu> Message-ID: What Andreas suggests is probably the best. But depending on the exact application you have in mind, one other option to consider is to simply pre-process your input corpus and either delete all non-vocab words, or replace them (or runs of them) with a special meta-word of your choice, e.g. @reject at . It may be that there's there's an option in ngram* to do these in-process, I must check the docs. Else, a simple pre-processing filter in awk, perl or python should do the trick. & On Mon, Jun 17, 2013 at 10:16 AM, Andreas Stolcke wrote: > On 6/17/2013 1:03 AM, Joris Pelemans wrote: > >> Hello, >> >> I am trying to build a unigram model with only the 400k most frequent >> words (this is essential) out of a training set of 4M tokens. The language >> model has to be open i.e. include the tag, because I want to assign >> probabilities to unseen words. However, I don't want it to base the >> probability for on that part of 4M minus 400k words, because then >> would get way too much probability mass (since there is a lot of data >> that I do not include in my LM). I simply want to ignore the other words >> and build a model based on the Good-Turing intuition of >> count-of-counts. However, since I limit the training data to 400k words, my >> training data does not contain any words with a frequency of 1 (i.e. N_1 = >> 0). >> >> How should I go about building this language model? >> > > To work around the problem of missing N_1 for estimating GT parameters, > you should run ngram-count twice. First, without vocabulary restriction, > and saving the GT parameters to a file (with -gt1 FILE and no -lm option). > Second, you run ngram-count again, with -vocab option, -lm and -gt1 > FILE. This will read the smoothing parameters from FILE. (The > make-big-lm wrapper script automates this two-step process.) > > I don't have a good solution for setting the unigram probablity > directly based on GT smoothing. I would recommend one of two practical > solutions. > 1) Replace rare words in your training data with ahead of running > ngram-count (this also gives you ngrams that predict unseen words). > 2) Interpolate your LM with an LM containing only and optimize the > interpolation weight on a held-out set. > > Of course you can always edit the LM file to insert with whatever > probability you want (and possibly use ngram -renorm to renormalize the > model). > > Andreas > > > ______________________________**_________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/**mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.N.Maijers at student.ru.nl Tue Jun 18 15:39:15 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Wed, 19 Jun 2013 00:39:15 +0200 Subject: [SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker In-Reply-To: <51BE0E3C.8030201@icsi.berkeley.edu> References: <51BCC306.3080906@student.ru.nl> <51BE0E3C.8030201@icsi.berkeley.edu> Message-ID: <51C0E193.5070108@student.ru.nl> On 16-06-13 21:13, Andreas Stolcke wrote: > On 6/15/2013 12:39 PM, Sander Maijers wrote: >> In the case of an LM created with '-skip', what is the meaning of the >> values past "\end\"? >> >> They are of the form: >> >> a-team 0.5 >> a-teens 0.5 >> a-test 0.5 > > These are the skip probabilities estimated by the model. 0.5 is the > default initial value, but after doing the EM estimation each word would > have its individual probability of being skipped in the computation of > condition probabilities. With the above values you would get > > P(w | a b "a-team" ) = 0.5 P'(w | a b) + 0.5 P'(w | a b "a-team" ) > > and so on for all words. Here P' is the probability as determined by a > standard n-gram LM. > Note: "a-team" is the word right before the word being predicted (w). > >> >> >> I do not understand their relation to these 'ngram-count' parameters: >> >> -init-lm lmfile >> Load an LM to initialize the parameters of the skip-N-gram. > As it says, you can start the estimation process with a preexisting set > of parameters, read from a model file "lmfile". > >> -skip-init value >> The initial skip probability for all words. > Alternatively, you can initialize all skip probabilities to the same > fixed value. >> -em-iters n >> The maximum number of EM iterations. >> -em-delta d >> The convergence criterion for EM: if the relative change in log >> likelihood falls below the given value, iteration stops. > These are just standard parameters for an EM-type algorithm. > > Andreas > 1. Can only the first preceding word ("a-team") be skipped in this kind of skip LM? I first believed all history words could be skipped, except for the very last (most distant from w_n), but now I am not sure anymore. 2. In this case, what kind of smoothing goes on under the hood of P'? I have created my skip LM with the following parameters to 'ngram-count': -vocab %s -prune %s -skip -debug 1 -order 3 -text %s -sort -lm %s -limit-vocab -tolower does that also incorporate backoff and Good-Turing discounting like it would without '-skip'? From stolcke at icsi.berkeley.edu Tue Jun 18 16:44:55 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 18 Jun 2013 16:44:55 -0700 Subject: [SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker In-Reply-To: <51C0E193.5070108@student.ru.nl> References: <51BCC306.3080906@student.ru.nl> <51BE0E3C.8030201@icsi.berkeley.edu> <51C0E193.5070108@student.ru.nl> Message-ID: <51C0F0F7.7020406@icsi.berkeley.edu> On 6/18/2013 3:39 PM, Sander Maijers wrote: > On 16-06-13 21:13, Andreas Stolcke wrote: >> On 6/15/2013 12:39 PM, Sander Maijers wrote: >>> In the case of an LM created with '-skip', what is the meaning of the >>> values past "\end\"? >>> >>> They are of the form: >>> >>> a-team 0.5 >>> a-teens 0.5 >>> a-test 0.5 >> >> These are the skip probabilities estimated by the model. 0.5 is the >> default initial value, but after doing the EM estimation each word would >> have its individual probability of being skipped in the computation of >> condition probabilities. With the above values you would get >> >> P(w | a b "a-team" ) = 0.5 P'(w | a b) + 0.5 P'(w | a b "a-team" ) >> >> and so on for all words. Here P' is the probability as determined by a >> standard n-gram LM. >> Note: "a-team" is the word right before the word being predicted (w). >> >>> >>> >>> I do not understand their relation to these 'ngram-count' parameters: >>> >>> -init-lm lmfile >>> Load an LM to initialize the parameters of the skip-N-gram. >> As it says, you can start the estimation process with a preexisting set >> of parameters, read from a model file "lmfile". >> >>> -skip-init value >>> The initial skip probability for all words. >> Alternatively, you can initialize all skip probabilities to the same >> fixed value. >>> -em-iters n >>> The maximum number of EM iterations. >>> -em-delta d >>> The convergence criterion for EM: if the relative change in log >>> likelihood falls below the given value, iteration stops. >> These are just standard parameters for an EM-type algorithm. >> >> Andreas >> > > 1. Can only the first preceding word ("a-team") be skipped in this > kind of skip LM? I first believed all history words could be skipped, > except for the very last (most distant from w_n), but now I am not > sure anymore. No, to keep it simple, the current implementation only considers skipping the word immediately preceding the word being predicted. > 2. In this case, what kind of smoothing goes on under the hood of P'? > I have created my skip LM with the following parameters to 'ngram-count': > -vocab %s -prune %s -skip -debug 1 -order 3 -text %s -sort -lm %s > -limit-vocab -tolower > does that also incorporate backoff and Good-Turing discounting like it > would without '-skip'? Yes, the underlying estimation algorithm (the M-step of the EM algorithm) is a standard backoff ngram estimation. The only thing that's nonstandard is that the ngram counts going into the estimation are fractional counts, as computed in the E-step. Therefore, the same limitations as triggered by the ngram-count -float-counts option apply. Mainly, you can use only certain discounting methods, those that can deal with fractional counts. In particular, the methods based on counts-of-counts are out, so no GT or KN discounting. You should get an error message if you try to use them. Andreas > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Thu Jun 20 19:02:37 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 20 Jun 2013 19:02:37 -0700 Subject: [SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker In-Reply-To: <51C240DE.5080300@student.ru.nl> References: <51BCC306.3080906@student.ru.nl> <51BE0E3C.8030201@icsi.berkeley.edu> <51C0E193.5070108@student.ru.nl> <51C0F0F7.7020406@icsi.berkeley.edu> <51C240DE.5080300@student.ru.nl> Message-ID: <51C3B43D.3000701@icsi.berkeley.edu> On 6/19/2013 4:38 PM, Sander Maijers wrote: > On 19-06-13 01:44, Andreas Stolcke wrote: >>> 2. In this case, what kind of smoothing goes on under the hood of P'? >>> I have created my skip LM with the following parameters to >>> 'ngram-count': >>> -vocab %s -prune %s -skip -debug 1 -order 3 -text %s -sort -lm %s >>> -limit-vocab -tolower >>> does that also incorporate backoff and Good-Turing discounting like it >>> would without '-skip'? >> Yes, the underlying estimation algorithm (the M-step of the EM >> algorithm) is a standard backoff ngram estimation. >> The only thing that's nonstandard is that the ngram counts going into >> the estimation are fractional counts, as computed in the E-step. >> Therefore, the same limitations as triggered by the ngram-count >> -float-counts option apply. Mainly, you can use only certain >> discounting methods, those that can deal with fractional counts. In >> particular, the methods based on counts-of-counts are out, so no GT or >> KN discounting. You should get an error message if you try to use them. > > I did not specify a discounting method in the command line I gave, and > if it can't be the default GT, then which discount method will be > applied to the counts prior to the E step? I had to review the code (written some 17 years ago) to remind myself how the smoothing is handled with skip-ngrams ... It looks like a short-cut is used: the discounting parameters are estimated on the standard counts, and then applied to the fractional EM counts without recomputing them at each iteration. This means you can use any method, but of course the results are probably suboptimal. It might be better to recompute discounts after each E-step, and you would do that by modifying the SkipNgram::estimateMstep() function and inserting calls to the discounts[]->estimate() function ahead of the Ngram::estimate() call. I also noticed there is a bug in ngram-count.cc that will keep things from working when you read counts from a file rather than computing them from text (i.e., if you're using ngram-count -read instead of ngram-count -text). The problem is that, to estimate a skip-ngram of order N, you need counts of order N+1. The attached patch will fix that, but you still need to make sure you extract the counts of order N+1 when you're doing that in a separate step. Below is a little script that you can stick in $SRILM/lm/test/tests/ngram-count-skip/run-test and then exercise building and testing a skip-bigram from trigram counts. This actually doesn't produce lower perplexity than the regular bigram, but when I apply the same method to 4gram counts (which are not distributed with SRILM), the skip-trigram does have lower perplexity than the corresponding standard trigram. In any case, there are many possible variations on skip-ngrams and the SRILM implementation should be considered more as an exercise to inspire experimentation. Andreas ------------------ ngram-count-skip/run-test ------------------------------- #!/bin/sh dir=../ngram-count-gt if [ -f $dir/swbd.3grams.gz ]; then gz=.gz else gz= fi smooth="-wbdiscount -gt3min 1 -gt4min 1" order=2 counts=$dir/swbd.3grams$gz # create LM from counts ngram-count -debug 1 \ -order $order \ -skip -skip-init 0.0 \ -em-iters 3 \ $smooth \ -read $counts \ -vocab $dir/eval2001.vocab \ -lm skiplm.${order}bo$gz ngram -debug 0 -order $order \ -skip -lm skiplm.${order}bo$gz \ -ppl $dir/eval97.text rm -f skiplm.${order}bo$gz -------------- next part -------------- Index: lm/src/ngram-count.cc =================================================================== RCS file: /home/srilm/CVS/srilm/lm/src/ngram-count.cc,v retrieving revision 1.74 diff -c -r1.74 ngram-count.cc *** lm/src/ngram-count.cc 1 Mar 2013 16:34:37 -0000 1.74 --- lm/src/ngram-count.cc 21 Jun 2013 01:29:23 -0000 *************** *** 434,453 **** if (readFile) { File file(readFile, "r"); if (readWithMincounts) { ! makeArray(Count, minCounts, order); /* construct min-counts array from -gtNmin options */ unsigned i; ! for (i = 0; i < order && i < maxorder; i ++) { minCounts[i] = gtmin[i + 1]; } ! for ( ; i < order; i ++) { minCounts[i] = gtmin[0]; } ! USE_STATS(readMinCounts(file, order, minCounts)); } else { ! USE_STATS(read(file, order, limitVocab)); } } --- 434,455 ---- if (readFile) { File file(readFile, "r"); + unsigned countOrder = USE_STATS(getorder()); + if (readWithMincounts) { ! makeArray(Count, minCounts, countOrder); /* construct min-counts array from -gtNmin options */ unsigned i; ! for (i = 0; i < countOrder && i < maxorder; i ++) { minCounts[i] = gtmin[i + 1]; } ! for ( ; i < countOrder; i ++) { minCounts[i] = gtmin[0]; } ! USE_STATS(readMinCounts(file, countOrder, minCounts)); } else { ! USE_STATS(read(file, countOrder, limitVocab)); } } From ellsamig at yahoo.fr Mon Jun 24 06:42:40 2013 From: ellsamig at yahoo.fr (samira ellouze) Date: Mon, 24 Jun 2013 14:42:40 +0100 (BST) Subject: [SRILM User List] problem with the installation of SRILM on ubuntu 13.04 Message-ID: <1372081360.85864.YahooMailNeo@web171602.mail.ir2.yahoo.com> Hello, I tried to install the SRILM (both versions 1.7.0) on Ubuntu 13.04. I installed all prerequisite packages mentioned in the website but it doesn't work... when I execute the command line : make NO_TCL=1 MACHINE_TYPE=i686-ubuntu World I get the following error: mkdir -p include lib bin make init make[1]: Entering directory `/home/samira/Bureau/srilm' for subdir in misc dstruct lm flm lattice utils; do \ ??? ??? (cd $subdir/src; make SRILM= MACHINE_TYPE=i686-ubuntu OPTION= MAKE_PIC= init) || exit 1; \ ??? done make[2]: Entering directory `/home/samira/Bureau/srilm/misc/src' Makefile:24: /common/Makefile.common.variables: No such file or directory Makefile:152: /common/Makefile.common.targets: No such file or directory make[2]: *** No rule to make target `/common/Makefile.common.targets'.? Stop. make[2]: Leaving directory `/home/samira/Bureau/srilm/misc/src' make[1]: *** [init] Error 1 make[1]: Leaving directory `/home/samira/Bureau/srilm' make: *** [World] Error 2 Please, can you help me to solve this error? I appreciate your help Best regards Samira -------------- next part -------------- An HTML attachment was scrubbed... URL: From qzhou at ispeech.org Mon Jun 24 07:59:31 2013 From: qzhou at ispeech.org (Qiru Zhou) Date: Mon, 24 Jun 2013 10:59:31 -0400 Subject: [SRILM User List] problem with the installation of SRILM on ubuntu 13.04 In-Reply-To: <1372081360.85864.YahooMailNeo@web171602.mail.ir2.yahoo.com> References: <1372081360.85864.YahooMailNeo@web171602.mail.ir2.yahoo.com> Message-ID: <51C85ED3.4040007@ispeech.org> Samira, Try set export MACHINE_TYPE=i686-ubuntu export SRILM=/home/samira/Bureau/srilm export PATH=$SRILM/bin/${MACHINE_TYPE}:$SRILM/bin:$PATH export MANPATH=$SRILM/man:$MANPATH before make. It works for me on ubuntu 12.04-12.10. -- Qiru On 6/24/2013 9:42 AM, samira ellouze wrote: > Hello, > I tried to install the SRILM (both versions 1.7.0) on Ubuntu 13.04. > I installed all prerequisite packages mentioned in the website but it doesn't work... > when I execute the command line : make NO_TCL=1 MACHINE_TYPE=i686-ubuntu World > I get the following error: > > mkdir -p include lib bin > make init > make[1]: Entering directory `/home/samira/Bureau/srilm' > for subdir in misc dstruct lm flm lattice utils; do \ > (cd $subdir/src; make SRILM= MACHINE_TYPE=i686-ubuntu OPTION= MAKE_PIC= init) || exit 1; \ > done > make[2]: Entering directory `/home/samira/Bureau/srilm/misc/src' > Makefile:24: /common/Makefile.common.variables: No such file or directory > Makefile:152: /common/Makefile.common.targets: No such file or directory > make[2]: *** No rule to make target `/common/Makefile.common.targets'. Stop. > make[2]: Leaving directory `/home/samira/Bureau/srilm/misc/src' > make[1]: *** [init] Error 1 > make[1]: Leaving directory `/home/samira/Bureau/srilm' > make: *** [World] Error 2 > > > Please, can you help me to solve this error? > > I appreciate your help > Best regards > Samira > > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -- QIRU ZHOU | CHIEF SCIENTIST | iSPEECH, INC. | T 917.338.7723 | qzhou at iSpeech.org | www.iSpeech.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.N.Maijers at student.ru.nl Mon Jun 24 10:41:29 2013 From: S.N.Maijers at student.ru.nl (Sander Maijers) Date: Mon, 24 Jun 2013 19:41:29 +0200 Subject: [SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker In-Reply-To: <51C3B43D.3000701@icsi.berkeley.edu> References: <51BCC306.3080906@student.ru.nl> <51BE0E3C.8030201@icsi.berkeley.edu> <51C0E193.5070108@student.ru.nl> <51C0F0F7.7020406@icsi.berkeley.edu> <51C240DE.5080300@student.ru.nl> <51C3B43D.3000701@icsi.berkeley.edu> Message-ID: <51C884C9.4010205@student.ru.nl> On 21-06-13 04:02, Andreas Stolcke wrote: > On 6/19/2013 4:38 PM, Sander Maijers wrote: >> On 19-06-13 01:44, Andreas Stolcke wrote: >>>> 2. In this case, what kind of smoothing goes on under the hood of P'? >>>> I have created my skip LM with the following parameters to >>>> 'ngram-count': >>>> -vocab %s -prune %s -skip -debug 1 -order 3 -text %s -sort -lm %s >>>> -limit-vocab -tolower >>>> does that also incorporate backoff and Good-Turing discounting like it >>>> would without '-skip'? >>> Yes, the underlying estimation algorithm (the M-step of the EM >>> algorithm) is a standard backoff ngram estimation. >>> The only thing that's nonstandard is that the ngram counts going into >>> the estimation are fractional counts, as computed in the E-step. >>> Therefore, the same limitations as triggered by the ngram-count >>> -float-counts option apply. Mainly, you can use only certain >>> discounting methods, those that can deal with fractional counts. In >>> particular, the methods based on counts-of-counts are out, so no GT or >>> KN discounting. You should get an error message if you try to use them. >> >> I did not specify a discounting method in the command line I gave, and >> if it can't be the default GT, then which discount method will be >> applied to the counts prior to the E step? > > I had to review the code (written some 17 years ago) to remind myself > how the smoothing is handled with skip-ngrams ... > > It looks like a short-cut is used: the discounting parameters are > estimated on the standard counts, and then applied to the fractional EM > counts without recomputing them at each iteration. This means you > can use any method, but of course the results are probably suboptimal. > It might be better to recompute discounts after each E-step, and you > would do that by modifying the SkipNgram::estimateMstep() function and > inserting calls to the discounts[]->estimate() function ahead of the > Ngram::estimate() call. > > I also noticed there is a bug in ngram-count.cc that will keep things > from working when you read counts from a file rather than computing them > from text (i.e., if you're using ngram-count -read instead of > ngram-count -text). The problem is that, to estimate a skip-ngram of > order N, you need counts of order N+1. The attached patch will fix > that, but you still need to make sure you extract the counts of order > N+1 when you're doing that in a separate step. > > Below is a little script that you can stick in > $SRILM/lm/test/tests/ngram-count-skip/run-test and then exercise > building and testing a skip-bigram from trigram counts. This actually > doesn't produce lower perplexity than the regular bigram, but when I > apply the same method to 4gram counts (which are not distributed with > SRILM), the skip-trigram does have lower perplexity than the > corresponding standard trigram. > > In any case, there are many possible variations on skip-ngrams and the > SRILM implementation should be considered more as an exercise to inspire > experimentation. > > Andreas Thank you for your work! The default discounting method was used with both my baseline and the skip LM. I was also looking at the code, however I've run my experiments already and need to wrap up quickly. After my thesis I may have a look again at the SRILM code. Based on the equations you described to me and the code, I do not see the fundamental difference with skip N-gram model and Jelinek-Mercer smoothing / deleted interpolation (Chen & Goodman, 1999, eqn. 4 p. 364). In the skip LM the skip probabilities substitute the lambda weights in the Jelinek-Mercer equation, and are estimated in the perhaps special way you explained. Is there something I miss? > ------------------ ngram-count-skip/run-test > ------------------------------- > #!/bin/sh > > dir=../ngram-count-gt > > if [ -f $dir/swbd.3grams.gz ]; then > gz=.gz > else > gz= > fi > > smooth="-wbdiscount -gt3min 1 -gt4min 1" > > order=2 > counts=$dir/swbd.3grams$gz > > # create LM from counts > ngram-count -debug 1 \ > -order $order \ > -skip -skip-init 0.0 \ > -em-iters 3 \ > $smooth \ > -read $counts \ > -vocab $dir/eval2001.vocab \ > -lm skiplm.${order}bo$gz > > ngram -debug 0 -order $order \ > -skip -lm skiplm.${order}bo$gz \ > -ppl $dir/eval97.text > > rm -f skiplm.${order}bo$gz > > > From stolcke at icsi.berkeley.edu Mon Jun 24 19:44:07 2013 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 24 Jun 2013 19:44:07 -0700 Subject: [SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker In-Reply-To: <51C884C9.4010205@student.ru.nl> References: <51BCC306.3080906@student.ru.nl> <51BE0E3C.8030201@icsi.berkeley.edu> <51C0E193.5070108@student.ru.nl> <51C0F0F7.7020406@icsi.berkeley.edu> <51C240DE.5080300@student.ru.nl> <51C3B43D.3000701@icsi.berkeley.edu> <51C884C9.4010205@student.ru.nl> Message-ID: <51C903F7.4030000@icsi.berkeley.edu> On 6/24/2013 10:41 AM, Sander Maijers wrote: > > Based on the equations you described to me and the code, I do not see > the fundamental difference with skip N-gram model and Jelinek-Mercer > smoothing / deleted interpolation (Chen & Goodman, 1999, eqn. 4 p. > 364). In the skip LM the skip probabilities substitute the lambda > weights in the Jelinek-Mercer equation, and are estimated in the > perhaps special way you explained. Is there something I miss? Jelinek-Mercer is a way to smooth N-gram probabilities by combining estimates based on different suffixes of the history, e.g. p(w|w1 w2 w3) = l1 * p'(w|w1 w2 w3) + l2 * p'(w|w1 w2) + l3 * p'(w|w1) + l4 * p'(w) + l5 / N (N = size of vocabulary) where p'(.) is a maximum-likelihood estimate. In skip-Ngram modeling, by contrast, you combine different histories that differ by skipping a word, e.g. p(w | w1 w2 w3 w4) = l1 * p'(w | w1 w2 w3) + l2 * p'(w | w2 w3 w4) where p'(.) now is smoothed estimate based. The only similarity is that they both use linear interpolation of an underlying probability estimator to arrive at a better estimator. That's not saying much. Linear interpolation is extremely widely used in all sorts of probability models. Andreas From yuan at ks.cs.titech.ac.jp Tue Jun 25 06:51:37 2013 From: yuan at ks.cs.titech.ac.jp (yuan liang) Date: Tue, 25 Jun 2013 22:51:37 +0900 Subject: [SRILM User List] About language model rescoring output Message-ID: Hi all, I want to rescore a bigram lattice use a trigram language model. I tired: lattice-tool -in-lattice INPUTLATTICE -read-htk -lm TRIGRAM_LM -order 3 -old-expansion -out-lattice -write-htk OUTPUTLATTICE The problem is: In the output lattice, there is no acoustic model score for each arc, each arc only has new language model score, did I miss some parameters? Regards, Jasmine -------------- next part -------------- An HTML attachment was scrubbed... URL: