From dugast at systran.fr Mon Jan 6 07:45:35 2014 From: dugast at systran.fr (DUGAST Loic) Date: Mon, 6 Jan 2014 15:45:35 +0000 Subject: [SRILM User List] class based model In-Reply-To: <52B1C962.3020004@icsi.berkeley.edu> References: <1387277889.14584.YahooMailNeo@web173202.mail.ir2.yahoo.com> <52B0F8E5.9060301@icsi.berkeley.edu> <1387353916.63557.YahooMailNeo@web173202.mail.ir2.yahoo.com>, <52B1C962.3020004@icsi.berkeley.edu> Message-ID: <2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan> Hi In the FAQ (http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html) You advise to ... c) Lower the minimum counts for N-grams included in the LM, i.e., the values of the options -gt2min, -gt3min, -gt4min, etc. The higher order N-grams typically get higher minimum counts. Do you not mean : *rise* the minimum counts (...) instead ? Plus I am not sure to understand why gt2min should be set higher than gt1min etc ? Higher-order ngrams are naturally less frequent. Therefore the same cutoff value (gt2min equal to gt1min)will be harsher to bigrams than to unigrams... Can you explain ? Thank you! Loic -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Jan 6 11:31:42 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 06 Jan 2014 11:31:42 -0800 Subject: [SRILM User List] class based model In-Reply-To: <2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan> References: <1387277889.14584.YahooMailNeo@web173202.mail.ir2.yahoo.com> <52B0F8E5.9060301@icsi.berkeley.edu> <1387353916.63557.YahooMailNeo@web173202.mail.ir2.yahoo.com>, <52B1C962.3020004@icsi.berkeley.edu> <2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan> Message-ID: <52CB049E.7070202@icsi.berkeley.edu> On 1/6/2014 7:45 AM, DUGAST Loic wrote: > Hi > > In the FAQ > (http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html) > > You advise to ... > > c) > Lower the minimum counts for N-grams included in the LM, i.e., the > values of the options *-gt2min*, *-gt3min*, *-gt4min*, etc. The > higher order N-grams typically get higher minimum counts. > > > Do you not mean : *rise* the minimum counts (...) instead ? > You are correct. It should say raise the min counts. We'll fix the documentation ASAP. > Plus I am not sure to understand why gt2min should be set higher than > gt1min etc ? > Higher-order ngrams are naturally less frequent. Therefore the same > cutoff value (gt2min equal to gt1min)will be harsher to bigrams than > to unigrams... Can you explain ? The minimum counts are a crude way to trade off performance for space, and since there are lot more long ngrams than short ngrams you get more space savings with higher order ngrams. It is typically not worth it to eliminate unigrams and bigrams, but a decent tradeoff to remove singleton trigrams and fourgrams. The default values were chose based on historical practice (I think they might have even been inherited from the CMU LM toolkit). The better and more principled way to remove ngrams is entropy-based pruning (ngram/ngram-count -prune option). So the best strategy given limited memory is to make the gtmin values as low are you can afford to fit into memory, then use -prune (you can do this in the same invocation of ngram-count or make-big-lm). Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From asad_12204 at yahoo.com Tue Jan 7 12:14:06 2014 From: asad_12204 at yahoo.com (Asad A.Malik) Date: Tue, 7 Jan 2014 12:14:06 -0800 (PST) Subject: [SRILM User List] Cannot Execute SRILM Message-ID: <1389125646.40817.YahooMailNeo@web122206.mail.ne1.yahoo.com> Hi All, I am trying to install MOSES and for that I have to install SRILM. I have downloaded SRILM version 1.7.0. I've extracted it, have used: ??? sudo apt-get install libc6-dev-i386 also as suggested by MOSES installation guide that "if your x86_64 type, you should edit thesbin/machine-type file" as ??????? else if (`uname -m` == x86_64) then ??????????????? set MACHINE_TYPE = i686 ??? and change it to ??? ??? else if (`uname -m` == x86_64) then ??????????????? set MACHINE_TYPE = i686-m64 But my file was automatically having i686-m64. So therefore I don't have to chage it. And for execution I enter: ??? ??? make MAKE_PIC=1? SRILM=`pwd` NO_TCL=X? World It give me following error: ??? ??? make: pwd/sbin/machine-type: Command not found ??? ??? Makefile:13: pwd/common/Makefile.common.variables: No such file or directory ??? ??? make: *** No rule to make target `pwd/common/Makefile.common.variables'.? Stop. I've attached the screen shot also. Kindly tell me that what sould I do??? Regards? Asad A.Malik -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screenshot from 2014-01-07 23:53:16.png Type: image/png Size: 33847 bytes Desc: not available URL: From stolcke at icsi.berkeley.edu Tue Jan 7 14:53:59 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 07 Jan 2014 14:53:59 -0800 Subject: [SRILM User List] Cannot Execute SRILM In-Reply-To: <1389125646.40817.YahooMailNeo@web122206.mail.ne1.yahoo.com> References: <1389125646.40817.YahooMailNeo@web122206.mail.ne1.yahoo.com> Message-ID: <52CC8587.6000907@icsi.berkeley.edu> On 1/7/2014 12:14 PM, Asad A.Malik wrote: > Hi All, > > I am trying to install MOSES and for that I have to install SRILM. I > have downloaded SRILM version 1.7.0. I've extracted it, have used: > sudo apt-get install libc6-dev-i386 > > also as suggested by MOSES installation guidethat "if your x86_64 > type, you should edit the/sbin/machine-type/ file" as > > /else if (`uname -m` == x86_64) then/ > /set MACHINE_TYPE = i686/ > and change it to > /else if (`uname -m` == x86_64) then/ > /set MACHINE_TYPE = i686-m64/ > > But my file was automatically having i686-m64. So therefore I don't > have to chage it. > > And for execution I enter: > make MAKE_PIC=1 SRILM=`pwd` NO_TCL=X World > > It give me following error: > > make: pwd/sbin/machine-type: Command not found > Makefile:13: pwd/common/Makefile.common.variables: No such file or > directory > make: *** No rule to make target > `pwd/common/Makefile.common.variables'. Stop. > > I've attached the screen shot also. Kindly tell me that what sould I do??? > you must have used SRILM='pwd' (forward quotes) instead of SRILM=`pwd` Alternatively, use SRILM=$PWD Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexx.tudor at gmail.com Sun Jan 26 22:01:31 2014 From: alexx.tudor at gmail.com (alex tudor) Date: Mon, 27 Jan 2014 08:01:31 +0200 Subject: [SRILM User List] lattice-tool for FLM Message-ID: Hi, I wish to use FLM in speech recognition. So I've done a factored language model for a small text: *fngram-count -factor-file flm_config_kn.txt -text textLM.txt -write-counts textFLM_kn.count -lm textFLM_kn.lm* flm_config_kn.txt: *W : 2 W(-1) P(-1) textFLM_kn.count textFLM_kn.lm 3* *W1,P1 W1 kndiscount gtmin 1 interpolate* *P1 P1 kndiscount gtmin 1 * *0 0 kndiscount gtmin 1 * Finally textFLM_kn.lm is: *\data\* *ngram 0x0=510* *ngram 0x1=0* *ngram 0x2=1378* *ngram 0x3=1978* *\0x0-grams:* *-1.746398 * *-99 * *-2.252263 W-A* *-3.047544 W-ABUNDENTE* *....* Afterwards I've tried to rescore a HTK word lattice for a bigram using the 'lattice-tool' and a HTK word lattice (bigram.lat): *lattice-tool -read-htk -write-htk -in-lattice bigram.lat -htk-lmscale 10 -posterior-scale 10 -factored -lm textFLM_kn.lm -out-lattice bigramFLM_kn.lat* But I have an error in textFLM_kn.lm: line 2: error: couldn't form int for number of factored LMs in when reading FLM spec file What's wrong ? I would be appreciated for any advice. Sincerely yours, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From omm.tayma at yahoo.fr Mon Jan 27 10:18:07 2014 From: omm.tayma at yahoo.fr (omm tayma) Date: Mon, 27 Jan 2014 18:18:07 +0000 (GMT) Subject: [SRILM User List] install srilm Message-ID: <1390846687.52966.YahooMailNeo@web133201.mail.ir2.yahoo.com> Hi, I try to install srilm , i did this steps 1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64 2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS) CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES by : # Use the GNU C compiler. ? GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit CC = /usr/bin/gcc $(GCC_FLAGS) CXX = /usr/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES and #TCL_INCLUDE = #TCL_LIBRARY = -ltcl by :? # Tcl support (standard in Linux) ?TCL_INCLUDE = -I/usr/include/tcl8.5 TCL_LIBRARY = -ltcl8.5 3- i execute make SRILM=/home/lenovo/Documents/srilm? World ?but i have this error : cc1plus: error: bad value (i686-m64) for -march= switch can some one help me plz !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From omm.tayma at yahoo.fr Mon Jan 27 10:22:19 2014 From: omm.tayma at yahoo.fr (omm tayma) Date: Mon, 27 Jan 2014 18:22:19 +0000 (GMT) Subject: [SRILM User List] install srilm Message-ID: <1390846939.38659.YahooMailNeo@web133204.mail.ir2.yahoo.com> Hi, I try to install srilm , i did this steps 1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64 2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS) CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES by : # Use the GNU C compiler. ? GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit CC = /usr/bin/gcc $(GCC_FLAGS) CXX = /usr/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES and #TCL_INCLUDE = #TCL_LIBRARY = -ltcl by :? # Tcl support (standard in Linux) ?TCL_INCLUDE = -I/usr/include/tcl8.5 TCL_LIBRARY = -ltcl8.5 3- i execute make SRILM=/home/lenovo/Documents/srilm? World ?but i have this error : cc1plus: error: bad value (i686-m64) for -march= switch can some one help me plz !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Jan 27 11:08:21 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 27 Jan 2014 11:08:21 -0800 Subject: [SRILM User List] lattice-tool for FLM In-Reply-To: References: Message-ID: <52E6AEA5.9060304@icsi.berkeley.edu> Alex, I haven't looked at your example in detail, but two hints: 1) debug your LM evaluation using ngram -factored first (before moving on to lattice rescoring) 2) follow the steps in $SRILM/flm/test/tests/ngram-factored and make sure sure you are (a) inputting and producing files in similar format and (b) supplying the right options. Once you have that working, move on to lattice-tool. The word strings in your lattices should be the same as the input to ngram -factored -ppl ... Andreas On 1/26/2014 10:01 PM, alex tudor wrote: > Hi, > > I wish to use FLM in speech recognition. So I've done a factored > language model for a small text: > /fngram-count -factor-file flm_config_kn.txt -text textLM.txt > -write-counts textFLM_kn.count -lm textFLM_kn.lm/ > > flm_config_kn.txt: > /W : 2 W(-1) P(-1) textFLM_kn.count textFLM_kn.lm 3/ > /W1,P1 W1 kndiscount gtmin 1 interpolate/ > /P1 P1 kndiscount gtmin 1 / > /0 0 kndiscount gtmin 1 / > > Finally textFLM_kn.lm is: > /\data\/ > /ngram 0x0=510/ > /ngram 0x1=0/ > /ngram 0x2=1378/ > /ngram 0x3=1978/ > / > / > /\0x0-grams:/ > /-1.746398/ > /-99/ > /-2.252263W-A/ > /-3.047544W-ABUNDENTE/ > /..../ > > Afterwards I've tried to rescore a HTK word lattice for a bigram using > the 'lattice-tool' and a HTK word lattice (bigram.lat): > /lattice-tool -read-htk -write-htk -in-lattice bigram.lat -htk-lmscale > 10 -posterior-scale 10 -factored -lm textFLM_kn.lm -out-lattice > bigramFLM_kn.lat/ > > But I have an error in textFLM_kn.lm: line 2: error: couldn't form int > for number of factored LMs in when reading FLM spec file > > What's wrong ? I would be appreciated for any advice. > > Sincerely yours, > Alex > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From omm.tayma at yahoo.fr Mon Jan 27 11:16:35 2014 From: omm.tayma at yahoo.fr (omm tayma) Date: Mon, 27 Jan 2014 19:16:35 +0000 (GMT) Subject: [SRILM User List] install srilm Message-ID: <1390850195.12957.YahooMailNeo@web133205.mail.ir2.yahoo.com> Hi, I try to install srilm , i did this steps 1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64 2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS) CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES by : # Use the GNU C compiler. ? GCC_FLAGS = -arch=i686-m64 -Wreturn-type -Wimplicit CC = /usr/bin/gcc $(GCC_FLAGS) CXX = /usr/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES and #TCL_INCLUDE = #TCL_LIBRARY = -ltcl by :? # Tcl support (standard in Linux) ?TCL_INCLUDE = -I/usr/include/tcl8.5 TCL_LIBRARY = -ltcl8.5 3- i execute make SRILM=/home/lenovo/Documents/srilm? World ?but i have this error : make[2]: Leaving directory `/home/lenovo/Documents/srilm/lattice/src' make[2]: Entering directory `/home/lenovo/Documents/srilm/utils/src' rm -f Dependencies.i686-m64 /home/lenovo/Documents/srilm/sbin/generate-program-dependencies ../bin/i686-m64 ../obj/i686-m64 ""? | sed -e "s&\.o&.o&g" >> Dependencies.i686-m64 make[2]: Leaving directory `/home/lenovo/Documents/srilm/utils/src' make[1]: Leaving directory `/home/lenovo/Documents/srilm' make release-libraries make[1]: Entering directory `/home/lenovo/Documents/srilm' for subdir in misc dstruct lm flm lattice utils; do \ ??? ??? (cd $subdir/src; make SRILM=/home/lenovo/Documents/srilm MACHINE_TYPE=i686-m64 OPTION= MAKE_PIC= release-libraries) || exit 1; \ ??? done make[2]: Entering directory `/home/lenovo/Documents/srilm/misc/src' /usr/bin/gcc -arch=i686-m64 -Wreturn-type -Wimplicit?? -I/usr/include/tcl8.5 -I. -I../../include?? -c -g -O3? -o ../obj/i686-m64/option.o option.c gcc: error: unrecognized option ?-arch=i686-m64? make[2]: *** [../obj/i686-m64/option.o] Error 1 make[2]: Leaving directory `/home/lenovo/Documents/srilm/misc/src' make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory `/home/lenovo/Documents/srilm' make: *** [World] Error 2 can some one help me plz !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From omm.tayma at yahoo.fr Mon Jan 27 11:21:56 2014 From: omm.tayma at yahoo.fr (omm tayma) Date: Mon, 27 Jan 2014 19:21:56 +0000 (GMT) Subject: [SRILM User List] install srilm Message-ID: <1390850516.83578.YahooMailNeo@web133203.mail.ir2.yahoo.com> Hi, I try to install srilm , i did this steps 1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64 2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS) CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES by : # Use the GNU C compiler. ? GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit CC = /usr/bin/gcc $(GCC_FLAGS) CXX = /usr/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES and #TCL_INCLUDE = #TCL_LIBRARY = -ltcl by :? # Tcl support (standard in Linux) ?TCL_INCLUDE = -I/usr/include/tcl8.5 TCL_LIBRARY = -ltcl8.5 3- i execute make SRILM=/home/lenovo/Documents/srilm? World ?but i have this error : make[2]: Entering directory `/home/lenovo/Documents/srilm/utils/src' rm -f Dependencies.i686-m64 /home/lenovo/Documents/srilm/sbin/generate-program-dependencies ../bin/i686-m64 ../obj/i686-m64 ""? | sed -e "s&\.o&.o&g" >> Dependencies.i686-m64 make[2]: Leaving directory `/home/lenovo/Documents/srilm/utils/src' make[1]: Leaving directory `/home/lenovo/Documents/srilm' make release-libraries make[1]: Entering directory `/home/lenovo/Documents/srilm' for subdir in misc dstruct lm flm lattice utils; do \ ??? ??? (cd $subdir/src; make SRILM=/home/lenovo/Documents/srilm MACHINE_TYPE=i686-m64 OPTION= MAKE_PIC= release-libraries) || exit 1; \ ??? done make[2]: Entering directory `/home/lenovo/Documents/srilm/misc/src' /usr/bin/gcc -arch=i686-m64 -Wreturn-type -Wimplicit?? -I/usr/include/tcl8.5 -I. -I../../include?? -c -g -O3? -o ../obj/i686-m64/option.o option.c gcc: error: unrecognized option ?-arch=i686-m64? make[2]: *** [../obj/i686-m64/option.o] Error 1 make[2]: Leaving directory `/home/lenovo/Documents/srilm/misc/src' make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory `/home/lenovo/Documents/srilm' make: *** [World] Error 2 can some one help me plz !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Jan 27 12:23:12 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 27 Jan 2014 12:23:12 -0800 Subject: [SRILM User List] install srilm In-Reply-To: <1390850516.83578.YahooMailNeo@web133203.mail.ir2.yahoo.com> References: <1390850516.83578.YahooMailNeo@web133203.mail.ir2.yahoo.com> Message-ID: <52E6C030.4020109@icsi.berkeley.edu> On 1/27/2014 11:21 AM, omm tayma wrote: > Hi, > I try to install srilm , i did this steps > 1- i execute sudo ./sbin/machine-type and my machine type is :i686-m64 > 2- in the file /home/lenovo/Documents/srilm/common/ > Makefile.machine.i686-m64 i replace > GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit > CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS) > CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS) > -DINSTANTIATE_TEMPLATES > by : > # Use the GNU C compiler. > GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit > CC = /usr/bin/gcc $(GCC_FLAGS) > CXX = /usr/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES Just remove -march=i686-m64 from the GCC_FLAGS. Andreas PS. Please don't send your posts multiple times. I received 4 copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From omm.tayma at yahoo.fr Wed Jan 29 02:32:45 2014 From: omm.tayma at yahoo.fr (omm tayma) Date: Wed, 29 Jan 2014 10:32:45 +0000 (GMT) Subject: [SRILM User List] error in Makefile.machine.i686-m64 Message-ID: <1390991565.72702.YahooMailNeo@web133201.mail.ir2.yahoo.com> hi if the type of my machineis :? i686-m64 so I changed in the file Makefile.machine.i686-m64 as follows:? ?# Use the GNU C compiler. ?? GCC_FLAGS = -m64 -Wall -Wno-unused-variable -Wno-uninitialized ?? CC = $(GCC_PATH)gcc $(GCC_FLAGS) -Wimplicit-int ?? CXX = $(GCC_PATH)g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES and ? # Tcl support (standard in Linux) ? TCL_INCLUDE = -I/usr/include/tcl8.5 TCL_LIBRARY = -ltcl8.5 ?? NO_TCL = 1 but I think this is wrong because I always get errors !!! please can some one help me !! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Jan 29 10:02:10 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 29 Jan 2014 10:02:10 -0800 Subject: [SRILM User List] Error in usage of make-batch-counts In-Reply-To: References: Message-ID: <52E94222.7030002@icsi.berkeley.edu> On 1/28/2014 12:40 AM, Rajen Chatterjee wrote: > Hello, > I want to pass this options "-order 5 -nterpolate -kndiscount" > to make-batch-counts, how can I do this? > > When I am giving this command "./make-batch-counts > /home/rajen/file_name 1 /home/rajen/count-dir -order 5" I am getting > this error mkdir: invalid option -- 'o'Try `mkdir --help' for more > information. > > Can you help me out to fix this problem. > > There are two problems: 1 - According to the training-scripts(1) man page, the usage is *make-batch-counts* /file-list/ [/batch-size/ [/filter/ [/count-dir/ [/options/ ... ] ] ] ] so you need to pass 4 parameters before any options that are passed to ngram-count. For the "filter" parameter you can use the "cat" program. 2. The options -interpolate and -kndiscount are inappropriate since make-batch-counts does not create LMs, it only collects the counts. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From asooryampasya at gmail.com Fri Feb 7 05:14:06 2014 From: asooryampasya at gmail.com (Asooryampasya) Date: Fri, 7 Feb 2014 14:14:06 +0100 Subject: [SRILM User List] Errors while trying to use fngram-count, fngram with Estonian Tagged data Message-ID: Dear fellow users I am trying to build a factored model for Estonian (which is morphologically tagged, using tree tagger). The fngram-count program seems to run without issues. However, when I use fngram program to estimate the perplexity of a test sample, I get an error. I found the same question asked here before ( http://www.speech.sri.com/pipermail/srilm-user/2011q3/001088.html), but I could not find a response to this email. Hence, I am posting it to the list again. I am pasting below the error I am getting while running fngram program and also the contents of my factor-file that I used with both fngram-count and fngram programs. Please let me know if any more information is needed. The error: *** w_g4_w1w2m1m2.count.gz: line 14172: malformed N-gram count or more than 100 words per line warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled s_g4_w1w2m1m2.lm.gz: line 21: error, ngram line has invalid number (1) of fields, expecting either 2 or 3 format error in lm file ******* I am still new to using factored models, and I am as of now only using the example settings given in the Kirchhoff, Blimes and Duh tutorial. Here is how my factor-file looks like: ****** ##word given word-1 word-2 morph-1 morph-2 1 W : 4 W(-1) W(-2) M(-1) M(-2) w_g4_w1w2m1m2.count.gz s_g4_w1w2m1m2.lm.gz 5 0b0111 0b0010 wbdiscount gtmin 4 interpolate 0b1101 0b1000 wbdiscount gtmin 3 interpolate 0b0101 0b0001 wbdiscount gtmin 2 interpolate 0b0100 0b0100 wbdiscount gtmin 1 interpolate 0b0000 0b0000 wbdiscount gtmin 1 ****** My training data look like this: W-Eksamit??:M-S.com.pl.nom W-I.:M-Y.nominal.? W-Pange:M-V.main.imper.pres W-sulgudes:M-S.com.pl.in W-olevad:M-A.pos.pl.nom W-s?nad:M-S.com.pl.nom W-?igesse:M-A.pos.sg.ill W-vormi:M-S.com.sg.adit W-!:M-Z.Exc W-Piret:M-S.prop.sg.nom W-Toomet:M-S.prop.sg.abl W-on:M-V.main.indic.pres.ps3 W-ettev?tlik:M-A.pos.sg.nom W-naine:M-S.com.sg.nom W-.:M-Z.Fst ****** Thanks, Pasya. -------------- next part -------------- An HTML attachment was scrubbed... URL: From junfei.guo at gmail.com Tue Mar 4 02:31:14 2014 From: junfei.guo at gmail.com (Junfei Guo) Date: Tue, 4 Mar 2014 11:31:14 +0100 Subject: [SRILM User List] Does the calculation of Back Off Weight use the special property of different Discount? Message-ID: Hi All, >From the source code I can see that to calculate the Lower Oder Weight for interpolation, Srilm uses the special properties of different Discount function. For example for Modified KN, it uses the fact that the discount is a constant for all trigrams which shows up more than 3 times. My question is weather the calculation of Back Off Weight also use this sort of information or it only assume a general discount function. Thanks - Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Mar 4 09:58:15 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 04 Mar 2014 09:58:15 -0800 Subject: [SRILM User List] Does the calculation of Back Off Weight use the special property of different Discount? In-Reply-To: References: Message-ID: <53161437.2080507@icsi.berkeley.edu> On 3/4/2014 2:31 AM, Junfei Guo wrote: > Hi All, > > From the source code I can see that to calculate the Lower Oder Weight > for interpolation, Srilm uses the special properties of different > Discount function. For example for Modified KN, it uses the fact that > the discount is a constant for all trigrams which shows up more than 3 > times. > > My question is weather the calculation of Back Off Weight also use > this sort of information or it only assume a general discount function. The computation of backoff weights if independent of discounting method. It is only determined by the requirement that the sum of all probabilities for a given history sum up to 1. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From kruza at ufal.mff.cuni.cz Tue Mar 18 08:20:05 2014 From: kruza at ufal.mff.cuni.cz (Oldrich Kruza) Date: Tue, 18 Mar 2014 16:20:05 +0100 Subject: [SRILM User List] ngram crashing In-Reply-To: References: Message-ID: Hello everybody. I'm trying to reduce the vocabulary of a huge (64GB) trigram language model. I ran the script change-lm-vocab, and the ngram process died with this error message: include/LHash.cc:141: void LHash::alloc(unsigned int) [with KeyT = unsigned int, DataT = Trie]: Assertion `body != 0' failed. I'm positive this is not due to insufficient memory. Thanks for any insights. ? Oldrich Kruza -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Mar 18 11:21:04 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 18 Mar 2014 11:21:04 -0700 Subject: [SRILM User List] ngram crashing In-Reply-To: References: Message-ID: <53288E90.3000801@icsi.berkeley.edu> On 3/18/2014 8:20 AM, Oldrich Kruza wrote: > > Hello everybody. > > I'm trying to reduce the vocabulary of a huge (64GB) trigram language > model. > > I ran the script change-lm-vocab, and the ngram process died with this > error message: > > include/LHash.cc:141: void LHash::alloc(unsigned int) > [with KeyT = unsigned int, DataT = Trie]: > Assertion `body != 0' failed. > > I'm positive this is not due to insufficient memory. > This is the error message when SRILM fails to allocate more memory. The reasons could be - you are using a 32bit binary and running up against the 4GB limit of the architecture - you have a memory resource limit in force (set by you or your sysadmin) - check the ulimit or limit (csh) command - your system is actually, really out of memory (which also depends on what other users are doing) By running top or some similar tool concurrently you can see how big your ngram process actually grows before crashing, and this can give you additional clues. Andreas From tsuki_stefy at yahoo.com Tue Mar 18 12:44:56 2014 From: tsuki_stefy at yahoo.com (Stefy D.) Date: Tue, 18 Mar 2014 12:44:56 -0700 (PDT) Subject: [SRILM User List] compute perplexity Message-ID: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com> Dear all, I have some questions regarding perplexity...I am very thankful for your time/ answers. Settings: - one language model LM_A estimated using training corpus A? - one language model LM_B estimated using training corpus B (B = corpus_A + corpus_X) My intention is to prove that model B is better than model A so I though I should show that the perplexity decreased (which can be seen from the ppl files). Commands used to estimate ppl: $NGRAM_FILE -order 3 ?-lm $WORKING_DIR"lm_A/lmodel.lm" -ppl $WORKING_DIR"test.lowercased."$TARGET > ?$WORKING_DIR"ppl_A.ppl" $NGRAM_FILE -order 3 ?-lm $WORKING_DIR"lm_B/lmodel.lm" -ppl $WORKING_DIR"test.lowercased."$TARGET > ?$WORKING_DIR"ppl_B.ppl" This contents of the two ppl files is (A then B): 1000 sentences, 21450 words, 0 OOVs 0 zeroprobs, logprob= -57849.4 ppl= 377.407 ppl1= 497.67 ------------------------------------------------------------------------------------------- 1000 sentences, 21450 words, 0 OOVs 0 zeroprobs, logprob= -55535.3 ppl= 297.67 ppl1= 388.204 Questions: 1. Why do I get 0 OOVs? I checked using the compute-oov-rate script how many OOV there are in the test data compared to the training and it gave me the result "OOV tokens: 393 / 21450 (1.83%) excluding fragments: 390 / 21442 (1.82%)". 2. I read on the srilm-faq that "Note that perplexity comparisons are only ever meaningful if the vocabularies of all LMs are the same."?Since I want to compare perplexities of two LM I am wondering if I did the right thing with my settings and commands used. The two LM were estimated on different training corpora so the vocabularies are not identical, right? Please tell me what am I doing wrong. 3. If those two perplexities were computed correctly, then could you please tell me if their difference means that the LM model has been really improved and if there is a measure that says if this improvement is significantly?? Thank you very much for your time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Mar 18 22:02:22 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 18 Mar 2014 22:02:22 -0700 Subject: [SRILM User List] compute perplexity In-Reply-To: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com> References: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com> Message-ID: <532924DE.7000301@icsi.berkeley.edu> On 3/18/2014 12:44 PM, Stefy D. wrote: > Dear all, > > I have some questions regarding perplexity...I am very thankful for > your time/ answers. > > Settings: > - one language model LM_A estimated using training corpus A > - one language model LM_B estimated using training corpus B (B = > corpus_A + corpus_X) > > My intention is to prove that model B is better than model A so I > though I should show that the perplexity decreased (which can be seen > from the ppl files). > > Commands used to estimate ppl: > $NGRAM_FILE -order 3 -lm $WORKING_DIR"lm_A/lmodel.lm" -ppl > $WORKING_DIR"test.lowercased."$TARGET > $WORKING_DIR"ppl_A.ppl" > > $NGRAM_FILE -order 3 -lm $WORKING_DIR"lm_B/lmodel.lm" -ppl > $WORKING_DIR"test.lowercased."$TARGET > $WORKING_DIR"ppl_B.ppl" > > This contents of the two ppl files is (A then B): > 1000 sentences, 21450 words, 0 OOVs > 0 zeroprobs, logprob= -57849.4 ppl= 377.407 ppl1= 497.67 > ------------------------------------------------------------------------------------------- > 1000 sentences, 21450 words, 0 OOVs > 0 zeroprobs, logprob= -55535.3 ppl= 297.67 ppl1= 388.204 > > Questions: > 1. Why do I get 0 OOVs? I checked using the compute-oov-rate script > how many OOV there are in the test data compared to the training and > it gave me the result "OOV tokens: 393 / 21450 (1.83%) excluding > fragments: 390 / 21442 (1.82%)". You didn't say how you trained the LMs. Did you include an unknown-word probability? The exact option used for LM training matter here. > > 2. I read on the srilm-faq that "Note that perplexity comparisons are > only ever meaningful if the vocabularies of all LMs are the same." > Since I want to compare perplexities of two LM I am wondering if I did > the right thing with my settings and commands used. The two LM were > estimated on different training corpora so the vocabularies are not > identical, right? Please tell me what am I doing wrong. Again, we don't know how you trained the LMs, hence we don't know the vocabularies. The best way to make the perplexities comparable would be to extract the vocabulary from corpus A + corpus X, and then specify that for training LM_A (using -vocab). > > 3. If those two perplexities were computed correctly, then could you > please tell me if their difference means that the LM model has been > really improved and if there is a measure that says if this > improvement is significantly? The perplexities looks quite different. Differences of 10-20% are usually considered non-negligible. For statistical significance there are a number of tests you can apply, although none are built into SRILM. The most straightforward tests would be nonparametric ones that compare the probabilities output by the two LMs for corresponding word or sentences. Generate a table of word-level probabilities for LM_A and then LM_B, on the same test set. Then ask, how many words had lower/same/greater probability in LM_B? From those statistics you can apply either the Sign test or the stronger Wilcoxon test (for the latter you need the differences of the probabilities, not just their sign). The Sign test is extremely simple and can be computed with a small helper script included in SRILM. For example if LM_B gives higher probability for 1080 out of 2000 words (and there are no ties), then the significance levels are computed by % $SRILM/bin/cumbin 2000 1080 One-tailed: P(k >= 1080 | n=2000, p=0.5) = 0.00018750253721029 Two-tailed: 2*P(k >= 1080 | n=2000, p=0.5) = 0.00037500507442058 Doing this at the word-level assumes that all the words in a sentence are assigned probabilities independently, which is plainly not true (the same word occurs in several ngrams). So a more conservative approach would compare the sentence-level probabilities. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From kruza at ufal.mff.cuni.cz Wed Mar 19 04:52:15 2014 From: kruza at ufal.mff.cuni.cz (Oldrich Kruza) Date: Wed, 19 Mar 2014 12:52:15 +0100 Subject: [SRILM User List] ngram crashing In-Reply-To: <53288E90.3000801@icsi.berkeley.edu> References: <53288E90.3000801@icsi.berkeley.edu> Message-ID: Eh, yes. It was a 32-bit executable. Thank you. On Tue, Mar 18, 2014 at 7:21 PM, Andreas Stolcke wrote: > On 3/18/2014 8:20 AM, Oldrich Kruza wrote: >> >> >> Hello everybody. >> >> I'm trying to reduce the vocabulary of a huge (64GB) trigram language >> model. >> >> I ran the script change-lm-vocab, and the ngram process died with this >> error message: >> >> include/LHash.cc:141: void LHash::alloc(unsigned int) [with >> KeyT = unsigned int, DataT = Trie]: Assertion `body != >> 0' failed. >> >> I'm positive this is not due to insufficient memory. >> > This is the error message when SRILM fails to allocate more memory. The > reasons could be > > - you are using a 32bit binary and running up against the 4GB limit of the > architecture > - you have a memory resource limit in force (set by you or your sysadmin) - > check the ulimit or limit (csh) command > - your system is actually, really out of memory (which also depends on what > other users are doing) > > By running top or some similar tool concurrently you can see how big your > ngram process actually grows before crashing, and this can give you > additional clues. > > Andreas > From asosimi at unilag.edu.ng Sat Mar 22 05:12:33 2014 From: asosimi at unilag.edu.ng (Adeyanju Sosimi) Date: Sat, 22 Mar 2014 13:12:33 +0100 (WAT) Subject: [SRILM User List] Using HTK LM Score and externally computed Tone n-gram score Message-ID: <2010698914.3384481.1395490353157.JavaMail.root@unilag.edu.ng> Am currently working on a tone based language. The language has a CV and V syllabic structure, Have decided to adopt both tone n-gram and word n-gram as prior probability in developing the ASR system. That is using hybrid HTK LM score and Tone n-gram. To accomplish this with HTK, I don't know what to do. But I have develop routine for computing Tone N-gram in MATLAB. How to make use computed tone n-gram with HTK LM score has remained a challenge. Also, I need your assistance with regards tutorial materials/manual on the SRILM toolkits or scripting files for easy usage of the script. From rimlaatar at yahoo.fr Thu Mar 27 07:38:10 2014 From: rimlaatar at yahoo.fr (Laatar Rim) Date: Thu, 27 Mar 2014 14:38:10 +0000 (GMT) Subject: [SRILM User List] Calculate perplexity Message-ID: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com> Dear Andreas , to calculate perplexity i do this : lenovo at ubuntu:~/Documents/srilm$ ngram -lm class_based_model '/home/lenovo/Documents/srilm/ML_N_Class/IN_SRILM' -ppl '/home/lenovo/Documents/srilm/ML_N_Class/titi.txt'? titi.txt is my training data 1- i should calculate perplexity elso in my test data ? 2- how can i interpretate this result :? file /home/lenovo/Documents/srilm/ML_N_Class/titi.txt: 18657 sentences, 66817 words, 5285 OOVs 0 zeroprobs, logprob= -259950 ppl= 1744.69 ppl1= 16773.8 ?what is the difference between ppl and ppl1 ?? ---- Cordialement Rim LAATAR? Ing?nieur? Informatique, de l'?cole Nationale d?Ing?nieurs de Sfax(ENIS) ?tudiante en mast?re de recherche, Syst?me d'Information & Nouvelles Technologies ? laFSEGS?--Option TALN Site web:Rim LAATAR BEN SAID Tel: (+216) 99 64 74 98? ---- -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Mar 27 10:25:35 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 27 Mar 2014 10:25:35 -0700 Subject: [SRILM User List] compute perplexity In-Reply-To: <1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com> References: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com> <532924DE.7000301@icsi.berkeley.edu> <1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com> Message-ID: <53345F0F.201@icsi.berkeley.edu> On 3/19/2014 10:57 AM, Stefy D. wrote: > Dear Andreas, > > thank you very much for replying. > > I trained both LMs using the "-unk" option like this: > $NGRAMCOUNT_FILE -order 3 -interpolate -kndiscount -unk -text > $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -lm > $WORKING_DIR"lm_a/lmodel.lm" That explains who you are not getting OOVs reported in the ppl output. Unknown words are mapped to and thus the LM has a probability for . > > For the OOV rate I created a vocabulary list for the training data and > I used the unigram counts of the test set and the compute-oov-rate > script like this: > > $NGRAMCOUNT_FILE -order 1 -write-vocab "vocabularyTargetUnigram.txt" > -text $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -sort > > $NGRAMCOUNT_FILE -order 1 -text $WORKING_DIR"test.lowercased."$TARGET > -write "unigramCounts_testdatal.txt" -sort > > $OOVCOUNT_FILE vocabularyTargetUnigram.txt unigramCounts_testdata.txt > > This is how I got that OOV rate mentioned in the first mail. Could you > please let me know if I used the right commands to compute that? You did it right. > > You said I should train LM_A using the vocabulary of corpus A + corpus > X so that the perplexities can be compared. So I should train LM_A > using only corpus A but the vocabulary of A + X? I am sorry to be > confused, but I thought that for estimating the LM the vocabulary > should be from the same corpus used for estimating. I am using these > LMs in SMT systems (a baseline and an adapted one). If I influence the > baseline LM with vocabulary from the adapted data, then the baseline > is not really a baseline. Please tell me if I am thinking incorrectly. You are right. What this illustrates is that perplexity alone is not a sufficient metric for comparing LMs. In your scenario (LM adaptation) the expansion of the vocabulary is a key component of the adaptation process, but LMs with different vocabularies are no longer comparable by ppl. My suggestion to unify the vocabularies was a workaround to allow you to still use perplexity comparison. > > Thank you for introducing me into statistical significance. > To generate a table of word level probabilities on the same test set > should I use get-unigram-probs? But where do I specify the test set? > $UNIGRAMPROBS_FILE linear=1 $WORKING_DIR"lm_a/lmodel.arpa."$TARGET > > table_A.out No, you get the word probabilities from output of ngram -debug 2 -ppl (you need to write some perl or whatever script to extract the probabilities). > > To get how many words had lower/same/greater probability in LM_B is > using compare-ppls script ok? For example, I get this output when > applying it to my 2 LMs (ngram -debug 2 on the same test set as in > previous commands): > $COMPARE_PPLS $WORKING_DIR"ppl_files/ppl_A_detail.ppl" > $WORKING_DIR"ppl_files/ppl_B_detail.ppl" > output: total 22450, equal 0, different 22450, greater 11447 Yes, it seems compare-ppls extracts exactly the statistics I was talking about. I had forgotten about it ... Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Mar 27 10:44:54 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 27 Mar 2014 10:44:54 -0700 Subject: [SRILM User List] Calculate perplexity In-Reply-To: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com> References: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com> Message-ID: <53346396.3000101@icsi.berkeley.edu> On 3/27/2014 7:38 AM, Laatar Rim wrote: > Dear Andreas , > to calculate perplexity i do this : > lenovo at ubuntu:~/Documents/srilm$ ngram -lm class_based_model > '/home/lenovo/Documents/srilm/ML_N_Class/IN_SRILM' -ppl > '/home/lenovo/Documents/srilm/ML_N_Class/titi.txt' > titi.txt is my training data > 1- i should calculate perplexity elso in my test data ? Yes, in fact, perplexity is usually reported on test data (data not used in training the model) since otherwise you get a very biased estimate. > 2- how can i interpretate this result : > file /home/lenovo/Documents/srilm/ML_N_Class/titi.txt: 18657 > sentences, 66817 words, 5285 OOVs > 0 zeroprobs, logprob= -259950 ppl= 1744.69 ppl1= 16773.8 > what is the difference between ppl and ppl1 ?? OOVs is the count of words that don't occur in the vocabulary (technically, that are mapped to ) and have zero probability. zeroprobs refers to any other words that have zero probability. These counts are reported because they are not included in the perplexity computation. ppl is the standard perplexity where end-of-sentence tokens () are counted in the denominator. ppl1 is the same thing but tokens are not counted in the denominator. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Mar 28 10:18:11 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 28 Mar 2014 10:18:11 -0700 Subject: [SRILM User List] Calculate perplexity In-Reply-To: <1395997546.8767.YahooMailNeo@web173205.mail.ir2.yahoo.com> References: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com> <53346396.3000101@icsi.berkeley.edu> <1395997546.8767.YahooMailNeo@web173205.mail.ir2.yahoo.com> Message-ID: <5335AED3.8020105@icsi.berkeley.edu> On 3/28/2014 2:05 AM, Laatar Rim wrote: > thanks > so my file replace-word-with-class sould not contain the words from > data test ? Knowing which words should be in class should be considered part of the training process, or comes from prior knowledge. If you application gives you the class membership of the words in the test data then you can add it, otherwise it would be "training on test data". Andreas > ---- > Cordialement > > *Rim LAATAR * > Ing?nieur Informatique, de l'?cole Nationale d?Ing?nieurs de > Sfax(ENIS ) > ?tudiante en mast?re de recherche, Syst?me d'Information & Nouvelles > Technologies ? laFSEGS --Option TALN > Site web:Rim LAATAR BEN SAID > > Tel: (+216) 99 64 74 98 > ---- > > > Le Jeudi 27 mars 2014 18h44, Andreas Stolcke > a ?crit : > On 3/27/2014 7:38 AM, Laatar Rim wrote: >> Dear Andreas , >> to calculate perplexity i do this : >> lenovo at ubuntu:~/Documents/srilm$ ngram -lm class_based_model >> '/home/lenovo/Documents/srilm/ML_N_Class/IN_SRILM' -ppl >> '/home/lenovo/Documents/srilm/ML_N_Class/titi.txt' >> titi.txt is my training data >> 1- i should calculate perplexity elso in my test data ? > Yes, in fact, perplexity is usually reported on test data (data not > used in training the model) since otherwise you get a very biased > estimate. > >> 2- how can i interpretate this result : >> file /home/lenovo/Documents/srilm/ML_N_Class/titi.txt: 18657 >> sentences, 66817 words, 5285 OOVs >> 0 zeroprobs, logprob= -259950 ppl= 1744.69 ppl1= 16773.8 >> what is the difference between ppl and ppl1 ?? > OOVs is the count of words that don't occur in the vocabulary > (technically, that are mapped to ) and have zero probability. > zeroprobs refers to any other words that have zero probability. > These counts are reported because they are not included in the > perplexity computation. > > ppl is the standard perplexity where end-of-sentence tokens () are > counted in the denominator. ppl1 is the same thing but tokens are > not counted in the denominator. > > > Andreas > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Mar 31 17:44:48 2014 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 31 Mar 2014 17:44:48 -0700 Subject: [SRILM User List] perplexity In-Reply-To: <1396265179.10448.YahooMailNeo@web173202.mail.ir2.yahoo.com> References: <1396265179.10448.YahooMailNeo@web173202.mail.ir2.yahoo.com> Message-ID: <533A0C00.3020505@icsi.berkeley.edu> On 3/31/2014 4:26 AM, Laatar Rim wrote: > Dear Andreas, > > PLz i have a question : > you say : Knowing which words should be in class should be considered > part of the training process, or comes from prior knowledge. > If you application gives you the class membership of the words in the > test data then you can add it, otherwise it would be "training on test > data". > > you mean that my "IN_SRILM: my classes-format - File format for word > class definitions ( /class/ [/p/] /word1/ /word2/ ... )" should also > contain both words that exist in my training data and test data or it > should contains only words from trainnig data .?? You should only use words in the training data, plus any other knowledge source or databases that are different from the test data. In many application domains that involve semantic knowledge you have additional information about the task domain from which you can infer class membership. For example, if you are doing air travel domain, you probably have a list of all airport cities, and you create a word class from that. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: