From A20766 at motorola.com Sat Jan 3 18:34:11 2009 From: A20766 at motorola.com (Cai Hongbin-A20766) Date: Sun, 4 Jan 2009 10:34:11 +0800 Subject: Format of LMs Message-ID: <4EBACE1519E1C3418FBC204BC85BF87A05ED471B@zmy16exm68.ds.mot.com> Hi, In recent days I am doing some evaluation on some SRI training tools. I met problems when I tried to use skipping LMs and factored LMs. What is the format of these models? As for skipping LMs, what is the meaning of the last part at the end of the LM file? \end\ ## the end of a normal LM file -pau- 0.5 0.5 0 0.0041594 (how to apply these coef. to some beam-search engine?) As for the factored LMs, I trained a bigram, and got a result that there seemed to be no backing-off coef. in the unigram section. And what is the meaning of the coefficients right after the 2-gram probs? ... \0x0-grams: -1.071043 -1.281587 (where is the backing-off coef.? ) ... \0x1-grams: -2.178066 86AA B2BB -0.7455529(what is the meaning of these coef.?) -0.9450388 86AA B6BA_BAC5 -1.72854 86AA CBF4 -1.281777 86AA CECA_BAC5 -6.393295 -0.9474632 Anyone can show me some helpful reference? Thanks a lot. Best Regards, Rick Cai -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Jan 3 19:01:24 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 03 Jan 2009 19:01:24 -0800 Subject: Format of LMs In-Reply-To: <4EBACE1519E1C3418FBC204BC85BF87A05ED471B@zmy16exm68.ds.mot.com> References: <4EBACE1519E1C3418FBC204BC85BF87A05ED471B@zmy16exm68.ds.mot.com> Message-ID: <49602684.6090606@speech.sri.com> Cai Hongbin-A20766 wrote: > Hi, > In recent days I am doing some evaluation on some SRI training tools. > I met problems when I tried to use skipping LMs and factored LMs. > What is the format of these models? > As for skipping LMs, what is the meaning of the last part at the end > of the LM file? > \end\ ## the end of a normal LM file > -pau- 0.5 > 0.5 > 0 > 0.0041594 (how to apply these coef. to some beam-search engine?) I cannot answer the last question, but the numbers in the word list following \end\ represent the probabilities with which a word in the history is "skipped". So if the skip probability of a word x is p and x occurs in a history before a word w, the probability of w is estimated as (1-p) times the regular ngram probability + p times the ngram probability with x removed from the history. > As for the factored LMs, I trained a bigram, and got a result that > there seemed to be no backing-off coef. in the unigram section. > And what is the meaning of the coefficients right after the 2-gram probs? > ... > \0x0-grams: > -1.071043 > -1.281587 (where is the backing-off coef.? ) > ... > \0x1-grams: > -2.178066 86AA B2BB -0.7455529(what is the meaning of these coef.?) > -0.9450388 86AA B6BA_BAC5 > -1.72854 86AA CBF4 > -1.281777 86AA CECA_BAC5 > -6.393295 -0.9474632 > Anyone can show me some helpful reference? Thanks a lot. The best documentation of FLMs can be found at http://ssli.ee.washington.edu/people/duh/papers/flm-manual.pdf, but I don't see an explanation of the modified backoff model file format there. It is probably best to either read the code, or contact bilmes at ee.washington.edu, who wrote most of the code. Andreas From deliverable at gmail.com Sat Jan 3 19:30:01 2009 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sat, 3 Jan 2009 22:30:01 -0500 Subject: Kneser-Ney context counts Message-ID: Greetings Andreas -- I'd like to access the number of contexts for any given ngram, used in Kneser-Ney computation (those with the fat dot). What's a good way to get at them via the C++ API? Cheers, Alexy From stolcke at speech.sri.com Sat Jan 3 20:14:05 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 03 Jan 2009 20:14:05 -0800 Subject: Kneser-Ney context counts In-Reply-To: References: Message-ID: <4960378D.2050705@speech.sri.com> Alexy Khrabrov wrote: > Greetings Andreas -- I'd like to access the number of contexts for any > given ngram, used in Kneser-Ney computation (those with the fat dot). > What's a good way to get at them via the C++ API? You create a ModKneserNey object (Discount.h). Be sure to leave the countsAreModified parameter at the default value (false). Then invoke ModKneserNey:: estimate() on your counts. As a side effect, the lower-order counts will be modified to reflect the context type counts. Note that the counts of ngrams starting with are unchanged since there are no preceding words for them. Andreas From mr.spoon21 at gmail.com Mon Jan 5 04:29:14 2009 From: mr.spoon21 at gmail.com (Mr.SpOOn) Date: Mon, 5 Jan 2009 13:29:14 +0100 Subject: Installation: failing the tests Message-ID: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> Hi, I'm new here. I'm trying to install SRILM on an Ubuntu 8.04. I thought I built everything fine, but I get errors in the test phase (point 7 of the INSTALL file). When I give the command: make all it starts the tests and appear a lot of things like this: *** Running test hidden-ngram *** Command exited with non-zero status 127 0.00user 0.00system 0:00.00elapsed 66%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+8outputs (0major+600minor)pagefaults 0swaps hidden-ngram: stdout output DIFFERS. hidden-ngram: stderr output DIFFERS. When it finishes and I check the output files, I find that all .stdout files are empty, while in the .stderr ones there are errors like these: ./run-test: 16: ngram: not found ./run-test: 19: ngram-count: not found ./run-test: 22: ngram: not found Now I don't really know what to do. What can be the problem? I think I set the environment variables right. I put them in the ~/.bashrc file. I did so: SRILM=/home/carlo/ordinami/srilm PATH3=$SRILM/bin/i686:$SRILM/bin PATH=$PATH:$PATH1:$PATH2:$PATH3 PATH=$PATH:"~/bin/" MANPATH=$MANPATH:$SRILM/man export PATH ALIGN_BASE MANPATH Can you help me? Thanks. From stolcke at speech.sri.com Mon Jan 5 20:55:00 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 05 Jan 2009 20:55:00 -0800 Subject: Installation: failing the tests In-Reply-To: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> References: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> Message-ID: <4962E424.1030407@speech.sri.com> Mr.SpOOn wrote: > Hi, > I'm new here. > > I'm trying to install SRILM on an Ubuntu 8.04. I thought I built > everything fine, but I get errors in the test phase (point 7 of the > INSTALL file). > > When I give the command: > > make all > > it starts the tests and appear a lot of things like this: > > *** Running test hidden-ngram *** > Command exited with non-zero status 127 > 0.00user 0.00system 0:00.00elapsed 66%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+8outputs (0major+600minor)pagefaults 0swaps > hidden-ngram: stdout output DIFFERS. > hidden-ngram: stderr output DIFFERS. > > When it finishes and I check the output files, I find that all .stdout > files are empty, while in the .stderr ones there are errors like > these: > > ./run-test: 16: ngram: not found > ./run-test: 19: ngram-count: not found > ./run-test: 22: ngram: not found > > Now I don't really know what to do. What can be the problem? > It looks like the executables (ngram etc.) weren't generated. Make sure $SRILM/bin/i686 actually contains these files. If not then examine the make output for error messages from the compiler or linker. > I think I set the environment variables right. I put them in the ~/.bashrc file. > > I did so: > > SRILM=/home/carlo/ordinami/srilm > > PATH3=$SRILM/bin/i686:$SRILM/bin > > PATH=$PATH:$PATH1:$PATH2:$PATH3 > PATH=$PATH:"~/bin/" > MANPATH=$MANPATH:$SRILM/man > > export PATH ALIGN_BASE MANPATH > > > Can you help me? > If it looks like the executable were generated, try 'which ngram' and see if they are found. Andreas > Thanks. > From mr.spoon21 at gmail.com Tue Jan 6 01:31:43 2009 From: mr.spoon21 at gmail.com (Mr.SpOOn) Date: Tue, 6 Jan 2009 10:31:43 +0100 Subject: Installation: failing the tests In-Reply-To: <4962E424.1030407@speech.sri.com> References: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> <4962E424.1030407@speech.sri.com> Message-ID: <8f67b6f80901060131i27858557ye0504856305a0c2f@mail.gmail.com> 2009/1/6 Andreas Stolcke : > It looks like the executables (ngram etc.) weren't generated. Make sure > $SRILM/bin/i686 actually contains > these files. If not then examine the make output for error messages from > the compiler or linker. That directory doesn't contain any ngram executable. There is a context-ngrams, or continuous-ngram-count, but not ngram. How shall I examine messages from compiler and linker? > If it looks like the executable were generated, try 'which ngram' and see if > they are found. The command gives no output. From stolcke at speech.sri.com Tue Jan 6 09:54:00 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 06 Jan 2009 09:54:00 -0800 Subject: Installation: failing the tests In-Reply-To: <8f67b6f80901060131i27858557ye0504856305a0c2f@mail.gmail.com> References: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> <4962E424.1030407@speech.sri.com> <8f67b6f80901060131i27858557ye0504856305a0c2f@mail.gmail.com> Message-ID: <49639AB8.4000003@speech.sri.com> Mr.SpOOn wrote: > 2009/1/6 Andreas Stolcke : > >> It looks like the executables (ngram etc.) weren't generated. Make sure >> $SRILM/bin/i686 actually contains >> these files. If not then examine the make output for error messages from >> the compiler or linker. >> > > That directory doesn't contain any ngram executable. There is a > context-ngrams, or continuous-ngram-count, but not ngram. > > How shall I examine messages from compiler and linker? > They would be in the output from the "make" command, after the command lines that invoke the compiler (typically something starting wiht "gcc". If you cannot make sense of your situation you should probably consult with a local person who has experience building software. Andreas From mr.spoon21 at gmail.com Tue Jan 6 17:03:03 2009 From: mr.spoon21 at gmail.com (Mr.SpOOn) Date: Wed, 7 Jan 2009 02:03:03 +0100 Subject: Installation: failing the tests In-Reply-To: <49639AB8.4000003@speech.sri.com> References: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> <4962E424.1030407@speech.sri.com> <8f67b6f80901060131i27858557ye0504856305a0c2f@mail.gmail.com> <49639AB8.4000003@speech.sri.com> Message-ID: <8f67b6f80901061703o6469c1bfg5255b470d1dc0988@mail.gmail.com> 2009/1/6 Andreas Stolcke : > They would be in the output from the "make" command, after the command lines > that invoke the compiler (typically something starting wiht "gcc". If you > cannot make sense of your situation you should probably consult with a local > person who has experience building software. I did this: make World > output So, in the "output" file I've found this, that may be the problem: make[2]: Leaving directory `/home/carlo/ordinami/srilm/dstruct/src' make[2]: Entering directory `/home/carlo/ordinami/srilm/lm/src' g++ -mtune=pentium3 -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I/usr/include/tcl8.5 -I. -I../../include -u matherr -L../../lib/i686 -g -O3 -o ../bin/i686/ngram ../obj/i686/ngram.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -L/usr/lib/tcl8.5 -ltcl -lm 2>&1 | c++filt /usr/bin/ld: cannot find -ltcl collect2: ld returned 1 exit status /home/carlo/ordinami/srilm/sbin/decipher-install 0555 ../bin/i686/ngram ../../bin/i686 ERROR: File to be installed (../bin/i686/ngram) does not exist. ERROR: File to be installed (../bin/i686/ngram) is not a plain file. Usage: decipher-install ... mode: file permission mode, in octal file1 ... fileN: files to be installed directory: where the files should be installed files = ../bin/i686/ngram directory = ../../bin/i686 mode = 0555 touch -c ../../bin/i686/ngram But I don't understand what's the problem. Besides this, in the terminal (I think it's the standard error) appeared this: mkdir: cannot create directory `include': File exists mkdir: cannot create directory `lib': File exists mkdir: cannot create directory `bin': File exists make: [dirs] Error 1 (ignored) make[2]: [../../bin/i686/ngram] Error 1 (ignored) make[2]: [../../bin/i686/ngram-count] Error 1 (ignored) make[2]: [../../bin/i686/ngram-merge] Error 1 (ignored) make[2]: [../../bin/i686/ngram-class] Error 1 (ignored) make[2]: [../../bin/i686/disambig] Error 1 (ignored) make[2]: [../../bin/i686/anti-ngram] Error 1 (ignored) make[2]: [../../bin/i686/nbest-lattice] Error 1 (ignored) make[2]: [../../bin/i686/nbest-mix] Error 1 (ignored) make[2]: [../../bin/i686/nbest-optimize] Error 1 (ignored) make[2]: [../../bin/i686/nbest-pron-score] Error 1 (ignored) make[2]: [../../bin/i686/segment] Error 1 (ignored) make[2]: [../../bin/i686/segment-nbest] Error 1 (ignored) make[2]: [../../bin/i686/hidden-ngram] Error 1 (ignored) make[2]: [../../bin/i686/multi-ngram] Error 1 (ignored) make[2]: [../../bin/i686/fngram-count] Error 1 (ignored) make[2]: [../../bin/i686/fngram] Error 1 (ignored) make[2]: [../../bin/i686/lattice-tool] Error 1 (ignored) Any suggestions? From receiving07 at ckoei.com Wed Jan 7 16:03:00 2009 From: receiving07 at ckoei.com (Chris Oei) Date: Wed, 7 Jan 2009 16:03:00 -0800 Subject: Installation: failing the tests In-Reply-To: <8f67b6f80901061703o6469c1bfg5255b470d1dc0988@mail.gmail.com> References: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> <4962E424.1030407@speech.sri.com> <8f67b6f80901060131i27858557ye0504856305a0c2f@mail.gmail.com> <49639AB8.4000003@speech.sri.com> <8f67b6f80901061703o6469c1bfg5255b470d1dc0988@mail.gmail.com> Message-ID: <89b1251a0901071603w4e4eca50yd2781c1bf278039b@mail.gmail.com> I had the same problem when I tried to build it on Ubuntu 8.04. First, you have to make sure that you have tcl/tk installed (sudo apt-get install tcl-dev), and then you'll have to tell the compiler where the tcl headers and libraries are, since the Makefile doesn't look in the right place for Ubuntu 8.04 distros. Something like ADDITIONAL_INCLUDES = -I/usr/include/tcl8.4 ought to do it. There's probably other little tweaks I had to do (it's been a while), so if you want to download my tarball-ed build (so you can compare it with your setup and see the diffs), just let me know. Best of luck, Chris On Tue, Jan 6, 2009 at 5:03 PM, Mr.SpOOn wrote: > 2009/1/6 Andreas Stolcke : > > They would be in the output from the "make" command, after the command > lines > > that invoke the compiler (typically something starting wiht "gcc". If > you > > cannot make sense of your situation you should probably consult with a > local > > person who has experience building software. > > I did this: > > make World > output > > So, in the "output" file I've found this, that may be the problem: > > > make[2]: Leaving directory `/home/carlo/ordinami/srilm/dstruct/src' > make[2]: Entering directory `/home/carlo/ordinami/srilm/lm/src' > g++ -mtune=pentium3 -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES > -D_FILE_OFFSET_BITS=64 -I/usr/include/tcl8.5 -I. -I../../include > -u matherr -L../../lib/i686 -g -O3 -o ../bin/i686/ngram > ../obj/i686/ngram.o ../obj/i686/liboolm.a -lm -ldl > ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a > ../../lib/i686/libmisc.a -L/usr/lib/tcl8.5 -ltcl -lm 2>&1 | c++filt > /usr/bin/ld: cannot find -ltcl > collect2: ld returned 1 exit status > /home/carlo/ordinami/srilm/sbin/decipher-install 0555 > ../bin/i686/ngram ../../bin/i686 > ERROR: File to be installed (../bin/i686/ngram) does not exist. > ERROR: File to be installed (../bin/i686/ngram) is not a plain file. > Usage: decipher-install ... > mode: file permission mode, in octal > file1 ... fileN: files to be installed > directory: where the files should be installed > > files = ../bin/i686/ngram > directory = ../../bin/i686 > mode = 0555 > > touch -c ../../bin/i686/ngram > > > But I don't understand what's the problem. > > Besides this, in the terminal (I think it's the standard error) appeared > this: > > mkdir: cannot create directory `include': File exists > mkdir: cannot create directory `lib': File exists > mkdir: cannot create directory `bin': File exists > make: [dirs] Error 1 (ignored) > make[2]: [../../bin/i686/ngram] Error 1 (ignored) > make[2]: [../../bin/i686/ngram-count] Error 1 (ignored) > make[2]: [../../bin/i686/ngram-merge] Error 1 (ignored) > make[2]: [../../bin/i686/ngram-class] Error 1 (ignored) > make[2]: [../../bin/i686/disambig] Error 1 (ignored) > make[2]: [../../bin/i686/anti-ngram] Error 1 (ignored) > make[2]: [../../bin/i686/nbest-lattice] Error 1 (ignored) > make[2]: [../../bin/i686/nbest-mix] Error 1 (ignored) > make[2]: [../../bin/i686/nbest-optimize] Error 1 (ignored) > make[2]: [../../bin/i686/nbest-pron-score] Error 1 (ignored) > make[2]: [../../bin/i686/segment] Error 1 (ignored) > make[2]: [../../bin/i686/segment-nbest] Error 1 (ignored) > make[2]: [../../bin/i686/hidden-ngram] Error 1 (ignored) > make[2]: [../../bin/i686/multi-ngram] Error 1 (ignored) > make[2]: [../../bin/i686/fngram-count] Error 1 (ignored) > make[2]: [../../bin/i686/fngram] Error 1 (ignored) > make[2]: [../../bin/i686/lattice-tool] Error 1 (ignored) > > Any suggestions? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From deliverable at gmail.com Wed Jan 7 16:43:13 2009 From: deliverable at gmail.com (Alexy Khrabrov) Date: Wed, 7 Jan 2009 19:43:13 -0500 Subject: Kneser-Ney context counts In-Reply-To: <4960378D.2050705@speech.sri.com> References: <4960378D.2050705@speech.sri.com> Message-ID: <20136C93-0C28-4055-9715-457EB705D3EF@gmail.com> On Jan 3, 2009, at 11:14 PM, Andreas Stolcke wrote: > Alexy Khrabrov wrote: >> Greetings Andreas -- I'd like to access the number of contexts for >> any given ngram, used in Kneser-Ney computation (those with the fat >> dot). What's a good way to get at them via the C++ API? > You create a ModKneserNey object (Discount.h). Be sure to leave the > countsAreModified parameter at the default value (false). > > Then invoke ModKneserNey:: estimate() on your counts. As a side > effect, the lower-order counts will be modified to reflect the > context type counts. Note that the counts of ngrams starting with > are unchanged since there are no preceding words for them. OK. I am also wondering, whether the number of contexts can be reverse-engineered from the kncounts file -- since we have both counts and kncounts? Cheers, Alexy From mr.spoon21 at gmail.com Thu Jan 8 03:03:56 2009 From: mr.spoon21 at gmail.com (Mr.SpOOn) Date: Thu, 8 Jan 2009 12:03:56 +0100 Subject: Installation: failing the tests In-Reply-To: <89b1251a0901071603w4e4eca50yd2781c1bf278039b@mail.gmail.com> References: <8f67b6f80901050429y2df1d196td839635509f6a189@mail.gmail.com> <4962E424.1030407@speech.sri.com> <8f67b6f80901060131i27858557ye0504856305a0c2f@mail.gmail.com> <49639AB8.4000003@speech.sri.com> <8f67b6f80901061703o6469c1bfg5255b470d1dc0988@mail.gmail.com> <89b1251a0901071603w4e4eca50yd2781c1bf278039b@mail.gmail.com> Message-ID: <8f67b6f80901080303l43f72492i9918349789a8838d@mail.gmail.com> I solved the proble. First I just tried removing the option -ltcl from the line: TCL_LIBRARY = -L/usr/lib/tcl8.5 -ltcl It worked. But the I tried installing tcl with: sudo apt-get install tcl I put again the option -ltcl and it works the same. I wonder what's the difference between the packages tcl8.5 and tcl. Anyway, thanks. Now the I can run the tests. One last doubt: is it normal that in some test the output differs? For example here: *** Running test disambig *** 0.46user 0.06system 0:00.74elapsed 71%CPU (0avgtext+0avgdata 0maxresident)k 16inputs+800outputs (0major+3623minor)pagefaults 0swaps disambig: stdout output DIFFERS. disambig: stderr output IDENTICAL. From mjglab at googlemail.com Tue Jan 13 06:43:54 2009 From: mjglab at googlemail.com (Matt Green) Date: Tue, 13 Jan 2009 14:43:54 +0000 Subject: can srilm cope with xml tagged corpora? Message-ID: <6aae75000901130643p38335ef1w6b7eda59224add37@mail.gmail.com> I'd like to use srilm to generate bigram counts from the British National Corpus in XML format. I see that the paper "SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002 mentions that support for SGML-tagged formats is regarded as desirable: has this support been implemented in the toolkit at this time please? thanks, --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Jan 13 08:27:00 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 13 Jan 2009 21:57:00 +0530 Subject: can srilm cope with xml tagged corpora? In-Reply-To: <6aae75000901130643p38335ef1w6b7eda59224add37@mail.gmail.com> References: <6aae75000901130643p38335ef1w6b7eda59224add37@mail.gmail.com> Message-ID: <496CC0D4.3030106@speech.sri.com> Matt Green wrote: > I'd like to use srilm to generate bigram counts from the British > National Corpus in XML format. I see that the paper > "SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. > Conf. Spoken Language Processing, Denver, Colorado, September 2002 > mentions that support for SGML-tagged formats is regarded as > desirable: has this support been implemented in the toolkit at this > time please? > There's been a conscious decision to leave all text processing, filtering, conditioning, etc. out of SRILM as it tends to be too application-specific. So you'll have to use other available tools or your own to convert SGML to a pure ascii format, with words separated by whitespace. Andreas > thanks, > --matt From stolcke at speech.sri.com Tue Mar 31 16:19:13 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 31 Mar 2009 16:19:13 PDT Subject: srilm-user mailing list fixed Message-ID: <200903312319.n2VNJDo29013@ns2> It seems that a mail server software upgrade in late January had effectively kept the srilm-user mailing list from working at all. It should be fixed now. I apologize to all those who sent mail that ended up in the bit bucket. Please resend your messages if possible. --Andreas