From bassam_qatab at hotmail.com Mon Jul 9 17:20:35 2012 From: bassam_qatab at hotmail.com (Bassam Al_Qatab) Date: Tue, 10 Jul 2012 03:20:35 +0300 Subject: [SRILM User List] Confidence Measure Message-ID: Dear all I want to develop the Pronunciation Verification. Actually, I have been developed the Automatic Speech Recognition (ASR) using HTK toolkit. I used the SRILM tool to convert the HTK lattice to confusion network then, the word posterior probability has been calculated using SRILM tool. Actually, my understanding is that, first I have to save the word posterior probability for the words (based on the given sentences). Next, obtain the word posterior probability for the given utterance that should be among saved sentences. Finally, divided the obtained word posterior probability by saved one, the output should be between 0 and 1. This value will be the threshold for accepting or rejection of the word according to the output we observed. My question, is that all we need? Or should I have to calculate the confidence measure. For the confidence measure I want to know how to calculate it (if there is any tutorial). Any one can help! or send me a link or paper describe the procedure. Thank you in advance. - - -- - -- - --- --- -- --- --- ---- -- --- --- ---- --- ---- --- -- -- --- ---- ------- --- --- --- - - - -- - -- - --- --- -- --- --- ---- -- --- --- ---- --- ---- --- -- -- --- ---- ------- --- --- --- - Bassam Ali Qasem Al-Qatab Master Of Software Engineering Faculty Of Computer Science and Information Technology University of Malaya -------------- next part -------------- An HTML attachment was scrubbed... URL: From pranavshriram at gmail.com Mon Jul 9 20:59:34 2012 From: pranavshriram at gmail.com (Pranav Jawale) Date: Tue, 10 Jul 2012 09:29:34 +0530 Subject: [SRILM User List] Confidence Measure In-Reply-To: References: Message-ID: > > Or should I have to calculate the confidence measure? > Word posterior probability computed using confusion network itself IS a confidence measure. But it is not the only one, there are many others too. e.g. see [] H Jiang, ?Confidence measures for speech recognition: A survey?, Speech Communication, 2005 [] F. Wessel, R. Schl?ter, K. Macherey and H. Ney, ?Confidence measures for large vocabulary continuous speech recognition?, IEEE Transactions on Speech and Audio Processing , 2001 For pronunciation evaluation at phone-level, you may need to compute phone posterior probability using a phone decoder. -- * * The best way to get something done is to begin. ~Author Unknown -------------- next part -------------- An HTML attachment was scrubbed... URL: From bassam_qatab at hotmail.com Mon Jul 9 23:09:10 2012 From: bassam_qatab at hotmail.com (Bassam Al_Qatab) Date: Tue, 10 Jul 2012 09:09:10 +0300 Subject: [SRILM User List] Confidence Measure In-Reply-To: References: , Message-ID: Dear pranavshriram first thank you for replied. Actually, I have read the two papers( espcially the second one), but I will read them again to figure out the other things to include. for the phone decoder you mean that the output of the recognizer(decoder) will be in the phone level( not a word level).thank you. Bassam. From: pranavshriram at gmail.com Date: Tue, 10 Jul 2012 09:29:34 +0530 Subject: Re: [SRILM User List] Confidence Measure To: bassam_qatab at hotmail.com CC: srilm-user at speech.sri.com Or should I have to calculate the confidence measure?Word posterior probability computed using confusion network itself IS a confidence measure. But it is not the only one, there are many others too. e.g. see [] H Jiang, ?Confidence measures for speech recognition: A survey?, Speech Communication, 2005[] F. Wessel, R. Schl?ter, K. Macherey and H. Ney, ?Confidence measures for large vocabulary continuous speech recognition?, IEEE Transactions on Speech and Audio Processing , 2001 For pronunciation evaluation at phone-level, you may need to compute phone posterior probability using a phone decoder. -- * * The best way to get something done is to begin. ~Author Unknown -------------- next part -------------- An HTML attachment was scrubbed... URL: From pranavshriram at gmail.com Mon Jul 9 23:54:52 2012 From: pranavshriram at gmail.com (Pranav Jawale) Date: Tue, 10 Jul 2012 12:24:52 +0530 Subject: [SRILM User List] Confidence Measure In-Reply-To: References: Message-ID: > for the phone decoder you mean that the output of the recognizer(decoder) > will be in the phone level( not a word level). > Yes. For example, see S.M Witt, S.J Young, Phone-level pronunciation scoring and assessment for interactive language learning, Speech Communication, Volume 30, Issues 2?3, February 2000 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lqin at cs.cmu.edu Wed Jul 11 15:15:11 2012 From: lqin at cs.cmu.edu (Long Qin) Date: Wed, 11 Jul 2012 18:15:11 -0400 Subject: [SRILM User List] lattice-tool error while loading mesh Message-ID: <4FFDFAEF.8070208@cs.cmu.edu> Hi, I tried to load a confusion network in the mesh format using the lattice-tool. The mesh contains word level time mark and scores. The command I used is "lattice-tool -read-mesh -in-lattice mesh" And the lattice-tool outputs the following error message: mesh: line 5: invalid word info error reading mesh The mesh file was generated using the "nbest-lattice" tool. So I guess the format should be correct. Then the question is can lattice-tool work with mesh file with word level time mark and scores? Another question I want to ask is how can I convert nbest hypotheses into a word lattice with time marks, acoustic and LM scores? It seems the nbest-lattice tool can only produce a lattice without those word level information. Thanks, Long Qin From stolcke at icsi.berkeley.edu Wed Jul 11 17:25:37 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 11 Jul 2012 17:25:37 -0700 Subject: [SRILM User List] lattice-tool error while loading mesh In-Reply-To: <4FFDFAEF.8070208@cs.cmu.edu> References: <4FFDFAEF.8070208@cs.cmu.edu> Message-ID: <4FFE1981.7010403@icsi.berkeley.edu> On 7/11/2012 3:15 PM, Long Qin wrote: > Hi, > > I tried to load a confusion network in the mesh format using the > lattice-tool. The mesh contains word level time mark and scores. The > command I used is "lattice-tool -read-mesh -in-lattice mesh" And the > lattice-tool outputs the following error message: > > mesh: line 5: invalid word info > error reading mesh > > The mesh file was generated using the "nbest-lattice" tool. So I guess > the format should be correct. Then the question is can lattice-tool > work with mesh file with word level time mark and scores? Did you generate the mesh using nbest-lattice -use-mesh ? Can you send a small sample file? It should not happen. > > Another question I want to ask is how can I convert nbest hypotheses > into a word lattice with time marks, acoustic and LM scores? It seems > the nbest-lattice tool can only produce a lattice without those word > level information. Actually, nbest-lattice will do this when (1) the -nbest-backtrace option is given and (2) the nbest lists are in the 'NBestList2.0' format. See the nbest-format(5) man page. The format is awkward if you don't happen to be using the SRI Decipher recognizer, but it should work if you carefully convert your data into this format. Andreas > > Thanks, > Long Qin > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Wed Jul 11 19:03:13 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 11 Jul 2012 19:03:13 -0700 Subject: [SRILM User List] lattice-tool error while loading mesh In-Reply-To: <4FFE22AE.7080501@cs.cmu.edu> References: <4FFDFAEF.8070208@cs.cmu.edu> <4FFE1981.7010403@icsi.berkeley.edu> <4FFE22AE.7080501@cs.cmu.edu> Message-ID: <4FFE3061.9000305@icsi.berkeley.edu> On 7/11/2012 6:04 PM, Long Qin wrote: > Hi Andreas, > > Thanks for answering my question. > > Yes, I did use the nbest-lattice -use-mesh -nbest-backtrace. The > attachment files are the nbest file and the mesh file. Is there > anything wrong with it? Your nbest file is fine, but there was a bug in the nbest list parser that would lead to incorrect mesh files when no pronunciation information was given (as in your case). The attached patch should fix this. You will have to rebuild nbest-lattice and then regenerate the mesh file. Andreas -------------- next part -------------- diff -c -r1.83 NBest.cc *** lm/src/NBest.cc 6 Jul 2012 06:43:26 -0000 1.83 --- lm/src/NBest.cc 12 Jul 2012 01:57:40 -0000 *************** *** 620,626 **** /* * save pronunciation info for previous word */ ! if (prevWordInfo) { prevWordInfo->phones = strdup(phones); assert(prevWordInfo->phones != 0); --- 620,626 ---- /* * save pronunciation info for previous word */ ! if (prevWordInfo && phones[0] != '\0') { prevWordInfo->phones = strdup(phones); assert(prevWordInfo->phones != 0); *************** *** 695,701 **** /* * save pronunciation info for last word */ ! if (prevWordInfo) { prevWordInfo->phones = strdup(phones); assert(prevWordInfo->phones != 0); --- 695,701 ---- /* * save pronunciation info for last word */ ! if (prevWordInfo && phones[0] != '\0') { prevWordInfo->phones = strdup(phones); assert(prevWordInfo->phones != 0); From chenmengdx at gmail.com Wed Jul 18 03:40:37 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Wed, 18 Jul 2012 18:40:37 +0800 Subject: [SRILM User List] How to train LM fast with large corpus Message-ID: Hi, I want to ask how to train N-gram language model with SRILM if the corpus is very large (100GB). Should I still use the command of *ngram-count *? Or use *make-big-lm* instead? I also want to know if there is any limitation of training corpus in vocabulary and size with SRILM? Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Jul 18 05:13:23 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 18 Jul 2012 05:13:23 -0700 Subject: [SRILM User List] How to train LM fast with large corpus In-Reply-To: References: Message-ID: <5006A863.9030903@icsi.berkeley.edu> On 7/18/2012 3:40 AM, Meng Chen wrote: > Hi, I want to ask how to train N-gram language model with SRILM if the > corpus is very large (100GB). Should I still use the command of > *ngram-count*? Or use *make-big-lm* instead? I also want to know if > there is any limitation of training corpus in vocabulary and size with > SRILM? > Thanks! Definitely make-big-lm. Read the FAQ on handling large data. You are limited by computer memory but it is not possible to give a hard limit, it depends on the properties of your data. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From shinichiro.hamada at gmail.com Thu Jul 19 16:47:19 2012 From: shinichiro.hamada at gmail.com (shinichiro.hamada) Date: Fri, 20 Jul 2012 08:47:19 +0900 Subject: [SRILM User List] counts in ngram-count output Message-ID: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91> Hi, I have a question if my outputs of ngram-count are correct or not. I made a fractional word-count file by my own program and executed ngram-count command with wb discount. The header of outputs were bellow: -------------------------- [4gram wb float-count] ngram-count -read countfile_float -float-counts -order 4 -lm outfile \ -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3 ngram 1=780387 ngram 2=20321 ngram 3=2692 ngram 4=2622 .. -------------------------- I thought higher order models have always more counts than lower order ones, but the above result wasn't so. Does this result designate that my word-count file has bug? ---------------------------------------------------------------------- For further investigation, I made a integer word-count file by scaling and truncating (I know it is inappropriate approximation) and executed ngram-count with other discount methods. But higher order models doesn't have always more counts than lower order ones in this result too. -------------------------- [4gram none int-count] ngram-count -read countfile_int -order 3 -lm outfile \ -gt1min 0 -gt1max 0 -gt2min 0 -gt2max 0 -gt3min 0 -gt3max 0 ngram 1=780387 ngram 2=871835 ngram 3=1310979 ngram 4=1038980 -------------------------- [4gram gt int-count] ngram-count -read countfile_int -order 3 -lm outfile \ ngram 1=780387 ngram 2=871835 ngram 3=1170462 ngram 4=1038980 -------------------------- [4gram natural int-count] ngram-count -read countfile_int -order 3 -lm outfile \ -ndiscount -ndiscount1 -ndiscount2 -ndiscount3 ngram 1=780387 ngram 2=871835 ngram 3=1170339 ngram 4=1038858 Any advices will help me very much. Thank you in advance. -- Shincihiro Hamada From nouf.alharbi at yahoo.com Fri Jul 20 05:04:55 2012 From: nouf.alharbi at yahoo.com (Nouf Al-Harbi) Date: Fri, 20 Jul 2012 13:04:55 +0100 (BST) Subject: [SRILM User List] Predicting words Message-ID: <1342785895.7010.YahooMailNeo@web171304.mail.ir2.yahoo.com> Hello, I am new to language modeling and was hoping that someone can help me with the following. I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting). I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...). Many thanks in advance, Nouf -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Jul 20 08:55:42 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 20 Jul 2012 10:55:42 -0500 Subject: [SRILM User List] counts in ngram-count output In-Reply-To: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91> References: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91> Message-ID: <50097F7E.5090206@icsi.berkeley.edu> On 7/19/2012 6:47 PM, shinichiro.hamada wrote: > Hi, I have a question if my outputs of ngram-count are correct or not. > > I made a fractional word-count file by my own program and executed > ngram-count command with wb discount. The header of outputs were > bellow: > > -------------------------- > [4gram wb float-count] > ngram-count -read countfile_float -float-counts -order 4 -lm outfile \ > -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3 > > ngram 1=780387 > ngram 2=20321 > ngram 3=2692 > ngram 4=2622 > .. > -------------------------- > > I thought higher order models have always more counts than lower > order ones, but the above result wasn't so. Does this result > designate that my word-count file has bug? This is probably because the defaults for minimum count frequency are higher for trigrams and 4grams than for bigrams. For bigrams it is 1, whereas for 3grams and higher it is 2. You should see the expected behavior if you add -gt3min 1 -gt4min 1 to the options. (As explained in the man page, -gtXmin options apply to all discounting methods, not just GT.) Andreas From saraelec at yahoo.com Sun Jul 22 20:05:56 2012 From: saraelec at yahoo.com (sara) Date: Sun, 22 Jul 2012 20:05:56 -0700 (PDT) Subject: [SRILM User List] create LM for one topic Message-ID: <1343012756.90490.YahooMailClassic@web162303.mail.bf1.yahoo.com> Hi, I am new to SRILM and I want to create language model for one topic. I have used the online tool. The results shows two different probabilities. could you please help me how I can build the language model for one topic? Many thanks, Sara The results: \1-grams: -1.2884 -0.3010 -1.2884 -0.2781 -1.6564 A -0.2913 -2.1335 ABBREVIATIONS -0.2978 -2.1335 ACRONYMS -0.2978 -2.1335 AN -0.2946 -2.1335 AND -0.2978 -2.1335 ARE -0.2978 -1.8325 AS -0.2881 -2.1335 BE -0.2978 -2.1335 BEST -0.2978 -2.1335 BUT -0.2978 -2.1335 CAN -0.2978 -2.1335 CETERA -0.2781 -2.1335 EACH -0.2978 -2.1335 ENTERED -0.2946 -2.1335 ET -0.2978 -1.8325 EXAMPLE -0.2747 -2.1335 FEW -0.2978 -2.1335 FOR -0.2946 -2.1335 HUNDRED -0.2978 -1.6564 IS -0.2848 -2.1335 LETTERS -0.2978 -2.1335 LIMIT -0.2781 -2.1335 LINE -0.2913 -2.1335 NUMBERS -0.2978 -2.1335 OUGHT -0.2946 -2.1335 OUT -0.2978 -2.1335 PRONOUNCED -0.2946 -2.1335 RECOGNIZE -0.2781 -2.1335 SENTENCE -0.2781 -------------- next part -------------- An HTML attachment was scrubbed... URL: From shinichiro.hamada at gmail.com Tue Jul 24 11:05:08 2012 From: shinichiro.hamada at gmail.com (shinichiro.hamada) Date: Wed, 25 Jul 2012 03:05:08 +0900 Subject: [SRILM User List] counts in ngram-count output In-Reply-To: <50097F7E.5090206@icsi.berkeley.edu> References: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91> <50097F7E.5090206@icsi.berkeley.edu> Message-ID: <1A01F11B0513446E84A5D179F713FF26@f91> I haven't understood the specifications of the options. Thank you very much for pointing it out. I'll try it. Best regards, Shinichiro > -----Original Message----- > From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] > Sent: Saturday, July 21, 2012 12:56 AM > To: shinichiro.hamada > Cc: srilm-user at speech.sri.com > Subject: Re: [SRILM User List] counts in ngram-count output > > On 7/19/2012 6:47 PM, shinichiro.hamada wrote: > > Hi, I have a question if my outputs of ngram-count are > correct or not. > > > > I made a fractional word-count file by my own program and executed > > ngram-count command with wb discount. The header of outputs were > > bellow: > > > > -------------------------- > > [4gram wb float-count] > > ngram-count -read countfile_float -float-counts -order 4 > -lm outfile \ > > -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3 > > > > ngram 1=780387 > > ngram 2=20321 > > ngram 3=2692 > > ngram 4=2622 > > .. > > -------------------------- > > > > I thought higher order models have always more counts than > lower order > > ones, but the above result wasn't so. Does this result > designate that > > my word-count file has bug? > This is probably because the defaults for minimum count > frequency are higher for trigrams and 4grams than for bigrams. > For bigrams it is 1, whereas for 3grams and higher it is 2. > You should see the expected behavior if you add > > -gt3min 1 -gt4min 1 > > to the options. (As explained in the man page, -gtXmin > options apply to all discounting methods, not just GT.) > > Andreas From ma.farajian at gmail.com Tue Jul 24 23:40:09 2012 From: ma.farajian at gmail.com (amin farajian) Date: Wed, 25 Jul 2012 11:10:09 +0430 Subject: [SRILM User List] Error in compiling SRILM Message-ID: Hi all, I recently changed my machine, and I'm now trying to install the latest version of SRILM on it. I installed all the required tools and libraries (at least I hope so). but I couldn't finish the installation correctly. I checked everything that I thought could cause the problem, but I couldn't find anything. Some information about my new machine are: Machine Type: i686 (according to output of this script: srilm/sbin/machine-type) OS: kubuntu 12.04 (output of uname: 36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux) compiler version (output of "gcc -v"): gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) I changed the contents of srilm/common/Makefile.machine.i686 as described in installation instruction: CC = /usr/bin/gcc $(GCC_FLAGS) CXX = /usr/bin/g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES and added this line to the file: NO_TCL = X but nothing changed in installation procedure. I also attached the output of make command. As could be seen in the file, the first error occurs in line 158: ERROR: File to be installed (../bin/i686/maxalloc) does not exist. Usage: decipher-install [-p] ... mode: file permission mode, in octal file1 ... fileN: files to be installed directory: where the files should be installed files = ../bin/i686/maxalloc directory = ../../bin/i686 mode = 0555 May I ask you to help me in this problem? Thank you in advance. Regards, M. Amin -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: make.output Type: application/octet-stream Size: 36535 bytes Desc: not available URL: From stolcke at icsi.berkeley.edu Thu Jul 26 00:07:36 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 26 Jul 2012 00:07:36 -0700 Subject: [SRILM User List] Error in compiling SRILM In-Reply-To: References: Message-ID: <5010ECB8.1010409@icsi.berkeley.edu> On 7/24/2012 11:40 PM, amin farajian wrote: > Hi all, > > I recently changed my machine, and I'm now trying to install the > latest version of SRILM on it. I installed all the required tools and > libraries (at least I hope so). but I couldn't finish the installation > correctly. I checked everything that I thought could cause the > problem, but I couldn't find anything. > Some information about my new machine are: > > Machine Type: i686 (according to output of this script: > srilm/sbin/machine-type) > OS: kubuntu 12.04 (output of uname: 36-Ubuntu SMP Tue Apr 10 20:39:51 > UTC 2012 x86_64 x86_64 x86_64 GNU/Linux) > compiler version (output of "gcc -v"): gcc version 4.6.3 > (Ubuntu/Linaro 4.6.3-1ubuntu5) > > I changed the contents of srilm/common/Makefile.machine.i686 as > described in installation instruction: > CC = /usr/bin/gcc $(GCC_FLAGS) > CXX = /usr/bin/g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES > and added this line to the file: > NO_TCL = X > but nothing changed in installation procedure. > I also attached the output of make command. As could be seen in the > file, the first error occurs in line 158: > > ERROR: File to be installed (../bin/i686/maxalloc) does not exist. > Usage: decipher-install [-p] ... > mode: file permission mode, in octal > file1 ... fileN: files to be installed > directory: where the files should be installed > > files = ../bin/i686/maxalloc > directory = ../../bin/i686 > mode = 0555 > > May I ask you to help me in this problem? Based on the error message from the linker > /usr/bin/g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_B > ITS=64 -I. -I../../include -u matherr -L../../lib/i686 -g -O3 -o > ../bin/i686/lattice-tool ../obj/i686/lattice-tool > .o ../obj/i686/liblattice.a -lm -ldl ../../lib/i686/libflm.a > ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../ > lib/i686/libmisc.a -lm 2>&1 | c++filt > /usr/bin/ld: skipping incompatible > /usr/lib/gcc/x86_64-linux-gnu/4.6/libstdc++.so when searching for -lstdc++ > /usr/bin/ld: skipping incompatible > /usr/lib/gcc/x86_64-linux-gnu/4.6/libstdc++.a when searching for -lstdc++ > /usr/bin/ld: cannot find -lstdc++ you don't have the 32bit version of libstdc++ installed. Try building 64bit binaries: make MACHINE_TYPE=i686-m64 World If that shows similar problem seek the advice of someone familiar with your Ubuntu installation. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From ma.farajian at gmail.com Thu Jul 26 01:29:51 2012 From: ma.farajian at gmail.com (Amin Farajian) Date: Thu, 26 Jul 2012 12:59:51 +0430 Subject: [SRILM User List] Error in compiling SRILM In-Reply-To: <5010ECB8.1010409@icsi.berkeley.edu> References: <5010ECB8.1010409@icsi.berkeley.edu> Message-ID: <5010FFFF.7080805@gmail.com> An HTML attachment was scrubbed... URL: From ee07b282 at gmail.com Thu Jul 26 14:00:59 2012 From: ee07b282 at gmail.com (xinrui yu) Date: Thu, 26 Jul 2012 14:00:59 -0700 Subject: [SRILM User List] Question about lattice-tool -nbest-decode Message-ID: Hi All, I'm new to srilm and I have some questions about finding nbest result by using srilm. I try Srilm by the command "./lattice-tool -read-htk -in-lattice test.lat -nbest-decode 10 -out-nbest-dir my_nbest_dir". I indeed get 10 results. But are the result placed in order? I read from manual page said that they are placed in order by default, I think they should placed according to the score (combine the acoustic and lm score) in front of it. But from what i have get, it's not. Am I misunderstanding something? I also notice that there are another command callled "nbest-lattice". I try to use it as well but it seems it does not accept HTK lattice. So could it be used to find nbest result. And what's the different bettwen lattice-tool and nbest-lattice? How to decide which one should be used? Another question is that I read from manual page there is one option called *-nbest-backtrace *which could preserve word-level timemarks and scores. Is there similar option for lattice tool? What if I want to keep those information while using lattice-tool? Thanks! Liz -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Jul 26 15:35:00 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 26 Jul 2012 15:35:00 -0700 Subject: [SRILM User List] Question about lattice-tool -nbest-decode In-Reply-To: References: Message-ID: <5011C614.9090505@icsi.berkeley.edu> On 7/26/2012 2:00 PM, xinrui yu wrote: > Hi All, > I'm new to srilm and I have some questions about finding nbest result > by using srilm. > I try Srilm by the command "./lattice-tool -read-htk -in-lattice > test.lat -nbest-decode 10 -out-nbest-dir my_nbest_dir". I indeed get > 10 results. But are the result placed in order? I read from manual > page said that they are placed in order by default, I think they > should placed according to the score (combine the acoustic and lm > score) in front of it. But from what i have get, it's not. Am I > misunderstanding something? The output is sorted by score. You are probably not considering the way that the combined score is computed. You need to take the acoustic score, and added the weighted LM score and word insertion penalty. The LM weight and insertion penalties might have default values encoded in the lattices. You can override them on the command line. You might get the output you expect by using -htk-lm-scale 1 and -htk-wdpenalty 0, but that will probably not be the best result in terms of word error. > > I also notice that there are another command callled "nbest-lattice". > I try to use it as well but it seems it does not accept HTK lattice. > So could it be used to find nbest result. And what's the different > bettwen lattice-tool and nbest-lattice? How to decide which one should > be used? nbest-lattice takes nbest lists as INPUT and constructs a special type of lattice representing word posterior probabilities. So it is not what you need. > > Another question is that I read from manual page there is one option > called *-nbest-backtrace *which could preserve word-level timemarks > and scores. Is there similar option for lattice tool? What if I want > to keep those information while using lattice-tool? No, sorry. Andreas > > Thanks! > > Liz > > > > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.dan.tomescu at gmail.com Sat Jul 28 03:09:18 2012 From: alex.dan.tomescu at gmail.com (Alex Tomescu) Date: Sat, 28 Jul 2012 13:09:18 +0300 Subject: [SRILM User List] Fwd: Batch no-sos and no-eos In-Reply-To: References: Message-ID: Hi I need to make a language model from a set of 5000+ texts. The texts are separated into one sentence per line so there are a lot of sentence boundary tokens which I need to get rid of. I used make-batch-counts and merge-batch counts to count the ngrams, and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but still sentence boundaries we're included. I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with the same results. Removing '\n' from the text files resulted in "line 1: line too long". I tried ngram-count with -no-eos -no-sos on one of the files and it worked, but on a batch it didn't seem to work. Any ideas on what I should try next ? Thanks -- Alexandru Tomescu, undergraduate Computer Science student at Polytechnic University of Bucharest From tonyr at cantabresearch.com Sat Jul 28 04:16:20 2012 From: tonyr at cantabresearch.com (Tony Robinson) Date: Sat, 28 Jul 2012 12:16:20 +0100 Subject: [SRILM User List] Fwd: Batch no-sos and no-eos In-Reply-To: References: Message-ID: <5013CA04.8000907@cantabResearch.com> Hi Alex, and are not really "sentence boundary" tokens, even though that's what everyone calls them and that's how they are used most of the time. They are for the start and end of utterance contexts. So for your problem pick a suitably large chunk - let's say we decode a chapter at a time and have a at the start and a at the end and replace the rest with . I'm back, so mail me if this doesn't make sense. Tony On 07/28/2012 11:09 AM, Alex Tomescu wrote: > Hi > > I need to make a language model from a set of 5000+ texts. The texts > are separated into one sentence per line so there are a lot of > sentence boundary tokens which I need to get rid of. > > I used make-batch-counts and merge-batch counts to count the ngrams, > and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but > still sentence boundaries we're included. > > I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with > the same results. > > Removing '\n' from the text files resulted in "line 1: line too long". > > I tried ngram-count with -no-eos -no-sos on one of the files and it > worked, but on a batch it didn't seem to work. > > Any ideas on what I should try next ? > > Thanks > -- > Alexandru Tomescu, undergraduate Computer Science student at > Polytechnic University of Bucharest > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -- Dr A J Robinson, Founder and Director of Cantab Research Limited. St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK. Company reg no 05697423 (England and Wales), VAT reg no 925606030. From stolcke at icsi.berkeley.edu Sat Jul 28 09:46:05 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 28 Jul 2012 09:46:05 -0700 Subject: [SRILM User List] Fwd: Batch no-sos and no-eos In-Reply-To: References: Message-ID: <5014174D.1080701@icsi.berkeley.edu> On 7/28/2012 3:09 AM, Alex Tomescu wrote: > Hi > > I need to make a language model from a set of 5000+ texts. The texts > are separated into one sentence per line so there are a lot of > sentence boundary tokens which I need to get rid of. > > I used make-batch-counts and merge-batch counts to count the ngrams, > and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but > still sentence boundaries we're included. I don't see this behavior. With make-big-lm -no-sos -no-eos it's true that and appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving or in the resulting LM. The same is true if you run ngram-count -no-sos -no-eos, so the two ways of building the LM are consistent in this regard. Presently, -no-sos -no-eos just affect the way ngrams are generated from text. After counts are extracted, they don't affect any part of the LM building process. It might make sense for these options to also modify the default vocab membership or and . Having the tags in the vocab without N-grams should be fine for most LM uses, but I can see an argument for removing them. Is that the behavior you are looking for? Andreas > > I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with > the same results. > > Removing '\n' from the text files resulted in "line 1: line too long". > > I tried ngram-count with -no-eos -no-sos on one of the files and it > worked, but on a batch it didn't seem to work. > > Any ideas on what I should try next ? > > Thanks > -- > Alexandru Tomescu, undergraduate Computer Science student at > Polytechnic University of Bucharest > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From alex.dan.tomescu at gmail.com Sun Jul 29 03:46:21 2012 From: alex.dan.tomescu at gmail.com (Alex Tomescu) Date: Sun, 29 Jul 2012 13:46:21 +0300 Subject: [SRILM User List] Fwd: Batch no-sos and no-eos In-Reply-To: <5014174D.1080701@icsi.berkeley.edu> References: <5014174D.1080701@icsi.berkeley.edu> Message-ID: Hello, > I don't see this behavior. With make-big-lm -no-sos -no-eos it's true that and appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving or in the resulting LM. These are the exact parameters I passed to make-big-lm, and still I looked through the LM and there are ngrams containing ("-0.0009011862 ") make-big-lm -name biglm -read merge-iter9-1.ngrams.gz -lm gut.lm -no-eos -no-sos -prune 1e-8 -vocab ../gut.vocab -limit-vocab using existing gtcounts warning: discount coeff 1 is out of range: 1.1758 warning: discount coeff 3 is out of range: 1.11643 warning: discount coeff 5 is out of range: 1.17202 warning: discount coeff 7 is out of range: 1.12503 + ngram-count -read - -read-with-mincounts -order 3 -gt1 biglm.gt1 -gt2 biglm.gt2 -gt3 biglm.gt3 -lm gut.lm -no-eos -no-sos -prune 1e-8 -vocab ../gut.vocab -limit-vocab -meta-tag __meta__ It's really weird because when I tried ngram-count on a single file (very similar to the one triggered by make-big-lm), eos and sos tokens were only included in the unigrams. > Presently, -no-sos -no-eos just affect the way ngrams are generated from text. After counts are extracted, they don't affect any part of the LM building process. It might make sense for these options to also modify the default vocab membership or and . Having the tags in the vocab without N-grams should be fine for most LM uses, but I can see an argument for removing them. Is that the behavior you are looking for? It's ok if they are included as unigrams. I am going to make some more tests and if I find the problem I will post it. For the moment I can work around this by making bigger paragraphs so that there are not so many eos and sos tags. Thank you, Alex On Sat, Jul 28, 2012 at 7:46 PM, Andreas Stolcke wrote: > > On 7/28/2012 3:09 AM, Alex Tomescu wrote: >> >> Hi >> >> I need to make a language model from a set of 5000+ texts. The texts >> are separated into one sentence per line so there are a lot of >> sentence boundary tokens which I need to get rid of. >> >> I used make-batch-counts and merge-batch counts to count the ngrams, >> and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but >> still sentence boundaries we're included. > > I don't see this behavior. With make-big-lm -no-sos -no-eos it's true that and appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving or in the resulting LM. > > The same is true if you run ngram-count -no-sos -no-eos, so the two ways of building the LM are consistent in this regard. > > Presently, -no-sos -no-eos just affect the way ngrams are generated from text. After counts are extracted, they don't affect any part of the LM building process. It might make sense for these options to also modify the default vocab membership or and . Having the tags in the vocab without N-grams should be fine for most LM uses, but I can see an argument for removing them. Is that the behavior you are looking for? > > Andreas > > >> >> I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with >> the same results. >> >> Removing '\n' from the text files resulted in "line 1: line too long". >> >> I tried ngram-count with -no-eos -no-sos on one of the files and it >> worked, but on a batch it didn't seem to work. >> >> Any ideas on what I should try next ? >> >> Thanks >> -- >> Alexandru Tomescu, undergraduate Computer Science student at >> Polytechnic University of Bucharest >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > > -- Alexandru Tomescu, undergraduate Computer Science student at Polytechnic University of Bucharest From stolcke at icsi.berkeley.edu Sun Jul 29 08:55:34 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 29 Jul 2012 08:55:34 -0700 Subject: [SRILM User List] Fwd: Batch no-sos and no-eos In-Reply-To: References: <5014174D.1080701@icsi.berkeley.edu> Message-ID: <50155CF6.4080901@icsi.berkeley.edu> On 7/29/2012 3:46 AM, Alex Tomescu wrote: > Hello, > >> I don't see this behavior. With make-big-lm -no-sos -no-eos it's true that and appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving or in the resulting LM. > > These are the exact parameters I passed to make-big-lm, and still I > looked through the LM and there are ngrams containing > ("-0.0009011862 ") > > make-big-lm -name biglm -read merge-iter9-1.ngrams.gz -lm gut.lm > -no-eos -no-sos -prune 1e-8 -vocab ../gut.vocab -limit-vocab That just means that those ngrams are in the input count file (merge-iter9-1.ngrams.gz). You need to also include -no-eos -no-sos when generating the counts (e.g., with make-batch-counts or directly with ngram-count). Andreas From shahramk at gmail.com Mon Jul 30 21:46:17 2012 From: shahramk at gmail.com (Shahram) Date: Tue, 31 Jul 2012 14:46:17 +1000 Subject: [SRILM User List] installation problem Message-ID: Hi, I have a problem installing SRILM on my linux machine. When I install it with "NO-TCL=X" it works fine, however it seems it does not install ngram and ngram-count. I have tclsh installed on my machine. SRILM installation seems to need ltcl. I actually do not know much about tcl. Are tclsh and ltcl the same? If so, how can I make the SRILM installation use tclsh instead of ltcl? -- --- Regards Shahram -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Jul 31 11:15:31 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 31 Jul 2012 11:15:31 -0700 Subject: [SRILM User List] installation problem In-Reply-To: References: Message-ID: <501820C3.4040103@icsi.berkeley.edu> On 7/30/2012 9:46 PM, Shahram wrote: > Hi, > > I have a problem installing SRILM on my linux machine. > When I install it with "NO-TCL=X" it works fine, however it seems it > does not install ngram and ngram-count. > I have tclsh installed on my machine. SRILM installation seems to need > ltcl. I actually do not know much about tcl. Are tclsh and ltcl the same? To remove the dependency on -ltcl you also need to set the variable TCL_LIBRARY= (to empty) . > If so, how can I make the SRILM installation use tclsh instead of ltcl? tclsh and -ltcl are for different purposes. One is a command shell, the other a library you link your programs with . However, if tclsh is installed on your system then chances are that somewhere in /usr/lib there is a version of -ltcl . Try ls /usr/lib/libtcl*.so and if you see something like /usr/lib/libtcl8.4.so then set TCL_LIBRARY=-ltcl8.4 (and leave NO_TCL= empty). Andreas From chenmengdx at gmail.com Thu Aug 2 02:30:56 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Thu, 2 Aug 2012 17:30:56 +0800 Subject: [SRILM User List] Why modified Kneser-Ney much slower than Good-Turing using make-big-lm? Message-ID: Hi, I am training LM using *make-batch-counts*, *merge-batch-counts* and * make-big-lm*. I compared the modified Kneser-Ney and Good-Turing smoothing algorithm in *make-big-lm*, and found that the training speed is much slower by modified Kneser-Ney. I checked the debug information, and found that it run *make-kn-counts* and *merge-batch-counts*, which cost most of the time. I wonder if the extra two steps could run in *make-batch-counts*, so it could save much time. Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Aug 2 09:40:30 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 02 Aug 2012 09:40:30 -0700 Subject: [SRILM User List] Why modified Kneser-Ney much slower than Good-Turing using make-big-lm? In-Reply-To: References: Message-ID: <501AAD7E.2070708@icsi.berkeley.edu> On 8/2/2012 2:30 AM, Meng Chen wrote: > Hi, I am training LM using *make-batch-counts*, *merge-batch-counts* > and *make-big-lm*. I compared the modified Kneser-Ney and Good-Turing > smoothing algorithm in *make-big-lm*, and found that the training > speed is much slower by modified Kneser-Ney. I checked the debug > information, and found that it run *make-kn-counts* and > *merge-batch-counts*, which cost most of the time. I wonder if the > extra two steps could run in *make-batch-counts*, so it could save > much time. KN is slower because it has to first compute the regular ngram counts, then, in a second pass, make-kn-counts, which takes the merged ngram counts as input. Because the counts have to be merged first (you are counting the ngram types, not the token frequencies) you need to do it in this order. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Fri Aug 3 03:18:37 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Fri, 3 Aug 2012 18:18:37 +0800 Subject: [SRILM User List] What's the limitation to memory in make-batch-counts ? Message-ID: Hi, in *make-batch-counts*, we need to set the batch-size in order to count faster. it says "For maximum performance, batch-size should be as large as possible without triggering paging". However, sometimes I found it would crash if I set it too large (eg. 500). So I want to ask if there is any limitation to batch-size. Suppose every text in file list is *a* MB, the memory of server is *b* MB,the batch-size should not be larger than *b/a*, is it right? Or some other limitations? Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Aug 3 16:39:50 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 03 Aug 2012 16:39:50 -0700 Subject: [SRILM User List] What's the limitation to memory in make-batch-counts ? In-Reply-To: References: Message-ID: <501C6146.5040506@icsi.berkeley.edu> On 8/3/2012 3:18 AM, Meng Chen wrote: > Hi, in *make-batch-counts*, we need to set the batch-size in order to > count faster. it says "For maximum performance, batch-size should be > as large as possible without triggering paging". However, sometimes I > found it would crash if I set it too large (eg. 500). So I want to ask > if there is any limitation to batch-size. Suppose every text in file > list is *a* MB, the memory of server is *b* MB,the batch-size should > not be larger than *b/a*, is it right? Or some other limitations? make-batch-counts actually works sequentially, so you can devote all of a machine's memory to computing counts, unless you have other things running. If you want to parallelize the counting you have to devise your own method for that. Of course in general there other things running on a machine, and some systems start randomly killing processes when you exhaust their memory. I suspect that's what is happening in your case. There is no built-in limitation in make-batch-counts, other than the limits imposed by the system. Another reason your job might have crashed is that you are using 32bit binaries and you were hitting against the 2 or 4 GB limit inherent in 32bit memory addresses. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From saraelec at yahoo.com Sun Aug 5 16:02:17 2012 From: saraelec at yahoo.com (sara) Date: Sun, 5 Aug 2012 16:02:17 -0700 (PDT) Subject: [SRILM User List] ngram command not found Message-ID: <1344207737.29305.YahooMailClassic@web162301.mail.bf1.yahoo.com> Hi, I complied SRILM in Linux and? got this message : "ngram command not found" . Please help me why i got this error and what I shoud do? Thanks, Sara -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sun Aug 5 17:47:39 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 05 Aug 2012 17:47:39 -0700 Subject: [SRILM User List] ngram command not found In-Reply-To: <1344207737.29305.YahooMailClassic@web162301.mail.bf1.yahoo.com> References: <1344207737.29305.YahooMailClassic@web162301.mail.bf1.yahoo.com> Message-ID: <501F142B.2010401@icsi.berkeley.edu> On 8/5/2012 4:02 PM, sara wrote: > Hi, > > I complied SRILM in Linux and got this message : "ngram command not > found" . Please help me why i got this error and what I shoud do? > Go through the first question in the FAQ item (A1) and check each possible problem described there. http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From saraelec at yahoo.com Mon Aug 6 12:13:51 2012 From: saraelec at yahoo.com (sara) Date: Mon, 6 Aug 2012 12:13:51 -0700 (PDT) Subject: [SRILM User List] complie on 32-bit system Message-ID: <1344280431.50069.YahooMailClassic@web162301.mail.bf1.yahoo.com> Hi, How Can I compile SRILM on 32-bit system? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From saraelec at yahoo.com Mon Aug 6 17:35:23 2012 From: saraelec at yahoo.com (sara) Date: Mon, 6 Aug 2012 17:35:23 -0700 (PDT) Subject: [SRILM User List] clmain.cc:8:17: error Message-ID: <1344299723.30666.YahooMailClassic@web162303.mail.bf1.yahoo.com> Hi I compile SRILM in Linux but I got these errors: tclmain.cc:8:17: error: tcl.h: No such file or directory make[2]: *** [../obj/i686/tclmain.o] Error 1 make[2]: Leaving directory `/root/Desktop/srilm/misc/src' make[1]: *** [release-libraries] Error 1 make[1]: Leaving directory `/root/Desktop/srilm' make: *** [World] Error 2 Please help! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Aug 6 21:10:19 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 06 Aug 2012 21:10:19 -0700 Subject: [SRILM User List] clmain.cc:8:17: error In-Reply-To: <1344299723.30666.YahooMailClassic@web162303.mail.bf1.yahoo.com> References: <1344299723.30666.YahooMailClassic@web162303.mail.bf1.yahoo.com> Message-ID: <5020952B.4010508@icsi.berkeley.edu> On 8/6/2012 5:35 PM, sara wrote: > Hi I compile SRILM in Linux but I got these errors: > > tclmain.cc:8:17: error: tcl.h: No such file or directory > make[2]: *** [../obj/i686/tclmain.o] Error 1 > make[2]: Leaving directory `/root/Desktop/srilm/misc/src' > make[1]: *** [release-libraries] Error 1 > make[1]: Leaving directory `/root/Desktop/srilm' > make: *** [World] Error 2 > Look in the FAQ file http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html and search for "tcl" to find your answer. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From ayse.serbetci at hotmail.com Mon Aug 6 23:53:26 2012 From: ayse.serbetci at hotmail.com (=?windows-1254?B?QXn+ZSDeZXJiZXTnaQ==?=) Date: Tue, 7 Aug 2012 09:53:26 +0300 Subject: [SRILM User List] build problem, nothing under bin directory In-Reply-To: References: Message-ID: Hi, I am trying to build SRILM on cygwin installed on a Windows 7 environment. When I run the makefile I obtain .h and .cc files under /include, .a files under /lib/cygwin but nothing under /bin. My gcc version : 4.5.3 Machine type : CYGWIN_NT-6.1 Make file output is as follows. Any help is really appreciated. Thanks in advance, -- Ayse mkdir -p include lib bin make init make[1]: Entering directory `/home/aserbetci/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/home/aserbetci/srilm MACHINE_TYPE=cygwin OPTION= MAKE_PIC= init) || exit 1; \ done make[2]: Entering directory `/home/aserbetci/srilm/misc/src' cd ..; /home/aserbetci/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entering directory `/home/aserbetci/srilm/misc/src' make[3]: `../obj/cygwin/STAMP' is up to date. make[3]: `../bin/cygwin/STAMP' is up to date. make[3]: Leaving directory `/home/aserbetci/srilm/misc/src' make[2]: Leaving directory `/home/aserbetci/srilm/misc/src' make[2]: Entering directory `/home/aserbetci/srilm/dstruct/src' cd ..; /home/aserbetci/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entering directory `/home/aserbetci/srilm/dstruct/src' make[3]: `../obj/cygwin/STAMP' is up to date. make[3]: `../bin/cygwin/STAMP' is up to date. make[3]: Leaving directory `/home/aserbetci/srilm/dstruct/src' make[2]: Leaving directory `/home/aserbetci/srilm/dstruct/src' make[2]: Entering directory `/home/aserbetci/srilm/lm/src' cd ..; /home/aserbetci/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entering directory `/home/aserbetci/srilm/lm/src' make[3]: `../obj/cygwin/STAMP' is up to date. make[3]: `../bin/cygwin/STAMP' is up to date. make[3]: Leaving directory `/home/aserbetci/srilm/lm/src' make[2]: Leaving directory `/home/aserbetci/srilm/lm/src' make[2]: Entering directory `/home/aserbetci/srilm/flm/src' cd ..; /home/aserbetci/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entering directory `/home/aserbetci/srilm/flm/src' make[3]: `../obj/cygwin/STAMP' is up to date. make[3]: `../bin/cygwin/STAMP' is up to date. make[3]: Leaving directory `/home/aserbetci/srilm/flm/src' make[2]: Leaving directory `/home/aserbetci/srilm/flm/src' make[2]: Entering directory `/home/aserbetci/srilm/lattice/src' cd ..; /home/aserbetci/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entering directory `/home/aserbetci/srilm/lattice/src' make[3]: `../obj/cygwin/STAMP' is up to date. make[3]: `../bin/cygwin/STAMP' is up to date. make[3]: Leaving directory `/home/aserbetci/srilm/lattice/src' make[2]: Leaving directory `/home/aserbetci/srilm/lattice/src' make[2]: Entering directory `/home/aserbetci/srilm/utils/src' cd ..; /home/aserbetci/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entering directory `/home/aserbetci/srilm/utils/src' make[3]: `../obj/cygwin/STAMP' is up to date. make[3]: `../bin/cygwin/STAMP' is up to date. make[3]: Leaving directory `/home/aserbetci/srilm/utils/src' make[2]: Leaving directory `/home/aserbetci/srilm/utils/src' make[1]: Leaving directory `/home/aserbetci/srilm' make release-headers make[1]: Entering directory `/home/aserbetci/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/home/aserbetci/srilm MACHINE_TYPE=cygwin OPTION= MAKE_PIC= release-headers) || exit 1; \ done make[2]: Entering directory `/home/aserbetci/srilm/misc/src' make[2]: Nothing to be done for `release-headers'. make[2]: Leaving directory `/home/aserbetci/srilm/misc/src' make[2]: Entering directory `/home/aserbetci/srilm/dstruct/src' make[2]: Nothing to be done for `release-headers'. make[2]: Leaving directory `/home/aserbetci/srilm/dstruct/src' make[2]: Entering directory `/home/aserbetci/srilm/lm/src' make[2]: Nothing to be done for `release-headers'. make[2]: Leaving directory `/home/aserbetci/srilm/lm/src' make[2]: Entering directory `/home/aserbetci/srilm/flm/src' make[2]: Nothing to be done for `release-headers'. make[2]: Leaving directory `/home/aserbetci/srilm/flm/src' make[2]: Entering directory `/home/aserbetci/srilm/lattice/src' make[2]: Nothing to be done for `release-headers'. make[2]: Leaving directory `/home/aserbetci/srilm/lattice/src' make[2]: Entering directory `/home/aserbetci/srilm/utils/src' make[2]: Nothing to be done for `release-headers'. make[2]: Leaving directory `/home/aserbetci/srilm/utils/src' make[1]: Leaving directory `/home/aserbetci/srilm' make depend make[1]: Entering directory `/home/aserbetci/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/home/aserbetci/srilm MACHINE_TYPE=cygwin OPTION= MAKE_PIC= depend) || exit 1; \ done make[2]: Entering directory `/home/aserbetci/srilm/misc/src' rm -f Dependencies.cygwin gcc -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int -I. -I../../include -MM ./option.c ./zio.c ./fcheck.c ./fake-rand48.c ./version.c ./ztest.c | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -MM ./Debug.cc ./File.cc ./MStringTokUtil.cc ./tclmain.cc ./testFile.cc | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /home/aserbetci/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj/cygwin ".exe" ztest testFile | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Leaving directory `/home/aserbetci/srilm/misc/src' make[2]: Entering directory `/home/aserbetci/srilm/dstruct/src' rm -f Dependencies.cygwin gcc -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int -I. -I../../include -MM ./qsort.c ./BlockMalloc.c ./maxalloc.c | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -MM ./MemStats.cc ./LHashTrie.cc ./SArrayTrie.cc ./Array.cc ./IntervalHeap.cc ./Map.cc ./SArray.cc ./LHash.cc ./Map2.cc ./Trie.cc ./CachedMem.cc ./testArray.cc ./testMap.cc ./benchHash.cc ./testHash.cc ./testSizes.cc ./testCachedMem.cc ./testBlockMalloc.cc ./testMap2.cc ./testTrie.cc | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /home/aserbetci/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj/cygwin ".exe" maxalloc testArray testMap benchHash testHash testSizes testCachedMem testBlockMalloc testMap2 testTrie | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Leaving directory `/home/aserbetci/srilm/dstruct/src' make[2]: Entering directory `/home/aserbetci/srilm/lm/src' rm -f Dependencies.cygwin gcc -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int -I. -I../../include -MM ./matherr.c | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -MM ./Prob.cc ./Counts.cc ./XCount.cc ./Vocab.cc ./VocabMap.cc ./VocabMultiMap.cc ./VocabDistance.cc ./SubVocab.cc ./MultiwordVocab.cc ./TextStats.cc ./LM.cc ./LMClient.cc ./LMStats.cc ./RefList.cc ./Bleu.cc ./NBest.cc ./NBestSet.cc ./NgramLM.cc ./NgramStatsInt.cc ./NgramStatsShort.cc ./NgramStatsLong.cc ./NgramStatsLongLong.cc ./NgramStatsFloat.cc ./NgramStatsDouble.cc ./NgramStatsXCount.cc ./NgramCountLM.cc ./Discount.cc ./ClassNgram.cc ./SimpleClassNgram.cc ./DFNgram.cc ./SkipNgram.cc ./HiddenNgram.cc ./HiddenSNgram.cc ./VarNgram.cc ./DecipherNgram.cc ./TaggedVocab.cc ./TaggedNgram.cc ./TaggedNgramStats.cc ./StopNgram.cc ./StopNgramStats.cc ./MultiwordLM.cc ./NonzeroLM.cc ./BayesMix.cc ./LoglinearMix.cc ./AdaptiveMix.cc ./AdaptiveMarginals.cc ./CacheLM.cc ./DynamicLM.cc ./HMMofNgrams.cc ./WordAlign.cc ./WordLattice.cc ./WordMesh.cc ./simpleTrigram.cc ./NgramStats.cc ./Trellis.cc ./testBinaryCounts.cc ./testHash.cc ./testProb.cc ./testXCount.cc ./testParseFloat.cc ./testVocabDistance.cc ./testNgram.cc ./testNgramAlloc.cc ./testMultiReadLM.cc ./hoeffding.cc ./tolower.cc ./testLattice.cc ./testError.cc ./testNBest.cc ./testMix.cc ./testTaggedVocab.cc ./testVocab.cc ./ngram.cc ./ngram-count.cc ./ngram-merge.cc ./ngram-class.cc ./disambig.cc ./anti-ngram.cc ./nbest-lattice.cc ./nbest-mix.cc ./nbest-optimize.cc ./nbest-pron-score.cc ./segment.cc ./segment-nbest.cc ./hidden-ngram.cc ./multi-ngram.cc | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /home/aserbetci/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj/cygwin ".exe" testBinaryCounts testHash testProb testXCount testParseFloat testVocabDistance testNgram testNgramAlloc testMultiReadLM hoeffding tolower testLattice testError testNBest testMix testTaggedVocab testVocab ngram ngram-count ngram-merge ngram-class disambig anti-ngram nbest-lattice nbest-mix nbest-optimize nbest-pron-score segment segment-nbest hidden-ngram multi-ngram | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Leaving directory `/home/aserbetci/srilm/lm/src' make[2]: Entering directory `/home/aserbetci/srilm/flm/src' rm -f Dependencies.cygwin g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -MM ./FDiscount.cc ./FNgramStats.cc ./FNgramStatsInt.cc ./FNgramSpecs.cc ./FNgramSpecsInt.cc ./FactoredVocab.cc ./FNgramLM.cc ./ProductVocab.cc ./ProductNgram.cc ./wmatrix.cc ./pngram.cc ./fngram-count.cc ./fngram.cc | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /home/aserbetci/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj/cygwin ".exe" pngram fngram-count fngram | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Leaving directory `/home/aserbetci/srilm/flm/src' make[2]: Entering directory `/home/aserbetci/srilm/lattice/src' rm -f Dependencies.cygwin g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -MM ./Lattice.cc ./LatticeAlign.cc ./LatticeExpand.cc ./LatticeIndex.cc ./LatticeNBest.cc ./LatticeNgrams.cc ./LatticeReduce.cc ./HTKLattice.cc ./LatticeLM.cc ./LatticeDecode.cc ./testLattice.cc ./lattice-tool.cc | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /home/aserbetci/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj/cygwin ".exe" testLattice lattice-tool | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Leaving directory `/home/aserbetci/srilm/lattice/src' make[2]: Entering directory `/home/aserbetci/srilm/utils/src' rm -f Dependencies.cygwin /home/aserbetci/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj/cygwin ".exe" | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Leaving directory `/home/aserbetci/srilm/utils/src' make[1]: Leaving directory `/home/aserbetci/srilm' make release-libraries make[1]: Entering directory `/home/aserbetci/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/home/aserbetci/srilm MACHINE_TYPE=cygwin OPTION= MAKE_PIC= release-libraries) || exit 1; \ done make[2]: Entering directory `/home/aserbetci/srilm/misc/src' make[2]: Nothing to be done for `release-libraries'. make[2]: Leaving directory `/home/aserbetci/srilm/misc/src' make[2]: Entering directory `/home/aserbetci/srilm/dstruct/src' make[2]: Nothing to be done for `release-libraries'. make[2]: Leaving directory `/home/aserbetci/srilm/dstruct/src' make[2]: Entering directory `/home/aserbetci/srilm/lm/src' make[2]: Nothing to be done for `release-libraries'. make[2]: Leaving directory `/home/aserbetci/srilm/lm/src' make[2]: Entering directory `/home/aserbetci/srilm/flm/src' make[2]: Nothing to be done for `release-libraries'. make[2]: Leaving directory `/home/aserbetci/srilm/flm/src' make[2]: Entering directory `/home/aserbetci/srilm/lattice/src' make[2]: Nothing to be done for `release-libraries'. make[2]: Leaving directory `/home/aserbetci/srilm/lattice/src' make[2]: Entering directory `/home/aserbetci/srilm/utils/src' make[2]: Nothing to be done for `release-libraries'. make[2]: Leaving directory `/home/aserbetci/srilm/utils/src' make[1]: Leaving directory `/home/aserbetci/srilm' make release-programs make[1]: Entering directory `/home/aserbetci/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/home/aserbetci/srilm MACHINE_TYPE=cygwin OPTION= MAKE_PIC= release-programs) || exit 1; \ done make[2]: Entering directory `/home/aserbetci/srilm/misc/src' make[2]: Nothing to be done for `release-programs'. make[2]: Leaving directory `/home/aserbetci/srilm/misc/src' make[2]: Entering directory `/home/aserbetci/srilm/dstruct/src' g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/maxalloc.exe ../obj/cygwin/maxalloc.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl84 -lm /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: cannot find -ltcl84 collect2: ld returned 1 exit status /home/aserbetci/srilm/common/Makefile.common.targets:108: recipe for target `../bin/cygwin/maxalloc.exe' failed make[2]: *** [../bin/cygwin/maxalloc.exe] Error 1 make[2]: Leaving directory `/home/aserbetci/srilm/dstruct/src' Makefile:105: recipe for target `release-programs' failed make[1]: *** [release-programs] Error 1 make[1]: Leaving directory `/home/aserbetci/srilm' Makefile:54: recipe for target `World' failed make: *** [World] Error 2 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Aug 7 09:31:44 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 07 Aug 2012 09:31:44 -0700 Subject: [SRILM User List] build problem, nothing under bin directory In-Reply-To: References: Message-ID: <502142F0.2070805@icsi.berkeley.edu> On 8/6/2012 11:53 PM, Ay?e ?erbet?i wrote: > > Hi, > > I am trying to build SRILM on cygwin installed on a Windows 7 environment. > > When I run the makefile I obtain .h and .cc files under /include, .a > files under /lib/cygwin but nothing under /bin. > > My gcc version : 4.5.3 > > Machine type : CYGWIN_NT-6.1 > Check the first question and list of remedies in the FAQ file! http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From shinichiro.hamada at gmail.com Tue Aug 7 10:32:43 2012 From: shinichiro.hamada at gmail.com (shinichiro.hamada) Date: Wed, 8 Aug 2012 02:32:43 +0900 Subject: [SRILM User List] WBDiscount backoff weights Message-ID: <545624FF362B4E4F9FEC494EFE119B27@f91> Hi. I did a small test described as below to understand SRILM behavior of WBDiscount backoff weights (bow), and got a question. The values of bows of " context", "context word1", "context word2" (2grams) are set to zero. Why? They are the prefix of " context word1" (or " context word2"), "context word1 ", "context word2 " respetively, so I think they are qualified to have bow values. I read the explanation of WBDiscount and "Warning5" in the ngram-discount manual (*1), but I couln't get it's answer. Any advices will help me very much. Thank you. (*1) ngram-discount manual http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html ---------------------------------------------------------------------- $ cat > smp.txt << EOF context word1 context word2 EOF $ ngram-count -order 3 -wbdiscount -text smp.txt -gtmin 0 -gt1min0 -gt2min 0 -gt3min 0 -lm lm.arpa $ cat lm.arpa \data\ ngram 1=5 ngram 2=5 ngram 3=4 \1-grams: -0.5228788 -99 -0.3222193 -0.5228788 context -0.07918124 -0.69897 word1 -0.146128 -0.69897 word2 -0.146128 \2-grams: -0.1760913 context 0 -0.60206 context word1 0 -0.60206 context word2 0 -0.30103 word1 -0.30103 word2 \3-grams: -0.60206 context word1 -0.60206 context word2 -0.30103 context word1 -0.30103 context word2 \end\ -- Shinichiro Hamada From stolcke at icsi.berkeley.edu Tue Aug 7 11:14:55 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 07 Aug 2012 11:14:55 -0700 Subject: [SRILM User List] WBDiscount backoff weights In-Reply-To: <545624FF362B4E4F9FEC494EFE119B27@f91> References: <545624FF362B4E4F9FEC494EFE119B27@f91> Message-ID: <50215B1F.4080403@icsi.berkeley.edu> On 8/7/2012 10:32 AM, shinichiro.hamada wrote: > Hi. > > I did a small test described as below to understand SRILM behavior > of WBDiscount backoff weights (bow), and got a question. > > The values of bows of " context", "context word1", "context > word2" (2grams) are set to zero. Why? > > They are the prefix of " context word1" (or " context word2"), > "context word1 ", "context word2 " respetively, so I think > they are qualified to have bow values. > > I read the explanation of WBDiscount and "Warning5" in the > ngram-discount manual (*1), but I couln't get it's answer. > > Any advices will help me very much. Thank you. Backoff log weight zero (= 1 in the probability domain) means that the bigram probs don't need to be modified when used for backoff purposes. This is because, in your example, the probability mass left over from the explicit trigrams is the same as the probability mass of the corresponding bigrams. And this, in turn, is because your trigrams -0.60206 context word1 -0.60206 context word2 have the same probabilities as the corresponding bigrams: -0.60206 context word1 0 -0.60206 context word2 0 So there is nothing mysterious going on, it just happens to follow from the bigram and trigrams in your data. You will not likely find this situation in realistic data sets. Andreas > > (*1) ngram-discount manual > http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html > > > ---------------------------------------------------------------------- > $ cat > smp.txt << EOF > context word1 > context word2 > EOF > $ ngram-count -order 3 -wbdiscount -text smp.txt -gtmin 0 -gt1min0 -gt2min 0 > -gt3min 0 -lm lm.arpa > $ cat lm.arpa > > \data\ > ngram 1=5 > ngram 2=5 > ngram 3=4 > > \1-grams: > -0.5228788 > -99 -0.3222193 > -0.5228788 context -0.07918124 > -0.69897 word1 -0.146128 > -0.69897 word2 -0.146128 > > \2-grams: > -0.1760913 context 0 > -0.60206 context word1 0 > -0.60206 context word2 0 > -0.30103 word1 > -0.30103 word2 > > \3-grams: > -0.60206 context word1 > -0.60206 context word2 > -0.30103 context word1 > -0.30103 context word2 > > \end\ > > -- > Shinichiro Hamada > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From chenmengdx at gmail.com Wed Aug 8 03:31:03 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Wed, 8 Aug 2012 18:31:03 +0800 Subject: [SRILM User List] Question about -prune-lowprobs and -text-has-weights Message-ID: Hi, the* -prune-lowprobs* option in* ngram* will "prune N-gram probabilities that are lower than the corresponding backed-off estimates". This option would be useful especially when the back-off-weight (bow) value is positive. However, I want to ask if I could simply replace the positive bow value with 0 instead of using prune-lowprobs. Are there any differences? Or replace simply is not correct? Another question: When training LM, we could use* -text-has-weights* option for the corpus with sentence frequency. I want to ask what we should do with the*duplicated sentences * in large corpus. Should I delete the duplicated sentences? Or should I calculate the sentence frequency first and use the -text-has-weights option instead? Or do nothing, just throw all the corpus into training? Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From shinichiro.hamada at gmail.com Wed Aug 8 08:00:42 2012 From: shinichiro.hamada at gmail.com (shinichiro.hamada) Date: Thu, 9 Aug 2012 00:00:42 +0900 Subject: [SRILM User List] WBDiscount backoff weights In-Reply-To: <50215B1F.4080403@icsi.berkeley.edu> References: <545624FF362B4E4F9FEC494EFE119B27@f91> <50215B1F.4080403@icsi.berkeley.edu> Message-ID: <42E86E835B3F497FA4E68890F183E435@f91> Dear Mr. Stolcke, I understood very well owning to your detail explanation with concrete examples. Thank you for always being so kind. Best Regards, Shinichiro > -----Original Message----- > From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] > Sent: Wednesday, August 08, 2012 3:15 AM > To: shinichiro.hamada > Cc: srilm-user at speech.sri.com > Subject: Re: [SRILM User List] WBDiscount backoff weights > > On 8/7/2012 10:32 AM, shinichiro.hamada wrote: > > Hi. > > > > I did a small test described as below to understand SRILM > > behavior of WBDiscount backoff weights (bow), and got a question. > > > > The values of bows of " context", "context word1", "context > > word2" (2grams) are set to zero. Why? > > > > They are the prefix of " context word1" (or " context > > word2"), "context word1 ", "context word2 " respetively, > > so I think they are qualified to have bow values. > > > > I read the explanation of WBDiscount and "Warning5" in the > > ngram-discount manual (*1), but I couln't get it's answer. > > > > Any advices will help me very much. Thank you. > > Backoff log weight zero (= 1 in the probability domain) means that > the bigram probs don't need to be modified when used for backoff > purposes. > This is because, in your example, the probability mass left over > from the explicit trigrams is the same as the probability mass of > the corresponding bigrams. And this, in turn, is because your > trigrams > > -0.60206 context word1 > -0.60206 context word2 > > have the same probabilities as the corresponding bigrams: > > -0.60206 context word1 0 > -0.60206 context word2 0 > > So there is nothing mysterious going on, it just happens to follow > from the bigram and trigrams in your data. You will not likely > find this situation in realistic data sets. > > Andreas > > > > > > > > (*1) ngram-discount manual > > http://www-speech.sri.com/projects/srilm/manpages/ngram-discount. > > 7.html > > > > > > ----------------------------------------------------------------- > > $ cat > smp.txt << EOF > > context word1 > > context word2 > > EOF > > $ ngram-count -order 3 -wbdiscount -text smp.txt -gtmin 0 > > -gt1min0 -gt2min 0 -gt3min 0 -lm lm.arpa $ cat lm.arpa > > > > \data\ > > ngram 1=5 > > ngram 2=5 > > ngram 3=4 > > > > \1-grams: > > -0.5228788 > > -99 -0.3222193 > > -0.5228788 context -0.07918124 > > -0.69897 word1 -0.146128 > > -0.69897 word2 -0.146128 > > > > \2-grams: > > -0.1760913 context 0 > > -0.60206 context word1 0 > > -0.60206 context word2 0 > > -0.30103 word1 > > -0.30103 word2 > > > > \3-grams: > > -0.60206 context word1 > > -0.60206 context word2 > > -0.30103 context word1 > > -0.30103 context word2 > > > > \end\ > > > > -- > > Shinichiro Hamada From stolcke at icsi.berkeley.edu Wed Aug 8 11:57:27 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 08 Aug 2012 11:57:27 -0700 Subject: [SRILM User List] Question about -prune-lowprobs and -text-has-weights In-Reply-To: References: Message-ID: <5022B697.1080908@icsi.berkeley.edu> On 8/8/2012 3:31 AM, Meng Chen wrote: > Hi, the*-prune-lowprobs* option in*ngram* will "prune N-gram > probabilities that are lower than the corresponding backed-off > estimates". This option would be useful especially when the > back-off-weight (bow) value is positive. However, I want to ask if I > could simply replace the positive bow value with 0 instead of using > prune-lowprobs. Are there any differences? Or replace simply is not > correct? It's not correct. If you modify the backoff weight you end up with an LM that is no longer normalized (word probs for a given context don't sum to 1). > > Another question: > When training LM, we could use*-text-has-weights* option for the > corpus with sentence frequency. I want to ask what we should do with > the*duplicated sentences* in large corpus. Should I delete the > duplicated sentences? Or should I calculate the sentence frequency > first and use the -text-has-weights option instead? Or do nothing, > just throw all the corpus into training? You can do either. Have a duplicated sentence 1.0 a b c 1.0 a b c is equivalent to having the sentence once with added weights: 2.0 a b c Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Aug 8 22:09:35 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 08 Aug 2012 22:09:35 -0700 Subject: [SRILM User List] Predicting words In-Reply-To: <1342785895.7010.YahooMailNeo@web171304.mail.ir2.yahoo.com> References: <1342785895.7010.YahooMailNeo@web171304.mail.ir2.yahoo.com> Message-ID: <5023460F.5050301@icsi.berkeley.edu> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote: > Hello, > > I am new to language modeling and was hoping that someone can help me > with the following. > > I try to predict a word given an input sentence. For example, I would > like to get a word replacing the ... that has the > highest probability in sentences such as ' A man is ...' (e.g. sitting). > > I try to use disambig tool but I couldn't found any example illustrate > how to use it especially how how I can create the map file and what is > the type of this file ( e.g. txt, arpa, ...). Indeed you can use disambig, at least in theory to solve this problem. 1. prepare a map file of the form: a a man man ... [for all words occurring in your data] UNKNOWN_WORD word1 word2 .... [list all words in the vocabulary here] 2. train an LM of word sequences. 3. prepare disambig input of the form a man is sitting UNKNOWN_WORD You can also add known words to the right of UKNOWN_WORD if you have that information (see the note about -fw-only below). 4. run disambig disambig -map MAPFILE -lm LMFILE -text INPUTFILE If you want to use only the left context of the UNKNOWN_WORD use the -fw-only option. This is in theory. If your vocabulary is large it may be very slow and take too much memory. I haven't tried it, so let me know if it works for you. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Thu Aug 16 04:07:13 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Thu, 16 Aug 2012 19:07:13 +0800 Subject: [SRILM User List] How to interpolate big LMs? Message-ID: Hi, suppose I have trained three big LMs: LM1 LM2 and LM3, each of which has more than billions of ngrams. I wonder to know how to interpolate such big LMs together. I found that the ngram command in SRILM would load all the LMs in memory firstly, so it will reach the limitation of server's memory. In such situation, how can I get the interpolation of big LMs? Another question about training LM with large corpus. There are two methods: 1) I can pool all data to train a big LM0. 2) I can split the data into several parts, and train small LMs (eg. LM1 and LM2). Then interpolate them with average weight (eg. 0.5 X LM1 + 0.5 X LM2 ) to get the final LM3. All the cut-offs and smoothing algorithm are the same for both methods. So does LM3 the same with LM0? Thanks! Meng CHRN -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Aug 16 11:06:55 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 16 Aug 2012 11:06:55 -0700 Subject: [SRILM User List] How to interpolate big LMs? In-Reply-To: References: Message-ID: <502D36BF.6080804@icsi.berkeley.edu> On 8/16/2012 4:07 AM, Meng Chen wrote: > Hi, suppose I have trained three big LMs: LM1 LM2 and LM3, each of > which has more than billions of ngrams. I wonder to know how to > interpolate such big LMs together. I found that the ngram command in > SRILM would load all the LMs in memory firstly, so it will reach the > limitation of server's memory. In such situation, how can I get the > interpolation of big LMs? > > Another question about training LM with large corpus. There are two > methods: > 1) I can pool all data to train a big LM0. > 2) I can split the data into several parts, and train small LMs (eg. > LM1 and LM2). Then interpolate them with average weight (eg. 0.5 X LM1 > + 0.5 X LM2 ) to get the final LM3. > All the cut-offs and smoothing algorithm are the same for both > methods. So does LM3 the same with LM0? > > I'm assuming you are merging ngram LMs into one big LM (-mix-lm etc. WITHOUT the -bayes option). In that case the LMs are merged destructively into the first LM, one by one. This means at any given time only the partially merged LM and the next LM to be merged are kept in memory. So when you're running ngram -lm LM1 -mix-lm LM2 -mix-lm2 LM3 it is NOT the case that LM1, LM2 and LM3 are in memory at the same time. Instead, the result of merging LM1 and LM2, plus LM3 need to fit into memory. Of course, depending on how much overlap in ngrams there is, that might be almost the same in terms of total memory. Try building your binaries with OPTION=_c (compact memory). Also, try using the latest beta version off the web site. It contains an optimized memory allocator that leads to significant memory savings. Finally, if all else fails, prune your large component LMs prior to merging. Andreas From stolcke at icsi.berkeley.edu Thu Aug 16 11:30:28 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 16 Aug 2012 11:30:28 -0700 Subject: [SRILM User List] How to interpolate big LMs? In-Reply-To: <502D36BF.6080804@icsi.berkeley.edu> References: <502D36BF.6080804@icsi.berkeley.edu> Message-ID: <502D3C44.3090903@icsi.berkeley.edu> On 8/16/2012 11:06 AM, Andreas Stolcke wrote: > > > Try building your binaries with OPTION=_c (compact memory). Also, try > using the latest beta version off the web site. It contains an > optimized memory allocator that leads to significant memory savings. > Finally, if all else fails, prune your large component LMs prior to > merging. Correction: the improved memory allocator is already in the 1.6.0 release, which is the current stable release. But do make sure you have that version, and not some older one. Andreas From kcananda at gmail.com Thu Aug 16 21:11:52 2012 From: kcananda at gmail.com (Ananda K.C.) Date: Fri, 17 Aug 2012 09:56:52 +0545 Subject: [SRILM User List] (no subject) Message-ID: Dear all, I am doing dissertation of my Master's degree in computer science.I want to calculate the bigram and trigram probability table as in attachment.,from back off N-gram language models in ARPA format. Also when i use this command "ngram-count -order 3 -read /home/ananda/Desktop/work/countoutput.txt -vocab /home/ananda/Desktop/work/corpusvocab.txt -lm /home/ananda/Desktop/work/anandamodeling",which discounting is use for backoff smothing. I am new in the language modeling and thanks in advance. Regards, Ananda K.C. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: trigram probability table.jpg Type: image/jpeg Size: 19638 bytes Desc: not available URL: From kcananda at gmail.com Thu Aug 16 21:14:05 2012 From: kcananda at gmail.com (Ananda K.C.) Date: Fri, 17 Aug 2012 09:59:05 +0545 Subject: [SRILM User List] bigram and trigram probability table Message-ID: Dear all, I am doing dissertation of my Master's degree in computer science.I want to calculate the bigram and trigram probability table as in attachment,from back off N-gram language models in ARPA format. Also when i use this command "ngram-count -order 3 -read /home/ananda/Desktop/work/countoutput.txt -vocab /home/ananda/Desktop/work/corpusvocab.txt -lm /home/ananda/Desktop/work/anandamodeling",which discounting is use for backoff smothing. I am new in the language modeling and thanks in advance. Regards, Ananda K.C. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: trigram probability table.jpg Type: image/jpeg Size: 19638 bytes Desc: not available URL: From shahramk at gmail.com Thu Aug 16 21:24:48 2012 From: shahramk at gmail.com (Shahram) Date: Fri, 17 Aug 2012 14:24:48 +1000 Subject: [SRILM User List] Topic Dependent Audio Date set Message-ID: Hi all, I am looking for an audio data set for my thesis in the area of topic dependent spoken term detection and I need to create topic dependent language models. Does any one know any audio data set with its textual transcription which is tagged by topics? Topics should preferably categorized as Politics, Sports, ... -- --- Regards Shahram Kalantari -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Aug 17 11:09:11 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 17 Aug 2012 11:09:11 -0700 Subject: [SRILM User List] bigram and trigram probability table In-Reply-To: References: Message-ID: <502E88C7.9010708@icsi.berkeley.edu> Ananda, the easiest way to have the toolkit compute your bigram and trigram probabilities once you have the model trained is: ngram -lm /home/ananda/Desktop/work/anandamodeling -debug 2 -counts NGRAMS where NGRAMS is a file you prepare that lists all the bigrams and trigrams you need, followed by a "1". For example: i i 1 i want 1 i to 1 want want 1 to to 1 etc. Andreas On 8/16/2012 9:14 PM, Ananda K.C. wrote: > Dear all, > > I am doing dissertation of my Master's degree in computer science.I > want to calculate the bigram and trigram probability table as in > attachment,from back off N-gram language models in ARPA format. > > Also when i use this command "ngram-count -order 3 -read > /home/ananda/Desktop/work/countoutput.txt -vocab > /home/ananda/Desktop/work/corpusvocab.txt -lm > /home/ananda/Desktop/work/anandamodeling",which discounting is use for > backoff smothing. > > I am new in the language modeling and thanks in advance. > > > Regards, > Ananda K.C. > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From kcananda at gmail.com Mon Aug 20 20:08:23 2012 From: kcananda at gmail.com (Ananda K.C.) Date: Tue, 21 Aug 2012 08:53:23 +0545 Subject: [SRILM User List] bigram and trigram probability table In-Reply-To: <502E88C7.9010708@icsi.berkeley.edu> References: <502E88C7.9010708@icsi.berkeley.edu> Message-ID: Dear Andreas, Thanks for your help. On Fri, Aug 17, 2012 at 11:54 PM, Andreas Stolcke wrote: > Ananda, > > the easiest way to have the toolkit compute your bigram and trigram > probabilities once you have the model trained is: > > ngram -lm /home/ananda/Desktop/work/anandamodeling -debug 2 -counts NGRAMS > > where NGRAMS is a file you prepare that lists all the bigrams and trigrams > you need, followed by a "1". > For example: > > i i 1 > i want 1 > i to 1 > want want 1 > to to 1 > etc. > > Andreas > > > > On 8/16/2012 9:14 PM, Ananda K.C. wrote: > > Dear all, > > I am doing dissertation of my Master's degree in computer science.I > want to calculate the bigram and trigram probability table as in > attachment,from back off N-gram language models in ARPA format. > > Also when i use this command "ngram-count -order 3 -read > /home/ananda/Desktop/work/countoutput.txt -vocab > /home/ananda/Desktop/work/corpusvocab.txt -lm > /home/ananda/Desktop/work/anandamodeling",which discounting is use for > backoff smothing. > > I am new in the language modeling and thanks in advance. > > > Regards, > Ananda K.C. > > > > _______________________________________________ > SRILM-User site listSRILM-User at speech.sri.comhttp://www.speech.sri.com/mailman/listinfo/srilm-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lluis.formiga at upc.edu Tue Aug 21 14:43:05 2012 From: lluis.formiga at upc.edu (=?iso-8859-1?Q?Llu=EDs_Formiga_i_Fanals?=) Date: Tue, 21 Aug 2012 23:43:05 +0200 Subject: [SRILM User List] Does keep-unk work with lattice-tool and htk format? In-Reply-To: <4FBC29CD.3010601@icsi.berkeley.edu> References: <4FB9B671.9080604@mit.edu> <4FBA8AE3.2070709@icsi.berkeley.edu> <9D363DB8-11DF-4534-AEB5-058E96E3A74C@upc.edu> <4FBC29CD.3010601@icsi.berkeley.edu> Message-ID: <7507354C-E6F2-4F91-8B9D-D75B1118023D@upc.edu> Hi Andreas, Sorry to bother you with this old issue. The two-step lattice-tool process worked perfectly. First the rescoring and second the conversion to CN. But, unfortunately I have seen a few unks while rescoring the lattice (not as many as writing the mesh). The command I use to rescore is: lattice-tool -lm ../../lm/interpolated-lm.en -in-lattice wordlattice0.slf -read-htk -out-lattice out.slf -write-htk -keep-unk -print-sent-tags -htk-logbase 2.71828 And I find lines like these: (Whithin these lines the tag should be queit) J=26 S=19 E=24 W=qu a=0 l=-13.8261 J=27 S=19 E=25 W=que a=0 l=-11.4986 J=28 S=19 E=26 W= a=0 l=-2.76367 J=29 S=19 E=27 W=quest a=0 l=-10.831 J=30 S=19 E=28 W=quiet a=0 l=-10.57 J=31 S=19 E=29 W=quit a=0 l=-10.4455 J=32 S=20 E=21 W=row a=0 l=-10.1076 J=33 S=21 E=24 W=qu a=0 l=-14.9448 J=34 S=21 E=25 W=que a=0 l=-12.6173 J=35 S=21 E=26 W= a=0 l=-3.88236 J=36 S=21 E=27 W=quest a=0 l=-11.9497 J=37 S=21 E=28 W=quiet a=0 l=-11.6887 J=38 S=21 E=29 W=quit a=0 l=-11.0153 J=39 S=22 E=19 W=arrow a=0 l=-12.6258 I have to say that I use the rescoring to give probabilities to the archs from misspelling corrections. So I do not have any acoustic scores. (I set all them equal). Regards, Llu?s El 23/05/2012, a les 2:05, Andreas Stolcke va escriure: > On 5/22/2012 10:56 AM, Llu?s Formiga i Fanals wrote: >> >> Hi, >> >> I was trying to execute the following command: >> >> >> lattice-tool -in-lattice-list lattice_lists.txt -read-htk -lm >> /veu4/usuaris24/lluisf/EMS/misspelling2012/lm/interpolated-lm.en >> -write-mesh-dir out -keep-unk >> >> but I find that unks ("") are still on the written CN (-write-mesh). >> >> Does -keep-unk option work only for lattices output? Am I doing something wrong? > No, the code is working as intended. > > The option is described as > -keep-unk > Treat out-of-vocabulary words as but preserve their labels in lattice output. > > What you are outputting is confusion networks, not lattices. In the CN building process, lattice nodes that are mapped to are treated as equivalent, and the word information is lost in the process. > > I would suggest that you simple do your lattice rescoring with -keep-unk, output the rescored lattices, and then run lattice-tool a second time without -keep-unk and without the -vocab option, so all word labels are preserved (all words are implicitly added to the vocabulary). > > Andreas > > >> >> Thanks, >> >> Llu?s >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 8771 bytes Desc: not available URL: From stolcke at icsi.berkeley.edu Fri Aug 24 00:07:40 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 24 Aug 2012 00:07:40 -0700 Subject: [SRILM User List] Does keep-unk work with lattice-tool and htk format? In-Reply-To: <7507354C-E6F2-4F91-8B9D-D75B1118023D@upc.edu> References: <4FB9B671.9080604@mit.edu> <4FBA8AE3.2070709@icsi.berkeley.edu> <9D363DB8-11DF-4534-AEB5-058E96E3A74C@upc.edu> <4FBC29CD.3010601@icsi.berkeley.edu> <7507354C-E6F2-4F91-8B9D-D75B1118023D@upc.edu> Message-ID: <5037283C.9070005@icsi.berkeley.edu> Congratulations, you found a bug! The patch attached to this message (to HTKLattice.cc) should fix this problem. Andreas On 8/21/2012 2:43 PM, Llu?s Formiga i Fanals wrote: > Hi Andreas, > > Sorry to bother you with this old issue. > > The two-step lattice-tool process worked perfectly. First the > rescoring and second the conversion to CN. > > But, unfortunately I have seen a few unks while rescoring the lattice > (not as many as writing the mesh). > > The command I use to rescore is: > > lattice-tool -lm ../../lm/interpolated-lm.en -in-lattice > wordlattice0.slf -read-htk -out-lattice out.slf-write-htk -keep-unk > -print-sent-tags -htk-logbase 2.71828 > > And I find lines like these: (Whithin these lines the tag should > be queit) > > J=26 S=19 E=24 W=qu a=0 l=-13.8261 J=27 S=19 E=25 W=que a=0 l=-11.4986 > J=28 S=19 E=26 W= a=0 l=-2.76367 J=29 S=19 E=27 W=quest a=0 > l=-10.831 J=30 S=19 E=28 W=quiet a=0 l=-10.57 J=31 S=19 E=29 W=quit > a=0 l=-10.4455 J=32 S=20 E=21 W=row a=0 l=-10.1076 J=33 S=21 E=24 W=qu > a=0 l=-14.9448 J=34 S=21 E=25 W=que a=0 l=-12.6173 J=35 S=21 E=26 > W= a=0 l=-3.88236 J=36 S=21 E=27 W=quest a=0 l=-11.9497 J=37 S=21 > E=28 W=quiet a=0 l=-11.6887 J=38 S=21 E=29 W=quit a=0 l=-11.0153 J=39 > S=22 E=19 W=arrow a=0 l=-12.6258 > > I have to say that I use the rescoring to give probabilities to the > archs from misspelling corrections. So I do not have any acoustic > scores. (I set all them equal). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- *** lattice/src/HTKLattice.cc 3 Aug 2012 01:11:34 -0000 1.60 --- lattice/src/HTKLattice.cc 24 Aug 2012 07:02:40 -0000 *************** *** 1769,1776 **** toNode->word == vocab.seIndex()) || toNode->word == Vocab_None) ? HTK_null_word : ! (node->htkinfo && node->htkinfo->wordLabel ? ! node->htkinfo->wordLabel : vocab.getWord(toNode->word)), htkheader.useQuotes); } --- 1769,1776 ---- toNode->word == vocab.seIndex()) || toNode->word == Vocab_None) ? HTK_null_word : ! (toNode->htkinfo && toNode->htkinfo->wordLabel ? ! toNode->htkinfo->wordLabel : vocab.getWord(toNode->word)), htkheader.useQuotes); } From wrested at hotmail.de Tue Aug 28 05:06:36 2012 From: wrested at hotmail.de (hic et nunc) Date: Tue, 28 Aug 2012 12:06:36 +0000 Subject: [SRILM User List] (no subject) Message-ID: hello. i'm a newbie of srilm toolkit. when i used wb, kn, or mkn smoothing methods for lm making, i realized that some of ngrams are not in lm file (albeit they exists in count file).i checked for ngrams (4-5-6 ordered) and saw that, 3 and 3+grams which have 1 count are not included in lm file. is it possible to ignore this feature in srilm? if yes, could you tell me which part of the code should be changed? thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Aug 28 10:32:05 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 28 Aug 2012 10:32:05 -0700 Subject: [SRILM User List] (no subject) In-Reply-To: References: Message-ID: <503D0095.9050201@icsi.berkeley.edu> On 8/28/2012 5:06 AM, hic et nunc wrote: > hello. i'm a newbie of srilm toolkit. > when i used wb, kn, or mkn smoothing methods for lm making, i > realized that some of ngrams are not in lm file (albeit they exists in > count file). > i checked for ngrams (4-5-6 ordered) and saw that, 3 and 3+grams which > have 1 count are not included in lm file. > is it possible to ignore this feature in srilm? if yes, could you tell > me which part of the code should be changed? This should be a FAQ. The answer to your question is at http://www.speech.sri.com/pipermail/srilm-user/2012q3/001276.html . Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From federico.sangati at gmail.com Sun Sep 2 03:59:16 2012 From: federico.sangati at gmail.com (Federico Sangati) Date: Sun, 2 Sep 2012 11:59:16 +0100 Subject: [SRILM User List] Predicting words Message-ID: Hi, Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below). MAPFILE: shock shock 1961 1961 ? [same for all words occurring in vocabulary] UNK_NEXT_WORD maturing analyzing attended ? [list of all words occurring in vocabulary] INPUTFILE: No , UNK_NEXT_WORD No , UNK_NEXT_WORD But while , UNK_NEXT_WORD But while , UNK_NEXT_WORD The 49 stock specialist UNK_NEXT_WORD OUTPUTFILE: No , talent No , talent But while , talent But while , talent The 49 stock specialist talent Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters). It would be nice to know if there is any solution for this. Best, Federico Sangati University of Edinburgh > On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote: > Indeed you can use disambig, at least in theory to solve this problem. > > 1. prepare a map file of the form: > > a a > man man > ... [for all words occurring in your data] > UNKNOWN_WORD word1 word2 .... [list all words in the vocabulary > here] > > 2. train an LM of word sequences. > > 3. prepare disambig input of the form > > a man is sitting UNKNOWN_WORD > > You can also add known words to the right of UKNOWN_WORD if you have > that information (see the note about -fw-only below). > > 4. run disambig > > disambig -map MAPFILE -lm LMFILE -text INPUTFILE > > If you want to use only the left context of the UNKNOWN_WORD use the > -fw-only option. > > This is in theory. If your vocabulary is large it may be very slow and > take too much memory. I haven't tried it, so let me know if it works > for you. > > Andreas >> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote: >> Hello, >> I am new to language modeling and was hoping that someone can help me with the following. >> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting). >> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...). From wrested at hotmail.de Sun Sep 2 22:10:39 2012 From: wrested at hotmail.de (hic et nunc) Date: Mon, 3 Sep 2012 05:10:39 +0000 Subject: [SRILM User List] (no subject) In-Reply-To: References: Message-ID: hello again. i have a new question about lm ngram probs. as you know well, in lm file, the log probs are calculated like this: log [(count[n-gram]*d/count[(n-1)-gram] - count[(n-1)-gram_]] sometimes 1 is added to denominator, but sometimes not. what is the reason of this? thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Mon Sep 3 05:37:16 2012 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 3 Sep 2012 13:37:16 +0100 Subject: [SRILM User List] Predicting words In-Reply-To: References: Message-ID: FYI, for others on the list and the archives-- After talking to Federico offline, I think he ended up solving his problem by using the Python bindings I wrote a while back to query the ngram model directly. Since they might be useful to others I went ahead and uploaded them to github as well: https://github.com/njsmith/pysrilm Download snapshot: https://github.com/njsmith/pysrilm/zipball/master -- Nathaniel Smith University of Edinburgh On Sun, Sep 2, 2012 at 11:59 AM, Federico Sangati wrote: > Hi, > > Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below). > > MAPFILE: > shock shock > 1961 1961 > ? [same for all words occurring in vocabulary] > UNK_NEXT_WORD maturing analyzing attended ? [list of all words occurring in vocabulary] > > INPUTFILE: > No , UNK_NEXT_WORD > No , UNK_NEXT_WORD > But while , UNK_NEXT_WORD > But while , UNK_NEXT_WORD > The 49 stock specialist UNK_NEXT_WORD > > OUTPUTFILE: > No , talent > No , talent > But while , talent > But while , talent > The 49 stock specialist talent > > Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters). > It would be nice to know if there is any solution for this. > > Best, > Federico Sangati > University of Edinburgh > > >> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote: >> Indeed you can use disambig, at least in theory to solve this problem. >> >> 1. prepare a map file of the form: >> >> a a >> man man >> ... [for all words occurring in your data] >> UNKNOWN_WORD word1 word2 .... [list all words in the vocabulary >> here] >> >> 2. train an LM of word sequences. >> >> 3. prepare disambig input of the form >> >> a man is sitting UNKNOWN_WORD >> >> You can also add known words to the right of UKNOWN_WORD if you have >> that information (see the note about -fw-only below). >> >> 4. run disambig >> >> disambig -map MAPFILE -lm LMFILE -text INPUTFILE >> >> If you want to use only the left context of the UNKNOWN_WORD use the >> -fw-only option. >> >> This is in theory. If your vocabulary is large it may be very slow and >> take too much memory. I haven't tried it, so let me know if it works >> for you. >> >> Andreas > >>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote: >>> Hello, >>> I am new to language modeling and was hoping that someone can help me with the following. >>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting). >>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...). > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Tue Sep 4 00:46:32 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 04 Sep 2012 00:46:32 -0700 Subject: [SRILM User List] (no subject) In-Reply-To: References: Message-ID: <5045B1D8.50803@icsi.berkeley.edu> On 9/2/2012 10:10 PM, hic et nunc wrote: > hello again. i have a new question about lm ngram probs. > as you know well, in lm file, the log probs are calculated like this: > log [(count[n-gram]*d/count[(n-1)-gram] - count[(n-1)-gram_]] > sometimes 1 is added to denominator, but sometimes not. what is the > reason of this? One is added to the denominator only a last resort when the smoothing results in n-gram probabilities that sum to 1. The following comment in NgramLM.cc explains why: > /* > * This is a hack credited to Doug Paul (by Roni Rosenfeld in > * his CMU tools). It may happen that no probability mass > * is left after totalling all the explicit probs, typically > * because the discount coefficients were out of range and > * forced to 1.0. Unless we have seen all vocabulary words in > * this context, to arrive at some non-zero backoff mass, > * we try incrementing the denominator in the estimator by 1. > * Another hack: If the discounting method uses interpolation > * we first try disabling that because interpolation removes > * probability mass. > */ This happens occasionally with GT smoothing due to degenerate count-of-counts statistics. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Sep 4 17:10:13 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 04 Sep 2012 17:10:13 -0700 Subject: [SRILM User List] Predicting words In-Reply-To: References: Message-ID: <50469865.2010309@icsi.berkeley.edu> I suspect there were some problems with the construction of the map file. For one thing, when you have a word that is also a valid numeric string (like the second line in your example) you cannot leave out the explicit mapping probability. Also, it turns out that it is much more convenient to use the disambig -classes option instead of -map to supply the mapping information (this allows you to give the mapping one-word-at-a-time for the "unknown" token). Anyway, here is a short example that demonstrates that my instructions worked in principle ;-). It uses the trigram LM supplied with SRILM. # construct the map file in classes format ngram -order 1 -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -write-vocab - | \ gawk '{ print $1, 1, $1; print $1, 1, "UNKNOWN-WORD" }' lm.vocab > test.mapfile # fill in the blanks (uses both left and right word context). Note -order 2 is default so specify -order 3 disambig -order 3 -classes test.mapfile -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -text - INPUT: what a great UNKNOWN-WORD OUTPUT: what a great time INPUT: that is the stupidest UNKNOWN-WORD i've heard OUTPUT: that is the stupidest thing i've heard Seems to work ;-) Andreas On 9/2/2012 3:59 AM, Federico Sangati wrote: > Hi, > > Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below). > > MAPFILE: > shock shock > 1961 1961 > ? [same for all words occurring in vocabulary] > UNK_NEXT_WORD maturing analyzing attended ? [list of all words occurring in vocabulary] > > INPUTFILE: > No , UNK_NEXT_WORD > No , UNK_NEXT_WORD > But while , UNK_NEXT_WORD > But while , UNK_NEXT_WORD > > OUTPUTFILE: > No , talent > No , talent > But while , talent > But while , talent > > > Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters). > It would be nice to know if there is any solution for this. > > Best, > Federico Sangati > University of Edinburgh > > >> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote: >> Indeed you can use disambig, at least in theory to solve this problem. >> >> 1. prepare a map file of the form: >> >> a a >> man man >> ... [for all words occurring in your data] >> UNKNOWN_WORD word1 word2 .... [list all words in the vocabulary >> here] >> >> 2. train an LM of word sequences. >> >> 3. prepare disambig input of the form >> >> a man is sitting UNKNOWN_WORD >> >> You can also add known words to the right of UKNOWN_WORD if you have >> that information (see the note about -fw-only below). >> >> 4. run disambig >> >> disambig -map MAPFILE -lm LMFILE -text INPUTFILE >> >> If you want to use only the left context of the UNKNOWN_WORD use the >> -fw-only option. >> >> This is in theory. If your vocabulary is large it may be very slow and >> take too much memory. I haven't tried it, so let me know if it works >> for you. >> >> Andreas >>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote: >>> Hello, >>> I am new to language modeling and was hoping that someone can help me with the following. >>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting). >>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...). From stolcke at icsi.berkeley.edu Tue Sep 4 18:55:36 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 04 Sep 2012 18:55:36 -0700 Subject: [SRILM User List] Predicting words In-Reply-To: Your message of Tue, 04 Sep 2012 17:10:13 -0700. <50469865.2010309@icsi.berkeley.edu> Message-ID: <201209050155.q851ta91013037@fruitcake.ICSI.Berkeley.EDU> In message <50469865.2010309 at icsi.berkeley.edu>I wrote: > > Anyway, here is a short example that demonstrates that my instructions > worked in principle ;-). > It uses the trigram LM supplied with SRILM. > > # construct the map file in classes format > ngram -order 1 -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz > -write-vocab - | \ > gawk '{ print $1, 1, $1; print $1, 1, "UNKNOWN-WORD" }' lm.vocab > > test.mapfile Copy-and-paste error. The above command should be ngram -order 1 -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -write-vocab - | \ gawk '{ print $1, 1, $1; print $1, 1, "UNKNOWN-WORD" }' > test.mapfile Andreas From chenmengdx at gmail.com Wed Sep 5 03:06:51 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Wed, 5 Sep 2012 18:06:51 +0800 Subject: [SRILM User List] Question about select-vocab Message-ID: Hi, I am using the *select-vocab* command to choose vocabulary from corpus A and B in a Chinese speech recognition task, the command is as follows: *select-vocab -heldout dev A B > vocab_with_weight* Then I saw the prompts below: *Iter 0: lambdas = (0.5 0.5)* *Iter 1: lambdas = (0.443075 0.556925) log P(held-out) = -374805.0047 PPL = 6937.8495* *Iter 2: lambdas = (0.399799 0.600201) log P(held-out) = -374319.5890 PPL = 6858.8301* *Iter 3: lambdas = (0.366822 0.633178) log P(held-out) = -374032.9165 PPL = 6812.5869* *Iter 4: lambdas = (0.341533 0.658467) log P(held-out) = -373860.8231 PPL = 6784.9764* I want to ask what's the meaning of PPL. Does the command train a LM with corpus A and B first, then calculate the PPL of heldout data with the LM? If corpus A and B are 10GB each, how much the heldout data should be at least in order to choose a reasonable vocabulary? Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkataraman.anand at gmail.com Wed Sep 5 13:05:04 2012 From: venkataraman.anand at gmail.com (Anand Venkataraman) Date: Wed, 5 Sep 2012 13:05:04 -0700 Subject: [SRILM User List] Question about select-vocab Message-ID: I realized I was off the list and just rejoined (thanks Andreas). Meng - In response to your questions about select-vocab: 1. Yes, you're right about the PPL. The program trains separate unigram LMs for the given corpora (A & B) and the diagnostic output prints the PPL of the held-out set according to the _best_ word-level mixture of A.1bo and B.1bo. 2. Hard to say how big the held-out set ought to be for given A and B sizes. My only suggestion is to ensure that the held-out set contains a representative sample of words that you expect to see in the domain. If in doubt, you can always extract the domain vocabulary and ensure that the held-out set covers the top N% (by freq) of the domain words (for some suitable N) Hope this helps. & -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Sep 5 13:36:29 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 05 Sep 2012 13:36:29 -0700 Subject: [SRILM User List] Question about select-vocab In-Reply-To: References: Message-ID: <5047B7CD.1060406@icsi.berkeley.edu> On 9/5/2012 1:05 PM, Anand Venkataraman wrote: > I realized I was off the list and just rejoined (thanks Andreas). > > Meng - In response to your questions about select-vocab: > > 1. Yes, you're right about the PPL. The program trains separate > unigram LMs for the given corpora (A & B) and the diagnostic > output prints the PPL of the held-out set according to the _best_ > word-level mixture of A.1bo and B.1bo. > 2. Hard to say how big the held-out set ought to be for given A and B > sizes. My only suggestion is to ensure that the held-out set > contains a representative sample of words that you expect to see > in the domain. If in doubt, you can always extract the domain > vocabulary and ensure that the held-out set covers the top N% (by > freq) of the domain words (for some suitable N) > > Hope this helps. > > & > Thanks Anand. Good to have you back on the list. Meng: in case this wasn't clear, "PPL" is short for "perplexity". Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From kcananda at gmail.com Thu Sep 6 06:58:56 2012 From: kcananda at gmail.com (Ananda K.C.) Date: Thu, 6 Sep 2012 19:43:56 +0545 Subject: [SRILM User List] Regarding ngram Message-ID: hi, how to print the output probability calculation of the command "ngram -lm /home/ananda/Desktop/reporting/probability -debug 2 -counts /home/ananda/Desktop/reporting/countoutput.txt" in a file. Anadna -------------- next part -------------- An HTML attachment was scrubbed... URL: From kcananda at gmail.com Sat Sep 15 08:01:12 2012 From: kcananda at gmail.com (Ananda K.C.) Date: Sat, 15 Sep 2012 20:46:12 +0545 Subject: [SRILM User List] Regarding backoff using Message-ID: hi all of you, I have send my test file containing corpus,vocab,and final output bigram probability.Also i have send you all the command in command file. My main problem is when we use Backoff with Good Turing discounting.Then p( He | ) = [2gram] 0.0348584 [ -1.45769 ] p( I | ) = [2gram] 0.0348584 [ -1.45769 ] p( this | ) = [2gram] 0.0348584 [ -1.45769 *2 ] is only find out. But it should find the probabilty with all the words in the vocabulary,if bigram count is zero then it should move towards unigram count to assign some probabilty to bigram. like p( am | ) p(going| ) p( kath | ) and so on with all the word in the vocabulary,which is not calculated. Since we know that when the bigram count is zero ,we should get probability from unigram count.May be i have done some mistake in commands. Please help me to solve my problem. regards, Ananda -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- this is ananda this is bhawana I am going to kath He is going to kath -------------- next part -------------- A non-text attachment was scrubbed... Name: command Type: application/octet-stream Size: 384 bytes Desc: not available URL: -------------- next part -------------- 4 this 2 I 1 He 1 this 2 this is 2 is 3 is ananda 1 is bhawana 1 is going 1 ananda 1 ananda 1 4 bhawana 1 bhawana 1 I 1 I am 1 am 1 am going 1 going 2 going to 2 to 2 to kath 2 kath 2 kath 2 He 1 He is 1 -------------- next part -------------- p( | ) = [1gram] 0.2 [ -0.69897 *4 ] p( | ) = [1gram] 0 [ -inf *4 ] p( He | ) = [1gram] 0.05 [ -1.30103 ] p( I | ) = [1gram] 0.05 [ -1.30103 ] p( am | ) = [1gram] 0.05 [ -1.30103 ] p( ananda | ) = [1gram] 0.05 [ -1.30103 ] p( bhawana | ) = [1gram] 0.05 [ -1.30103 ] p( going | ) = [1gram] 0.1 [ -1 *2 ] p( is | ) = [1gram] 0.15 [ -0.823909 *3 ] p( kath | ) = [1gram] 0.1 [ -1 *2 ] p( this | ) = [1gram] 0.1 [ -1 *2 ] p( to | ) = [1gram] 0.1 [ -1 *2 ] p( He | ) = [2gram] 0.2 [ -0.69897 ] p( I | ) = [2gram] 0.2 [ -0.69897 ] p( this | ) = [2gram] 0.4 [ -0.39794 *2 ] p( is | He ) = [2gram] 0.5 [ -0.30103 ] p( am | I ) = [2gram] 0.5 [ -0.30103 ] p( going | am ) = [2gram] 0.5 [ -0.30103 ] p( | ananda ) = [2gram] 0.5 [ -0.30103 ] p( | bhawana ) = [2gram] 0.5 [ -0.30103 ] p( to | going ) = [2gram] 0.666667 [ -0.176091 *2 ] p( ananda | is ) = [2gram] 0.25 [ -0.60206 ] p( bhawana | is ) = [2gram] 0.25 [ -0.60206 ] p( going | is ) = [2gram] 0.25 [ -0.60206 ] p( | kath ) = [2gram] 0.666667 [ -0.176091 *2 ] p( is | this ) = [2gram] 0.666667 [ -0.176091 *2 ] p( kath | to ) = [2gram] 0.666667 [ -0.176091 *2 ] 8 sentences, 36 words, 0 OOVs 4 zeroprobs, logprob= -26.6866 ppl= 4.64693 ppl1= 6.82272 file /home/ananda/Desktop/countout.txt: 8 sentences, 36 words, 0 OOVs 4 zeroprobs, logprob= -26.6866 ppl= 4.64693 ppl1= 6.82272 -------------- next part -------------- \data\ ngram 1=12 ngram 2=15 \1-grams: -0.69897 -99 -0.60206 -1.30103 He -0.2304489 -1.30103 I -0.2787536 -1.30103 am -0.2552725 -1.30103 ananda -0.20412 -1.30103 bhawana -0.20412 -1 going -0.4313638 -0.8239087 is -0.5051499 -1 kath -0.3802113 -1 this -0.4065402 -1 to -0.4313638 \2-grams: -0.69897 He -0.69897 I -0.39794 this -0.30103 He is -0.30103 I am -0.30103 am going -0.30103 ananda -0.30103 bhawana -0.1760913 going to -0.60206 is ananda -0.60206 is bhawana -0.60206 is going -0.1760913 kath -0.1760913 this is -0.1760913 to kath \end\ -------------- next part -------------- this is ananda bhawana I am going to kath He From julia_hancke at yahoo.com Sat Sep 22 17:50:44 2012 From: julia_hancke at yahoo.com (Julia Hancke) Date: Sat, 22 Sep 2012 17:50:44 -0700 (PDT) Subject: [SRILM User List] Hi! Message-ID: <1348361444.18417.BPMail_high_noncarrier@web113519.mail.gq1.yahoo.com> http://grange-aux-ormes.com/work.at.home.online.php?owmarket=9yov0