From dmytro.prylipko at ovgu.de Mon Oct 1 08:34:28 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Mon, 01 Oct 2012 17:34:28 +0200 Subject: [SRILM User List] Strange log probabilities Message-ID: <5069B804.5080107@ovgu.de> Hi, I am sorry for such a long e-mail, but I found a strange behavior during the log probability calculation of the unigrams. I have two language models trained on two text sets. Actually, those sets are just two different sentences, repeated 100 times each: ACTION_REJECT_003.train.txt: ~~der gew?nschte artikel ist nicht im koffer enthalten~~ (x 100) ACTION_REJECT_004.train.txt: ~~ihre aussage kann nicht verarbeitet werden~~ (x 100) Also, I have defined few specific categories to build a class-based LM. One class is numbers (ein, eine, eins, einundachtzig etc.), the second one comprises names of specific items related to the task domain (achselshirt, blusen), and the last one consists just of two words: 'wurde' and 'wurden'. So, I am building two expanded class-based LMs using Witten-Bell discounting (I triedalso the default Good-Turing, but with the same result): replace-words-with-classes classes=wizard.class.defs ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt ngram-count -text ACTION_REJECT_003.train.class.txt -lm ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3 ngram -lm ACTION_REJECT_003.lm -write-lm ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs -expand-classes 3 -expand-exact 3 -vocab wizard.wlist The second LM (ACTION_REJECT_004) is built using the same approach. But these two models are pretty different. ACTION_REJECT_003.expanded.lm has reasonable smoothed log probabilities for the unseen unigrams: \data\ ngram 1=924 ngram 2=9 ngram 3=8 \1-grams: -0.9542425 -10.34236 -99 -99 -10.34236 ab -10.34236 abgeben [...] -10.34236 ?berschritten -10.34236 ?bertragung \2-grams: 0 ~~der 0 0 artikel ist 0 0 der gew?nschte 0 0 enthalten~~ 0 gew?nschte artikel 0 0 im koffer 0 0 ist nicht 0 0 koffer enthalten 0 0 nicht im 0 \3-grams: 0 gew?nschte artikel ist 0 ~~der gew?nschte 0 koffer enthalten~~ 0 der gew?nschte artikel 0 nicht im koffer 0 artikel ist nicht 0 im koffer enthalten 0 ist nicht im \end\ Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have a zero probability: \data\ ngram 1=924 ngram 2=7 ngram 3=6 \1-grams: -0.845098 -99 -99 -99 -99 ab -99 abgeben [...] -0.845098 aussage -99 [...] -99 ?berschritten -99 ?bertragung \2-grams: 0 ~~ihre 0 0 aussage kann 0 0 ihre aussage 0 0 kann nicht 0 0 nicht verarbeitet 0 0 sagen~~ 0 verarbeitet sagen 0 \3-grams: 0 ihre aussage kann 0 ~~ihre aussage 0 aussage kann nicht 0 kann nicht verarbeitet 0 verarbeitet sagen~~ 0 nicht verarbeitet sagen \end\ None of the words from both training sentences belong to any class. Also, I found that removing the last word from the second training sentence fixes the problem. Thus, for the following sentence: ~~ihre aussage kann nicht~~ corresponding LM has correctly discounted probabilities (also around -10). Replacing 'werden' with any other word (I tried 'sagen', 'abgeben' and 'beer') causes the same problem again. Is it a bug or am I doing something wrong? I would be appreciated for any advice. I also can provide all necessary data and scripts if needed. Sincerely yours, Dmytro Prylipko. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmytro.prylipko at ovgu.de Tue Oct 2 02:48:16 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Tue, 02 Oct 2012 11:48:16 +0200 Subject: [SRILM User List] Strange log probabilities In-Reply-To: References: <5069B804.5080107@ovgu.de> Message-ID: <506AB860.2010606@ovgu.de> Hi, Thank you for the quick feedback. I found out something else remarkable: I tried to run the script on our cluster under CentOS (my workstation is running Kubuntu 12.04) and discovered that on the cluster all the LMs have zero probabilities for unseen 1-grams. No smoothing at all! The setup is of course different. Output of the uname -a on the cluster: Linux frontend1.service 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64 x86_64 x86_64 GNU/Linux On the workstation: Linux KS-PC113 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7 16:39:45 UTC 2012 i686 i686 i386 GNU/Linux SRILM on the cluster was build with MACHINE_TYPE=i686-m64 (with and without _C option, both give the same result), on the workstation with MACHINE_TYPE=i686-gcc4 LANG variable is en_US.UTF-8 on both machines. Replacing umlauts with regular characters gave no difference. What do you mean exactly under 'behavior of your local awk installation when it encounters extended chars'? So, I am sending you the minimal dataset for replicating it. Shell script buildtaglm.sh does all the job. Yours, Dmytro Prylipko. On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote: > > On a first reading of your email I'm indeed surprised that the results > differ between the two texts. Have you tried replacing the umlaut in > the first corpus with a regular "u" and checked if you still get the > same behavior. Check the LANG environment variable and the behavior of > your local awk installation when it encounters extended chars. > > If the problem persists, please send me the two corpora, along with > the class file and I'll be glad to take a look for you. > > & > > On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko > > wrote: > > Hi, > > I am sorry for such a long e-mail, but I found a strange behavior > during the log probability calculation of the unigrams. > > I have two language models trained on two text sets. Actually, > those sets are just two different sentences, repeated 100 times each: > > ACTION_REJECT_003.train.txt: > ~~der gew?nschte artikel ist nicht im koffer enthalten~~ (x > 100) > > ACTION_REJECT_004.train.txt: > ~~ihre aussage kann nicht verarbeitet werden~~ (x 100) > > Also, I have defined few specific categories to build a > class-based LM. > One class is numbers (ein, eine, eins, einundachtzig etc.), the > second one comprises names of specific items related to the task > domain (achselshirt, blusen), and the last one consists just of > two words: 'wurde' and 'wurden'. > > So, I am building two expanded class-based LMs using Witten-Bell > discounting (I triedalso the default Good-Turing, but with the > same result): > > replace-words-with-classes classes=wizard.class.defs > ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt > > ngram-count -text ACTION_REJECT_003.train.class.txt -lm > ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3 > > ngram -lm ACTION_REJECT_003.lm -write-lm > ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs > -expand-classes 3 -expand-exact 3 -vocab wizard.wlist > > > The second LM (ACTION_REJECT_004) is built using the same > approach. But these two models are pretty different. > > ACTION_REJECT_003.expanded.lm has reasonable smoothed log > probabilities for the unseen unigrams: > > \data\ > ngram 1=924 > ngram 2=9 > ngram 3=8 > > \1-grams: > -0.9542425 > -10.34236 > -99 -99 > -10.34236 ab > -10.34236 abgeben > > [...] > > -10.34236 ?berschritten > -10.34236 ?bertragung > > \2-grams: > 0 ~~der 0 > 0 artikel ist 0 > 0 der gew?nschte 0 > 0 enthalten~~ > 0 gew?nschte artikel 0 > 0 im koffer 0 > 0 ist nicht 0 > 0 koffer enthalten 0 > 0 nicht im 0 > > \3-grams: > 0 gew?nschte artikel ist > 0 ~~der gew?nschte > 0 koffer enthalten~~ > 0 der gew?nschte artikel > 0 nicht im koffer > 0 artikel ist nicht > 0 im koffer enthalten > 0 ist nicht im > > \end\ > > > Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have > a zero probability: > > \data\ > ngram 1=924 > ngram 2=7 > ngram 3=6 > > \1-grams: > -0.845098 > -99 > -99 -99 > -99 ab > -99 abgeben > [...] > -0.845098 aussage -99 > [...] > -99 ?berschritten > -99 ?bertragung > > \2-grams: > 0 ~~ihre 0 > 0 aussage kann 0 > 0 ihre aussage 0 > 0 kann nicht 0 > 0 nicht verarbeitet 0 > 0 sagen~~ > 0 verarbeitet sagen 0 > > \3-grams: > 0 ihre aussage kann > 0 ~~ihre aussage > 0 aussage kann nicht > 0 kann nicht verarbeitet > 0 verarbeitet sagen~~ > 0 nicht verarbeitet sagen > > \end\ > > > None of the words from both training sentences belong to any class. > > Also, I found that removing the last word from the second training > sentence fixes the problem. > Thus, for the following sentence: > > ~~ihre aussage kann nicht~~ > > corresponding LM has correctly discounted probabilities (also > around -10). Replacing 'werden' with any other word (I tried > 'sagen', 'abgeben' and 'beer') causes the same problem again. > > Is it a bug or am I doing something wrong? > I would be appreciated for any advice. I also can provide all > necessary data and scripts if needed. > > Sincerely yours, > Dmytro Prylipko. > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: testbed.zip Type: application/zip Size: 21408 bytes Desc: not available URL: From venkataraman.anand at gmail.com Tue Oct 2 13:35:07 2012 From: venkataraman.anand at gmail.com (Anand Venkataraman) Date: Tue, 2 Oct 2012 13:35:07 -0700 Subject: [SRILM User List] Strange log probabilities In-Reply-To: <506AB860.2010606@ovgu.de> References: <5069B804.5080107@ovgu.de> <506AB860.2010606@ovgu.de> Message-ID: The problem is that your final vocabulary is introduced as a surprise in the last step (to ngram). When class expansion likelihoods sum to exactly 1.0 there is no room for novelty in back off orders at this stage. To get the correct behavior you must prime the initial language model with a vocabulary of all either class tags or the individual words themselves. E.g. awk '{print $1}' wizard.class.defs | sort -u >wizard.classnames.txt cat $datafile \ | replace-words-with-classes classes=wizard.class.defs - \ | ngram-count -text - -lm - -order 1 -wbdiscount -vocab wizard.classnames.txt \ > your-lm.1bo # Expanding classes in your-lm.1bo now will give you the desired behavior. HTH & On Tue, Oct 2, 2012 at 2:48 AM, Dmytro Prylipko wrote: > Hi, > > Thank you for the quick feedback. > > I found out something else remarkable: I tried to run the script on our > cluster under CentOS (my workstation is running Kubuntu 12.04) and > discovered that on the cluster all the LMs have zero probabilities for > unseen 1-grams. No smoothing at all! > > The setup is of course different. Output of the uname -a on the cluster: > > Linux frontend1.service 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST > 2010 x86_64 x86_64 x86_64 GNU/Linux > > On the workstation: > > Linux KS-PC113 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7 16:39:45 UTC > 2012 i686 i686 i386 GNU/Linux > > SRILM on the cluster was build with MACHINE_TYPE=i686-m64 (with and > without _C option, both give the same result), on the workstation with > MACHINE_TYPE=i686-gcc4 > > LANG variable is en_US.UTF-8 on both machines. Replacing umlauts with > regular characters gave no difference. > > What do you mean exactly under 'behavior of your local awk installation > when it encounters extended chars'? > > So, I am sending you the minimal dataset for replicating it. Shell script > buildtaglm.sh does all the job. > > Yours, > Dmytro Prylipko. > > > On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote: > > > On a first reading of your email I'm indeed surprised that the results > differ between the two texts. Have you tried replacing the umlaut in > the first corpus with a regular "u" and checked if you still get the > same behavior. Check the LANG environment variable and the behavior of > your local awk installation when it encounters extended chars. > > If the problem persists, please send me the two corpora, along with > the class file and I'll be glad to take a look for you. > > & > > On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko > > > wrote: > > Hi, > > I am sorry for such a long e-mail, but I found a strange behavior > during the log probability calculation of the unigrams. > > I have two language models trained on two text sets. Actually, > those sets are just two different sentences, repeated 100 times each: > > ACTION_REJECT_003.train.txt: > ~~der gew?nschte artikel ist nicht im koffer enthalten~~ (x > 100) > > ACTION_REJECT_004.train.txt: > ~~ihre aussage kann nicht verarbeitet werden~~ (x 100) > > Also, I have defined few specific categories to build a > class-based LM. > One class is numbers (ein, eine, eins, einundachtzig etc.), the > second one comprises names of specific items related to the task > domain (achselshirt, blusen), and the last one consists just of > two words: 'wurde' and 'wurden'. > > So, I am building two expanded class-based LMs using Witten-Bell > discounting (I triedalso the default Good-Turing, but with the > > same result): > > replace-words-with-classes classes=wizard.class.defs > ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt > > ngram-count -text ACTION_REJECT_003.train.class.txt -lm > ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3 > > ngram -lm ACTION_REJECT_003.lm -write-lm > ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs > -expand-classes 3 -expand-exact 3 -vocab wizard.wlist > > > The second LM (ACTION_REJECT_004) is built using the same > approach. But these two models are pretty different. > > ACTION_REJECT_003.expanded.lm has reasonable smoothed log > probabilities for the unseen unigrams: > > \data\ > ngram 1=924 > ngram 2=9 > ngram 3=8 > > \1-grams: > -0.9542425 > -10.34236 > -99 -99 > -10.34236 ab > -10.34236 abgeben > > [...] > > -10.34236 ?berschritten > -10.34236 ?bertragung > > \2-grams: > 0 ~~der 0 > 0 artikel ist 0 > 0 der gew?nschte 0 > 0 enthalten~~ > 0 gew?nschte artikel 0 > 0 im koffer 0 > 0 ist nicht 0 > 0 koffer enthalten 0 > 0 nicht im 0 > > \3-grams: > 0 gew?nschte artikel ist > 0 ~~der gew?nschte > 0 koffer enthalten~~ > 0 der gew?nschte artikel > 0 nicht im koffer > 0 artikel ist nicht > 0 im koffer enthalten > 0 ist nicht im > > \end\ > > > Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have > a zero probability: > > \data\ > ngram 1=924 > ngram 2=7 > ngram 3=6 > > \1-grams: > -0.845098 > -99 > -99 -99 > -99 ab > -99 abgeben > [...] > -0.845098 aussage -99 > [...] > -99 ?berschritten > -99 ?bertragung > > \2-grams: > 0 ~~ihre 0 > 0 aussage kann 0 > 0 ihre aussage 0 > 0 kann nicht 0 > 0 nicht verarbeitet 0 > 0 sagen~~ > 0 verarbeitet sagen 0 > > \3-grams: > 0 ihre aussage kann > 0 ~~ihre aussage > 0 aussage kann nicht > 0 kann nicht verarbeitet > 0 verarbeitet sagen~~ > 0 nicht verarbeitet sagen > > \end\ > > > None of the words from both training sentences belong to any class. > > Also, I found that removing the last word from the second training > sentence fixes the problem. > Thus, for the following sentence: > > ~~ihre aussage kann nicht~~ > > corresponding LM has correctly discounted probabilities (also > around -10). Replacing 'werden' with any other word (I tried > 'sagen', 'abgeben' and 'beer') causes the same problem again. > > Is it a bug or am I doing something wrong? > I would be appreciated for any advice. I also can provide all > necessary data and scripts if needed. > > Sincerely yours, > Dmytro Prylipko. > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bibek9500 at gmail.com Thu Oct 4 09:11:14 2012 From: bibek9500 at gmail.com (bibek kc) Date: Thu, 4 Oct 2012 21:56:14 +0545 Subject: [SRILM User List] help regarding kartz backoff bigram and trigram model Message-ID: Hi all of you, I am new to the srilm toolkit. I want to make a kartz backoff bigram and trigram model where the value of K=5. Also calculate the Katz backoff bigram probabilities and Katz backoff trigram probabilities. if possible please enlist the steps to calculate the model and the probability. Regards, bibek -------------- next part -------------- An HTML attachment was scrubbed... URL: From dyuret at ku.edu.tr Sun Oct 7 01:05:31 2012 From: dyuret at ku.edu.tr (Deniz Yuret) Date: Sun, 7 Oct 2012 11:05:31 +0300 Subject: [SRILM User List] finding likely substitutes quickly In-Reply-To: References: Message-ID: Dear SRILM users, I have developed an algorithm (FASTSUBS) that can generate the most likely word substitutes from an n-gram model fast. We have used FASTSUBS to achieve state of the art results in unsupervised part of speech induction in EMNLP 2012. The paper, the code, and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available at http://goo.gl/jzKH0. best, deniz From member at linkedin.com Wed Oct 10 08:05:36 2012 From: member at linkedin.com (Manuel Prof.Manuel via LinkedIn) Date: Wed, 10 Oct 2012 15:05:36 +0000 (UTC) Subject: [SRILM User List] =?utf-8?q?Invitation_=C3=A0_se_connecter_sur_Li?= =?utf-8?q?nkedIn?= Message-ID: <1035697199.4930135.1349881536385.JavaMail.app@ela4-app2320.prod> LinkedIn ------------ Manuel Prof.Manuel requested to add you as a connection on LinkedIn: ------------------------------------------ G?khan Can, J'aimerais vous inviter ? rejoindre mon r?seau professionnel en ligne, sur le site LinkedIn. Manuel Accept invitation from Manuel Prof.Manuel http://www.linkedin.com/e/t1zgkk-h84km3jr-68/GYEITBnGRvUQDaFc0a-73k8G9tC-sl29aHrv22N/blk/I337404383_55/0UcDpKqiRzolZKqiRybmRSrCBvrmRLoORIrmkZt5YCpnlOt3RApnhMpmdzgmhxrSNBszYRdlYPe3cQc3gTcPd9bSxOrSYViCpobP4QcPwQd3ANc3ALrCBxbOYWrSlI/eml-comm_invm-b-in_ac-inv28/?hs=false&tok=3Cg4O7wX_FiBs1 View profile of Manuel Prof.Manuel http://www.linkedin.com/e/t1zgkk-h84km3jr-68/rso/208842380/wW2U/name/57580855_I337404383_55/?hs=false&tok=1xAHfqDdnFiBs1 ------------------------------------------ You are receiving Invitation emails. This email was intended for G?khan Can Avcu. Learn why this is included: http://www.linkedin.com/e/t1zgkk-h84km3jr-68/plh/http%3A%2F%2Fhelp%2Elinkedin%2Ecom%2Fapp%2Fanswers%2Fdetail%2Fa_id%2F4788/-GXI/?hs=false&tok=2YzXfRYi7FiBs1 (c) 2012, LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gregor.donaj at uni-mb.si Thu Oct 11 07:59:23 2012 From: gregor.donaj at uni-mb.si (Gregor Donaj) Date: Thu, 11 Oct 2012 16:59:23 +0200 Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile Message-ID: <5076DECB.4080301@uni-mb.si> Hi, I'm trying to rescore factored hypothesizes with ngram with the -factored option. I realized that the program requires the countfile to be present as specified in the flm definition file and that it also seems to be loaded into memory. Same with using fngram. Why is this so? Since for calculating probabilities and perplexities I only need the actual language model file and not the counts, this is a bit annoying as my countfiles are sometimes larger than my RAM. I kind of "solved" the problem by creating and empty countfile. I tested this on a small example and saw that it calculates the rescored probabilities fine. Is there any way to tell ngram not to look for the countfile? I guess that would be a better solution that just giving the program a dummy countfile that doesn't correspond to the language model file. Thanks -- Gregor Donaj, univ. dipl. in?. el., univ. dipl. mat. Laboratorij za digitalno procesiranje signalov Fakulteta za elektrotehniko, ra?unalni?tvo in informatiko Smetanova ulica 17, 2000 Maribor Tel.: 02/220 72 05 E-mail: gregor.donaj at uni-mb.si Digital Signal Processing Laboratory Faculty of Electrical Engineering and Computer Science Smetanova ulica 17, 2000 Maribor, Slovenia Tel.: +386 2 220 72 05 From stolcke at icsi.berkeley.edu Thu Oct 11 09:41:48 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 11 Oct 2012 09:41:48 -0700 Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile In-Reply-To: <5076DECB.4080301@uni-mb.si> References: <5076DECB.4080301@uni-mb.si> Message-ID: <5076F6CC.80807@icsi.berkeley.edu> On 10/11/2012 7:59 AM, Gregor Donaj wrote: > Hi, > > I'm trying to rescore factored hypothesizes with ngram with the > -factored option. I realized that the program requires the countfile > to be present as specified in the flm definition file and that it also > seems to be loaded into memory. Same with using fngram. Why is this so? > > Since for calculating probabilities and perplexities I only need the > actual language model file and not the counts, this is a bit annoying > as my countfiles are sometimes larger than my RAM. > > I kind of "solved" the problem by creating and empty countfile. I > tested this on a small example and saw that it calculates the rescored > probabilities fine. Is there any way to tell ngram not to look for the > countfile? I guess that would be a better solution that just giving > the program a dummy countfile that doesn't correspond to the language > model file. > > Thanks > > I would agree with you, but I'm cc-ing Jeff Bilmes, who wrote the original code and might know of other reasons for handling the countfiles the way it is done now. If empty countfiles work for you then a quick workaround is to write a few lines of perl that replace the count files with /dev/null (no need to create actual empty files) in any given FLM model file. Andreas From stolcke at icsi.berkeley.edu Thu Oct 11 11:15:22 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 11 Oct 2012 11:15:22 -0700 Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile In-Reply-To: <5076F6CC.80807@icsi.berkeley.edu> References: <5076DECB.4080301@uni-mb.si> <5076F6CC.80807@icsi.berkeley.edu> Message-ID: <50770CBA.5050707@icsi.berkeley.edu> FYI, here is Jeff's response, which didn't get propagated to the list since he isn't subscribed: On 10/11/2012 10:37 AM, Jeff Bilmes wrote: > For some backoff strategies (which can only be determined based on the > options associated with the backoff graph), one does need the count > file to determine how to do backoff. If I remember correctly, I think > that the check for existence of count file is done at a stage in the > code far different than when it is determined if it is needed or not > which might be the reason why it just, by default, always asks for > one. But if you are certain that in your backoffoptions associated > with the backoff graph it is not necessary to have a count file, then > it should be safe to use the /dev/null solution mentioned by Andreas > below ... Andreas On 10/11/2012 9:41 AM, Andreas Stolcke wrote: > On 10/11/2012 7:59 AM, Gregor Donaj wrote: >> Hi, >> >> I'm trying to rescore factored hypothesizes with ngram with the >> -factored option. I realized that the program requires the countfile >> to be present as specified in the flm definition file and that it >> also seems to be loaded into memory. Same with using fngram. Why is >> this so? >> >> Since for calculating probabilities and perplexities I only need the >> actual language model file and not the counts, this is a bit annoying >> as my countfiles are sometimes larger than my RAM. >> >> I kind of "solved" the problem by creating and empty countfile. I >> tested this on a small example and saw that it calculates the >> rescored probabilities fine. Is there any way to tell ngram not to >> look for the countfile? I guess that would be a better solution that >> just giving the program a dummy countfile that doesn't correspond to >> the language model file. >> >> Thanks >> >> > I would agree with you, but I'm cc-ing Jeff Bilmes, who wrote the > original code and might know of other reasons for handling the > countfiles the way it is done now. > > If empty countfiles work for you then a quick workaround is to write a > few lines of perl that replace the count files with /dev/null (no need > to create actual empty files) in any given FLM model file. > > Andreas > From gregor.donaj at uni-mb.si Fri Oct 12 01:25:38 2012 From: gregor.donaj at uni-mb.si (Gregor Donaj) Date: Fri, 12 Oct 2012 10:25:38 +0200 Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile In-Reply-To: <50770CBA.5050707@icsi.berkeley.edu> References: <5076DECB.4080301@uni-mb.si> <5076F6CC.80807@icsi.berkeley.edu> <50770CBA.5050707@icsi.berkeley.edu> Message-ID: <5077D402.9080905@uni-mb.si> Thank youfor your answers. I already thought it has something to do with backoff strategies. I am currently experimenting only on models with fixed backoff paths, so I will use /dev/null. Gregor On 10/11/2012 08:15 PM, Andreas Stolcke wrote: > FYI, here is Jeff's response, which didn't get propagated to the list > since he isn't subscribed: > > On 10/11/2012 10:37 AM, Jeff Bilmes wrote: >> For some backoff strategies (which can only be determined based on >> the options associated with the backoff graph), one does need the >> count file to determine how to do backoff. If I remember correctly, I >> think that the check for existence of count file is done at a stage >> in the code far different than when it is determined if it is needed >> or not which might be the reason why it just, by default, always asks >> for one. But if you are certain that in your backoffoptions >> associated with the backoff graph it is not necessary to have a count >> file, then it should be safe to use the /dev/null solution mentioned >> by Andreas below ... > > Andreas > > On 10/11/2012 9:41 AM, Andreas Stolcke wrote: >> On 10/11/2012 7:59 AM, Gregor Donaj wrote: >>> Hi, >>> >>> I'm trying to rescore factored hypothesizes with ngram with the >>> -factored option. I realized that the program requires the countfile >>> to be present as specified in the flm definition file and that it >>> also seems to be loaded into memory. Same with using fngram. Why is >>> this so? >>> >>> Since for calculating probabilities and perplexities I only need the >>> actual language model file and not the counts, this is a bit >>> annoying as my countfiles are sometimes larger than my RAM. >>> >>> I kind of "solved" the problem by creating and empty countfile. I >>> tested this on a small example and saw that it calculates the >>> rescored probabilities fine. Is there any way to tell ngram not to >>> look for the countfile? I guess that would be a better solution that >>> just giving the program a dummy countfile that doesn't correspond to >>> the language model file. >>> >>> Thanks >>> >>> >> I would agree with you, but I'm cc-ing Jeff Bilmes, who wrote the >> original code and might know of other reasons for handling the >> countfiles the way it is done now. >> >> If empty countfiles work for you then a quick workaround is to write >> a few lines of perl that replace the count files with /dev/null (no >> need to create actual empty files) in any given FLM model file. >> >> Andreas >> > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > From alphabnu at gmail.com Fri Oct 12 10:17:52 2012 From: alphabnu at gmail.com (Xiaolin Xie) Date: Fri, 12 Oct 2012 12:17:52 -0500 Subject: [SRILM User List] ask help with calculating word conditional probability Message-ID: Hi SRILM users. I am working on a project that needs to calculate the conditional probability of each word given its previous two words in a paragraph. A language model has been trained from a training set. Do your guys have any idea about how to directly calculate the conditional probability p(W_k|W_k-1, W_k-2), using the SRILM toolkit and the trained language model? Thanks a lot! I really appreciate any help you can offer. Xiaolin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Oct 12 10:45:41 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 12 Oct 2012 10:45:41 -0700 Subject: [SRILM User List] ask help with calculating word conditional probability In-Reply-To: References: Message-ID: <50785745.8090909@icsi.berkeley.edu> On 10/12/2012 10:17 AM, Xiaolin Xie wrote: > > Hi SRILM users. > > I am working on a project that needs to calculate the conditional > probability of each word given its previous two words in a paragraph. > A language model has been trained from a training set. Do your guys > have any idea about how to directly calculate the conditional > probability p(W_k|W_k-1, W_k-2), using the SRILM toolkit and the > trained language model? Thanks a lot! I really appreciate any help you > can offer. > One method is described in http://www.speech.sri.com/pipermail/srilm-user/2012q3/001314.html . Hope that helps, Andreas From yuan at ks.cs.titech.ac.jp Sat Oct 13 02:37:39 2012 From: yuan at ks.cs.titech.ac.jp (yuan liang) Date: Sat, 13 Oct 2012 18:37:39 +0900 Subject: [SRILM User List] lattice rescoring with conventional LM and FLM Message-ID: Hi srilm users, Now I'm using the 'lattice-tool' to rescore the lattice, my goal is using a Factor Language Model(FLM) score to replace the original language model score in the word lattice. 1) First in the baseline system, I used conventional Bigram LM to do speech recognition and generate the htk word lattice (we name it "Lattice_1"). Then I try to use a conventional Trigram LM to rescore the "Lattice_1", using: "lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file] -read-htk -no-nulls -no-htk-nulls -lm [Trigram_file] -htk-lmscale 15 -htk-logbase 2.71828183 -posterior-scale 15 -write-htk -out-lattice Lattice_2" I just want to use the new Trigram LM score to replace the old LM score in "Lattice_1", so I think "Lattice_2" and "Lattice_1" should have the same size, just each word's LM score will be different. But I found the size of "Lattice_2" are larger than "Latttice_1". Did I miss something? How can I only replace the LM score without expanding the size of the lattice? 2) I used a Trigram in FLM format to rescore "Lattice_1": First I converted all word nodes (HTk format) to FLM representation; Then rescored with: " lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file] -read-htk -no-nulls -no-htk-nulls -factored -lm [FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183 -posterior-scale 15 -write-htk -out-lattice Lattice_3" I think "Lattice_2" and "Lattice_3" should be the same, since the perplexity of using Trigram and using Trigram in FLM format are same. However, they are different. Did I miss something? 3) Also I checked the accuracy from the decoding result of using "Lattice_2" and "Lattice_3", the result are: viterbi decode result is the same; n-best list are almost same, but using "Lattice_2" is better than using "Lattice_3"; posterior decode result is quite different, using "Lattice_2" is better than using "Lattice_3"; Did I miss something when I using FLM to rescore the lattice? Thank you very much! Yuan -------------- next part -------------- An HTML attachment was scrubbed... URL: From alphabnu at gmail.com Sun Oct 14 10:15:42 2012 From: alphabnu at gmail.com (Xiaolin Xie) Date: Sun, 14 Oct 2012 12:15:42 -0500 Subject: [SRILM User List] ask help with calculating word conditional probability In-Reply-To: <50785745.8090909@icsi.berkeley.edu> References: <50785745.8090909@icsi.berkeley.edu> Message-ID: Hi Andreas Thank you very much. This information is very helpful. Xiaolin. On Fri, Oct 12, 2012 at 12:45 PM, Andreas Stolcke wrote: > On 10/12/2012 10:17 AM, Xiaolin Xie wrote: > >> >> Hi SRILM users. >> >> I am working on a project that needs to calculate the conditional >> probability of each word given its previous two words in a paragraph. A >> language model has been trained from a training set. Do your guys have any >> idea about how to directly calculate the conditional probability >> p(W_k|W_k-1, W_k-2), using the SRILM toolkit and the trained language >> model? Thanks a lot! I really appreciate any help you can offer. >> >> > One method is described in http://www.speech.sri.com/** > pipermail/srilm-user/2012q3/**001314.html. > Hope that helps, > > Andreas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Oct 16 09:59:46 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 16 Oct 2012 09:59:46 -0700 Subject: [SRILM User List] lattice rescoring with conventional LM and FLM In-Reply-To: References: Message-ID: <507D9282.1040306@icsi.berkeley.edu> On 10/13/2012 2:37 AM, yuan liang wrote: > Hi srilm users, > > Now I'm using the 'lattice-tool' to rescore the lattice, my goal is > using a Factor Language Model(FLM) score to replace the original > language model score in the word lattice. > > 1) First in the baseline system, I used conventional Bigram LM to do > speech recognition and generate the htk word lattice (we name it > "Lattice_1"). Then I try to use a conventional Trigram LM to rescore > the "Lattice_1", using: > > "lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file] > -read-htk -no-nulls -no-htk-nulls -lm [Trigram_file] -htk-lmscale 15 > -htk-logbase 2.71828183 -posterior-scale 15 -write-htk -out-lattice > Lattice_2" Two factors come into play here: 1) when you apply a trigram model to a bigram lattice the lattice is expanded so that trigram contexts (i.e., the last two words) are encoded uniquely at each node. Hence the size increase. 2) The options -no-nulls -no-htk-nulls actually imply a size increase all on their own because of the way HTK lattices are represented internally (arcs are encode as nodes, and then mapped back to arc on output). You should not use them. > > I just want to use the new Trigram LM score to replace the old LM > score in "Lattice_1", so I think "Lattice_2" and "Lattice_1" should > have the same size, just each word's LM score will be different. But I > found the size of "Lattice_2" are larger than "Latttice_1". Did I miss > something? How can I only replace the LM score without expanding the > size of the lattice? > > > > 2) I used a Trigram in FLM format to rescore "Lattice_1": > > First I converted all word nodes (HTk format) to FLM representation; > > Then rescored with: > > " lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file] > -read-htk -no-nulls -no-htk-nulls -factored -lm > [FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183 > -posterior-scale 15 -write-htk -out-lattice Lattice_3" > > I think "Lattice_2" and "Lattice_3" should be the same, since the > perplexity of using Trigram and using Trigram in FLM format are same. > However, they are different. Did I miss something? This is a question about the equivalent encoding of standard word-based LMs as FLMs, and I'm not an expert here. However, as a sanity check, I would first do a simple perplexity computation (ngram -debug 2 -ppl) with both models on some test set and make sure you get the same word-for-word conditional probabilities. If not, you can spot where the differences are and present a specific case of different probabilities to the group for debugging. > > > > 3) Also I checked the accuracy from the decoding result of using > "Lattice_2" and "Lattice_3", the result are: > > viterbi decode result is the same; > n-best list are almost same, but using "Lattice_2" > is better than using "Lattice_3"; > posterior decode result is quite different, using > "Lattice_2" is better than using "Lattice_3"; > > Did I miss something when I using FLM to rescore the lattice? You need to resolve question 2 above first before tackling this one. Andreas From yuan at ks.cs.titech.ac.jp Tue Oct 16 17:33:03 2012 From: yuan at ks.cs.titech.ac.jp (yuan liang) Date: Wed, 17 Oct 2012 09:33:03 +0900 Subject: [SRILM User List] lattice rescoring with conventional LM and FLM In-Reply-To: <507D9282.1040306@icsi.berkeley.edu> References: <507D9282.1040306@icsi.berkeley.edu> Message-ID: Hi Andreas, Thank you very much! >> 2) I used a Trigram in FLM format to rescore "Lattice_1": >> >> First I converted all word nodes (HTk format) to FLM representation; >> >> Then rescored with: >> >> " lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file] >> -read-htk -no-nulls -no-htk-nulls -factored -lm >> [FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183 >> -posterior-scale 15 -write-htk -out-lattice Lattice_3" >> >> I think "Lattice_2" and "Lattice_3" should be the same, since the >> perplexity of using Trigram and using Trigram in FLM format are same. >> However, they are different. Did I miss something? >> > > This is a question about the equivalent encoding of standard word-based > LMs as FLMs, and I'm not an expert here. > However, as a sanity check, I would first do a simple perplexity > computation (ngram -debug 2 -ppl) with both models on some test set and > make sure you get the same word-for-word conditional probabilities. If > not, you can spot where the differences are and present a specific case of > different probabilities to the group for debugging. > > > Actually I did the perplexity test on a test set of 6564 sentences (72854 words). The total perplexity are the same using standard word-based Trigram LM as using FLM Trigram. Also I checked the details of the word-for-word conditional probability, for these 72854 words, only 442 words' conditional probabilities are not exactly the same, others are exactly the same. However the probability difference is negligible ( like 0.00531048 and 0.00531049, 5.38809e-07 and 5.38808e-07 ). So I thought we can say both models can get the same word-for-word conditional probabilities. I also considered probably it's because of the FLM format, lattice expanding with standard Trigram is seems different with FLM Trigram, using FLM Trigram lattice expanded around 300 times larger than using standard Trigram, maybe the expanding way is different. I'm not sure, I still need to investigate more. Thank you very much for your advices! Regards, Yuan -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Oct 16 21:52:44 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 16 Oct 2012 21:52:44 -0700 Subject: [SRILM User List] lattice rescoring with conventional LM and FLM In-Reply-To: References: <507D9282.1040306@icsi.berkeley.edu> Message-ID: <507E399C.8030809@icsi.berkeley.edu> On 10/16/2012 5:33 PM, yuan liang wrote: > Hi Andreas, > > Thank you very much! > > > 2) I used a Trigram in FLM format to rescore "Lattice_1": > > First I converted all word nodes (HTk format) to FLM > representation; > > Then rescored with: > > " lattice-tool -in-lattice Lattice_1 -unk -vocab > [voc_file] -read-htk -no-nulls -no-htk-nulls -factored > -lm [FLM_specification_file] -htk-lmscale 15 -htk-logbase > 2.71828183 -posterior-scale 15 -write-htk -out-lattice > Lattice_3" > > I think "Lattice_2" and "Lattice_3" should be the same, > since the perplexity of using Trigram and using Trigram in FLM > format are same. However, they are different. Did I miss > something? > > > This is a question about the equivalent encoding of standard > word-based LMs as FLMs, and I'm not an expert here. > However, as a sanity check, I would first do a simple perplexity > computation (ngram -debug 2 -ppl) with both models on some test > set and make sure you get the same word-for-word conditional > probabilities. If not, you can spot where the differences are and > present a specific case of different probabilities to the group > for debugging. > > > Actually I did the perplexity test on a test set of 6564 sentences > (72854 words). The total perplexity are the same using standard > word-based Trigram LM as using FLM Trigram. Also I checked the details > of the word-for-word conditional probability, for these 72854 words, > only 442 words' conditional probabilities are not exactly the same, > others are exactly the same. However the probability difference is > negligible ( like 0.00531048 and 0.00531049, 5.38809e-07 and > 5.38808e-07 ). So I thought we can say both models can get the same > word-for-word conditional probabilities. > > I also considered probably it's because of the FLM format, lattice > expanding with standard Trigram is seems different with FLM Trigram, > using FLM Trigram lattice expanded around 300 times larger than using > standard Trigram, maybe the expanding way is different. I'm not sure, > I still need to investigate more. The lattice expansion algorithm makes use of the backoff structure of the standard LM to minimize the number of nodes that need to be duplicated to correctly apply the probabilities. The FLM makes more conservative assumptions and always assumes you need two words of context, leading to more nodes after expansion. That would explain the size difference. You can also check the probabilities in expanded lattices. The command lattice-tool -in-lattice LATTICE -ppl TEXT -debug 2 ... will compute the probabilities assigned to the words in TEXT by traversing the lattice. It is worth checking first that expansion with FLMs yields the right probabilities. You say that viterbi decoding gives almost the same results (this suggests the expansion works correctly), but posterior (confusion network) decoding doesn't. It is possible there is a problem with building CNs from lattices with factored vocabularies. I don't think I every tried that. It would help to find a minimal test case that shows the problem. Andreas > > > Thank you very much for your advices! > > Regards, > Yuan -------------- next part -------------- An HTML attachment was scrubbed... URL: From yuan at ks.cs.titech.ac.jp Wed Oct 17 03:04:53 2012 From: yuan at ks.cs.titech.ac.jp (yuan liang) Date: Wed, 17 Oct 2012 19:04:53 +0900 Subject: [SRILM User List] lattice rescoring with conventional LM and FLM In-Reply-To: <507E399C.8030809@icsi.berkeley.edu> References: <507D9282.1040306@icsi.berkeley.edu> <507E399C.8030809@icsi.berkeley.edu> Message-ID: Hi Andres, Thank you very much! I will test more. Regards, Yuan On Wed, Oct 17, 2012 at 1:52 PM, Andreas Stolcke wrote: > On 10/16/2012 5:33 PM, yuan liang wrote: > > Hi Andreas, > > Thank you very much! > > >>> 2) I used a Trigram in FLM format to rescore "Lattice_1": >>> >>> First I converted all word nodes (HTk format) to FLM representation; >>> >>> Then rescored with: >>> >>> " lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file] >>> -read-htk -no-nulls -no-htk-nulls -factored -lm >>> [FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183 >>> -posterior-scale 15 -write-htk -out-lattice Lattice_3" >>> >>> I think "Lattice_2" and "Lattice_3" should be the same, since the >>> perplexity of using Trigram and using Trigram in FLM format are same. >>> However, they are different. Did I miss something? >>> >> >> This is a question about the equivalent encoding of standard word-based >> LMs as FLMs, and I'm not an expert here. >> However, as a sanity check, I would first do a simple perplexity >> computation (ngram -debug 2 -ppl) with both models on some test set and >> make sure you get the same word-for-word conditional probabilities. If >> not, you can spot where the differences are and present a specific case of >> different probabilities to the group for debugging. >> >> >> Actually I did the perplexity test on a test set of 6564 sentences > (72854 words). The total perplexity are the same using standard word-based > Trigram LM as using FLM Trigram. Also I checked the details of the > word-for-word conditional probability, for these 72854 words, only 442 > words' conditional probabilities are not exactly the same, others are > exactly the same. However the probability difference is negligible ( like > 0.00531048 and 0.00531049, 5.38809e-07 and 5.38808e-07 ). So I thought we > can say both models can get the same word-for-word conditional > probabilities. > > I also considered probably it's because of the FLM format, lattice > expanding with standard Trigram is seems different with FLM Trigram, using > FLM Trigram lattice expanded around 300 times larger than using standard > Trigram, maybe the expanding way is different. I'm not sure, I still need > to investigate more. > > > The lattice expansion algorithm makes use of the backoff structure of the > standard LM to minimize the number of nodes that need to be duplicated to > correctly apply the probabilities. The FLM makes more conservative > assumptions and always assumes you need two words of context, leading to > more nodes after expansion. That would explain the size difference. > > You can also check the probabilities in expanded lattices. The command > > lattice-tool -in-lattice LATTICE -ppl TEXT -debug 2 ... > > will compute the probabilities assigned to the words in TEXT by traversing > the lattice. It is worth checking first that expansion with FLMs yields > the right probabilities. > > You say that viterbi decoding gives almost the same results (this suggests > the expansion works correctly), but posterior (confusion network) decoding > doesn't. It is possible there is a problem with building CNs from lattices > with factored vocabularies. I don't think I every tried that. It would > help to find a minimal test case that shows the problem. > > Andreas > > > > > Thank you very much for your advices! > > Regards, > Yuan > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Oct 23 09:52:47 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 23 Oct 2012 09:52:47 -0700 Subject: [SRILM User List] Commands not found In-Reply-To: <1350977625.53790.YahooMailNeo@web194705.mail.sg3.yahoo.com> References: <1350977625.53790.YahooMailNeo@web194705.mail.sg3.yahoo.com> Message-ID: <5086CB5F.5000109@icsi.berkeley.edu> On 10/23/2012 12:33 AM, Don Erick Bonus wrote: > Hi everyone. > > I'm new to SRILM and I do have SRILM installed in an Ubuntu machine. > Based on what I got from the Internet, make World and make Test did > work by displaying a lot of information. However, when I try to run > commands which are in the bin/i686 folder for testing, I always > encounter the COMMAND NOT FOUND message. As I try to run man in > displaying the manual for the commands it says "No manual entry ...". > I've been searching the Internet for solutions and can't find one. > > Please help me on this one... I need to do a statistical-based spell > and grammar checker for Tagalog as a project. You may suggest steps > on how I can do this also since I'm also new to statistical-based NLP. > > Your help will be highly appreaciated. Thanks. > Erick Try invoking ./bin/i686/ngram -version. Assuming that works, the only problem is that your shell's executable search path is not set to include $SRILM/bin and $SRILM/bin/i686 . This is item 6 in the INSTALL instructions. Please consult a local Linux/Unix expert if needed. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From lisepul at gmail.com Fri Oct 26 05:36:44 2012 From: lisepul at gmail.com (Lianet Sepulveda Torres) Date: Fri, 26 Oct 2012 10:36:44 -0200 Subject: [SRILM User List] SRILM install problem Message-ID: Hi, I'm tried to install SRILM on Windows 7, using cywing. The following errors is showing when I giving make World mkdir -p include lib bin make init make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwi n OPTION= MAKE_PIC= init) || exit 1; \ done make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src' cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[3]: `../obj/cygwin/STAMP' est? atualizado. make[3]: `../bin/cygwin/STAMP' est? atualizado. make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[3]: `../obj/cygwin/STAMP' est? atualizado. make[3]: `../bin/cygwin/STAMP' est? atualizado. make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src' cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[3]: `../obj/cygwin/STAMP' est? atualizado. make[3]: `../bin/cygwin/STAMP' est? atualizado. make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src' cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[3]: `../obj/cygwin/STAMP' est? atualizado. make[3]: `../bin/cygwin/STAMP' est? atualizado. make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[3]: `../obj/cygwin/STAMP' est? atualizado. make[3]: `../bin/cygwin/STAMP' est? atualizado. make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src' cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories make ../obj/cygwin/STAMP ../bin/cygwin/STAMP make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[3]: `../obj/cygwin/STAMP' est? atualizado. make[3]: `../bin/cygwin/STAMP' est? atualizado. make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm' make release-headers make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwi n OPTION= MAKE_PIC= release-headers) || exit 1; \ done make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Nada a ser feito para `release-headers'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[2]: Nada a ser feito para `release-headers'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[2]: Nada a ser feito para `release-headers'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[2]: Nada a ser feito para `release-headers'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[2]: Nada a ser feito para `release-headers'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[2]: Nada a ser feito para `release-headers'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm' make depend make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwi n OPTION= MAKE_PIC= depend) || exit 1; \ done make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src' rm -f Dependencies.cygwin C:/cygwin/bin -I. -I../../include -MM ./option.c ./zio.c ./fcheck.c ./fake- rand48.c ./version.c ./ztest.c | sed -e "s&^$[^ ]$&../obj/cygwin"'$(OBJ_OPTION )'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio C:/cygwin/bin -I. -I../../include -MM ./Debug.cc ./File.cc ./MStringTokUtil. cc ./testFile.cc | sed -e "s&^$[^ ]$&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e " s&\.o&.o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio /cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj /cygwin ".exe" ztest testFile | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' rm -f Dependencies.cygwin C:/cygwin/bin -I. -I../../include -MM ./qsort.c ./BlockMalloc.c ./maxalloc. c | sed -e "s&^$[^ ]$&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio C:/cygwin/bin -I. -I../../include -MM ./MemStats.cc ./LHashTrie.cc ./SArrayT rie.cc ./Array.cc ./IntervalHeap.cc ./Map.cc ./SArray.cc ./LHash.cc ./Map2.cc ./ Trie.cc ./CachedMem.cc ./testArray.cc ./testMap.cc ./benchHash.cc ./testHash.cc ./testSizes.cc ./testCachedMem.cc ./testBlockMalloc.cc | sed -e "s&^$[^ ]$&../ obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio /cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj /cygwin ".exe" maxalloc testArray testMap benchHash testHash testSizes tes tCachedMem testBlockMalloc | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src' rm -f Dependencies.cygwin C:/cygwin/bin -I. -I../../include -MM ./matherr.c | sed -e "s&^$[^ ]$&../ obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio C:/cygwin/bin -I. -I../../include -MM ./Prob.cc ./Counts.cc ./XCount.cc ./Vo cab.cc ./VocabMap.cc ./VocabMultiMap.cc ./VocabDistance.cc ./SubVocab.cc ./Multi wordVocab.cc ./TextStats.cc ./LM.cc ./LMClient.cc ./LMStats.cc ./RefList.cc ./Bl eu.cc ./NBest.cc ./NBestSet.cc ./NgramLM.cc ./NgramStatsInt.cc ./NgramStatsShort .cc ./NgramStatsLong.cc ./NgramStatsLongLong.cc ./NgramStatsFloat.cc ./NgramStat sDouble.cc ./NgramStatsXCount.cc ./NgramCountLM.cc ./Discount.cc ./ClassNgram.cc ./SimpleClassNgram.cc ./DFNgram.cc ./SkipNgram.cc ./HiddenNgram.cc ./HiddenSNgr am.cc ./VarNgram.cc ./DecipherNgram.cc ./TaggedVocab.cc ./TaggedNgram.cc ./Tagge dNgramStats.cc ./StopNgram.cc ./StopNgramStats.cc ./MultiwordLM.cc ./NonzeroLM.c c ./BayesMix.cc ./LoglinearMix.cc ./AdaptiveMix.cc ./AdaptiveMarginals.cc ./Cach eLM.cc ./DynamicLM.cc ./HMMofNgrams.cc ./WordAlign.cc ./WordLattice.cc ./WordMes h.cc ./simpleTrigram.cc ./NgramStats.cc ./Trellis.cc ./testBinaryCounts.cc ./tes tHash.cc ./testProb.cc ./testXCount.cc ./testParseFloat.cc ./testVocabDistance.c c ./testNgram.cc ./testNgramAlloc.cc ./testMultiReadLM.cc ./hoeffding.cc ./tolow er.cc ./testLattice.cc ./testError.cc ./testNBest.cc ./testMix.cc ./ngram.cc ./n gram-count.cc ./ngram-merge.cc ./ngram-class.cc ./disambig.cc ./anti-ngram.cc ./ nbest-lattice.cc ./nbest-mix.cc ./nbest-optimize.cc ./nbest-pron-score.cc ./segm ent.cc ./segment-nbest.cc ./hidden-ngram.cc ./multi-ngram.cc | sed -e "s&^$[^ ] $&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio /cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj /cygwin ".exe" testBinaryCounts testHash testProb testXCount testParseFloat testVocabDistance testNgram testNgramAlloc testMultiReadLM hoeffding tolow er testLattice testError testNBest testMix ngram ngram-count ngram-merge ngram-class disambig anti-ngram nbest-lattice nbest-mix nbest-optimize nb est-pron-score segment segment-nbest hidden-ngram multi-ngram | sed -e "s&\. o&.o&g" >> Dependencies.cygwin make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src' rm -f Dependencies.cygwin C:/cygwin/bin -I. -I../../include -MM ./FDiscount.cc ./FNgramStats.cc ./FNgr amStatsInt.cc ./FNgramSpecs.cc ./FNgramSpecsInt.cc ./FactoredVocab.cc ./FNgramLM .cc ./ProductVocab.cc ./ProductNgram.cc ./wmatrix.cc ./pngram.cc ./fngram-count. cc ./fngram.cc | sed -e "s&^$[^ ]$&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s& \.o&.o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio /cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj /cygwin ".exe" pngram fngram-count fngram | sed -e "s&\.o&.o&g" >> Dependenci es.cygwin make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' rm -f Dependencies.cygwin C:/cygwin/bin -I. -I../../include -MM ./Lattice.cc ./LatticeAlign.cc ./Latti ceExpand.cc ./LatticeIndex.cc ./LatticeNBest.cc ./LatticeNgrams.cc ./LatticeRedu ce.cc ./HTKLattice.cc ./LatticeLM.cc ./LatticeDecode.cc ./testLattice.cc ./latti ce-tool.cc | sed -e "s&^$[^ ]$&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o& .o&g" >> Dependencies.cygwin /bin/sh: C:/cygwin/bin: ? um diret?rio /cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj /cygwin ".exe" testLattice lattice-tool | sed -e "s&\.o&.o&g" >> Dependencies. cygwin make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src' rm -f Dependencies.cygwin /cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin ../obj /cygwin ".exe" | sed -e "s&\.o&.o&g" >> Dependencies.cygwin make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm' make release-libraries make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwi n OPTION= MAKE_PIC= release-libraries) || exit 1; \ done make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Nada a ser feito para `release-libraries'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[2]: Nada a ser feito para `release-libraries'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[2]: Nada a ser feito para `release-libraries'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[2]: Nada a ser feito para `release-libraries'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[2]: Nada a ser feito para `release-libraries'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[2]: Nada a ser feito para `release-libraries'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src' make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm' make release-programs make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwi n OPTION= MAKE_PIC= release-programs) || exit 1; \ done make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Nada a ser feito para `release-programs'. make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src' make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' C:/cygwin/bin -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygw in/maxalloc.exe ../obj/cygwin/maxalloc.o ../obj/cygwin/libdstruct.a -lm ../../ lib/cygwin/libmisc.a -lm C:/cygwin/bin make[2]: C:/cygwin/bin: Comando n?o encontrado /cygdrive/c/cygwin/srilm/common/Makefile.common.targets:108: recipe for target ` ../bin/cygwin/maxalloc.exe' failed make[2]: *** [../bin/cygwin/maxalloc.exe] Error 127 make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src' Makefile:105: recipe for target `release-programs' failed make[1]: *** [release-programs] Error 1 make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm' Makefile:54: recipe for target `World' failed make: *** [World] Error 2 Any Ideas?? Regards, Lisepul -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmelvinjose73 at yahoo.com Sun Oct 28 17:24:08 2012 From: jmelvinjose73 at yahoo.com (Melvin Jose) Date: Sun, 28 Oct 2012 17:24:08 -0700 (PDT) Subject: [SRILM User List] FLM Training takes too long! Message-ID: <1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com> Hey, ??? I am presently working with Tamil - a morphologically rich language. I am trying to build an FLM with approximately 3 million entires but it seems to take more than a day and a half now. The FLM specification is W : W(-1) W(-2) B(-1) S(-1) using generalized backoff. where B is word-base and S is suffix. Below is the output of -debug 2 warning: distributing 0.0989813 left-over probability mass over all 577519 words discarded 1 0x4-gram probs predicting pseudo-events discarded 1587186 0x4-gram probs discounted to zero discarded 1 0x8-gram probs predicting pseudo-events discarded 1 0xc-gram probs predicting pseudo-events discarded 4721615 0xc-gram probs discnounted to zero Starting estimation of general graph-backoff node: LM 0 Node 0xC, children: 0x8 0x4 Finished estimation of multi-child graph-backoff node: LM 0 Node 0xC This was the last message I received a day and a half ago. Is it normal for it to take soo long? I read that Katrin had no problem training on 5 million entries. Did it take so long? I am using a cluster in my lab to do the computation, so there shouln't be a problem with memory and computational power. Is there any way by which I cantell the fngram-count to utilize as much memory as it wants or parallelize the computation? Thanks, Melvin -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sun Oct 28 23:16:35 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 28 Oct 2012 23:16:35 -0700 Subject: [SRILM User List] FLM Training takes too long! In-Reply-To: <1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com> References: <1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com> Message-ID: <508E1F43.2050405@icsi.berkeley.edu> On 10/28/2012 5:24 PM, Melvin Jose wrote: > > > Hey, > > I am presently working with Tamil - a morphologically rich language. I > am trying to build an FLM with approximately 3 million entires but it > seems to take more than a day and a half now. The FLM specification is > > W : W(-1) W(-2) B(-1) S(-1) using generalized backoff. where B is > word-base and S is suffix. > > Below is the output of -debug 2 > > warning: distributing 0.0989813 left-over probability mass over all > 577519 words > discarded 1 0x4-gram probs predicting pseudo-events > discarded 1587186 0x4-gram probs discounted to zero > discarded 1 0x8-gram probs predicting pseudo-events > discarded 1 0xc-gram probs predicting pseudo-events > discarded 4721615 0xc-gram probs discnounted to zero > Starting estimation of general graph-backoff node: LM 0 Node 0xC, > children: 0x8 0x4 > Finished estimation of multi-child graph-backoff node: LM 0 Node 0xC > > This was the last message I received a day and a half ago. Is it > normal for it to take soo long? I read that Katrin had no problem > training on 5 million entries. Did it take so long? I am using a > cluster in my lab to do the computation, so there shouln't be a > problem with memory and computational power. I have no experience myself to tell you how long it should take. However, in cases like this I would run some experiments increasing the amount of data from, say 10k to 100k to see how the runtime increases as a function of input size. Then you can extrapolate to the full data set instead of just waiting. > > Is there any way by which I can tell the fngram-count to utilize as > much memory as it wants or parallelize the computation? It will take as much memory as it needs to, and there is no easy way to parallelize. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Mon Oct 29 03:09:55 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Mon, 29 Oct 2012 18:09:55 +0800 Subject: [SRILM User List] About the -prune option Message-ID: Hi, I need to obtain a small LM for ASR decoding by pruning from a large LM. The original large LM contains about 1.6 billion n-grams, and the small one should contains about 30 million n-grams. The -prune option in SRILM could do this. However, I want to ask if it's the same by pruning in one time and in serveral times. For example, there are two approaches to finish this pruning task. 1) Set a proper value and prune only one time to get the targe LM: ngram -lm LM_Large -prune 1e-9 -order 5 -write-lm LM_Small 2) Set several proper values to prune gradually to get the targe LM: ngram -lm LM_Large -prune 1e-10 -order 5 -write-lm LM_Small1 ... ... ngram -lm LM_Small1 -prune 1e-9 -order 5 -write-lm LM_Small Are there any differences between above two approaches? Does the pruned LM have a lower perplexity by the second method? Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsuki_stefy at yahoo.com Mon Oct 29 09:15:25 2012 From: tsuki_stefy at yahoo.com (Stefy D.) Date: Mon, 29 Oct 2012 09:15:25 -0700 (PDT) Subject: [SRILM User List] lm interpolation Message-ID: <1351527325.44053.YahooMailNeo@web112503.mail.gq1.yahoo.com> Hello everyone, I am trying to interpolate 2 language models because I want to do an experiment in domain adaption. Below are the commands that I used. When I try to compute lamda, I get the error "mismatch in number of samples (60001 != 67708)". I don't know what to fix...please help me. ~/local/tools/srilm/bin/i686/ngram -order 3? -unk -lm ~/local/test1/lm/lm1.lm -ppl ~/local/test1/lm/de-en_corpus1.lowercased.en -debug 2 >? ppl1.ppl ~/local/tools/srilm/bin/i686/ngram -order 3? -unk -lm ~/local/test2/lm/lm2.lm -ppl ~/local/test2/lm/de-en_corpus2.lowercased.en -debug 2 >? ppl2.ppl ~/local/tools/srilm/bin/i686/compute-best-mix ~/local/test1/ppl1.ppl ~/local/test2/ppl2.ppl The ppl1.ppl file contains: " 2082 sentences, 57919 words, 0 OOVs 0 zeroprobs, logprob= -100036 ppl= 46.4762 ppl1= 53.3534" and the ppl2.ppl file contains: "2091 sentences, 65617 words, 0 OOVs 0 zeroprobs, logprob= -89850.8 ppl= 21.2341 ppl1= 23.4057" I apologise for asking such a basic question...I have just started reading about machine translation. Thank you very much for your time! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Oct 29 12:44:11 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 29 Oct 2012 12:44:11 -0700 Subject: [SRILM User List] About the -prune option In-Reply-To: References: Message-ID: <508EDC8B.90903@icsi.berkeley.edu> On 10/29/2012 3:09 AM, Meng Chen wrote: > Hi, I need to obtain a small LM for ASR decoding by pruning from a > large LM. The original large LM contains about 1.6 billion n-grams, > and the small one should contains about 30 million n-grams. The -prune > option in SRILM could do this. However, I want to ask if it's the same > by pruning in one time and in serveral times. For example, there are > two approaches to finish this pruning task. > > 1) Set a proper value and prune only one time to get the targe LM: > ngram -lm LM_Large -prune 1e-9 -order 5 -write-lm LM_Small > > 2) Set several proper values to prune gradually to get the targe LM: > ngram -lm LM_Large -prune 1e-10 -order 5 -write-lm LM_Small1 > ... ... > ngram -lm LM_Small1 -prune 1e-9 -order 5 -write-lm LM_Small > > Are there any differences between above two approaches? Does the > pruned LM have a lower perplexity by the second method? Pruning tries to minimize the cross-entropy between the original and the pruned model. Therefore, you are expected to get best results if you do the pruning in one step (approach 1) since then you have to original model to compare to for all pruning decisions (at the ngram level). I have not investigate how much worse Approach 2 would do, so it might be just fine in practice. Andreas From stolcke at icsi.berkeley.edu Mon Oct 29 12:46:33 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 29 Oct 2012 12:46:33 -0700 Subject: [SRILM User List] lm interpolation In-Reply-To: <1351527325.44053.YahooMailNeo@web112503.mail.gq1.yahoo.com> References: <1351527325.44053.YahooMailNeo@web112503.mail.gq1.yahoo.com> Message-ID: <508EDD19.9060901@icsi.berkeley.edu> On 10/29/2012 9:15 AM, Stefy D. wrote: > Hello everyone, > > I am trying to interpolate 2 language models because I want to do an > experiment in domain adaption. Below are the commands that I used. > When I try to compute lamda, I get the error "mismatch in number of > samples (60001 != 67708)". I don't know what to fix...please help me. > > ~/local/tools/srilm/bin/i686/ngram -order 3 -unk -lm > ~/local/test1/lm/lm1.lm -ppl > ~/local/test1/lm/de-en_corpus1.lowercased.en -debug 2 > ppl1.ppl > ~/local/tools/srilm/bin/i686/ngram -order 3 -unk -lm > ~/local/test2/lm/lm2.lm -ppl > ~/local/test2/lm/de-en_corpus2.lowercased.en -debug 2 > ppl2.ppl > ~/local/tools/srilm/bin/i686/compute-best-mix ~/local/test1/ppl1.ppl > ~/local/test2/ppl2.ppl You need to collect ppl1.ppl and ppl2.ppl on the SAME EXACT DATA. Same data, different models. compute-best-mix will find the optimal interpolation to minimize the combined model on that data. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Sat Nov 3 11:10:01 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Sat, 3 Nov 2012 11:10:01 -0700 (PDT) Subject: [SRILM User List] Fw: nbest-error option In-Reply-To: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> Message-ID: <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> Hi, I have found an error when I pass nbest-lattice options to the nbest-error in the nbest-scripts. I use the following command: /srilm/bin/./nbest-error nbestfilelist refs [-wer] It gives the error line 44: nbest-lattice command not found gawk: cmd. line 10: fatal: division by zero attempted. Could anyone please tell where is the fault? By the way, I want to compute the WER of a set of N-best list. Thanks Best Regards Akmal -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkataraman.anand at gmail.com Sat Nov 3 11:30:50 2012 From: venkataraman.anand at gmail.com (Anand Venkataraman) Date: Sat, 3 Nov 2012 11:30:50 -0700 Subject: [SRILM User List] Fw: nbest-error option In-Reply-To: <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> Message-ID: Because you invoke nbest-error using the full path, I suspect that it's not in your $PATH environment variable. It needs to be because the script nbest-error invokes nbest-lattice directly. Try (bash): export PATH=/srilm/bin:$PATH; nbest-error nbestfilelist refs [-wer] HTH & On Sat, Nov 3, 2012 at 11:10 AM, Md. Akmal Haidar wrote: > > Hi, > > I have found an error when I pass nbest-lattice options to the nbest-error > in the nbest-scripts. > > I use the following command: > /srilm/bin/./nbest-error nbestfilelist refs [-wer] > > It gives the error line 44: nbest-lattice command not found > gawk: cmd. line 10: fatal: division by zero attempted. > > Could anyone please tell where is the fault? > By the way, I want to compute the WER of a set of N-best list. > > Thanks > Best Regards > Akmal > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Sun Nov 4 12:52:05 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Sun, 4 Nov 2012 12:52:05 -0800 (PST) Subject: [SRILM User List] Fw: nbest-error option In-Reply-To: References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> Message-ID: <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> Thanks Anand, It works now. But I found the same WER for the original n-best list and the rescored nbest list. For rescoring, I use the following command: rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm Is there anything wrong in the above command? Thanks Best Regards Akmal ________________________________ From: Anand Venkataraman To: Md. Akmal Haidar Cc: "srilm-user at speech.sri.com" Sent: Saturday, November 3, 2012 2:30:50 PM Subject: Re: [SRILM User List] Fw: nbest-error option Because you invoke nbest-error using the full path, I suspect that it's not in your $PATH environment variable. It needs to be because the script nbest-error invokes nbest-lattice directly. Try (bash): ? ?export PATH=/srilm/bin:$PATH; ? nbest-error nbestfilelist refs [-wer] HTH & On Sat, Nov 3, 2012 at 11:10 AM, Md. Akmal Haidar wrote: > >Hi, > > >I have found an error when I pass nbest-lattice options to the nbest-error in the nbest-scripts. > > >I use the following command: > >/srilm/bin/./nbest-error nbestfilelist refs [-wer] > > >It gives the error line 44: nbest-lattice command not found >gawk: cmd. line 10: fatal: division by zero attempted. > > > >Could anyone please tell where is the fault? > >By the way, I want to compute the WER of a set of N-best list. > > >Thanks >Best Regards >Akmal > > > >_______________________________________________ >SRILM-User site list >SRILM-User at speech.sri.com >http://www.speech.sri.com/mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Sun Nov 4 13:40:20 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Sun, 4 Nov 2012 13:40:20 -0800 (PST) Subject: [SRILM User List] rescore-decipher option In-Reply-To: <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> Message-ID: <1352065220.65029.YahooMailNeo@web161004.mail.bf1.yahoo.com> Hi, I have found the same WER for the original n-best list and the rescored nbest list. For rescoring, I use the following command: rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm Is there anything wrong in the above command? Thanks Best Regards Akmal ________________________________ From: Anand Venkataraman To: Md. Akmal Haidar Cc: "srilm-user at speech.sri.com" Sent: Saturday, November 3, 2012 2:30:50 PM Subject: Re: [SRILM User List] Fw: nbest-error option Because you invoke nbest-error using the full path, I suspect that it's not in your $PATH environment variable. It needs to be because the script nbest-error invokes nbest-lattice directly. Try (bash): ? ?export PATH=/srilm/bin:$PATH; ? nbest-error nbestfilelist refs [-wer] HTH & On Sat, Nov 3, 2012 at 11:10 AM, Md. Akmal Haidar wrote: > >Hi, > > >I have found an error when I pass nbest-lattice options to the nbest-error in the nbest-scripts. > > >I use the following command: > >/srilm/bin/./nbest-error nbestfilelist refs [-wer] > > >It gives the error line 44: nbest-lattice command not found >gawk: cmd. line 10: fatal: division by zero attempted. > > > >Could anyone please tell where is the fault? > >By the way, I want to compute the WER of a set of N-best list. > > >Thanks >Best Regards >Akmal > > > >_______________________________________________ >SRILM-User site list >SRILM-User at speech.sri.com >http://www.speech.sri.com/mailman/listinfo/srilm-user > _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Sun Nov 4 15:36:28 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Sun, 4 Nov 2012 15:36:28 -0800 (PST) Subject: [SRILM User List] nbest-lattice In-Reply-To: <1352065220.65029.YahooMailNeo@web161004.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <1352065220.65029.YahooMailNeo@web161004.mail.bf1.yahoo.com> Message-ID: <1352072188.43964.YahooMailNeo@web161004.mail.bf1.yahoo.com> Hi, Is there any way to find the overall WER using nbest-lattice? Akmal -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sun Nov 4 16:02:38 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 04 Nov 2012 16:02:38 -0800 Subject: [SRILM User List] Fw: nbest-error option In-Reply-To: <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> Message-ID: <5097021E.5090402@icsi.berkeley.edu> On 11/4/2012 12:52 PM, Md. Akmal Haidar wrote: > Thanks Anand, > It works now. > But I found the same WER for the original n-best list and the rescored > nbest list. > > For rescoring, I use the following command: > rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm nbest-error computes the "N-best error rate", meaning the best possible error rate that can be achieved by picking a hypothesis from anywhere among the N best. This is sometimes called the "oracle" error. It doesn't changed as a result of rescoring. What you probably want is the error rate of the highest-scoring hypothesis. For this, you first extract the highest-scoring hyps using the "rescore-reweight" command (see nbest-scripts(1) man page), then you using your favorite WER scoring program. If you have NIST sclite installed, you could use the compute-sclite wrapper, which takes care of format differences. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Mon Nov 5 07:53:07 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Mon, 5 Nov 2012 07:53:07 -0800 (PST) Subject: [SRILM User List] Fw: nbest-error option In-Reply-To: <5097021E.5090402@icsi.berkeley.edu> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> Message-ID: <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> Hi , Thanks. I used NIST sclite? command: ./sclite -r refs -i wsj -h hyps -o dtl but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight. I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl but it shows line 247: sclite: command not found. Can anyone tell me where is the problem? Thanks Best Regards Akmal ________________________________ From: Andreas Stolcke To: Md. Akmal Haidar Cc: Anand Venkataraman ; "srilm-user at speech.sri.com" Sent: Sunday, November 4, 2012 7:02:38 PM Subject: Re: [SRILM User List] Fw: nbest-error option On 11/4/2012 12:52 PM, Md. Akmal Haidar wrote: Thanks Anand, >It works now. > >But I found the same WER for the original n-best list and the rescored nbest list. > > >For rescoring, I use the following command: >rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm nbest-error computes the "N-best error rate", meaning the best possible error rate that can be achieved by picking a hypothesis from anywhere among the N best.?? This is sometimes called the "oracle" error.? It doesn't changed as a result of rescoring. What you probably want is the error rate of the highest-scoring hypothesis.? For this, you first extract the highest-scoring hyps using the "rescore-reweight" command (see nbest-scripts(1) man page), then you using your favorite WER scoring program.?? If you have NIST sclite installed, you could use the compute-sclite wrapper, which takes care of format differences. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Mon Nov 5 08:40:02 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Mon, 5 Nov 2012 08:40:02 -0800 (PST) Subject: [SRILM User List] compute-sclite option In-Reply-To: <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> Message-ID: <1352133602.34840.YahooMailNeo@web161002.mail.bf1.yahoo.com> Hi , Thanks. I used NIST sclite? command for rescoring: ./sclite -r refs -i wsj -h hyps -o dtl but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight. I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl but it shows line 247: sclite: command not found. Can anyone tell me where is the problem? Thanks Best Regards Akmal ________________________________ From: Andreas Stolcke To: Md. Akmal Haidar Cc: Anand Venkataraman ; "srilm-user at speech.sri.com" Sent: Sunday, November 4, 2012 7:02:38 PM Subject: Re: [SRILM User List] Fw: nbest-error option On 11/4/2012 12:52 PM, Md. Akmal Haidar wrote: Thanks Anand, >It works now. > >But I found the same WER for the original n-best list and the rescored nbest list. > > >For rescoring, I use the following command: >rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm nbest-error computes the "N-best error rate", meaning the best possible error rate that can be achieved by picking a hypothesis from anywhere among the N best.?? This is sometimes called the "oracle" error.? It doesn't changed as a result of rescoring. What you probably want is the error rate of the highest-scoring hypothesis.? For this, you first extract the highest-scoring hyps using the "rescore-reweight" command (see nbest-scripts(1) man page), then you using your favorite WER scoring program.?? If you have NIST sclite installed, you could use the compute-sclite wrapper, which takes care of format differences. Andreas _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Nov 5 10:22:42 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 05 Nov 2012 10:22:42 -0800 Subject: [SRILM User List] Fw: nbest-error option In-Reply-To: <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> Message-ID: <509803F2.9060106@icsi.berkeley.edu> On 11/5/2012 7:53 AM, Md. Akmal Haidar wrote: > Hi , > > Thanks. > > I used NIST sclite command: > ./sclite -r refs -i wsj -h hyps -o dtl > but I got the same scoring result for the baseline 1-best hypothesis > and updated 1-best hypothesis obtained by updated LM using > rescore-decipher and rescore-reweight. Compare the rescored and the original nbest lists. Are they different ? Make sure you specify the NEW nbest directory created by rescore-decipher as the input to rescore-reweight, not the original one. > > I tried with compute-sclite of SRILM with command: compute-sclite -r > refs -h hyps -i wsj -o dtl > but it shows line 247: sclite: command not found. You need to have the sclite binary in your executable search path. Modify the PATH environment variable so you can just type "sclite" and find the executable. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Mon Nov 5 16:17:09 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Mon, 5 Nov 2012 16:17:09 -0800 (PST) Subject: [SRILM User List] Fw: nbest-error option In-Reply-To: <509803F2.9060106@icsi.berkeley.edu> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> <509803F2.9060106@icsi.berkeley.edu> Message-ID: <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com> Hi, Yes. The nbest lists are different. Is there any lmw required in rescore-reweight. The original nbest list were generated from htk lattice. htk lattice were generated using language model scale factor 15. should i use this in the rescore weight. Is is possible to compute the wer using nbest-lattice -wer option. The total err(sum of sub,ins,del)? for the original nbest list is greater than the total err of using nbest list obtained by the updated lm. Thanks Best Regards Akmal ________________________________ From: Andreas Stolcke To: Md. Akmal Haidar Cc: "srilm-user at speech.sri.com" ; "venkataraman.anand at gmail.com" Sent: Monday, November 5, 2012 1:22:42 PM Subject: Re: [SRILM User List] Fw: nbest-error option On 11/5/2012 7:53 AM, Md. Akmal Haidar wrote: Hi , > > >Thanks. > > >I used NIST sclite? command: >./sclite -r refs -i wsj -h hyps -o dtl >but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight. Compare the rescored and the original nbest lists.?? Are they different ??? Make sure you specify the NEW nbest directory created by rescore-decipher as the input to rescore-reweight, not the original one. > >I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl >but it shows line 247: sclite: command not found. You need to have the sclite binary in your executable search path.?? Modify the PATH environment variable so you can just type "sclite" and find the executable. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Mon Nov 5 22:37:03 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Tue, 6 Nov 2012 14:37:03 +0800 Subject: [SRILM User List] How to compare LMs training with different vocabularies? Message-ID: Hi, I'm training LMs for Mandarin Chinese ASR task with two different vocabularies, vocab1(100635 vocabularies) and vocab2(102541 vocabularies). In order to compare the performance of two vocabularies, the training corpus is the same, the test corpus is the same, and the word segmentation method is also the same, which is Forward Maximum Match. The only difference is the segmentation vocabulary and LM training vocabulary. I trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test set. The result is as follows: LM1: logprobs = -84069.7, PPL = 416.452. LM2: logprobs = -82921.7, PPL = 189.564. It seems LM2 is much better than LM1, either by logprobs or by PPL. However, when I am doing decoding with the corresponding Acoustic Model. The WER of LM2 is higher than LM1. So I'm really confused. What's the relationship between the PPL and WER? How to compare LMs with different vocabularies? Can you give me some suggestions or references? I'm really confused. Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Mon Nov 5 22:43:51 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Tue, 6 Nov 2012 14:43:51 +0800 Subject: [SRILM User List] How to compare LMs training with different vocabularies? Message-ID: Hi, I'm training LMs for Mandarin Chinese ASR task with two different vocabularies, vocab1(100635 vocabularies) and vocab2(102541 vocabularies). In order to compare the performance of two vocabularies, the training corpus is the same, the test corpus is the same, and the word segmentation method is also the same, which is Forward Maximum Match. The only difference is the segmentation vocabulary and LM training vocabulary. I trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test set. The result is as follows: LM1: logprobs = -84069.7, PPL = 416.452. LM2: logprobs = -82921.7, PPL = 189.564. It seems LM2 is much better than LM1, either by logprobs or by PPL. However, when I am doing decoding with the corresponding Acoustic Model. The CER(Character Error Rate) of LM2 is higher than LM1. So I'm really confused. What's the relationship between the PPL and CER? How to compare LMs with different vocabularies? Can you give me some suggestions or references? I'm really confused. ps: There is a mistake in last mail, so I sent it gain. Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Mon Nov 5 22:46:16 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Tue, 6 Nov 2012 14:46:16 +0800 Subject: [SRILM User List] How to compare LMs training with different vocabularies? Message-ID: Hi, I'm training LMs for Mandarin Chinese ASR task with two different vocabularies, vocab1(100635 vocabularies) and vocab2(102541 vocabularies). In order to compare the performance of two vocabularies, the training corpus is the same, the test corpus is the same, and the word segmentation method is also the same, which is Forward Maximum Match. The only difference is the segmentation vocabulary and LM training vocabulary. I trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test set. The result is as follows: LM1: logprobs = -84069.7, PPL = 416.452. LM2: logprobs = -82921.7, PPL = 189.564. It seems LM2 is much better than LM1, either by logprobs or by PPL. However, when I am doing decoding with the corresponding Acoustic Model. The CER(Character Error Rate) of LM2 is higher than LM1. So I'm really confused. What's the relationship between the PPL and CER? How to compare LMs with different vocabularies? Can you give me some suggestions or references? I'm really confused. ps: There is a mistake in last mail, so I sent it gain. Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Tue Nov 6 06:23:54 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Tue, 6 Nov 2012 06:23:54 -0800 (PST) Subject: [SRILM User List] Fw: Fw: nbest-error option In-Reply-To: <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> <509803F2.9060106@icsi.berkeley.edu> <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com> Message-ID: <1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com> Hi, I found the problem and it works now. Thanks. Best Regards Akmal ----- Forwarded Message ----- From: Md. Akmal Haidar To: Andreas Stolcke Cc: "srilm-user at speech.sri.com" Sent: Monday, November 5, 2012 7:17:09 PM Subject: Re: [SRILM User List] Fw: nbest-error option Hi, Yes. The nbest lists are different. Is there any lmw required in rescore-reweight. The original nbest list were generated from htk lattice. htk lattice were generated using language model scale factor 15. should i use this in the rescore weight. Is is possible to compute the wer using nbest-lattice -wer option. The total err(sum of sub,ins,del)? for the original nbest list is greater than the total err of using nbest list obtained by the updated lm. Thanks Best Regards Akmal ________________________________ From: Andreas Stolcke To: Md. Akmal Haidar Cc: "srilm-user at speech.sri.com" ; "venkataraman.anand at gmail.com" Sent: Monday, November 5, 2012 1:22:42 PM Subject: Re: [SRILM User List] Fw: nbest-error option On 11/5/2012 7:53 AM, Md. Akmal Haidar wrote: Hi , > > >Thanks. > > >I used NIST sclite? command: >./sclite -r refs -i wsj -h hyps -o dtl >but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight. Compare the rescored and the original nbest lists.?? Are they different ??? Make sure you specify the NEW nbest directory created by rescore-decipher as the input to rescore-reweight, not the original one. > >I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl >but it shows line 247: sclite: command not found. You need to have the sclite binary in your executable search path.?? Modify the PATH environment variable so you can just type "sclite" and find the executable. Andreas _______________________________________________ SRILM-User site list SRILM-User at speech.sri.com http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Tue Nov 6 06:50:57 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Tue, 6 Nov 2012 06:50:57 -0800 (PST) Subject: [SRILM User List] nbest rescoring for LM with different smoothing In-Reply-To: <1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> <509803F2.9060106@icsi.berkeley.edu> <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com> <1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com> Message-ID: <1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com> Hi, I have found the same WER scoring result using the LM with two different smoothing (additive/Witten-Bell). First I have created the HTK lattice using the LM. Then, I used the lattice-tool to find the nbest-list. How the the two LM trained on the same text with different smoothing give the same WER result? Thanks Best Regards Akmal -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Nov 6 09:54:38 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 06 Nov 2012 09:54:38 -0800 Subject: [SRILM User List] nbest rescoring for LM with different smoothing In-Reply-To: <1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> <509803F2.9060106@icsi.berkeley.edu> <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com> <1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com> <1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com> Message-ID: <50994EDE.5030502@icsi.berkeley.edu> On 11/6/2012 6:50 AM, Md. Akmal Haidar wrote: > Hi, > > I have found the same WER scoring result using the LM with two > different smoothing (additive/Witten-Bell). > First I have created the HTK lattice using the LM. Then, I used the > lattice-tool to find the nbest-list. > > How the the two LM trained on the same text with different smoothing > give the same WER result? > > Thanks > Best Regards > Akmal Do the LM probabilities differ in the details? (Compare the rescored nbest lists.) If so then it could just be that your data is such that the smoothing method by itself does not make enough of a difference to change the top hypothesis choice. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Tue Nov 6 12:19:13 2012 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Tue, 6 Nov 2012 12:19:13 -0800 (PST) Subject: [SRILM User List] nbest rescoring for LM with different smoothing In-Reply-To: <50994EDE.5030502@icsi.berkeley.edu> References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com> <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com> <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com> <5097021E.5090402@icsi.berkeley.edu> <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com> <509803F2.9060106@icsi.berkeley.edu> <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com> <1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com> <1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com> <50994EDE.5030502@icsi.berkeley.edu> Message-ID: <1352233153.39948.YahooMailNeo@web161001.mail.bf1.yahoo.com> Hi, Using -no-expansion option in the lattice-tool command, I got the different result. Thanks Best Regards Akmal ________________________________ From: Andreas Stolcke To: Md. Akmal Haidar Cc: "srilm-user at speech.sri.com" Sent: Tuesday, November 6, 2012 12:54:38 PM Subject: Re: nbest rescoring for LM with different smoothing On 11/6/2012 6:50 AM, Md. Akmal Haidar wrote: Hi, > >I have found the same WER scoring result using the LM with two different smoothing (additive/Witten-Bell). >First I have created the HTK lattice using the LM. Then, I used the lattice-tool to find the nbest-list. > >How the the two LM trained on the same text with different smoothing give the same WER result? > >Thanks >Best Regards >Akmal > Do the LM probabilities differ in the details?? (Compare the rescored nbest lists.) If so then it could just be that your data is such that the smoothing method by itself does not make enough of a difference to change the top hypothesis choice. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From sidhurukku at yahoo.com Tue Nov 6 21:16:49 2012 From: sidhurukku at yahoo.com (Jasleen Sidhu) Date: Tue, 6 Nov 2012 21:16:49 -0800 (PST) Subject: [SRILM User List] (no subject) Message-ID: <1352265409.54975.BPMail_high_noncarrier@web111506.mail.gq1.yahoo.com> http://acia.com.mx/newnews.facebook.com.foxnews3.php From stolcke at icsi.berkeley.edu Wed Nov 14 13:27:23 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 14 Nov 2012 13:27:23 -0800 Subject: [SRILM User List] How to compare LMs training with different vocabularies? In-Reply-To: References: Message-ID: <50A40CBB.1080609@icsi.berkeley.edu> On 11/5/2012 10:46 PM, Meng Chen wrote: > Hi, I'm training LMs for Mandarin Chinese ASR task with two different > vocabularies, vocab1(100635 vocabularies) and vocab2(102541 > vocabularies). In order to compare the performance of two > vocabularies, the training corpus is the same, the test corpus is the > same, and the word segmentation method is also the same, which > is Forward Maximum Match. The only difference is the segmentation > vocabulary and LM training vocabulary. I trained LM1 and LM2 with > vocab1 and vocab2, and evaluate them on test set. The result is as > follows: > > LM1: logprobs = -84069.7, PPL = 416.452. > LM2: logprobs =-82921.7, PPL = 189.564. > > It seems LM2 is much better than LM1, either by logprobs or by PPL. > However, when I am doing decoding with the corresponding Acoustic > Model. The CER(Character Error Rate) of LM2 is higher than LM1. So I'm > really confused. What's the relationship between the PPL and CER? How > to compare LMs with different vocabularies? Can you give me some > suggestions or references? I'm really confused. > > ps: There is a mistake in last mail, so I sent it gain. It is hard or impossible to compare two LMs with different vocabularies even when word segmentation is not an issue. But you are comparing two LMs using different segmentations (because the vocabularies differ), so the problem is even harder. The fact that your log probs differ by only a small amount (relatively) but the perplexities by a lot means that somehow your segmentation (the number of tokens in particular) in the two systems but be quite different. Is that the case? Can you devise an experiment where the segmentations are kept as similar as possible? For example, you could apply the same segmenter to both test cases, and then split OOV words into their single-character components where needed to apply the LM. Anecdotally, PPL and WER are not always well correlated, though when comparing a large range of models the correlation is strong (if not perfect). See http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013 . I do not recall any systematic studies of the effect of Mandarin word segmentation on CER but given the amount of work in this area in the last decade there must be some. Maybe someone else has some pointers ? Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Mon Nov 19 18:40:59 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Tue, 20 Nov 2012 10:40:59 +0800 Subject: [SRILM User List] How to compare LMs training with different vocabularies? In-Reply-To: <50A40CBB.1080609@icsi.berkeley.edu> References: <50A40CBB.1080609@icsi.berkeley.edu> Message-ID: Yes, the number of tokens in training corpus and test set segmented with vocab2 is more than that with vocab1, so the word PPL diffed so much. I also did an experiment as follows: I compared each sentences' logprobs in test set under LM1 and LM2, and separated the sentences into three sets. A>B: represents sentences' logprobs with LM2 is higher than LM1 A=B: represents sentences' logprobs with LM2 is equal with LM1 AB set. It seems sentences with higher logprobs can have lower CER, assuming acoustic model is the same under two vocabs. However, I also found that the CER with LM2 is higher than LM1 in A=B set. So I was wondering whether the acoustic model is also influenced by vocab and segmentation. Thanks! Meng CHEN 2012/11/15 Andreas Stolcke > On 11/5/2012 10:46 PM, Meng Chen wrote: > > Hi, I'm training LMs for Mandarin Chinese ASR task with two different > vocabularies, vocab1(100635 vocabularies) and vocab2(102541 > vocabularies). In order to compare the performance of two vocabularies, the > training corpus is the same, the test corpus is the same, and the word > segmentation method is also the same, which is Forward Maximum Match. The > only difference is the segmentation vocabulary and LM training vocabulary. > I trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test > set. The result is as follows: > > LM1: logprobs = -84069.7, PPL = 416.452. > LM2: logprobs = -82921.7, PPL = 189.564. > > It seems LM2 is much better than LM1, either by logprobs or by PPL. > However, when I am doing decoding with the corresponding Acoustic Model. > The CER(Character Error Rate) of LM2 is higher than LM1. So I'm really > confused. What's the relationship between the PPL and CER? How to compare > LMs with different vocabularies? Can you give me some suggestions or > references? I'm really confused. > > ps: There is a mistake in last mail, so I sent it gain. > > > It is hard or impossible to compare two LMs with different vocabularies > even when word segmentation is not an issue. > But you are comparing two LMs using different segmentations (because the > vocabularies differ), so the problem is even harder. > The fact that your log probs differ by only a small amount (relatively) > but the perplexities by a lot means that somehow your segmentation (the > number of tokens in particular) in the two systems but be quite different. > Is that the case? Can you devise an experiment where the segmentations are > kept as similar as possible? For example, you could apply the same > segmenter to both test cases, and then split OOV words into their > single-character components where needed to apply the LM. > > Anecdotally, PPL and WER are not always well correlated, though when > comparing a large range of models the correlation is strong (if not > perfect). See > http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013 . > > I do not recall any systematic studies of the effect of Mandarin word > segmentation on CER but given the amount of work in this area in the last > decade there must be some. Maybe someone else has some pointers ? > > Andreas > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Tue Nov 20 06:05:20 2012 From: chenmengdx at gmail.com (Meng CHEN) Date: Tue, 20 Nov 2012 22:05:20 +0800 Subject: [SRILM User List] How to use disambig tool to convert Pinyin to character? Message-ID: Hi, I want to use disambig tool to finish a Pinyin-to-Character demo. Can you give me an example? Pinyin is the pronunciation of Chinese character. Thanks? Meng CHEN From dmytro.prylipko at ovgu.de Thu Nov 22 05:06:38 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Thu, 22 Nov 2012 14:06:38 +0100 Subject: [SRILM User List] SRILM trigram worse than HTK bigram? Message-ID: <50AE235E.7060400@ovgu.de> Hi, I found that the accuracy of the recognition results obtained with HVite is about 5% better with comparison to the hypothesis got after rescoring the lattices with lattice-tool. HVite do not really use an N-gram, it is a word net, but I cannot really figure out why does it work so much better than SRILM models. I use the following script to generate lattices (60-best): HVite -A -T 1 \ -C GENLATTICES.conf \ -n 20 60 \ -l outLatDir \ -z lat \ -H hmmDefs \ -S test.list \ -i out.bigram.HLStats.mlf \ -w bigram.HLStats.lat \ -p 0.0 \ -s 8.0 \ lexicon \ hmm.mono.list Which are then rescored with: lattice-tool \ -read-htk \ -write-htk \ -htk-lmscale 10.0 \ -htk-words-on-nodes \ -order 3 \ -in-lattice-list srclat.list \ -out-lattice-dir rescoredLatDir \ -lm trigram.SRILM.lm \ -overwrite find rescoredLatDir -name "*.lat" > rescoredLat.list lattice-tool \ -read-htk \ -write-htk \ -htk-lmscale 10.0 \ -htk-words-on-nodes \ -order 3 \ -in-lattice-list rescoredLat.list\ -viterbi-decode \ -output-ctm | ctm2mlf_r > out.trigram.SRILM.mlf Decoded with HVite (92.86%): LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer weiteren zweitaegigen arbeitssitzu REC: wie sieht es aus mit einem weiteren zweitaegigen in einer weiteren zweitaegigen arbeitssitzu ... and with lattice-tool (64.29%): LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer weiteren zweitaegigen arbeitssitzu REC: wie sieht es aus mit einen weiteren zweitaegigen dann bei einem zweitaegigen arbeitssitzung Corresponding word nets and LMs have been built using the same vocabulary and training data. I should say that for some sentences SRILM outperforms HTK, but in general it is roughly 5-7% behind. Could you please suggest why is it so? Maybe some parameter values are wrong? Or should it be like this? I would be greatly appreciated for help. Yours, Dmytro Prylipko. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Nov 23 10:12:50 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 23 Nov 2012 10:12:50 -0800 Subject: [SRILM User List] SRILM trigram worse than HTK bigram? In-Reply-To: <50AE235E.7060400@ovgu.de> References: <50AE235E.7060400@ovgu.de> Message-ID: <50AFBCA2.7030706@icsi.berkeley.edu> You need to run a few sanity checks to make sure things are working as you expect them to. 1. Decode 1-best from the HTK lattice WITHOUT rescoring. The results should be the same as from the HTK decoder. If not there might be a difference in the LM scaling factor, and you may have to adjust is via the command line option. There might also be issues with the CTM output and conversion back to MLF. 2. Rescore the lattices with the same LM that is used in the HTK decoder. Again, the results should be essentially identical. I'm not familiar with the bigram format used by HTK, but you may have to convert it to ARPA format. 3. Then try rescoring with a trigram. Approaching your goal in steps hopefully will help you pinpoint the problem(s). Andreas On 11/22/2012 5:06 AM, Dmytro Prylipko wrote: > Hi, > > I found that the accuracy of the recognition results obtained with > HVite is about 5% better with comparison to the hypothesis got after > rescoring the lattices with lattice-tool. > > HVite do not really use an N-gram, it is a word net, but I cannot > really figure out why does it work so much better than SRILM models. > > I use the following script to generate lattices (60-best): > > HVite -A -T 1 \ > -C GENLATTICES.conf \ > -n 20 60 \ > -l outLatDir \ > -z lat \ > -H hmmDefs \ > -S test.list \ > -i out.bigram.HLStats.mlf \ > -w bigram.HLStats.lat \ > -p 0.0 \ > -s 8.0 \ > lexicon \ > hmm.mono.list > > Which are then rescored with: > > lattice-tool \ > -read-htk \ > -write-htk \ > -htk-lmscale 10.0 \ > -htk-words-on-nodes \ > -order 3 \ > -in-lattice-list srclat.list \ > -out-lattice-dir rescoredLatDir \ > -lm trigram.SRILM.lm \ > -overwrite > > find rescoredLatDir -name "*.lat" > rescoredLat.list > > lattice-tool \ > -read-htk \ > -write-htk \ > -htk-lmscale 10.0 \ > -htk-words-on-nodes \ > -order 3 \ > -in-lattice-list rescoredLat.list\ > -viterbi-decode \ > -output-ctm | ctm2mlf_r > out.trigram.SRILM.mlf > > Decoded with HVite (92.86%): > > LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer > weiteren zweitaegigen arbeitssitzu > REC: wie sieht es aus mit einem weiteren zweitaegigen in einer > weiteren zweitaegigen arbeitssitzu > > ... and with lattice-tool (64.29%): > > LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer > weiteren zweitaegigen arbeitssitzu > REC: wie sieht es aus mit einen weiteren zweitaegigen dann bei > einem zweitaegigen arbeitssitzung > > Corresponding word nets and LMs have been built using the same > vocabulary and training data. I should say that for some sentences > SRILM outperforms HTK, but in general it is roughly 5-7% behind. > Could you please suggest why is it so? Maybe some parameter values are > wrong? > Or should it be like this? > > I would be greatly appreciated for help. > > Yours, > Dmytro Prylipko. > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmytro.prylipko at ovgu.de Sun Nov 25 08:51:03 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Sun, 25 Nov 2012 17:51:03 +0100 Subject: [SRILM User List] SRILM trigram worse than HTK bigram? In-Reply-To: <50AFBCA2.7030706@icsi.berkeley.edu> References: <50AE235E.7060400@ovgu.de> <50AFBCA2.7030706@icsi.berkeley.edu> Message-ID: <50B24C77.6040007@ovgu.de> 1. Output is identical. Thus, LM scale factor does not play a decisive role. Conversion to MLF from CTM is fine too. 2. I built a bigram in ARPA format with HTK (using HLStats). Here after rescoring and decoding I have got the same recognition result as for LM built with SRILM: I tried to change the LM scale factor from 10 to 8 (the lattice was obtained with LM scale factor 8), but it gave no difference. Thus, the changes are introduced when rescoring. I suggested the reason is the difference between start/end sentence markers. For HTK they are !ENTER and !EXIT respectively, and for SRILM: ~~and~~ . I do take it into account: I replace !ENTER and !EXIT with ~~and~~ in the lattice file. SRILM models are trained on the data, where ~~and~~ denote the boundaries. Also, I replaced these markers in the language model built with HTK in order to let it process the existing lattice correctly. However, when I tried to play around with those markers, it gave no result. Namely, I tried to use the HTK format only: the lattice generated and the language model use !ENTER and !EXIT. Unfortunately, the output was the same. Do you have any further suggestions? Yours, Dmytro. On Fri 23 Nov 2012 07:12:50 PM CET, Andreas Stolcke wrote: > You need to run a few sanity checks to make sure things are working as > you expect them to. > > 1. Decode 1-best from the HTK lattice WITHOUT rescoring. The results > should be the same as from the HTK decoder. If not there might be a > difference in the LM scaling factor, and you may have to adjust is via > the command line option. There might also be issues with the CTM > output and conversion back to MLF. > > 2. Rescore the lattices with the same LM that is used in the HTK > decoder. Again, the results should be essentially identical. > I'm not familiar with the bigram format used by HTK, but you may have > to convert it to ARPA format. > > 3. Then try rescoring with a trigram. > > Approaching your goal in steps hopefully will help you pinpoint the > problem(s). > > Andreas > > On 11/22/2012 5:06 AM, Dmytro Prylipko wrote: >> Hi, >> >> I found that the accuracy of the recognition results obtained with >> HVite is about 5% better with comparison to the hypothesis got after >> rescoring the lattices with lattice-tool. >> >> HVite do not really use an N-gram, it is a word net, but I cannot >> really figure out why does it work so much better than SRILM models. >> >> I use the following script to generate lattices (60-best): >> >> HVite -A -T 1 \ >> -C GENLATTICES.conf \ >> -n 20 60 \ >> -l outLatDir \ >> -z lat \ >> -H hmmDefs \ >> -S test.list \ >> -i out.bigram.HLStats.mlf \ >> -w bigram.HLStats.lat \ >> -p 0.0 \ >> -s 8.0 \ >> lexicon \ >> hmm.mono.list >> >> Which are then rescored with: >> >> lattice-tool \ >> -read-htk \ >> -write-htk \ >> -htk-lmscale 10.0 \ >> -htk-words-on-nodes \ >> -order 3 \ >> -in-lattice-list srclat.list \ >> -out-lattice-dir rescoredLatDir \ >> -lm trigram.SRILM.lm \ >> -overwrite >> >> find rescoredLatDir -name "*.lat" > rescoredLat.list >> >> lattice-tool \ >> -read-htk \ >> -write-htk \ >> -htk-lmscale 10.0 \ >> -htk-words-on-nodes \ >> -order 3 \ >> -in-lattice-list rescoredLat.list\ >> -viterbi-decode \ >> -output-ctm | ctm2mlf_r > out.trigram.SRILM.mlf >> >> Decoded with HVite (92.86%): >> >> LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer >> weiteren zweitaegigen arbeitssitzu >> REC: wie sieht es aus mit einem weiteren zweitaegigen in einer >> weiteren zweitaegigen arbeitssitzu >> >> ... and with lattice-tool (64.29%): >> >> LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer >> weiteren zweitaegigen arbeitssitzu >> REC: wie sieht es aus mit einen weiteren zweitaegigen dann bei >> einem zweitaegigen arbeitssitzung >> >> Corresponding word nets and LMs have been built using the same >> vocabulary and training data. I should say that for some sentences >> SRILM outperforms HTK, but in general it is roughly 5-7% behind. >> Could you please suggest why is it so? Maybe some parameter values >> are wrong? >> Or should it be like this? >> >> I would be greatly appreciated for help. >> >> Yours, >> Dmytro Prylipko. >> >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > From s.bakhshaei at yahoo.com Sun Nov 25 22:37:55 2012 From: s.bakhshaei at yahoo.com (Somayeh Bakhshaei) Date: Sun, 25 Nov 2012 22:37:55 -0800 (PST) Subject: [SRILM User List] ngram option Message-ID: <1353911875.88765.YahooMailNeo@web111717.mail.gq1.yahoo.com> Hello All, I want to know if it is possible to pass a variable to ngram for ppl counting? 4sen="this is my sentence." ngram -ppl $sen ? ------------------ Best Regards, S.Bakhshaei -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sun Nov 25 22:49:16 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 25 Nov 2012 22:49:16 -0800 Subject: [SRILM User List] ngram option In-Reply-To: <1353911875.88765.YahooMailNeo@web111717.mail.gq1.yahoo.com> References: <1353911875.88765.YahooMailNeo@web111717.mail.gq1.yahoo.com> Message-ID: <50B310EC.4020905@icsi.berkeley.edu> On 11/25/2012 10:37 PM, Somayeh Bakhshaei wrote: > Hello All, > > I want to know if it is possible to pass a variable to ngram for ppl > counting? > > 4sen="this is my sentence." > ngram -ppl $sen You could use echo "this is my sentence" | ngram -ppl - The input data cannot be passed via command line options, but it can be read from stdin. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmytro.prylipko at ovgu.de Tue Nov 27 09:49:30 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Tue, 27 Nov 2012 18:49:30 +0100 Subject: [SRILM User List] SRILM trigram worse than HTK bigram? In-Reply-To: <50B4BC60.5020802@ovgu.de> References: <50B4BC60.5020802@ovgu.de> Message-ID: <50B4FD2A.90308@ovgu.de> Dear Andreas, I checked everything one more time, in a 'clean' test conditions. Under this conditions, the results are predictable: - Output from HTK recognizer - 73.71% - Just decoding of the generated lattices with lattice-tool - 73.71% - Rescoring with HTK bigram and decoding - 73.78% - Rescoring with SRILM trigram and decoding - 75.72% I guess my previous results were so contradictory due to specific test conditions: I was playing with OOVs, which had a particular influence on the construction of the word list. Thank you for the help and sorry for the inconvenience. From chenmengdx at gmail.com Sat Dec 1 07:37:32 2012 From: chenmengdx at gmail.com (Meng CHEN) Date: Sat, 01 Dec 2012 23:37:32 +0800 Subject: [SRILM User List] Why there are "_meta_1" in LM? Message-ID: Hi, I trained LMs with the write-binary-lm option, however, when I converted the LM of bin format into arpa format, I found there were 4 more 1-grams in the arpa LM as follows: -8.988857 _meta_1 -8.988857 _meta_2 -9.201852 _meta_3 -9.201852 _meta_4 In facter, these four words do not exisit in my vocab. So where are they come from? What should I do to remove them ? Thanks! Meng CHEN From stolcke at icsi.berkeley.edu Sat Dec 1 09:08:50 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 01 Dec 2012 09:08:50 -0800 Subject: [SRILM User List] Why there are "_meta_1" in LM? In-Reply-To: References: Message-ID: <50BA39A2.9050602@icsi.berkeley.edu> On 12/1/2012 7:37 AM, Meng CHEN wrote: > Hi, I trained LMs with the write-binary-lm option, however, when I converted the LM of bin format into arpa format, I found there were 4 more 1-grams in the arpa LM as follows: > -8.988857 _meta_1 > -8.988857 _meta_2 > -9.201852 _meta_3 > -9.201852 _meta_4 > In facter, these four words do not exisit in my vocab. So where are they come from? What should I do to remove them ? > Thanks! Counts for _META_1 etc. (note the uppercase) are used by ngram-count to keep track of counts-of-counts required for smoothing. They should never appear in the LM. I suspect you lowercased the strings in the counts file somewhere in your processing, causing these special tokens to no longer be recognized. Andreas From chenmengdx at gmail.com Sun Dec 2 20:06:54 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Mon, 3 Dec 2012 12:06:54 +0800 Subject: [SRILM User List] Why there are "_meta_1" in LM? In-Reply-To: <50BA39A2.9050602@icsi.berkeley.edu> References: <50BA39A2.9050602@icsi.berkeley.edu> Message-ID: I have checked the make-big-lm shell script and found that the "_meta_" should be lowercase. In line 56 of make-big-lm script. It says: metatag=__meta__ #lowercase so it works with ngram-count -tolower In fact, when I used make-big-lm to train LM, there are not "__meta__1" in final arpa LM without the write-binary-lm. So I guess it's possible related to the binary format. 2012/12/2 Andreas Stolcke > On 12/1/2012 7:37 AM, Meng CHEN wrote: > >> Hi, I trained LMs with the write-binary-lm option, however, when I >> converted the LM of bin format into arpa format, I found there were 4 more >> 1-grams in the arpa LM as follows: >> -8.988857 _meta_1 >> -8.988857 _meta_2 >> -9.201852 _meta_3 >> -9.201852 _meta_4 >> In facter, these four words do not exisit in my vocab. So where are they >> come from? What should I do to remove them ? >> Thanks! >> > > Counts for _META_1 etc. (note the uppercase) are used by ngram-count to > keep track of counts-of-counts required for smoothing. They should never > appear in the LM. > > I suspect you lowercased the strings in the counts file somewhere in your > processing, causing these special tokens to no longer be recognized. > > Andreas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Dec 6 09:55:42 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 06 Dec 2012 09:55:42 -0800 Subject: [SRILM User List] Why there are "_meta_1" in LM? In-Reply-To: References: <50BA39A2.9050602@icsi.berkeley.edu> Message-ID: <50C0DC1E.3020503@icsi.berkeley.edu> This happened because the binary LM file contains a record of the full vocabulary at the time the LM was created, not just the words that appear as unigrams (as in the ARPA format). You must have done ngram -renorm or something similar later, which causes unigrams to be created for all words in the vocabulary. Attached is a patch that prevents the _meta_ tokens from being included in that vocabulary. Check that it fixes your problem. (You can also grab the beta version off the web site.) Andreas On 12/2/2012 8:06 PM, Meng Chen wrote: > I have checked the make-big-lm shell script and found that the > "_meta_" should be lowercase. > In line 56 of make-big-lm script. It says: > metatag=__meta__ #lowercase so it works with ngram-count -tolower > > In fact, when I used make-big-lm to train LM, there are not > "__meta__1" in final arpa LM without the write-binary-lm. So I guess > it's possible related to the binary format. > > > 2012/12/2 Andreas Stolcke > > > On 12/1/2012 7:37 AM, Meng CHEN wrote: > > Hi, I trained LMs with the write-binary-lm option, however, > when I converted the LM of bin format into arpa format, I > found there were 4 more 1-grams in the arpa LM as follows: > -8.988857 _meta_1 > -8.988857 _meta_2 > -9.201852 _meta_3 > -9.201852 _meta_4 > In facter, these four words do not exisit in my vocab. So > where are they come from? What should I do to remove them ? > Thanks! > > > Counts for _META_1 etc. (note the uppercase) are used by > ngram-count to keep track of counts-of-counts required for > smoothing. They should never appear in the LM. > > I suspect you lowercased the strings in the counts file somewhere > in your processing, causing these special tokens to no longer be > recognized. > > Andreas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- *** lm/src/NgramLM.cc.dist 2012-10-18 20:31:21.198065100 -0400 --- lm/src/NgramLM.cc 2012-12-05 18:08:22.701858000 -0500 *************** *** 875,881 **** /* * Vocabulary index */ ! vocab.writeIndexMap(file); long long offset = ftello(file); --- 875,881 ---- /* * Vocabulary index */ ! vocab.writeIndexMap(file, true); long long offset = ftello(file); *************** *** 1051,1057 **** fprintf(file, "data: %s\n", dataFile); } ! vocab.writeIndexMap(file); long long offset = ftello(dat); --- 1051,1057 ---- fprintf(file, "data: %s\n", dataFile); } ! vocab.writeIndexMap(file, true); long long offset = ftello(dat); *** lm/src/Vocab.cc.dist 2012-10-29 17:44:22.423039800 -0400 --- lm/src/Vocab.cc 2012-12-05 18:11:11.745755000 -0500 *************** *** 841,855 **** * The format is ascii with one word per line: * index string * The mapping is terminated by EOF or a line consisting only of ".". */ void ! Vocab::writeIndexMap(File &file) { // Output index map in order of internal vocab indices. // This ensures that vocab strings are assigned indices in the same order // on reading, and ensures faster insertions into SArray-based tries. for (unsigned i = byIndex.base(); i < nextIndex; i ++) { ! if (byIndex[i]) { fprintf(file, "%u %s\n", i, byIndex[i]); } } --- 841,856 ---- * The format is ascii with one word per line: * index string * The mapping is terminated by EOF or a line consisting only of ".". + * If writingLM is true, omit words that should not appear in LMs. */ void ! Vocab::writeIndexMap(File &file, Boolean writingLM) { // Output index map in order of internal vocab indices. // This ensures that vocab strings are assigned indices in the same order // on reading, and ensures faster insertions into SArray-based tries. for (unsigned i = byIndex.base(); i < nextIndex; i ++) { ! if (byIndex[i] && !(writingLM && isMetaTag(i))) { fprintf(file, "%u %s\n", i, byIndex[i]); } } From chenmengdx at gmail.com Tue Dec 11 20:31:26 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Wed, 12 Dec 2012 12:31:26 +0800 Subject: [SRILM User List] Are there any trigger-based language modeling open-source tools? Message-ID: Hi, I need to train a trigger-based language model, and it seems SRILM doesn't support this task. Are there any open-source tools which can do this job? Please give me some suggestions. Thanks! Meng CHEN -------------- next part -------------- An HTML attachment was scrubbed... URL: From yhifny at yahoo.com Fri Dec 14 12:51:36 2012 From: yhifny at yahoo.com (yasser hifny) Date: Fri, 14 Dec 2012 12:51:36 -0800 (PST) Subject: [SRILM User List] how do the script compute-best-mix work? Message-ID: <1355518296.89321.YahooMailNeo@web125801.mail.ne1.yahoo.com> Hello, I would like to ask how do the script?compute-best-mix work? I mean what is the idea behind it. Can you refer to any paper explain how it works. Thanks in advance, Yasser ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohammed.mediani at kit.edu Fri Dec 14 13:21:59 2012 From: mohammed.mediani at kit.edu (Mohammed Mediani) Date: Fri, 14 Dec 2012 22:21:59 +0100 Subject: [SRILM User List] gtmin and kndiscount In-Reply-To: References: Message-ID: <20121214222159.c7gbsnki880oc8kk@webmail.ira.uni-karlsruhe.de> Could anybody please tell me how the discounting parameters for modified kneser-ney smoothing (D1, D2, D3+) are computed in case we have gtmin parameter greater than 1. In such case, the corresponding ni would be zero, and we eventually have to divide by this ni to get one of the Di's. Many thanks, From stolcke at icsi.berkeley.edu Fri Dec 14 13:22:05 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 14 Dec 2012 13:22:05 -0800 Subject: [SRILM User List] how do the script compute-best-mix work? In-Reply-To: <1355518296.89321.YahooMailNeo@web125801.mail.ne1.yahoo.com> References: <1355518296.89321.YahooMailNeo@web125801.mail.ne1.yahoo.com> Message-ID: <50CB987D.8020101@icsi.berkeley.edu> On 12/14/2012 12:51 PM, yasser hifny wrote: > Hello, > I would like to ask how do the script compute-best-mix work? I mean > what is the idea behind it. Can you refer to any paper explain how it > works. It's an EM algorithm, where the underlying mixture distributions (the individual LMs) are held fixed. You can find the general theory at https://en.wikipedia.org/wiki/Mixture_model . Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Dec 14 13:43:51 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 14 Dec 2012 13:43:51 -0800 Subject: [SRILM User List] gtmin and kndiscount In-Reply-To: <20121214222159.c7gbsnki880oc8kk@webmail.ira.uni-karlsruhe.de> References: <20121214222159.c7gbsnki880oc8kk@webmail.ira.uni-karlsruhe.de> Message-ID: <50CB9D97.4090000@icsi.berkeley.edu> On 12/14/2012 1:21 PM, Mohammed Mediani wrote: > Could anybody please tell me how the discounting parameters for > modified kneser-ney smoothing (D1, D2, D3+) are computed in case we > have gtmin parameter greater than 1. > In such case, the corresponding ni would be zero, and we eventually > have to divide by this ni to get one of the Di's. > Many thanks, > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user The gtmin parameter is applied (i.e., the ngrams with frequency below the threshold are omitted from the model) AFTER the discounting constants are computed, so the gtmin options don't affect the D1,D2,D3 computation. You have a problem when frequency cutoffs have been applied to the Ngram data BEFORE SRILM gets to see it. This is the case, e.g., with the Google N-gram data. In that case, if you use the make-big-lm wrapper script, an attempt will be made to extrapolate the low count-of-counts from the higher ones, according to an empirical law that is described in Figure 1 / Equation 1 of this paper . Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From medmediani at gmail.com Sat Dec 15 06:48:21 2012 From: medmediani at gmail.com (Mohammed Mediani) Date: Sat, 15 Dec 2012 15:48:21 +0100 Subject: [SRILM User List] Interpolation of Unigrams Message-ID: Hi, Are the unigrams always interpolated with 0-gram (probability of any word from the vocab)? I got the same probabilities for unigrams with and without -interpolate (both with -kndiscount). Is it meant to be this way? Many thanks for your help. Mohammed -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Dec 15 23:34:37 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 15 Dec 2012 23:34:37 -0800 Subject: [SRILM User List] Interpolation of Unigrams In-Reply-To: References: Message-ID: <50CD798D.30109@icsi.berkeley.edu> On 12/15/2012 6:48 AM, Mohammed Mediani wrote: > Hi, > Are the unigrams always interpolated with 0-gram (probability of any > word from the vocab)? > I got the same probabilities for unigrams with and without > -interpolate (both with -kndiscount). Is it meant to be this way? > Many thanks for your help. > Mohammed The KN discounting strategy for unigrams only interpolates with the zero-gram (uniform) estimate if the -interpolate flag is given. This is just a special case of the interpolation happening at all N-vgram levels. However, there is an independent step whereby unallocated unigram probability mass is filled in by adding a uniform probability increment to all words in the vocabulary. When this happens you see a message like warning: distributing 0.0659302 left-over probability mass over all 26573 words This happens for unigrams only, and regardless of what discounting method is in effect, because otherwise that probability mass would be "lost" and the model would be deficient. It so happens that the effect of both strategies is the same when it comes to unigrams, and that explains your observation. Andreas From medmediani at gmail.com Mon Dec 17 01:41:04 2012 From: medmediani at gmail.com (Mohammed Mediani) Date: Mon, 17 Dec 2012 10:41:04 +0100 Subject: [SRILM User List] Cutoff, probabilities, and backoffs Message-ID: Could anybody please tell me how the probabilities and the backoff weights are computed in case we use -gtmin (with -kndiscount). Following Chen's paper and the ngram-count man pages, I was unable to reproduce the same results as ngram-count. Many thanks, Mohammed -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Dec 17 12:46:15 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 17 Dec 2012 12:46:15 -0800 Subject: [SRILM User List] Cutoff, probabilities, and backoffs In-Reply-To: References: Message-ID: <50CF8497.6090600@icsi.berkeley.edu> On 12/17/2012 1:41 AM, Mohammed Mediani wrote: > Could anybody please tell me how the probabilities and the backoff > weights are computed in case we use -gtmin (with -kndiscount). > Following Chen's paper and the ngram-count man pages, I was unable to > reproduce the same results as ngram-count. As I explained in a previous email, the -gtmin parameter doesn't change the way discounting is computed. It just eliminates ngrams from the model AFTER you compute their probabilities. Of course this frees up probability mass, which is then reallocated using the backoff mechanism (that is, the backoff weights change as a result). You can think of the process in three steps, plus the 0th step that is particular to KN methods: 0. Replace the lower-order counts based on the ngram type frequencies (if you use the -write option you can save these modified counts to a file to see what the effect is). 1. compute discounts for each ngram, and then their probabilities (use ngram-count -debug 4 to get a detailed record of the quantities involved in this step) 2. remove ngrams due to the -gtmin (or entropy pruning criterion, if specified) 3. compute backoff weights (to normalize the model). Andreas From medmediani at gmail.com Mon Dec 17 13:52:18 2012 From: medmediani at gmail.com (Mohammed Mediani) Date: Mon, 17 Dec 2012 22:52:18 +0100 Subject: [SRILM User List] Cutoff, probabilities, and backoffs In-Reply-To: <50CF8497.6090600@icsi.berkeley.edu> References: <50CF8497.6090600@icsi.berkeley.edu> Message-ID: Thank you very much Andreas, In fact, I have done all what you have just suggested. - Modify the counts - Compute smoothing parameters (discount constants) - Compute the probabilities - Remove the rare ngrams according to gtmin - Compute the backoffs. I get the exact numbers for both probabilities and backoffs if no gtmin specified. But in the presence of cutoffs, I get a bit different numbers (e.g if gt3min=2 I get slightly different backoffs for 2-grams). I thought I did something wrong, since I still can't get the Backoffs correctly. If there is no special attention to be paid to different cases, the I just need to look more into it. Once again, many many thanks for your kind help and great cooperation. Mohammed On Mon, Dec 17, 2012 at 9:46 PM, Andreas Stolcke wrote: > On 12/17/2012 1:41 AM, Mohammed Mediani wrote: > >> Could anybody please tell me how the probabilities and the backoff >> weights are computed in case we use -gtmin (with -kndiscount). Following >> Chen's paper and the ngram-count man pages, I was unable to reproduce the >> same results as ngram-count. >> > > As I explained in a previous email, the -gtmin parameter doesn't change > the way discounting is computed. It just eliminates ngrams from the model > AFTER you compute their probabilities. Of course this frees up probability > mass, which is then reallocated using the backoff mechanism (that is, the > backoff weights change as a result). You can think of the process in three > steps, plus the 0th step that is particular to KN methods: > > 0. Replace the lower-order counts based on the ngram type frequencies (if > you use the -write option you can save these modified counts to a file to > see what the effect is). > 1. compute discounts for each ngram, and then their probabilities (use > ngram-count -debug 4 to get a detailed record of the quantities involved in > this step) > 2. remove ngrams due to the -gtmin (or entropy pruning criterion, if > specified) > 3. compute backoff weights (to normalize the model). > > Andreas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Dec 17 14:05:45 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 17 Dec 2012 14:05:45 -0800 Subject: [SRILM User List] Cutoff, probabilities, and backoffs In-Reply-To: References: <50CF8497.6090600@icsi.berkeley.edu> Message-ID: <50CF9739.2080900@icsi.berkeley.edu> On 12/17/2012 1:52 PM, Mohammed Mediani wrote: > Thank you very much Andreas, > In fact, I have done all what you have just suggested. > - Modify the counts > - Compute smoothing parameters (discount constants) > - Compute the probabilities > - Remove the rare ngrams according to gtmin > - Compute the backoffs. > > I get the exact numbers for both probabilities and backoffs if no > gtmin specified. But in the presence of cutoffs, I get a bit different > numbers (e.g if gt3min=2 I get slightly different backoffs for > 2-grams). I thought I did something wrong, since I still can't get the > Backoffs correctly. If there is no special attention to be paid to > different cases, the I just need to look more into it. The ngram probabilities should be the same. The backoff weights MUST be different since you are backing of for more of the ngrams, when choosing a higher gtmin threshold. Andreas From is13-noreply at inria.fr Sun Dec 23 14:23:55 2012 From: is13-noreply at inria.fr (Interspeech 2013 - First announcement) Date: Sun, 23 Dec 2012 23:23:55 +0100 Subject: [SRILM User List] Interspeech 2013 - First Announcement Message-ID: <50D7847B.8080101@inria.fr> First Announcement http://www.interspeech2013.org/calls/ INTERSPEECH is the world's largest and most comprehensive conference on challenges surrounding the science and technology of Spoken Language Processing (SLP) both in humans and machines. It is our great pleasure to announce that Interspeech 2013 will be hosted by the Center of Congress of Lyon (France), under the sponsorship of the International Speech Communication Association (ISCA). Interspeech 2013 will be the 14th conference in the annual series of Interspeech events and will be held in Lyon, France, 25-29 August 2013. The theme of Interspeech 2013 is "Speech in Life Sciences and Human Societies". Life sciences cover a large set of disciplines, such as biology, medicine, anthropology, or ecology, and deal with living organisms and their organization, life processes, and relationships to each other and their environment. In that respect, speech appears as a key aspect of human activity that runs across these multiple sides of life sciences. Besides the conventional acoustic and linguistic viewpoints of speech communication, biology, psychology and sociology provide additional angles to approach speech as the keystone means of communication, cognition and interaction in the Human societies. Under this theme the conference will emphasize an interdisciplinary approach covering all aspects of speech science and technology spanning the basic theories to applications. Besides regular oral and poster sessions, plenary talks by internationally renowned experts, tutorials, exhibits, and special sessions are planned. For the complete Call for Papers and other Calls please visit our website at: www.interspeech2013.org We look forward to welcoming you to INTERSPEECH 2013 in Lyon! Sincerely, ----------------------------------------------------------------------------------------------------------- CALL FOR SATELLITE WORKSHOPS (Workshops at Interspeech2013.org) The call for Satellite workshops is now closed Notification of acceptance and ISCA approval / sponsorship will be launched on October 30, 2012 CALL FOR TUTORIALS Submission Deadline: December 15, 2012 Notification of acceptance: January 15, 2013 Tutorial proposals covering interdisciplinary topics and/or important new emerging areas of interest related to the main conference topics are encouraged. Visit the Tutorial Page of the conference website for more information and to submit a tutorial proposal. CALL FOR SHOW AND TELL AND OTHER SPECIAL EVENTS Submission Deadline: January 4, 2013 Notification of acceptance: January 11, 2013 The Show & Tell sessions gather researchers, engineering groups, and practitioners from academia, industry, and governmental institutes in a unique opportunity to demonstrate their most advanced research systems and interact with the conference attendees in an informal way. Other special events: less formal events about "SPEECH SCIENCES AND INNOVATIONS IN OUR FUTURE SOCIETY" are encouraged. CALL FOR SPECIAL SESSIONS Submission Deadline: January 15, 2013 Notification of pre-acceptance: February 15, 2013 Special sessions can be an opportunity to bring researchers in relevant fields of interest outside the traditional speech and language fields, together with the Interspeech community. Click here to learn more about the updated 2013 special session submission process. Special Session Topics will be defined after May 13, 2013 CALL FOR PAPERS Submission Deadline: March 10, 2013 Notification of acceptance: May 13, 2013 We invite you to submit original papers in any related area, including - but not limited to: * Speech Perception and Production * Phonology, Phonetics * Para-/Non- linguistic Information * Language Processing * Analysis, Enhancement and Coding of Speech and Audio Signals * Speaker and Language Identification * Speech & Spoken Language Generation, Speech Synthesis * Automatic Speech Recognition (ASR) * Technologies and Systems for New Applications * Spoken Dialogue System, Spoken Language Understanding, Speech Translation, * Information Retrieval * Application, Evaluation, Standardization, Spoken Language Resource * Prostheses and aids for speech communication disorders * Diagnosis and tools for speech therapy * Emotional speech and affective computing * Cognitive models of speech production and perception * Natural and artifactual speech interaction * Models for the origin and development of speech * Speech communication diversity among people, languages and dialects Paper Submission Procedure The working language of the conference is English. Papers for the INTERSPEECH 2013 proceedings should be up to 4 pages of content plus one page of references in length and conform to the format given in the paper preparation guidelines and author kits which will be available on the INTERSPEECH 2013 website along with the Final Call for Papers. Optionally, authors may submit additional files, such as multimedia files, to be included on the Proceedings USB key. Authors shall also declare that their contributions are original and not being submitted for publication elsewhere (e.g. another conference, workshop, or journal). Papers must be submitted via the on-line paper submission system, which will open in February 24, 2013. The deadline for submitting a paper is March 10, 2013. This date will not be extended. 2013 CONFERENCE VENUE The Conference will be held at the Cit? Centre des Congr?s, Lyon, France. Click here (http://www.ccc-lyon.com/) to learn more about the CCC PAPER SUBMISSIONS AND REGISTRATION INFORMATION Conference registration and the call for papers submission will open in February 24, 2013. Visit the Conference website to keep abreast of the program developments. SPONSORSHIP AND EXHIBIT OPPORTUNITES Want to increase your visibility and access a market of over 800 conference attendees? Apply online to become a sponsor or exhibitor of the 2013 Conference. Benefits are level based and are on a first come, first served basis.