From svmats at yahoo.com Sat Jul 7 09:06:51 2007 From: svmats at yahoo.com (Mats Svenson) Date: Sat, 7 Jul 2007 09:06:51 -0700 (PDT) Subject: Limited vocabulary causing "no-singletons" problem Message-ID: <499960.43881.qm@web31612.mail.mud.yahoo.com> Hi SRILM users, I have the following problem. I want to train a LM for a low-resource speech recognizer. Since the recognizer can only handle vocabularies with a limited size (N), I first must fix my vocabulary to only contain N most frequently occurring words from the training text. However, since all such words occur more than once in the training corpus, it seems to disables me from using the discounting schemes which rely on singleton counts. For GT discounting, ngram-count gives a warning on no-singletons in the training data, for KN no warning was printed, however, I guess the KN discounting is affected by the no-singletons as well. Ngram-count also has an option "-knn knfile" to calculate smoothing parameters using an unlimited vocabulary in advance, however, I guess this does not entirely solve this problem... Is it true? Is there a way how to bypass this problem using SRILM or do I have to use another (generally inferior) discounting scheme such as Witten-Bell (at least for counts of order 1)? Thanks for help, Mats ____________________________________________________________________________________ Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. http://mobile.yahoo.com/go?refer=1GNXIC From stolcke at speech.sri.com Sat Jul 7 09:48:35 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 07 Jul 2007 09:48:35 PDT Subject: Limited vocabulary causing "no-singletons" problem In-Reply-To: Your message of Sat, 07 Jul 2007 09:06:51 -0700. <499960.43881.qm@web31612.mail.mud.yahoo.com> Message-ID: <200707071648.l67GmaZ12653@huge> Use the make-big-lm script for training your LM. (Despite the name, it works for small LMs as well.) It will compute the GT or KN count-of-count statistics using the unlimited vocabulary, and then apply your vocabulary in building the LM. --Andreas In message <499960.43881.qm at web31612.mail.mud.yahoo.com>you wrote: > Hi SRILM users, > I have the following problem. I want to train a LM > for a low-resource speech recognizer. Since the > recognizer can only handle vocabularies with a limited > size (N), I first must fix my vocabulary to only > contain N most frequently occurring words from the > training text. However, since all such words occur > more than once in the training corpus, it seems to > disables me from using the discounting schemes which > rely on singleton counts. > > For GT discounting, ngram-count gives a warning on > no-singletons in the training data, for KN no warning > was printed, however, I guess the KN discounting is > affected by the no-singletons as well. Ngram-count > also has an option "-knn knfile" to calculate > smoothing parameters using an unlimited vocabulary in > advance, however, I guess this does not entirely solve > this problem... Is it true? > > Is there a way how to bypass this problem using SRILM > or do I have to use another (generally inferior) > discounting scheme such as Witten-Bell (at least for > counts of order 1)? > Thanks for help, > Mats > > > > _____________________________________________________________________________ > _______ > Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, new > s, photos & more. > http://mobile.yahoo.com/go?refer=1GNXIC From save.climate at gmail.com Tue Jul 10 07:47:59 2007 From: save.climate at gmail.com (Kamadev Bhanuprasad) Date: Tue, 10 Jul 2007 16:47:59 +0200 Subject: Estimating mixture LM weights using SRILM Message-ID: <244d59a50707100747w39e0bcfbt1de04db2dc851f82@mail.gmail.com> Hi, is there a SRILM tool to estimate weights of a mixture LM (from several separate LMs in ARPA format) using held-out data? I guess that similar algorithms (EM, Powell search) are implemented in SRILM many times but I haven't found in any SRILM tool implementation of this very common particular task. Thanks Kama -------------- next part -------------- An HTML attachment was scrubbed... URL: From shuet at irisa.fr Tue Jul 10 08:12:40 2007 From: shuet at irisa.fr (Stephane Huet) Date: Tue, 10 Jul 2007 17:12:40 +0200 Subject: lattice-tool to rescore a lattice Message-ID: <4693A1E8.2020502@irisa.fr> Hi, By rescoring lattices in HTK format with a 4-gram LM with the following options: lattice-tool -in-lattice -read-htk -out-lattice -write-htk -order 4 -lm , I noticed that the LM scores written in the output lattice were sometimes different from what was expected. In the lattice-tool manpages (I use SRILM 1.5.0), I read that the algorithm by default for lattice-expansion is "General LM expansion" and expands the lattice "without use of backoff transitions". Does this mean that during LM expansion, no backoff is taken into account in the LM probabilities? However, by investigating in the source code, I noticed that the following line of Lattice::expandAddTransition function in LatticeExpand.cc: transProb += lm.contextBOW(usedContext, usedLength); includes backoff transition. To get the expected linguistic scores when expanding the lattice by the LM, I put the previous line in comment and take into account the LM back-off by modifying the lm.contextID(nextWord, usedContext, usedLength2) call in Lattice::expandNodeToLM function of LatticeExpand.cc. Indeed, I noticed that when lm.contextID returns the LM order instead of what he originally did, the context of the conditional probability is not anymore truncated and the LM scores of the output lattices are coherent with what I expected. There may be options that I didn't understand to rescore lattices with a LM but I find strange the LM scores processed by lattice-tool. I can send the files I modified if you want. Regards, St?phane From stolcke at speech.sri.com Tue Jul 10 08:43:52 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 10 Jul 2007 08:43:52 PDT Subject: Estimating mixture LM weights using SRILM In-Reply-To: Your message of Tue, 10 Jul 2007 16:47:59 +0200. <244d59a50707100747w39e0bcfbt1de04db2dc851f82@mail.gmail.com> Message-ID: <200707101543.l6AFhq418932@huge> Check the ppl-scripts(1) man page, and specifically the "compute-best-mix" command. --Andreas In message <244d59a50707100747w39e0bcfbt1de04db2dc851f82 at mail.gmail.com>you wro te: > > Hi, > is there a SRILM tool to estimate weights of a mixture LM (from several > separate LMs in ARPA format) using held-out data? I guess that similar > algorithms (EM, Powell search) are implemented in SRILM many times but I > haven't found in any SRILM tool implementation of this very common > particular task. > > Thanks > Kama > From shuet at irisa.fr Wed Jul 11 08:37:01 2007 From: shuet at irisa.fr (Stephane Huet) Date: Wed, 11 Jul 2007 17:37:01 +0200 Subject: lattice-tool to rescore a lattice In-Reply-To: <200707102202.l6AM2LU27317@speech.sri.com> References: <200707102202.l6AM2LU27317@speech.sri.com> Message-ID: <4694F91D.8070603@irisa.fr> > > >What you can verify is that the lattice as a whole assigns the correct >log probabiliy to a complete path through the lattice. >For this purpose, the lattice-tool -ppl option allows you to treat the >lattice as a language model, and you can feed it sentences. >The -debug 2 option displays scores at the word level. > > > As you suggested, I compared the result of lattice-tool -ppl given for a lattice and the result of ngram -ppl. In both cases, I obtained the same logprob for the complete sentence. However, the logprobs at the word level are different, which I have already noticed in the linguistic scores of the HTK lattices. Here are the results I obtained: > ngram -lm -order 4 -ppl test.ppl -debug 2 appeler les op?rateurs marocains p( appeler | ) = [2gram] 6.15184e-06 [ -5.211 ] p( les | appeler ...) = [2gram] 0.0759806 [ -1.1193 ] p( op?rateurs | les ...) = [2gram] 0.000738462 [ -3.13167 ] p( marocains | op?rateurs ...) = [2gram] 0.00450344 [ -2.34646 ] p( | marocains ...) = [2gram] 0.186189 [ -0.730047 ] 1 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -12.5385 ppl= 321.879 ppl1= 1363.38 file test.ppl: 1 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -12.5385 ppl= 321.879 ppl1= 1363.38 > lattice-tool -ppl test.ppl -in-lattice -read-htk -debug 2 -order 4 appeler les op?rateurs marocains p( appeler | ) = [400][405][385][386][390] 5.1187e-06 [ -5.29084 ] p( les | appeler ...) = [512][519][531][532][539] 0.0623089 [ -1.20545 ] p( op?rateurs | les ...) = [1120][1121][1122][1067][1068][1069][977][978][979][965][966][967][879][880][881] 0.00107221 [ -2.96972 ] p( marocains | op?rateurs ...) = [1123][1124][1125][1126][1127][1128][1129][1130][1131][980][981][982][983][984][985][986][987][988][882][883][884][885][886][887][888][889][890][1070][1071][1072][1073][1074][1075][1076][1077][1078][968][969][970][971][972][973][974][975][976] 0.00454559 [ -2.34241 ] p( | marocains ...) = [1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1] 0.186188 [ -0.730047 ] Lattice states: 0 386 532 966 973 1 1 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -12.5385 ppl= 321.88 ppl1= 1363.38 The differences might be linked to the way the backoffs are taken into account in the linguistic scores in the lattice. With the few changes I previously did in the source code, the logprobs seem more correct at the word level: > lattice-tool -ppl test.ppl -in-lattice -read-htk -debug 2 -order 4 appeler les op?rateurs marocains p( appeler | ) = [474][479][459][460][464] 6.15177e-06 [ -5.211 ] p( les | appeler ...) = [586][593][605][606][613] 0.0759801 [ -1.1193 ] p( op?rateurs | les ...) = [966][967][968][1243][1244][1245][1164][1165][1166][1064][1065][1066][1052][1053][1054] 0.000738466 [ -3.13167 ] p( marocains | op?rateurs ...) = [1055][1056][1057][1058][1059][1060][1061][1062][1063][1167][1168][1169][1170][1171][1172][1173][1174][1175][1246][1247][1248][1249][1250][1251][1252][1253][1254][1067][1068][1069][1070][1071][1072][1073][1074][1075][969][970][971][975][976][977][972][973][974] 0.00450339 [ -2.34646 ] p( | marocains ...) = [1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1] 0.186188 [ -0.730047 ] Lattice states: 0 464 613 967 973 1 1 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -12.5385 ppl= 321.881 ppl1= 1363.39 Anyway, what I need is the scores provided by lattice-tool at the sentence level and they are correct. Thanks for your answer. St?phane From brodbd at u.washington.edu Fri Jul 13 15:26:57 2007 From: brodbd at u.washington.edu (David Brodbeck) Date: Fri, 13 Jul 2007 15:26:57 -0700 Subject: Test failures on RHEL 5 Message-ID: I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5. The machine type is i686_m64. Everything builds all right, but the tests fail for make-ngram-pfsg, ngram-class, and ngram-count-lm-limit- vocab. make-ngram-pfsg is the most obvious one, so I'll tackle that one first. I get the following in the stderr file: gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid collation character: /[[:lower:]-?]/ Has anyone else run into this? I'm using GNU Awk 3.1.5, and the locale is set to en_US.UTF-8. David Brodbeck Information Technology Specialist 3 Computational Linguistics University of Washington -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Wed Jul 18 13:53:03 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 18 Jul 2007 13:53:03 -0700 Subject: paper on lattice tool In-Reply-To: Your message of Wed, 18 Jul 2007 21:40:47 +0100. Message-ID: <200707182053.l6IKr3S18445@speech.sri.com> In message you wrote: > I am talking about the option compute posteriors. E.g. my options are > > $lattice_tool -compute-posteriors \ > -read-htk -in-lattice $path \ > -write-htk -out-lattice ${tmp_lattice_dir}/${file} \ > -htk-lmscale 8.3 \ > -htk-acscale 1.0 \ > -htk-wdpenalty -4.3429 \ > -htk-logbase 2.718 \ > -posterior-scale 8.3 \ lattice-tool -compute-posteriors computes NODE posterior probabilities, not WORD posterior probabilities. In other words, posteriors of the same word appearing on different nodes are not summed over. Therefore, the algorithm is the basic FB known from HMMs, which is explained in many text books, as well as the popular tutorial L. R. Rabiner and B. H. Juang, An Introduction to Hidden {Markov} Models, IEEE Signal Processing Magazine, 3(1), 4-16, Jan. 1986. The Wessel et al. paper talks about how to sum over multiple hypotheses containing the same word in the same "position". That is something lattice-tool can do also, using word confusion networks (see the -write-mesh option). Andreas > > > Thanks, > Joel. > > On 7/18/2007, "Andreas Stolcke" wrote: > > >jpinto at idiap.ch wrote: > >> Dear SRILM user, > >> > >> Is there any publication or write up on how exactly the forward-backward > >> alogirthm (for estimation of word posterior probability from a word > >> lattice) is implemented in lattice-tool ? > >> > >What lattice-tool options specifically are you talking about ? > > > >Andreas > > > >> How similar or different is it from the algorithm described in > >> "Confidence Measures for Large Vocabulary Continuous Speech > >> Recognition" by Frank Wessel et. al. > >> > >> Many thanks, > >> Joel. > >> > >> > > > From stolcke at speech.sri.com Thu Jul 19 11:36:47 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 19 Jul 2007 11:36:47 -0700 Subject: Test failures on RHEL 5 In-Reply-To: References: Message-ID: <469FAF3F.4070608@speech.sri.com> David Brodbeck wrote: > I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5. > The machine type is i686_m64. Everything builds all right, but the > tests fail for make-ngram-pfsg, ngram-class, and > ngram-count-lm-limit-vocab. > > make-ngram-pfsg is the most obvious one, so I'll tackle that one > first. I get the following in the stderr file: > gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid > collation character: /[[:lower:]-?]/ > > Has anyone else run into this? I'm using GNU Awk 3.1.5, and the > locale is set to en_US.UTF-8. This is odd since we're also using gawk 3.1.5 and I cannot replicate the problem even when setting LANG to en_US.UTF-8. It seems that the interpretation of gawk regular expressions should not depend on the OS release version, but of course there may always be bugs. ngram-class is very fickle. Small changes in the implementation of math library functions or machine arithmetic can cause small numerical differences and then different clustering decisions as a result. In fact, I get different results with 32bit and 64bit Linux binaries, so don't worry about that one. ngram-count-lm-limit-vocab should work. You can send me more details on how the output differs. Andreas From stolcke at speech.sri.com Mon Jul 23 12:55:18 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 23 Jul 2007 12:55:18 -0700 Subject: [Fwd: Re: Test failures on RHEL 5 -- fixed!] Message-ID: <46A507A6.30508@speech.sri.com> This resolves an issue brought up on this list previously. Andreas -------------- next part -------------- An embedded message was scrubbed... From: David Brodbeck Subject: Re: Test failures on RHEL 5 -- fixed! Date: Mon, 23 Jul 2007 11:47:32 -0700 Size: 3185 URL: From hanisaf at gmail.com Mon Jul 23 13:35:03 2007 From: hanisaf at gmail.com (Hani Safadi) Date: Mon, 23 Jul 2007 16:35:03 -0400 Subject: unsubscribe Message-ID: <990817d50707231335q3f27479ckdfa0e7844de71905@mail.gmail.com> unsubscribe -------------- next part -------------- An HTML attachment was scrubbed... URL: From desaikey at egr.msu.edu Mon Jul 23 15:08:27 2007 From: desaikey at egr.msu.edu (Keyur Desai) Date: Mon, 23 Jul 2007 18:08:27 -0400 Subject: unsubscribe Message-ID: <46A526DB.9040704@egr.msu.edu> unsubscribe From svmats at yahoo.com Tue Jul 31 14:43:13 2007 From: svmats at yahoo.com (Mats Svenson) Date: Tue, 31 Jul 2007 14:43:13 -0700 (PDT) Subject: tcl and gawk problems when compiling SRIL 1.5.3 Message-ID: <666454.35086.qm@web31602.mail.mud.yahoo.com> Dear SRILM users, I have just tried to compile the current SRILM version and several problems surprised me. 1) As to the tcl, the INSTALL file reads that: "TCL_INCLUDE, TCL_LIBRARY: to whatever is needed to find the Tcl header files and library. If Tcl is not available, set NO_TCL=X and leave the above variables empty." I have an openSUSE system (10.0) with tcl installed, but there's no "tcl.h" file present and I didn't find appropriate rpm to obtain it. If I use "NO_TCL=X", will it affect SRILM's functionality? What is tcl good for there? 2) As to gawk, the INSTALL file reads that: Recent versions of gawk may not perform correct floating-point arithmetic unless either LC_NUMERIC=C or LC_ALL=C is set in the environment. This affects many of the scripts in utils/. Does it mean that with non-standard locales, SRILM does not work corretly? Does it affect model parameters estimation? Thanks, Mats ____________________________________________________________________________________ Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase. http://farechase.yahoo.com/ From stolcke at speech.sri.com Tue Jul 31 16:15:56 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 31 Jul 2007 16:15:56 -0700 Subject: tcl and gawk problems when compiling SRIL 1.5.3 In-Reply-To: <666454.35086.qm@web31602.mail.mud.yahoo.com> References: <666454.35086.qm@web31602.mail.mud.yahoo.com> Message-ID: <46AFC2AC.4020303@speech.sri.com> Mats Svenson wrote: > Dear SRILM users, > I have just tried to compile the current SRILM > version and several problems surprised me. > > 1) As to the tcl, the INSTALL file reads that: > "TCL_INCLUDE, TCL_LIBRARY: to whatever is needed to > find the Tcl header files and library. > If Tcl is not available, set NO_TCL=X and leave > the above variables empty." > > I have an openSUSE system (10.0) with tcl installed, > but there's no "tcl.h" file present and I didn't find > appropriate rpm to obtain it. If I use "NO_TCL=X", > will it affect SRILM's functionality? What is tcl good > for there? > It is only used in some of the development test programs (not needed for a regular build and install). There is no harm in not using it. > 2) As to gawk, the INSTALL file reads that: > Recent versions of gawk may not perform correct > floating-point arithmetic unless either LC_NUMERIC=C > or LC_ALL=C is set in the environment. This affects > many of the scripts in utils/. > > Does it mean that with non-standard locales, SRILM > does not work corretly? Does it affect model > parameters estimation? > One user sent me this observation, so I have no idea how widespread a problem it is. SRILM should work fine with exotic locales for most everything. The only issue is in the gawk scripts that do arithmetic. You should run the test suite to see if it is an issue for you. Andreas From stolcke at speech.sri.com Mon Aug 6 10:55:25 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 06 Aug 2007 10:55:25 -0700 Subject: Class-based LM using the SRILM toolkit? In-Reply-To: References: <200705300356.l4U3u3R26372@huge> Message-ID: <46B7608D.6080002@speech.sri.com> Madhav Krishna wrote: > Dear Dr. Stolcke, > > Thank you for your email. However, we require a little more help. We > have completed our experiments but have obtained surprising results. > > We trained and tested a class-based language model as per your > instructions. We trained it on 5 training sets drawn from the same > corpus. These sets were of sizes 300,000 words to 15,000,000 words - > increasing in steps of 300,000 words. The testing data size was held > constant at 400,000 sentences. When testing the 5 LMs obtained from > the training data sets, we observed that the resulting perplexity > values increased with increase in the size of training data. This is > indeed contrary to popular findings. In fact, the perplexity values > obtained were 710, 890, 1150, 1200, 1280. > > Could these values have occurred due to my not specifying a vocabulary > explicitly while training the LMs? I believe that the toolkit adds all > the words in the training data to the vocabulary by default. But then, > how does it treat OOVs in the testing set? Also, how does the choice > of vocabulary effect perplexity? > Indeed, you cannot compare perplexities unless the LM vocabulary is constant across models. That's because a large vocabulary leads to higher inherent uncertainty about the next word. OOVs and words with zero probability are excluded from the perplexity computation, so by fixing the vocabulary you are also fixing the set of excluded words, again, making the comparison valid. So, extract your vocabulary from the smallest or the largest of your training sets, and then train all models with -vocab VOCAB. To handle words properly in the class-based LM you might want to stick all unseen words in a special class (which you have to construct separately from ngram-class and add to the class definition file). Andreas > I would appreciate your help. > > Sincerely, > Madhav Krishna > > On 5/30/07, Andreas Stolcke wrote: > >>> Dear Dr. Stolcke, >>> >>> Thank you once again for your invaluable help. >>> >>> I have now developed two LMs using your toolkit - a trigram word-based model >>> and a class-based model (static models). I now want to interpolate them and >>> then apply some form of smoothing on the resultant LM. The ngram program in >>> the toolkit has a -mix-lm option which allows linear interpolation; the >>> manpages for that option mention: >>> >>> "*NOTE: *Unless *-bayes *(see below) is specified, *-mix-lm *triggers a >>> static interpolation of the models in memory. In most cases a more >>> efficient, dynamic interpolation is sufficient, requested by *-bayes >>> 0*.**Also, mixing models of different type ( >>> e.g., word-based and class-based) will *only *work correctly with dynamic >>> interpolation." >>> >>> What is dynamic interpolation? Is it applicable in my case? Can >>> >> Dynamic interpolation means that the probabilities of the interpolated model >> are computed on-the-fly, at test time. >> Static interpolation, by contrast, means that a single model is created >> ahead of testing, containing the interpolated probabilities in the >> usual backoff format. This is only possible for models of the same type, >> as explained in the note above. >> >> >>> mixing/interpolation of these models be perfomed only with the -dynamic >>> option? In that case, how? >>> >> The -dynamic option has nothing to do with dynamic interpolation of the >> kind we are discussing here. >> Dynamic interpolation is enabled by the -bayes option. >> >> >>> Also, what is the -bayes interpolation method about? The manpages say for >>> the -bayes option: >>> "Interpolate the second and the main model using posterior probabilities for >>> local N-gram-contexts of length *length*." >>> What are you referring to by "N-gram contexts"? Are only the posterior >>> probabilities interpolated here? If possible, please provide me with a link >>> to a reference text etc. where I can learn more about this. >>> >> For an explanation of Bayesian interpolation please consult the technical >> report cited at the bottom of the ngram(1) man page. You can get it at >> http://www.speech.sri.com/cgi-bin/run-distill?papers/lm95-report.ps.gz >> then check Section 2.3. >> >> Andreas >> >> >> > > > From janeklwb at yahoo.com.cn Wed Aug 8 17:47:35 2007 From: janeklwb at yahoo.com.cn (jane) Date: Thu, 9 Aug 2007 08:47:35 +0800 (CST) Subject: question about lattice tool Message-ID: <808074.51076.qm@web15703.mail.cnb.yahoo.com> hi, I try to use lattice-tool.exe construct a confusion network, but I don't know how to use the option "-init-mesh file", Would you please possibly give me some clue about the problem? Thanks in advance! Jane 2007-8-9 ___________________________________________________________ ??????3.5G???20M??? http://cn.mail.yahoo.com/ From stolcke at speech.sri.com Wed Aug 8 20:57:44 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 08 Aug 2007 20:57:44 -0700 Subject: question about lattice tool In-Reply-To: <808074.51076.qm@web15703.mail.cnb.yahoo.com> References: <808074.51076.qm@web15703.mail.cnb.yahoo.com> Message-ID: <46BA90B8.2010208@speech.sri.com> jane wrote: > hi, > > I try to use lattice-tool.exe construct a confusion > network, but I don't know how to use the option > "-init-mesh file", Would you please possibly give me > some clue about the problem? > You don't need to use -init-mesh at all. It is used to align a lattice to a preexisting confusion network. But usually you just build a CN from scratch using only the lattice. lattice-tool -in-lattice INPUT -write-mesh OUTPUT (other options) Andreas > Thanks in advance! > > Jane > 2007-8-9 > > > ___________________________________________________________ > ??????3.5G???20M??? > http://cn.mail.yahoo.com/ > From bond at fgan.de Fri Aug 24 06:47:38 2007 From: bond at fgan.de (Christine de Bond) Date: Fri, 24 Aug 2007 15:47:38 +0200 Subject: Problem installing SRILM Message-ID: <46CEE17A.602@fgan.de> Hello, I am trying to install SRILM on a Suse Linux 10.1 system. Whenever I type "make World" I get error prompts. (see below) It seems the libmisc.a is not created, and therefore ngram, ngram-count, disambig, fngram-count, fngram, hidden-ngram, mulit-ngram, nbest-lattice, nbest-optimize, ngram-class, anti-ngram, nbest-mix, nbest-pron-score, ngram-merge, segment, segment-nbest are not being installed. I am new to Linux, and new to MT. I tried installing SRILM for weeks, and I don't know what goes wrong. (I did change the environment variables and adjusted the gcc, g++, perl, tcl paths, and I followed the install instructions.) Does anyone have a clue and can give me a hint? Does someone know where the libmisc.a file should come from? With kind regards, Christine de Bond ---------------------------------------------------------------------------------------- ... make[2]: Entering directory `/home/bond/SMTSystem/srilm/lm/src' /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I/home/bond/ActiveTcl8.4.15.0. -I. -I../../include -u matherr -L../../lib/i686 -g -O3 -o ../bin/i686/ngram ../obj/i686/ngram.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a /home/bond/ActiveTcl8.4.15.0./lib/libtcl8.4.so -lm 2>&1 | c++filt g++: ../../lib/i686/libmisc.a: No such file or directory /home/bond/SMTSystem/srilm/sbin/decipher-install 0555 ../bin/i686/ngram ../../bin/i686 ERROR: File to be installed (../bin/i686/ngram) does not exist. ERROR: File to be installed (../bin/i686/ngram) is not a plain file. Usage: decipher-install ... mode: file permission mode, in octal file1 ... fileN: files to be installed directory: where the files should be installed files = ../bin/i686/ngram directory = ../../bin/i686 mode = 0555 make[2]: [../../bin/i686/ngram] Error 1 (ignored) touch ../../bin/i686/ngram ... ---------------------------------------------------------------------------------------- From stolcke at speech.sri.com Fri Aug 24 11:10:01 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 24 Aug 2007 11:10:01 -0700 Subject: Problem installing SRILM In-Reply-To: <46CEE17A.602@fgan.de> References: <46CEE17A.602@fgan.de> Message-ID: <46CF1EF9.5040600@speech.sri.com> Christine de Bond wrote: > Hello, > > I am trying to install SRILM on a Suse Linux 10.1 system. > Whenever I type "make World" I get error prompts. (see below) > It seems the libmisc.a is not created, and therefore > ngram, ngram-count, disambig, fngram-count, fngram, hidden-ngram, > mulit-ngram, nbest-lattice, nbest-optimize, ngram-class, anti-ngram, > nbest-mix, nbest-pron-score, ngram-merge, segment, segment-nbest > are not being installed. > > I am new to Linux, and new to MT. I tried installing SRILM for weeks, > and I don't know what goes wrong. > (I did change the environment variables and adjusted the gcc, g++, perl, > tcl paths, and I followed the install instructions.) > Does anyone have a clue and can give me a hint? Does someone know where > the libmisc.a file should come from? > > With kind regards, > Christine de Bond > My guess is you are having trouble with the Tcl library. Please rebuild everything after editing the common/Makefile.machine.i686 file to contain: NO_TCL = X TCL_INCLUDE = TCL_LIBRARY = Andreas > ---------------------------------------------------------------------------------------- > ... > make[2]: Entering directory `/home/bond/SMTSystem/srilm/lm/src' > /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 > -I/home/bond/ActiveTcl8.4.15.0. -I. -I../../include -u matherr > -L../../lib/i686 -g -O3 -o ../bin/i686/ngram ../obj/i686/ngram.o > ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a > ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a > /home/bond/ActiveTcl8.4.15.0./lib/libtcl8.4.so -lm 2>&1 | c++filt > > g++: ../../lib/i686/libmisc.a: No such file or directory > > /home/bond/SMTSystem/srilm/sbin/decipher-install 0555 ../bin/i686/ngram > ../../bin/i686 > ERROR: File to be installed (../bin/i686/ngram) does not exist. > ERROR: File to be installed (../bin/i686/ngram) is not a plain file. > Usage: decipher-install ... > mode: file permission mode, in octal > file1 ... fileN: files to be installed > directory: where the files should be installed > > files = ../bin/i686/ngram > directory = ../../bin/i686 > mode = 0555 > > make[2]: [../../bin/i686/ngram] Error 1 (ignored) > touch ../../bin/i686/ngram > ... > ---------------------------------------------------------------------------------------- > > From gelbart at icsi.berkeley.edu Mon Sep 10 14:11:44 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Mon, 10 Sep 2007 14:11:44 -0700 (PDT) Subject: SRI LM archives Message-ID: Hi Andreas, As I recall, you expressed dismay once that the only way to get at the srilm-user archives is to request them by email to majordomo. Have you seen this script? http://www.lunamorena.net/perl/archives.html It is a Perl CGI script to render majordomo archive files as web pages. Maybe that would do the trick? If you don't want to run a CGI script, I guess it would be easy enough to modify this script into a command line tool that would create a static web page for each archive file. An index page with a link to each archive file could be easily generated from the list of archive filenames, since the filenames encode the month and year. Let me know if you'd like me to take a look at this, since I may have the time to do it. Similarly, if you are uneasy about exposing list participant's email addresses on the public web, I guess it would also be easy enough to modify the script to strip out the domain names from email addresses. Again, I might have the time to do it. Regards, David From deliverable at gmail.com Tue Sep 11 08:11:30 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Tue, 11 Sep 2007 17:11:30 +0200 Subject: memory-resident LMs for ngram? Message-ID: <3E6206DB-6143-4320-8EDE-87F48C790426@gmail.com> Is there an easy way to make ngram load an LM into memory and become a server of perlexity scores for sentences? Cheers, Alexy From stolcke at speech.sri.com Tue Sep 11 14:12:16 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 11 Sep 2007 14:12:16 PDT Subject: memory-resident LMs for ngram? In-Reply-To: Your message of Tue, 11 Sep 2007 17:11:30 +0200. <3E6206DB-6143-4320-8EDE-87F48C790426@gmail.com> Message-ID: <200709112112.l8BLCGB03079@huge> In message <3E6206DB-6143-4320-8EDE-87F48C790426 at gmail.com>you wrote: > Is there an easy way to make ngram load an LM into memory and become > a server of perlexity scores for sentences? It should be easy using the existing functionality. Write a wrapper script (shell, perl, whatever) that - invokes ngram -lm LM ... -ppl - -debug 2 - reads input sentences from some defined place and writes them to the std input of ngram (above) - reads the std output of ngram and reformats it into whatever format is suitable Using this approach, ngram is invoked only once and the LM is read only once. It will terminated after its std input is closed or sees end-of-file. Andreas From gelbart at icsi.berkeley.edu Tue Sep 11 18:22:29 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Tue, 11 Sep 2007 18:22:29 -0700 (PDT) Subject: SRI LM archives In-Reply-To: References: Message-ID: Dear srilm-user, I have placed a copy of the srilm-user archives online at http://www.icsi.berkeley.edu/~gelbart/tmp/srilm-user-www/ Please let me know if you notice any problems with the online archives (the code I used to convert was mostly written today for this purpose). I have checked several months of messages and I haven't noticed any problems so far. The above URL is just a temporary location until Andreas sets up the archives at SRI. So I don't plan to keep the archives at that URL up-to-date in the future. Regards, David From deliverable at gmail.com Wed Sep 12 08:50:50 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Wed, 12 Sep 2007 17:50:50 +0200 Subject: unicode & many files Message-ID: <5591C830-E686-4E13-81AB-925A9E8B6140@gmail.com> How good is the unicode support -- e.g. for utf8? I've fed it some utf8 Cyrillics and it did fine. How does it know we're using multibyte or single byte characters? Another question -- how do I feed many text files from a directory, should I do multiple -text options after cooking them somehow, or use -read on an accumulating count file? Cheers, Alexy From stolcke at speech.sri.com Wed Sep 12 10:07:06 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 12 Sep 2007 10:07:06 -0700 Subject: unicode & many files In-Reply-To: <5591C830-E686-4E13-81AB-925A9E8B6140@gmail.com> References: <5591C830-E686-4E13-81AB-925A9E8B6140@gmail.com> Message-ID: <46E81CBA.8050205@speech.sri.com> Alexy Khrabrov wrote: > How good is the unicode support -- e.g. for utf8? I've fed it some > utf8 Cyrillics and it did fine. How does it know we're using > multibyte or single byte characters? SRILM is oblivious to character sets. I uses whitespace to delimit words, but doesn't analyze them further. As long as words are separated by ASCII whitespace most functions will work with any character set. An exception to the above is the lower-case mapping enabled by the -tolower option of various tools. This requires that your operating system knows how to map characters to lowercase via the tolower() library function. This will interact with the locale setting which is typically controlled by environment variables. But again, this is all outside SRILM, it's implemented by the OS and C library functions. > > Another question -- how do I feed many text files from a directory, > should I do multiple -text options after cooking them somehow, or use > -read on an accumulating count file? You use Unix tools: cat foo/file.* | ngram-count -text - ... or find directory -type f (other options to select the right files) | xargs cat | ngram-count -text - .... Creating separate count files and then cat-ing them together is also an option. Andreas From stolcke at speech.sri.com Wed Sep 12 10:01:55 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 12 Sep 2007 10:01:55 -0700 Subject: SRI LM archives In-Reply-To: References: Message-ID: <46E81B83.7050603@speech.sri.com> David Gelbart wrote: > Dear srilm-user, > > I have placed a copy of the srilm-user archives online at > > http://www.icsi.berkeley.edu/~gelbart/tmp/srilm-user-www/ > > Please let me know if you notice any problems with the online archives > (the code I used to convert was mostly written today for this > purpose). I have checked several months of messages and I haven't > noticed any problems so far. > > The above URL is just a temporary location until Andreas sets up the > archives at SRI. So I don't plan to keep the archives at that URL > up-to-date in the future. Thanks very much for doing this, David! The srilm-user archive is now hosted at SRI in http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/ with a link from the SRILM home page. I also added a search function. Note this also makes it possible (which wasn't before) for people who are not subscribed to srilm-user to access the contributions of the list. Andreas From gelbart at icsi.berkeley.edu Fri Sep 14 15:50:55 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Fri, 14 Sep 2007 15:50:55 -0700 (PDT) Subject: nbest-rover-acoustic test failing In-Reply-To: <46E81B83.7050603@speech.sri.com> References: <46E81B83.7050603@speech.sri.com> Message-ID: Hi, I have built SRILM 1.5.3 under Fedora Core 3 and Ubuntu 6.06. The nbest-rover-acoustic test fails for me because stdout differs from the reference output. Below, I have included the beginning of my output and the reference output. On line 10, puh_f and pum_f in the reference output are replaced with puh and pum in my output. On line 16, 0.958381 in the reference output is replaced with 0.958202 in my output, and similarly for several of the other numbers. The same kind of differences (sometimes missing _f after phone name, and sometimes slightly different numbers) continue later on in my output, and there are also cases where different words are recognized. I have placed the full outputs at http://www.icsi.berkeley.edu/~gelbart/sriTest.tar Does anyone have suggestions about what might be causing this? I have set LANG=C, LC_NUMERIC=C, and LC_ALL=C. The beginning of my output, with line numbers: [root at localhost test]# head -16 output/nbest-rover-acoustic.i686.stdout | cat -n 1 name sw_40008_A_0015814_0016128 2 numaligns 16 3 posterior 1 4 align 0 *DELETE* 0.999999 uhhuh 1.28131e-06 5 reference 0 *DELETE* 6 info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 7 align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh 4.63062e-05 *DELETE* 0 8 reference 1 *DELETE* 9 info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 10 info 1 um 0.58 0.21 -154.354 -13.163 puh:pum 18:3 11 info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43 12 info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10 13 align 2 *DELETE* 1 [laugh] 9.69422e-10 14 reference 2 *DELETE* 15 info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3 16 align 3 *DELETE* 0.958202 [mouth] 0.0325037 uhhuh 0.00557467 [laugh] 0.00342208 [noise] 0.000275018 yeah 1.44687e-05 is 4.34731e-06 oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13 The beginning of the reference output, with line numbers: [root at localhost test]# head -16 reference/nbest-rover-acoustic.stdout | cat -n 1 name sw_40008_A_0015814_0016128 2 numaligns 16 3 posterior 1 4 align 0 *DELETE* 0.999999 uhhuh 1.28131e-06 5 reference 0 *DELETE* 6 info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 7 align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh 4.63062e-05 *DELETE* 0 8 reference 1 *DELETE* 9 info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 10 info 1 um 0.58 0.21 -154.354 -13.163 puh_f:pum_f 18:3 11 info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43 12 info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10 13 align 2 *DELETE* 1 [laugh] 9.69422e-10 14 reference 2 *DELETE* 15 info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3 16 align 3 *DELETE* 0.958381 [mouth] 0.0323282 uhhuh 0.00557468 [laugh] 0.00341845 [noise] 0.000274616 yeah 1.44687e-05 is 4.34731e-06 oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13 Thanks, David From gelbart at icsi.berkeley.edu Fri Sep 14 15:59:38 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Fri, 14 Sep 2007 15:59:38 -0700 (PDT) Subject: nbest-rover-acoustic test failing In-Reply-To: References: <46E81B83.7050603@speech.sri.com> Message-ID: On Fri, 14 Sep 2007, David Gelbart wrote: > I have built SRILM 1.5.3 under Fedora Core 3 and Ubuntu 6.06. The > nbest-rover-acoustic test fails for me because stdout differs from the > reference output. Oops, I should give some more detail. My CPU is a Pentium 4. I am running these operating systems under VMware, with Windows XP as the host. My gawk is GNU Awk 3.1.5. Regards, David From stolcke at speech.sri.com Fri Sep 14 20:29:14 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 14 Sep 2007 20:29:14 -0700 Subject: nbest-rover-acoustic test failing In-Reply-To: References: <46E81B83.7050603@speech.sri.com> Message-ID: <46EB518A.5030405@speech.sri.com> David Gelbart wrote: > Hi, > > I have built SRILM 1.5.3 under Fedora Core 3 and Ubuntu 6.06. The > nbest-rover-acoustic test fails for me because stdout differs from the > reference output. > > Below, I have included the beginning of my output and the reference > output. On line 10, puh_f and pum_f in the reference output are > replaced with puh and pum in my output. On line 16, 0.958381 in the > reference output is replaced with 0.958202 in my output, and similarly > for several of the other numbers. The same kind of differences > (sometimes missing _f after phone name, and sometimes slightly > different numbers) continue later on in my output, and there are also > cases where different words are recognized. I have placed the full > outputs at http://www.icsi.berkeley.edu/~gelbart/sriTest.tar > > Does anyone have suggestions about what might be causing this? I have > set LANG=C, LC_NUMERIC=C, and LC_ALL=C. It's a bug in the reference output. There was an update to the handling of phone labels with diacritics ("_f") in nbest-rover-acoustic, in release 1.5.3, but I never regenerated the reference output for this test. Your output is in fact correct. If you want you can download 1.5.4-beta and grab the reference output in it. Andreas > > The beginning of my output, with line numbers: > > [root at localhost test]# head -16 > output/nbest-rover-acoustic.i686.stdout | cat -n > 1 name sw_40008_A_0015814_0016128 > 2 numaligns 16 > 3 posterior 1 > 4 align 0 *DELETE* 0.999999 uhhuh 1.28131e-06 > 5 reference 0 *DELETE* > 6 info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 > 7 align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh > 4.63062e-05 *DELETE* 0 > 8 reference 1 *DELETE* > 9 info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 > 10 info 1 um 0.58 0.21 -154.354 -13.163 puh:pum 18:3 > 11 info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43 > 12 info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10 > 13 align 2 *DELETE* 1 [laugh] 9.69422e-10 > 14 reference 2 *DELETE* > 15 info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3 > 16 align 3 *DELETE* 0.958202 [mouth] 0.0325037 uhhuh 0.00557467 > [laugh] 0.00342208 [noise] 0.000275018 yeah 1.44687e-05 is 4.34731e-06 > oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13 > > The beginning of the reference output, with line numbers: > > [root at localhost test]# head -16 reference/nbest-rover-acoustic.stdout > | cat -n > 1 name sw_40008_A_0015814_0016128 > 2 numaligns 16 > 3 posterior 1 > 4 align 0 *DELETE* 0.999999 uhhuh 1.28131e-06 > 5 reference 0 *DELETE* > 6 info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 > 7 align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh > 4.63062e-05 *DELETE* 0 > 8 reference 1 *DELETE* > 9 info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8 > 10 info 1 um 0.58 0.21 -154.354 -13.163 puh_f:pum_f 18:3 > 11 info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43 > 12 info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10 > 13 align 2 *DELETE* 1 [laugh] 9.69422e-10 > 14 reference 2 *DELETE* > 15 info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3 > 16 align 3 *DELETE* 0.958381 [mouth] 0.0323282 uhhuh 0.00557468 > [laugh] 0.00341845 [noise] 0.000274616 yeah 1.44687e-05 is 4.34731e-06 > oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13 > > Thanks, > David From stolcke at speech.sri.com Tue Sep 18 09:26:57 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 18 Sep 2007 09:26:57 PDT Subject: x86_m64 In-Reply-To: Your message of Tue, 18 Sep 2007 17:13:07 +0100. <200709181713.07957.sadafre@computing.dcu.ie> Message-ID: <200709181626.l8IGQvh25371@claudio> In message <200709181713.07957.sadafre at computing.dcu.ie>you wrote: > Hi, > > I wanted to install SRILM on x86_m64 Suse linux machine. Which of the make > files are appropriate or close to this plateform ? I do not see a makefile > which correspond to this machine type. You can use MACHINE_TYPE=i686-m64 (assuming you have 64-bit gcc installed). Andreas From stolcke at speech.sri.com Thu Sep 20 11:16:21 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 20 Sep 2007 11:16:21 -0700 Subject: srilm toolkit In-Reply-To: <8CE2D832-8777-4256-BE5D-90A098D10043@ehu.es> References: <8CE2D832-8777-4256-BE5D-90A098D10043@ehu.es> Message-ID: <46F2B8F5.4030904@speech.sri.com> Raquel Justo wrote: > Dear Dr. Stolcke, > I have seen in "SRILM - AN EXTENSIBLE LANGUAGE MODELING TOOLKIT" > article that the srilm toolkit deals with class N-gram LMs and that it > allows class members to be multiword strings . > Although I have read the manual pages and seen that the "n-gram" > command has several options as "-expand-classes k" and "-expand-exact > k" for class expansion, I do not really understand how it works. Would > you mind telling me where I could find further information related to > this issue? > > I am working with class-based LMs and I propose the use of class > n-gram LMs (where classes are made up of "multiword" strings or > "subsequences of words") in two different ways: > - In a first approach a multiword string is considered as a new > lexical unit generated by joining the words and it is treated as a > unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P("san_franciso"| > C_CITY_NAME)) > - Instead, in a second approach, the words (taking part in the > multiword string) are separately studied and the conditioned > probabilities are calculated. Thus, a class n-gram LM is generated on > the one hand, and on the other hand a word n-gram LM is generated > within each class. (e.g. "san francisco", > P(C_CITY_NAME)*P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)). It looks to me like your second approach is equivalent to the first, modulo smoothing effects achieved by the different backing off distributions you might use in estimating the component probabilities. > > I send in an attached file a paper published in the "IEEE workshop on > machine learning and signal processing" explaining better the two > approaches. > > Does the -expand-classes or the -expand-exact option do something > similar to the aforementioned approaches do? or does it adapt the > class n-gram LM to a word n-gram LM considering that the words takes > into account the information related to the classes (e.g. > P(san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))? Here is a high-level description of what -expand-classes does: 1) generate a list of all word ngrams obtained by replacing the class tokens in the given LM. 2) for each word ngram thus obtained: a) compute the joint probability p of the entire word ngram, according to the original class LM b) compute the joint probability q of the prefixes (excluding the last word) of the ngrams c) compute the conditional ngram probability as p/q . 3) insert the newly generated word ngrams into the original LM, remove the class-based ngrams 4) recompute backoff weights (renormalize the model) Andreas From stolcke at speech.sri.com Thu Sep 20 12:22:42 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 20 Sep 2007 12:22:42 PDT Subject: problem with multiwords splitting and meshes In-Reply-To: Your message of Thu, 20 Sep 2007 15:06:56 +0200. Message-ID: <200709201922.l8KJMgc01722@huge> > > Dear Andreas, > I am working with word-lattices containing multi-words. > > I need to extract meshes from them, > but I noticed a wrong behavior by just using > the parameters "-split-multiwords" > > This is due to the fact, I think, that the additional nodes are set > with "wrong" timestamps > (equal to the timestamp of the original endnode) as I can see when > saving in htk format instead. Yes, that is expected. If you have no sub-word (phone) alignment information, there is no way to assign time stamps to the components of a multiword. > > This fact should be solved by version 1.5.3 by means of parameter "- > multiword-dictionary". That's what it was made for. > Unfortunately I am not able to use it correctly. > > I run the following command > > cat example.lat | lattice-tool -htk-acscale 1 -htk-lmscale 14.766 - > htk-wdpenalty -3 -in-lattice - -read-htk -out-lattice - -write-htk - > split-multiwords -multiword-dictionary multiword.lexicon > > and I got the following error message > > Lattice::splitHTKMultiwordNodes: no pronunciation on multiword node > we_will > > I attached a very small (artificial) lattice "example.lat" and a real > lattice "example2.lat". > > The file multiword.lexicon contains lines like the following > we_will w iy | w el > > > So I would ask you if you can please help me. > > Specifically, I have some specific questions > > - Is the format of the file with the multiword lexicon correct Yes. > - Do I need also the lexicon dictionary? Something like the following? > we w iy > will w el No. > - Do I miss anything else? Yes. Look at the error message: "no pronunciation on multiword node". If you have no pronunciation information in the original lattice you cannot infer the alignment of the split multiword. The pronunciation and phone alignment format for HTK lattices may not be well documented. It consists of a string of phone labels and durations separated by commas and colons. In your case, the node for we_will would need to look like this: J=1 S=0 E=1 W=we_will v=3 a=-200 l=-4 d=:w,0.1:iy,0.2:w,0.1:el:0.2: AND the phone string needs to correspond exactly to an entry in your multiword dictionary with boundary marker (as it does in this case). I have no idea how you would get your decoder to output this information. You might be able to "fake it" by (1) looking up the pronuncation variant (3 in this case) in your decoding dictionary, and (2) making assumptions about the relative durations of the phones (you can get the total word duration from the lattice node times). You would then have to insert properly formatted "d=" fields into the lattices before sending the lattice to lattice-tool. > - What happens to the scores of the edge corresponding to the multiword? All the scores are retained on the first multiword component, the remaining components get 0 scores (so the total scores along the path is unchanged). > In other words, how can I generate a new lattice with multiwords > splitted over several edges, > containing "correct" scores and times, somehow proportional to the > "length" of each component word? If you want to split multiword nodes using a different strategy from what is described above you can implement it yourself, either as a preprocessing step or by modify ing the function Lattice::splitHTKMultiwordNodes() in lattice/src/HTKLattice.cc . Andreas From stolcke at speech.sri.com Fri Sep 21 12:53:39 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 21 Sep 2007 12:53:39 PDT Subject: srilm toolkit In-Reply-To: Your message of Fri, 21 Sep 2007 13:16:58 +0200. <9F79B713-3894-4455-9179-E29269B4A5EE@ehu.es> Message-ID: <200709211953.l8LJrdn11743@huge> > > El 20/09/2007, a las 20:16, Andreas Stolcke escribi?: > > > Raquel Justo wrote: > >> > >> I am working with class-based LMs and I propose the use of class n- > >> gram LMs (where classes are made up of "multiword" strings or > >> "subsequences of words") in two different ways: > >> - In a first approach a multiword string is considered as a new > >> lexical unit generated by joining the words and it is treated as a > >> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P > >> ("san_franciso"| C_CITY_NAME)) > >> - Instead, in a second approach, the words (taking part in the > >> multiword string) are separately studied and the conditioned > >> probabilities are calculated. Thus, a class n-gram LM is generated > >> on the one hand, and on the other hand a word n-gram LM is > >> generated within each class. (e.g. "san francisco", P(C_CITY_NAME) > >> *P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)). > > It looks to me like your second approach is equivalent to the > > first, modulo smoothing effects achieved by the different backing > > off distributions you might use in estimating the component > > probabilities. > > I don't know if I have understood very well what you want to say but > I think that using backing off smoothing the first approach is > different from the second one because different combination of all > the words belonging to a class are allowed and in the second approach > instead, only the considered subsequences of words are allowed > because they are treated as unigrams inside each class. I think that > even when no smoothing is considered the first approach can > generalize better due to the fact that n-gram models themselves > generalize on the training data. You are right. That's actually what I meant by "different backing off". > >> > >> I send in an attached file a paper published in the "IEEE workshop > >> on machine learning and signal processing" explaining better the > >> two approaches. > >> > >> Does the -expand-classes or the -expand-exact option do something > >> similar to the aforementioned approaches do? or does it adapt the > >> class n-gram LM to a word n-gram LM considering that the words > >> takes into account the information related to the classes (e.g. P > >> (san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))? > > Here is a high-level description of what -expand-classes does: > > > > 1) generate a list of all word ngrams obtained by replacing the > > class tokens in the given LM. > > 2) for each word ngram thus obtained: > > a) compute the joint probability p of the entire word > > ngram, according to the original class LM > > Would you mind telling me how you compute this probability when > multiwords are considered? > do you consider the multiword as a unique token or do you estimate > the conditional probabilities between the words that make up the > multiword? Are you talking about multiwords that are joined by underscores (as handled by the -multiwords) option? In that case there is no special processing for them in ngram -expand-classes. The class mechanism treats multiwords as regular word tokens. If you are asking about class expansions that contain multiple words separated by spaces (e.g. CITY -> San Franscisco) then the answer is that the expansion algorithm deals with them just fine. The algorithm I outlined above handles this case quite naturally. I forgot to mention one feature of the expansion algorithm: If the same word ngram can be generated by expanding different class ngrams then to corresponding joint probabilities are added, as they should be. Andreas From oatgnaw at gmail.com Wed Sep 26 05:20:09 2007 From: oatgnaw at gmail.com (=?GB2312?B?zfXMzg==?=) Date: Wed, 26 Sep 2007 20:20:09 +0800 Subject: some questions about confusion network Message-ID: <46fa4e78.07ec720a.1112.415a@mx.google.com> Hi, I've tried to use SRILM to construct confusion network? but I met some problems. 1 command: lattice-tool -in-lattice INPUT -write-mesh OUTPUT I used this command to construct confusion network and the input lattice was in PFSG format. However, I didn't got the right answer. I wonder the problem lies in the probability for each transition. What's the meaning of the probability and how to calculate? 2 I can't use SRILM to directly convert a lattice in wlat-format to confusion network. Only lattices in PFSG format can be converted to confusion network. Right? 3 In the conversion, no time information is used. Right? 4 How to combine two confusion network into a big one? nbest-lattice -use-mesh -lattice-files mesh.filelist -write mesh.output Thank you very much.