From nayeem at cs.iitm.ernet.in Tue Jan 4 05:07:12 2005 From: nayeem at cs.iitm.ernet.in (Nayeem) Date: Tue, 4 Jan 2005 18:37:12 +0530 (IST) Subject: Rescoring query In-Reply-To: <200412312046.MAA06884@tonga> Message-ID: Sir, I have a language model (long span LM) that gives a sequence of probabilites for each word in a test sentence. I cannot write this language model in arpa format. The query I have is, can I integrate these probabilites into a lattice or N-best lists for rescoring. In detail the problem is, for a test sentence/utterance I can get a lattice or N-best list generated using HTK. For the same sentence (assuming I have the transcription) I can get the probabilities for each word in the sentence using my long span LM. How can I integrate/rescore the lattice or N-best list using the tools in the SRI-LM toolkit. What tools and options should I use. Kindly help me in this regard. Any suggestions is welcome. Any pointers to important papers on rescoring is also requested. A. Nayeemulla Khan Research scholar IIT Madras India From solen.quiniou at irisa.fr Mon Jan 31 09:46:43 2005 From: solen.quiniou at irisa.fr (Solen Quiniou) Date: Mon, 31 Jan 2005 18:46:43 +0100 Subject: once occuring trigram discarded Message-ID: <41FE6F03.5040103@irisa.fr> Hi, I made a trigram model using Kneser-Ney modified smoothing and interpolation and I don't understand why there are only 5828 trigrams in the model whereas there are 102520 trigrams in the corpus. I think that the trigrams discarded occur just once because there are 96692 trigrams occuring once which is the difference between the trigrams in the corpus and the trigram in the model. I tried to use other smoothing and even no smoothing but every time the trigrams are discarded. I don't understand why since the bigram occuring once (there are 58764 of such bigrams) are not discarded in the bigram model I built using Kneser-Ney modified smoothing and interpolation. Thanks a lot for your answer. Solen. -- Solen Quiniou (Solen.Quiniou at irisa.fr) Doctorante, ?quipe IMADOC - bureau C303 IRISA-INRIA, Campus de Beaulieu 35042 Rennes cedex, France T?l: +33 (0) 2 99 84 22 35 Fax: +33 (0) 2 99 84 71 71 From stolcke Mon Jan 31 10:01:16 2005 From: stolcke (Andreas Stolcke) Date: Mon, 31 Jan 2005 10:01:16 PST Subject: once occuring trigram discarded Message-ID: <200501311801.j0VI1GH22094@speech.sri.com> In message <41FE6F03.5040103 at irisa.fr>you wrote: > Hi, > I made a trigram model using Kneser-Ney modified smoothing and > interpolation and I don't understand why there are only 5828 trigrams in > the model whereas there are 102520 trigrams in the corpus. I think that > the trigrams discarded occur just once because there are 96692 trigrams > occuring once which is the difference between the trigrams in the corpus > and the trigram in the model. I tried to use other smoothing and even no > smoothing but every time the trigrams are discarded. > I don't understand why since the bigram occuring once (there are 58764 > of such bigrams) are not discarded in the bigram model I built using > Kneser-Ney modified smoothing and interpolation. The default cutoff for trigrams (and higher) is count 2. The default cutoff for unigrams and bigrams is count 1. Use ngram-count -gt3min 1 to include all trigrams. ngram-count -help displays the default values for all the options. --Andreas From stolcke at speech.sri.com Thu Jan 6 23:25:16 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 06 Jan 2005 23:25:16 PST Subject: Rescoring query In-Reply-To: Your message of Tue, 04 Jan 2005 18:37:12 +0530. Message-ID: <200501070725.XAA09190@huge> In message you wrote: > Sir, > > I have a language model (long span LM) that gives a sequence of > probabilites for each word in a test sentence. I cannot write this > language model in arpa format. The query I have is, can I integrate these > probabilites into a lattice or N-best lists for rescoring. > > In detail the problem is, for a test sentence/utterance I can get a > lattice or N-best list generated using HTK. For the same sentence > (assuming I have the transcription) I can get the probabilities for each > word in the sentence using my long span LM. How can I integrate/rescore > the lattice or N-best list using the tools in the SRI-LM toolkit. What > tools and options should I use. The following approaches use the SRILM tools at a high level to integrate LM scores that you compute externally. For N-best lists: 1. Generate N-best lists using HTK, then convert them into the 3rd format described in nbest-format(5). 2. Generate your own LM scores by whatever means, and store them into a separate directory. For example, if one of your waveform names is abcde.wav, then format the corresponding N-best LM scores into a single column of numbers and put them in a file called DIR/abcde or DIR/abcde.gz (compressed). 3. Use the rescore-reweight script (see nbest-scripts(1) man page to combine the standard scores and your own and extract the 1-best hypotheses. 4. You can use the nbest-optimize(1) tool to tune the score combination weights on a held-out set. For lattices: 1. Generate lattices using HTK. 2. Generate your own LM scores by whatever means and insert them into the lattices. You can either replace the original LM scores (by modifying the "l=" fields), or add them as a separate set of scores. In SRILM you can add the "x1=", "x2=", etc. fields to add arbitrary additional scores to lattice nodes or links. Note this assumes you can somehow compute LM scores on a word-by-word basis. This might not be simple, especially if your LM is "long-span", and might require expanding the lattice etc. 3. Use lattice-tool(1) to combine the old and new scores in a weighted fashion and extract the 1-best hypotheses. The other possibility is to implement your LM in C++ as a LM class in the SRILM framework. This is a fair amount of work, would require some study of the existing code, etc., but would ultimately allow you to use your LM seamlessly in all the SRILM tools, for perplexity computation, nbest rescoring, lattice expansion, etc. (I'm assuming you probably do NOT want to attempt this for now.) > Kindly help me in this regard. Any suggestions is welcome. > > Any pointers to important papers on rescoring is also requested. The classic paper on rescoring is M. Ostendorf, A. Kannan, S. Austin, O. Kimball, R. Schwartz, J. R. Rohlicek, Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses, Proceedings DARPA workshop on Speech and natural language, Pacific Grove, California, pp. 83-87, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1991. The framework has been extremely popular since, and there have probably been probably hundreds of other papers since. --Andreas From dtwitchell at cmi.arizona.edu Wed Jan 26 09:33:10 2005 From: dtwitchell at cmi.arizona.edu (Twitchell, Doug) Date: Wed, 26 Jan 2005 10:33:10 -0700 Subject: problems compiling on alpha Message-ID: <270593C43CEE6E42A84D7F860469E9120290C7@grande.CMI.arizona.edu> I attempting to compile srilm on the following machine: Machine: HP/Compaq Alpha GS1280 OS: Tru64 Unix Compiler: gcc 3.4.3 Make: GNU make 3.80 Everything compiles cleanly except the "ngram" executable (which, of course, is one that I need to use). This is the error it returns during the make: g++ -mieee-with-inexact -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES -I~/tcl/generic -I. -I/home4/u12/dtwitche/srilm/include -u matherr -L/home4/u12/dtwitche/srilm/lib/alpha -g3 -O2 -o ../bin/alpha/ngram ../obj/alpha/ngram.o -L/home4/u12/dtwitche/srilm/lib/alpha ../obj/alpha/liboolm.a -lm -lflm -ldstruct -lmisc -L~/tcl/unix -ltcl -lm 2>&1 | c++filt /usr/bin/ld: ../obj/alpha/liboolm.a(CacheLM.o): LHash::removedData: multiply defined ../obj/alpha/liboolm.a(CacheLM.o): global constructors keyed to _ZN5LHashIjdE11removedDataE: multiply defined ../obj/alpha/liboolm.a(CacheLM.o): global destructors keyed to _ZN5LHashIjdE11removedDataE: multiply defined ../obj/alpha/liboolm.a(CacheLM.o): _GLOBAL__F__ZN5LHashIjdE11removedDataE: multiply defined collect2: ld returned 1 exit status Any ideas on how to resolve this? Thanks, Doug From stolcke at speech.sri.com Fri Feb 4 11:59:33 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 04 Feb 2005 11:59:33 PST Subject: problems compiling on alpha In-Reply-To: Your message of Wed, 26 Jan 2005 10:33:10 -0700. <270593C43CEE6E42A84D7F860469E9120290C7@grande.CMI.arizona.edu> Message-ID: <200502041959.j14JxXC19256@huge> I don't have access to an Alpha system anymore. Your linker might require a flag to instruct it to merge multiple definitions of the same symbol. On Solaris that is ld -z muldefs, and you would invoke the compiler with -Wl,-z,muldefs Check your ld man page to find something similar. --Andreas In message <270593C43CEE6E42A84D7F860469E9120290C7 at grande.CMI.arizona.edu>you w rote: > I attempting to compile srilm on the following machine: > > Machine: HP/Compaq Alpha GS1280 > OS: Tru64 Unix > Compiler: gcc 3.4.3 > Make: GNU make 3.80 > > Everything compiles cleanly except the "ngram" executable (which, of > course, is one that I need to use). > > This is the error it returns during the make: > > g++ -mieee-with-inexact -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES > -I~/tcl/generic -I. -I/home4/u12/dtwitche/srilm/include -u matherr > -L/home4/u12/dtwitche/srilm/lib/alpha -g3 -O2 -o ../bin/alpha/ngram > ../obj/alpha/ngram.o -L/home4/u12/dtwitche/srilm/lib/alpha > ../obj/alpha/liboolm.a -lm -lflm -ldstruct -lmisc -L~/tcl/unix -ltcl -lm > 2>&1 | c++filt > /usr/bin/ld: > ../obj/alpha/liboolm.a(CacheLM.o): LHash double>::removedData: multiply defined > ../obj/alpha/liboolm.a(CacheLM.o): global constructors keyed to > _ZN5LHashIjdE11removedDataE: multiply defined > ../obj/alpha/liboolm.a(CacheLM.o): global destructors keyed to > _ZN5LHashIjdE11removedDataE: multiply defined > ../obj/alpha/liboolm.a(CacheLM.o): > _GLOBAL__F__ZN5LHashIjdE11removedDataE: multiply defined > collect2: ld returned 1 exit status > > Any ideas on how to resolve this? > > Thanks, > > Doug > From stolcke at speech.sri.com Fri Jan 28 20:12:59 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 28 Jan 2005 20:12:59 PST Subject: SRILM windows binaries sought Message-ID: <200501290412.UAA16574@huge> If someone on this list can kindly help Fatma please email him directly. thanks --Andreas ------- Forwarded Message Subject: Your help is appreciated! x-mimeole: Produced By Microsoft Exchange V6.5.7226.0 Date: Sat, 29 Jan 2005 08:06:28 +0400 Message-ID: <32FAD195D3CB674DA423D79021F1C82BB34309 at exbe1.sharjah.uos.edu> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Your help is appreciated! Thread-Index: AcUFt+hn3+RiYVFrQ5qsWs8CS+ogWg== From: "Fatima Al Shamsi" To: X-Spam-Status: No, score=0.1 threshold=4.0 X-Spam-Level: Dear Dr. Stolcke I'm at the final stages of my thesis in Arabic information extraction , = I faced a lot of errors while making the SRILM source files (I'm working = in Windows platform and using CYGwin package ) . on the other hand, I = was able to make a lot of other files using CYGwin package without any = problems . Could you please send me the executable files only so that I = can run them to create and train a trigram model.=20 I appreciate your help=20 =20 Thank you Yours=20 Fatma al shamsi University of Sharjah / UAE fshamsi at sharjah.ac.ae From mlebeau at stanford.edu Tue Feb 22 18:15:45 2005 From: mlebeau at stanford.edu (Mike LeBeau) Date: Tue, 22 Feb 2005 18:15:45 -0800 Subject: "format error in lattice file"? Message-ID: Hi folks, I've used lattice-tool to take a lattice file in HTK SLF format and convert it to a file in PFSG format. This seems to have worked okay, looking at the newly created file, it seems to correspond to the format described in the pfsg-format man page. I used the following syntax to create the file: lattice-tool -read-htk -in-lattice htklattice.lat -out-lattice pfsglattice.lat However now I'm trying to use this new lattice file as the input to nbest-lattice, to create an n-best list. Here's the syntax I'm trying to use: nbest-lattice -read pfsglattice.lat -write-nbest nbest.txt This gives me the error: pfsglattice.lat: line 2: unknown keyword format error in lattice file Looking over the file, it seems fine to me, and since the LM tools themselves created this file, I assumed it would work. So what am I missing? Thanks for any help you can provide. -mike From stolcke at speech.sri.com Tue Feb 22 18:53:11 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 22 Feb 2005 18:53:11 PST Subject: "format error in lattice file"? In-Reply-To: Your message of Tue, 22 Feb 2005 18:15:45 -0800. Message-ID: <200502230253.j1N2rB705117@huge> In message you wrote: > Hi folks, > > I've used lattice-tool to take a lattice file in HTK SLF format and > convert it to a file in PFSG format. This seems to have worked okay, > looking at the newly created file, it seems to correspond to the format > described in the pfsg-format man page. I used the following syntax to > create the file: > > lattice-tool -read-htk -in-lattice htklattice.lat -out-lattice > pfsglattice.lat > > However now I'm trying to use this new lattice file as the input to > nbest-lattice, to create an n-best list. Here's the syntax I'm trying > to use: > > nbest-lattice -read pfsglattice.lat -write-nbest nbest.txt nbest-lattice is the wrong tool. I makes lattices from nbest lists, not the other way around ;-) Also, it deals with yet another lattice format, which is described in wlat-format(5). What you want is the lattice-tool -nbest-decode function. This generates nbest lists from lattices, and you don't even have to convert them first. In fact, it is preferable to generate nbest directly from HTK lattices. Make sure you have SRILM 1.4.3. --Andreas From tanel.alumae at aqris.com Wed Feb 23 00:11:00 2005 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Wed, 23 Feb 2005 10:11:00 +0200 Subject: Interpolating with -lambda 1.0 Message-ID: <1109146260.28101.1.camel@markov> Hello, I'm a bit confused with interpolation. I want to calculate test text's perplexity using different interpolation weights (lambdas). Everything is OK until I set lambda to 1.0. Shouldn't I then get the same perplexity as using only the base language model? This doesn't seem to be the case: $ ngram -lm trigram.arpa -ppl file : 2394 sentences, 29475 words, 1224 OOVs 0 zeroprobs, logprob= -86274.9 ppl= 653.583 ppl1= 1132.06 $ ngram -lm trigram.arpa -ppl -classes -mix-lm class-trigram.arpa -lambda 1.0 file : 2394 sentences, 29475 words, 1224 OOVs 0 zeroprobs, logprob= -85554.4 ppl= 619.144 ppl1= 1067.5 As shown, the perplexity is 653.539 when using standalone trigram, and 619.144 when interpolating the trigram with the class-trigam, using lambda 1.0. Why are they not equal? Both word trigram and class trigram are close-vocabulary LMs, if it matters. Regards, Tanel A. From stolcke at speech.sri.com Wed Feb 23 10:23:55 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 23 Feb 2005 10:23:55 PST Subject: Interpolating with -lambda 1.0 In-Reply-To: Your message of Wed, 23 Feb 2005 10:11:00 +0200. <1109146260.28101.1.camel@markov> Message-ID: <200502231823.KAA12013@tonga> Try using -bayes 0 when running the interpolated model. Without it, ngram will construct a merged ngram model in memory, which does not work well when combining word and class-based models. --Andreas In message <1109146260.28101.1.camel at markov>you wrote: > Hello, > > I'm a bit confused with interpolation. > I want to calculate test text's perplexity using different interpolation > weights (lambdas). Everything is OK until I set lambda to 1.0. Shouldn't > I then get the same perplexity as using only the base language model? > This doesn't seem to be the case: > > $ ngram -lm trigram.arpa -ppl > file : 2394 sentences, 29475 words, 1224 OOVs > 0 zeroprobs, logprob= -86274.9 ppl= 653.583 ppl1= 1132.06 > > $ ngram -lm trigram.arpa -ppl -classes -mix-lm > class-trigram.arpa -lambda 1.0 > file : 2394 sentences, 29475 words, 1224 OOVs > 0 zeroprobs, logprob= -85554.4 ppl= 619.144 ppl1= 1067.5 > > As shown, the perplexity is 653.539 when using standalone trigram, and > 619.144 when interpolating the trigram with the class-trigam, using > lambda 1.0. Why are they not equal? > > Both word trigram and class trigram are close-vocabulary LMs, if it > matters. > > Regards, > > Tanel A. > > > From stolcke at speech.sri.com Wed Feb 23 16:34:29 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 23 Feb 2005 16:34:29 PST Subject: "format error in lattice file"? In-Reply-To: Your message of Wed, 23 Feb 2005 15:36:36 -0800. Message-ID: <200502240034.j1O0YTn18520@huge> In message you wrote: > Hi Andreas, > > If I have lattices, which I have converted into n-best lists using > lattice-tool -nbest-decode, and I want to then compare the scoring > results of the original nbest from the lattice with an nbest that has > been rescored in a particular way, how would you recommend I go about > inputting each of them into compute-sclite? > > Since the nbest list is not in one of the input forms for > compute-sclite, I'm not sure how to do this, yet I need the lattices in > n-best form in order to be able to rescore them in the manner I'm > planning. Does that make any sense? You don't score the entire nbest lists. You extract the 1best according to some linear weighting of the different model scores, then score the 1best hypotheses you get . Check the nbest-scripts(1) man page for a description of "rescore-reweight". There is also a tool to optimize the weights on a held-out set. Check the nbest-optimize(1) command. --Andreas From tanel.alumae at aqris.com Mon Feb 28 02:13:57 2005 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Mon, 28 Feb 2005 12:13:57 +0200 Subject: Interpolating with -lambda 1.0 In-Reply-To: <200502231823.KAA12013@tonga> References: <200502231823.KAA12013@tonga> Message-ID: <1109585637.7158.7.camel@localhost.localdomain> Hello, The -bayes 0 switch didn't help, although it did change the calculated perplexity values for lambda < 1.0 However, I discovered that I had the following in my classdefs: 1 1 After removing those lines, the interpolated perplexity with -lambda 1.0 is equal to the perplexity of the pure word trigram, as expected. Regards, Tanel A. On Wed, 2005-02-23 at 10:23 -0800, Andreas Stolcke wrote: > Try using -bayes 0 when running the interpolated model. > Without it, ngram will construct a merged ngram model in memory, > which does not work well when combining word and class-based models. > > --Andreas > > In message <1109146260.28101.1.camel at markov>you wrote: > > Hello, > > > > I'm a bit confused with interpolation. > > I want to calculate test text's perplexity using different interpolation > > weights (lambdas). Everything is OK until I set lambda to 1.0. Shouldn't > > I then get the same perplexity as using only the base language model? > > This doesn't seem to be the case: > > > > $ ngram -lm trigram.arpa -ppl > > file : 2394 sentences, 29475 words, 1224 OOVs > > 0 zeroprobs, logprob= -86274.9 ppl= 653.583 ppl1= 1132.06 > > > > $ ngram -lm trigram.arpa -ppl -classes -mix-lm > > class-trigram.arpa -lambda 1.0 > > file : 2394 sentences, 29475 words, 1224 OOVs > > 0 zeroprobs, logprob= -85554.4 ppl= 619.144 ppl1= 1067.5 > > > > As shown, the perplexity is 653.539 when using standalone trigram, and > > 619.144 when interpolating the trigram with the class-trigam, using > > lambda 1.0. Why are they not equal? > > > > Both word trigram and class trigram are close-vocabulary LMs, if it > > matters. > > > > Regards, > > > > Tanel A. > > > > > > > From vancrusoe at hotmail.com Tue Mar 29 09:09:27 2005 From: vancrusoe at hotmail.com (zhou hao) Date: Wed, 30 Mar 2005 01:09:27 +0800 Subject: about the ngram -hmm option Message-ID: Hey, just got a question in my mind, in the ngram command, it comes with an option -hmm, which needs to take a HMM file as input, so how can I create this file when I train the language model. or should I write some code myself to generate that. thanks Crusoe _________________________________________________________________ ??????????????? MSN Hotmail? http://www.hotmail.com From stolcke at speech.sri.com Wed Mar 30 00:01:45 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 30 Mar 2005 00:01:45 PST Subject: about the ngram -hmm option In-Reply-To: Your message of Wed, 30 Mar 2005 01:09:27 +0800. Message-ID: <200503300801.j2U81l805125@huge> In message you wrote: > Hey, > > just got a question in my mind, in the ngram command, it comes with an > option -hmm, which needs to take a HMM file as input, so how can I create > this file when I train the language model. or should I write some code > myself to generate that. You typically create the file by hand, thus SRILM comes with no special tools for this. However, if you are building a large HMM structure it is best done by a program or script. I hope you don't expect SRILM to include some kind of "mini HMM toolkit". It doesn't. ngram -hmm is meant for building simple models that switch ngram distributions as they generate a sentence. --Andreas