From stolcke at speech.sri.com Mon Apr 5 12:01:00 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 05 Apr 2004 12:01:00 PDT Subject: linear interpolation process In-Reply-To: Your message of Fri, 02 Apr 2004 14:05:01 +0100. <003c01c418b3$1ca4ee00$0800000a@speechasus> Message-ID: <200404051901.MAA02078@huge> In message <003c01c418b3$1ca4ee00$0800000a at speechasus>you wrote: > > Hi! > > I have a question about linear interpolation process executed by ngram > command in SRILM. > What's the main difference between dynamic interpolation (using -bayes) and > static interpolation? > I tried both but I'm getting a big difference in perplexity values: for > instance, 314 against 246. > If we do static interpolation one can use -write-lm to pruduce a file with > the interpolated model. However, using dynamic process it is not. Why? Are > the process diferences so big? > > Just an observation: the big differences in perplexity values result in the > case we are doing interpolation of word and class models. For interpolation > of word models the difference is quite insignificant. That's the problem. You cannot do "static" interpolation of a word and class-based N-gram LM. This is only supported for two word or two class-based LM.s --Andreas From stolcke at speech.sri.com Mon Apr 5 22:13:26 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 05 Apr 2004 22:13:26 PDT Subject: positive backoff weight In-Reply-To: Your message of Thu, 18 Mar 2004 11:52:02 +0100. <40597F52.4050803@irisa.fr> Message-ID: <200404060513.WAA01373@huge> In message <40597F52.4050803 at irisa.fr>you wrote: > Thank you for the past answers to my questions. > > I've got another question. Sometimes, when I use a Good-Turing > discounting, some of the backoff weight of the unigram (I compute a > bigram model) are positive log-probability. How is it possible ? Is it Backoff weights are not probabilities. They are normalizing factors. Backoff weight for a history h is defined as BOW(h) = [ 1- \sum_(w,h) p(w|h) ] / [ 1- \sum_(w,h) p'(w|h) ] where p'(w|h) is the lower-order probability estimate (e.g., a bigram estimate in a trigram model). So, if the trigram probability estimate give lower value than the corresponding bigram estimates for a given history, then BOW(h) will be > 1 and its log positive. > because Good-Turing discounting is disabled on unigram since there are > no unigram which frequency is 1 ? And, more, generally, how are computed > backoff weights for unigrams, in the case of a bigram model ? Backoff weights for unigrams are computed by exactly the same method (in the formula above, p(w|h) are bigram probabilities and p'(w|h) are unigram probabilities). --Andreas From dpico at dsic.upv.es Tue Apr 6 01:24:55 2004 From: dpico at dsic.upv.es (=?ISO-8859-1?Q?David_Pic=F3?=) Date: Tue, 06 Apr 2004 10:24:55 +0200 Subject: A simple question about SRILM Message-ID: <40726957.3070101@dsic.upv.es> Hello, I also have a little question about SRILM. How can I infer a trigram (or bigram, or tetragram...) with no smoothing at all? I need to do some experiments to check the effect of n-gram smoothing in my models and I need a pure trigram with no probability mass derived to lower levels. Is this possible in SRILM? I need to be sure that I really get a trigram (with the whole trigram probabilities). Thank you very much in advance for your help and attention! David -- David Pic?-Vila Universitat Polit?cnica de Val?ncia Departament de Sistemes Inform?tics i Computaci? Val?ncia, Spain From stolcke at speech.sri.com Tue Apr 6 09:34:12 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 06 Apr 2004 09:34:12 PDT Subject: A simple question about SRILM In-Reply-To: Your message of Tue, 06 Apr 2004 10:24:55 +0200. <40726957.3070101@dsic.upv.es> Message-ID: <200404061634.JAA01494@huge> The ngram-count man page says -gtnmax count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order n that are dis- counted under Good-Turing. All N-grams more fre- quent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. Therefore, you can disable smoothing with ngram-count -gt1max 0 -gt2max 0 -gt3max 0 ... --Andreas In message <40726957.3070101 at dsic.upv.es>you wrote: > Hello, > > I also have a little question about SRILM. How can I infer a trigram (or > bigram, or tetragram...) with no smoothing at all? I need to do some > experiments to check the effect of n-gram smoothing in my models and I > need a pure trigram with no probability mass derived to lower levels. Is > this possible in SRILM? I need to be sure that I really get a trigram > (with the whole trigram probabilities). > > Thank you very much in advance for your help and attention! > David > > -- > David Pic?-Vila > Universitat Polit?cnica de Val?ncia > Departament de Sistemes Inform?tics i Computaci? > Val?ncia, Spain > From julyjune03 at yahoo.com Sat Apr 10 22:30:12 2004 From: julyjune03 at yahoo.com (June July) Date: Sat, 10 Apr 2004 22:30:12 -0700 (PDT) Subject: can a cache LM be loaded in disambig? Message-ID: <20040411053012.75383.qmail@web41601.mail.yahoo.com> I'm trying to load a cache LM in disambig tool by adding several lines of code according to ngram.cc. Everything is fine except the linking, where I had a problem: /usr3/Test/sri-1.4/lm/src/CacheLM.cc:50: multiple definition of `LHash::removedData' ../obj/i686/liboolm.a(VocabMap.o):/usr3/Test/sri-1.4/lm/src/VocabMap.cc:39: first defined here collect2: ld returned 1 exit status Does that mean duplicate definations of "removeData" originally from LHash.cc? How to fix it? Or is there an way to load a cache model in disambig? --------------------------------- Do you Yahoo!? Yahoo! Tax Center - File online by April 15th -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Apr 10 22:50:15 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 10 Apr 2004 22:50:15 PDT Subject: can a cache LM be loaded in disambig? In-Reply-To: Your message of Sat, 10 Apr 2004 22:30:12 -0700. <20040411053012.75383.qmail@web41601.mail.yahoo.com> Message-ID: <200404110550.WAA12707@tonga> The linking problem can be solved by removing the instantiation of LHash in CacheLM.cc. However, it probably won't work as intended. CacheLM is not Markovian (is does not use a finite history). This will cause the DP algorithm in disambig to degenerate into keeping all histories as distinct states, which is not feasible except for very short sentences. --Andreas In message <20040411053012.75383.qmail at web41601.mail.yahoo.com>you wrote: > --0-1533885807-1081661412=:75154 > Content-Type: text/plain; charset=us-ascii > > I'm trying to load a cache LM in disambig tool by adding several lines of cod > e according to ngram.cc. Everything is fine except the linking, where I had > a problem: > > /usr3/Test/sri-1.4/lm/src/CacheLM.cc:50: multiple definition of `LHash ed, double>::removedData' > ../obj/i686/liboolm.a(VocabMap.o):/usr3/Test/sri-1.4/lm/src/VocabMap.cc:39: f > irst defined here > collect2: ld returned 1 exit status > > Does that mean duplicate definations of "removeData" originally from LHash.cc > ? How to fix it? Or is there an way to load a cache model in disambig? > > > > > > --------------------------------- > Do you Yahoo!? > Yahoo! Tax Center - File online by April 15th > --0-1533885807-1081661412=:75154 > Content-Type: text/html; charset=us-ascii > >
I'm trying to load a cache LM in disambig tool by adding sever > al lines of code according to ngram.cc.  Everything is fi > ne except the linking, where I had a problem:
>
 
>
/usr3/Test/sri-1.4/lm/src/CacheLM.cc:50: multiple definition of `LHash&l > t;unsigned, double>::removedData'
>
../obj/i686/liboolm.a(VocabMap.o):/usr3/Test/sri-1.4/lm/src/VocabMap.cc: > 39: first defined here
>
collect2: ld returned 1 exit status
>
 
>
Does that mean duplicate definations of "removeData" originally from LHa > sh.cc?  How to fix it? Or is there an way to load a cache model in disam > big?
>
 
>
 
>
 


Do you Yahoo!?
> Yahoo! Tax Center - File online > by April 15th > --0-1533885807-1081661412=:75154-- From cam at crb.ucp.pt Sun Apr 11 01:57:55 2004 From: cam at crb.ucp.pt (cam at crb.ucp.pt) Date: Sun, 11 Apr 2004 09:57:55 +0100 (WEST) Subject: Log-linear interpolation Message-ID: <32809.213.58.88.69.1081673875.squirrel@mail.crb.ucp.pt> Hi! Does anyone know a program or toolkit allowing to do log-linear interpolation of different language models? since SRILM only permit to do linear interpolation. Thanks for your help, Ciro Martins From nlp at pobox.sk Mon Apr 19 08:12:06 2004 From: nlp at pobox.sk (Robert Wagner) Date: Mon, 19 Apr 2004 17:12:06 +0200 Subject: Factored n-grams Message-ID: <200404191512.i3JFC59N023398@www7.pobox.sk> Hello Everybody! Please, does anybody know a good paper referring to factored n-grams (new SRILM feature)? I absolutely do not know what is it and would like to learn more about it:-) Thanks Robert ====================== REKLAMA ======================== Java Desktop System predstavuje prvu pouzitelnu alternativu voci Windows za poslednych 15 rokov, pretoze prinasa z?kaznikom bezpecne a doveryhodne desktopove riesenie za zlomok ceny Windows. Viac informacii najdete na : http://www.somi.sk/sun/java_desktop.php ======================================================= From sarahs at cs.washington.edu Mon Apr 19 08:37:48 2004 From: sarahs at cs.washington.edu (Sarah E. Schwarm) Date: Mon, 19 Apr 2004 08:37:48 -0700 (PDT) Subject: Factored n-grams In-Reply-To: <200404191512.i3JFC59N023398@www7.pobox.sk> Message-ID: <20040419083134.N13854-100000@scarpia.cs.washington.edu> Here's the paper: J. Bilmes and K. Kirchhoff, "Factored Language Models and Generalized Parallel Backoff", Proceedings of HLT/NAACL 2003, Edmonton, Canada, May 2003 [pdf] available on this page: http://ssli.ee.washington.edu/people/katrin/ There's also quite a bit of information about the factored LM extensions to SRILM in the final report for the JHU workshop 2002 Novel Speech Recognition Models for Arabic group: http://www.clsp.jhu.edu/ws2002/groups/arabic/ Hope this helps! - Sarah On Mon, 19 Apr 2004, Robert Wagner wrote: > > Hello Everybody! > Please, does anybody know a good paper referring to factored n-grams > (new SRILM feature)? I absolutely do not know what is it and would > like to learn more about it:-) > Thanks > Robert > > ====================== REKLAMA ======================== > Java Desktop System predstavuje prvu pouzitelnu alternativu voci Windows za > poslednych 15 rokov, pretoze prinasa z?kaznikom bezpecne a doveryhodne > desktopove riesenie za zlomok ceny Windows. > Viac informacii najdete na : http://www.somi.sk/sun/java_desktop.php > ======================================================= > > > ________________________ Sarah Schwarm sarahs at cs.washington.edu From Nicholas.Romanyshyn at colorado.edu Fri Apr 30 13:28:42 2004 From: Nicholas.Romanyshyn at colorado.edu (Nick Romanyshyn) Date: Fri, 30 Apr 2004 14:28:42 -0600 Subject: remove Message-ID: <1083356922.4092b6faf0e67@webmail.colorado.edu> Hi, I'm using ngram-count to make a language model, but I don't want or to be included in the language model. I coudn't find anything in the documentation about how to keep this from happening. Could somebody point me to the code where and are inserted? Thanks, Nick Romanyshyn From anand at speech.sri.com Fri Apr 30 13:50:32 2004 From: anand at speech.sri.com (Anand Venkataraman) Date: Fri, 30 Apr 2004 13:50:32 -0700 (PDT) Subject: remove In-Reply-To: <1083356922.4092b6faf0e67@webmail.colorado.edu> (message from Nick Romanyshyn on Fri, 30 Apr 2004 14:28:42 -0600) Message-ID: <200404302050.NAA13199@stockholm> You should be able to do this without modifying the code. There are at least two ways -- Create a file with lines containing and and give this file to ngram-count using -nonevents. Alternately, you can create count files first (-write), remove the uninteresting events and create an lm using the count file (-read). & From jachym at kky.zcu.cz Fri Apr 30 13:54:38 2004 From: jachym at kky.zcu.cz (Jachym Kolar) Date: Fri, 30 Apr 2004 22:54:38 +0200 Subject: remove In-Reply-To: <1083356922.4092b6faf0e67@webmail.colorado.edu> References: <1083356922.4092b6faf0e67@webmail.colorado.edu> Message-ID: <1083358478.4092bd0e87eae@webmail.zcu.cz> Hello Nick, you should use the script continuous-ngram-count. E.g.: continuous-ngram-count order=3 trainingtext | \ ngram-count -read - -write-vocab vocabulary -tolower -write output -lm lmfile Regards, Jachym Cituji z e-mailu od Nick Romanyshyn : > Hi, > > I'm using ngram-count to make a language model, but I don't want or > > to be included in the language model. I coudn't find anything in the > documentation about how to keep this from happening. Could somebody point > me > to the code where and are inserted? > > Thanks, > Nick Romanyshyn > > From gaston at gastonrcangiano.net Sun May 2 19:37:31 2004 From: gaston at gastonrcangiano.net (Gaston R. Cangiano) Date: Sun, 2 May 2004 19:37:31 -0700 (PDT) Subject: first install Message-ID: <20040503023731.27942.qmail@web11306.mail.yahoo.com> Hi, i am trying to install the toolkit (v 1.4) on an i686 running RH Linux 9 (kernel 2.4). I checked for the correct versions of gcc, make and Tcl installed in my machine, and also updated all the variables in the makefiles correctly (both top level and machine-specific). I am not able to build properly, nothing gets compiled. These are the error messages: g++: cannot specify -o with -c or -S and multiple compilations make [2]: *** [.../obj/i686/tclmain.o] Error for object files qsort.o matherr.o FDiscount.o Lattice.o ngram.o fngram.o and lattice-tool.o Can anyone lend a hint? thank you! Gaston. ===== Gaston R. Cangiano 3024 Deakin St. Apt. #5 Berkeley, CA 94705 tel: 510-486-8271 fax: 425-930-1047 gaston at gastonrcangiano.net From tanel.alumae at aqris.com Tue May 4 02:32:25 2004 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Tue, 04 May 2004 12:32:25 +0300 Subject: Factored LMs and interpolated models Message-ID: <1083663145.10474.16.camel@markov> Hello, I'm experimenting with factored language modeling implementation in SRILM. I got some nice results and now want to compare them with the traditional approach where a word-trigram LM is interpolated with the parallel class trigram. Is it possible to create a factored LM that actually implements such traditional interpolation? Thanks in advance, Tanel A. From Caroline.Lavecchia at loria.fr Tue May 4 07:18:11 2004 From: Caroline.Lavecchia at loria.fr (lavecchia) Date: Tue, 04 May 2004 16:18:11 +0200 Subject: question about vocabulary Message-ID: <4097A623.1E8E3B88@loria.fr> Hello everybody, I would like to know if it's possible with the SRILM toolkit to generate a vocabulary with the 20000 most frequent words of a corpus for example. I know that with -write-vocab in the ngram-count function I can generate a vocabulary but only with all the words of the corpus. Thanks in advance and sorry for my bad english, Caroline L. From anand at speech.sri.com Tue May 4 08:53:13 2004 From: anand at speech.sri.com (Anand Venkataraman) Date: Tue, 4 May 2004 08:53:13 -0700 (PDT) Subject: question about vocabulary In-Reply-To: <4097A623.1E8E3B88@loria.fr> (message from lavecchia on Tue, 04 May 2004 16:18:11 +0200) Message-ID: <200405041553.IAA23335@clara> > I would like to know if it's possible with the SRILM toolkit to generate > a vocabulary with the 20000 most frequent words of a corpus for example. You should be able achieve this by using "ngram-count -order 1 -write -", doing reverse sort on field 2 and taking the top 20000. & From stolcke at speech.sri.com Tue May 4 08:57:23 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 04 May 2004 08:57:23 PDT Subject: question about vocabulary In-Reply-To: Your message of Tue, 04 May 2004 16:18:11 +0200. <4097A623.1E8E3B88@loria.fr> Message-ID: <200405041557.IAA16703@huge> In message <4097A623.1E8E3B88 at loria.fr>you wrote: > Hello everybody, > > I would like to know if it's possible with the SRILM toolkit to generate > a vocabulary with the 20000 most frequent words of a corpus for example. > > I know that with -write-vocab in the ngram-count function I can > generate a vocabulary but only with all the words of the corpus. How about this: ngram-count -order 1 -text CORPUS -write - | \ sort +1rn -2 | awk 'NR <= 20000 { print $1 }' > top20000.vocab --Andreas From duh at ee.washington.edu Thu May 6 10:39:27 2004 From: duh at ee.washington.edu (Kevin Duh) Date: Thu, 06 May 2004 10:39:27 -0700 Subject: Factored LMs and interpolated models In-Reply-To: <1083663145.10474.16.camel@markov> References: <1083663145.10474.16.camel@markov> Message-ID: <409A784F.8060305@ee.washington.edu> There is no easy way to interpolate word and class ngram models in the factored language model framework. Factor language models support only interpolation of an N-gram probability estimate and its corresponding lower-order estimate, which is similar to the "interpolate" option in "ngram-count." You could conceivably treat the word and the class as your factors and perform interpolation whenever you back off from one set of these conditioning variables to a subset. However, this backoff nature makes the interpolation different from the traditional interpolation of parallel n-grams. Probably the best thing to do is to use the usual SRILM tools for this. Hope this helps, Kevin Duh Tanel Alum?e wrote: >Hello, > >I'm experimenting with factored language modeling implementation in >SRILM. I got some nice results and now want to compare them with the >traditional approach where a word-trigram LM is interpolated with the >parallel class trigram. Is it possible to create a factored LM that >actually implements such traditional interpolation? > >Thanks in advance, >Tanel A. > > From stolcke at speech.sri.com Thu May 6 11:33:20 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 06 May 2004 11:33:20 PDT Subject: Factored LMs and interpolated models In-Reply-To: Your message of Thu, 06 May 2004 10:39:27 -0700. <409A784F.8060305@ee.washington.edu> Message-ID: <200405061833.LAA15228@huge> Hi, you SHOULD be able to do this with ngram -factored -bayes 0 ... followed by the usual options to specify mixtures of LMs. This is because the -factored option causes all LM components to be interpreted as factored LMs, and this causes the standard interpolation mechanism to be wrapped around them. So, all you have to do is implement your standard word ngram and class ngrams each separately as FLMs. This is not quite what you asked for but it should be equivalent. This is the theory. I don't think we ever tested this, so there might be glitches. But those could be fixed if necessary, the basic machinery is there. There might also be a different approach. You could engineer the FLM definition so that at the highest level you always back off. Then you specify interpolation as the backoff strategy, emulating word and class ngrams as two parallel backoff paths. I'm not sure this will work with the current functionality, it's just an idea. Katrin or Jeff should be able to tell you if it's feasible. --Andreas In message <409A784F.8060305 at ee.washington.edu>you wrote: > There is no easy way to interpolate word and class ngram models in the > factored language model framework. Factor language models support only > interpolation of an N-gram probability estimate and its corresponding > lower-order estimate, which is similar to the "interpolate" option in > "ngram-count." > > You could conceivably treat the word and the class as your factors and > perform interpolation whenever you back off from one set of these > conditioning variables to a subset. However, this backoff nature makes > the interpolation different from the traditional interpolation of > parallel n-grams. Probably the best thing to do is to use the usual > SRILM tools for this. > > Hope this helps, > Kevin Duh > > Tanel Alum?e wrote: > > >Hello, > > > >I'm experimenting with factored language modeling implementation in > >SRILM. I got some nice results and now want to compare them with the > >traditional approach where a word-trigram LM is interpolated with the > >parallel class trigram. Is it possible to create a factored LM that > >actually implements such traditional interpolation? > > > >Thanks in advance, > >Tanel A. > > > > > From tanel.alumae at aqris.com Thu May 6 23:52:35 2004 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Fri, 07 May 2004 09:52:35 +0300 Subject: Factored LMs and interpolated models In-Reply-To: <20040506122755.A21243@duck.ee.washington.edu> References: <409A784F.8060305@ee.washington.edu> <200405061833.LAA15228@huge> <20040506122755.A21243@duck.ee.washington.edu> Message-ID: <1083912755.8267.7.camel@NOOL2> > Let me know if this helps or if I have misunderstood your question... > Hello, First, thanks to everybody for help. My goal was, as Katrin correctly assumed, "to interpolate a traditional class-based model and a standard n-gram model but you want to express this within a single FLM file". This is currently not possible, but it's not very important because I learned that I can use: ngram -factored -lm -mix-lm The above really works. Still, I noticed a strange thing with perplexity calculation. Namely, the perplexity figures calculated by fngram and ngram are slightly different. I used the following options and got following results: fngram -ppl -factor-file tmp/fngram_m.conf Result: 61 sentences, 1009 words, 26 OOVs 0 zeroprobs, logprob= -2760.87 ppl= 441.076 ppl1= 643.604 ngram -factored -ppl -lm tmp/fngram_m.conf 61 sentences, 1009 words, Result: 26 OOVs 0 zeroprobs, logprob= -2761.16 ppl= 441.359 ppl1= 644.042 -- The above is for a FLM that in fact is standard word trigram. The difference is very small. However, when I test a FLM that is a word-given-two-previous-classes trigram, the difference is much larger: fngram -ppl -factor-file tmp/fngram_c.conf 61 sentences, 1009 words, 26 OOVs 0 zeroprobs, logprob= - 2826.73 ppl= 510.034 ppl1= 750.963 And the same with ngram: ngram -factored -lm tmp/fngram_c.conf -ppl 61 sentences, 1009 words, 26 OOVs 0 zeroprobs, logprob= -2863.71 ppl= 553.378 ppl1= 818.917 As you see, here the difference (ppl1= 750 vs 818) is significant. Could this be a configuration issue, a bug or have I understood smth wrong? Regards, Tanel Alum?e From stolcke at speech.sri.com Fri May 7 07:02:40 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 07 May 2004 07:02:40 PDT Subject: Factored LMs and interpolated models In-Reply-To: Your message of Fri, 07 May 2004 09:52:35 +0300. <1083912755.8267.7.camel@NOOL2> Message-ID: <200405071402.HAA14786@tonga> There are few knowns bugs in the FLM code as last released. They will be fixed in the next release (1.4.1) which I expect to be out in a couple days. --Andreas In message <1083912755.8267.7.camel at NOOL2>you wrote: > > > > Let me know if this helps or if I have misunderstood your question... > > > > Hello, > > First, thanks to everybody for help. > > My goal was, as Katrin correctly assumed, "to interpolate a > traditional class-based model and a standard n-gram model but you want > to express this within a single FLM file". This is currently not > possible, but it's not very important because I learned that I can > use: > > ngram -factored -lm -mix-lm > > The above really works. > > Still, I noticed a strange thing with perplexity calculation. Namely, > the perplexity figures calculated by fngram and ngram are slightly > different. I used the following options and got following results: > > fngram -ppl -factor-file tmp/fngram_m.conf > > Result: > 61 sentences, 1009 words, 26 OOVs > 0 zeroprobs, logprob= -2760.87 ppl= 441.076 ppl1= 643.604 > > ngram -factored -ppl -lm tmp/fngram_m.conf 61 sentences, 1009 > words, > > Result: > 26 OOVs 0 zeroprobs, logprob= -2761.16 ppl= 441.359 ppl1= 644.042 > > > -- > > The above is for a FLM that in fact is standard word trigram. The > difference is very small. > > However, when I test a FLM that is a word-given-two-previous-classes > trigram, the difference is much larger: > > fngram -ppl -factor-file tmp/fngram_c.conf > > 61 sentences, 1009 words, 26 OOVs > 0 zeroprobs, logprob= - 2826.73 ppl= 510.034 ppl1= 750.963 > > And the same with ngram: > > ngram -factored -lm tmp/fngram_c.conf -ppl > > 61 sentences, 1009 words, 26 OOVs > 0 zeroprobs, logprob= -2863.71 ppl= 553.378 ppl1= 818.917 > > > As you see, here the difference (ppl1= 750 vs 818) is significant. Could > this be a configuration issue, a bug or have I understood smth wrong? > > Regards, > > Tanel Alum?e > From katrin at ssli-mail.ee.washington.edu Fri May 7 10:25:20 2004 From: katrin at ssli-mail.ee.washington.edu (Katrin Kirchhoff) Date: Fri, 7 May 2004 10:25:20 -0700 Subject: Factored LMs and interpolated models In-Reply-To: <1083912755.8267.7.camel@NOOL2>; from tanel.alumae@aqris.com on Fri, May 07, 2004 at 09:52:35AM +0300 References: <409A784F.8060305@ee.washington.edu> <200405061833.LAA15228@huge> <20040506122755.A21243@duck.ee.washington.edu> <1083912755.8267.7.camel@NOOL2> Message-ID: <20040507102520.A10555@duck.ee.washington.edu> In order to emulate the exact behaviour of ngram with fngram, you need to use: -no-virtual-begin-sentence -nonull and make sure that the smoothing options (smoothing method, gtmin, gtmax etc.) in your FLM file correspond to the the same values that ngram uses. E.g. for a standard trigram ngram -lm -ppl and fngram -factor-file -ppl -no-virtual-begin-sentence -nonull should give exactly the same perplexities. Andreas might be able to say whether these are needed when using ngram with the -factored option. Katrin > Still, I noticed a strange thing with perplexity calculation. Namely, > the perplexity figures calculated by fngram and ngram are slightly > different. I used the following options and got following results: > > fngram -ppl -factor-file tmp/fngram_m.conf > > Result: > 61 sentences, 1009 words, 26 OOVs > 0 zeroprobs, logprob= -2760.87 ppl= 441.076 ppl1= 643.604 > > ngram -factored -ppl -lm tmp/fngram_m.conf 61 sentences, 1009 > words, > > Result: > 26 OOVs 0 zeroprobs, logprob= -2761.16 ppl= 441.359 ppl1= 644.042 > > > -- > > The above is for a FLM that in fact is standard word trigram. The > difference is very small. > > However, when I test a FLM that is a word-given-two-previous-classes > trigram, the difference is much larger: > > fngram -ppl -factor-file tmp/fngram_c.conf > > 61 sentences, 1009 words, 26 OOVs > 0 zeroprobs, logprob= - 2826.73 ppl= 510.034 ppl1= 750.963 > > And the same with ngram: > > ngram -factored -lm tmp/fngram_c.conf -ppl > > 61 sentences, 1009 words, 26 OOVs > 0 zeroprobs, logprob= -2863.71 ppl= 553.378 ppl1= 818.917 > > > As you see, here the difference (ppl1= 750 vs 818) is significant. Could > this be a configuration issue, a bug or have I understood smth wrong? > > Regards, > > Tanel Alum?e -- ----------------------------------------------------------------- Katrin Kirchhoff Dept of Electrical Engineering, University of Washington M422 EE/CS Building, Box 352500, Seattle, WA, 98195 Phone: (206) 616 5494 katrin at ee.washington.edu ----------------------------------------------------------------- From barhaim at cs.technion.ac.il Thu May 13 08:30:37 2004 From: barhaim at cs.technion.ac.il (Roy Bar Haim) Date: Thu, 13 May 2004 17:30:37 +0200 Subject: Tagging with disambig Message-ID: <005a01c438ff$3efbc5c0$34284484@cs.technion.ac.il> Hi, I use disambig for POS tagging. I have two questions: 1.Is there a utility that automatically generates the map file required for disambig from a tagged corpus? 2.Suppose I want to assume (for a 'didactic' purpose) that Ti (the i'th tag) depends not ony on Ti-1 but also on Wi-1. Is there an easy way to encode this assumption into the lm file? Thanks, Roy. From stolcke at speech.sri.com Thu May 13 16:59:48 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 13 May 2004 16:59:48 PDT Subject: Tagging with disambig In-Reply-To: Your message of Thu, 13 May 2004 17:30:37 +0200. <005a01c438ff$3efbc5c0$34284484@cs.technion.ac.il> Message-ID: <200405132359.QAA25559@teeny> In message <005a01c438ff$3efbc5c0$34284484 at cs.technion.ac.il>you wrote: > Hi, > > I use disambig for POS tagging. > > I have two questions: > 1.Is there a utility that automatically generates the map file required > for disambig from a tagged corpus? It's very corpus dependent, just like text conditioning for LM training, so there are no "standard" tools. It should require only a moderate amount of perl or gawk hacking. > 2.Suppose I want to assume (for a 'didactic' purpose) that Ti (the i'th > tag) depends not ony on Ti-1 but also on Wi-1. Is there an easy way to > encode this assumption into the lm file? Depends on what you consider "easy" ;-). You can do it by including the words in the states of the HMM. So the "hidden" vocabulary would consist of pairs (Wi,Ti), and the observed vocabulary is still the words Wi. The map file would enforce consistency between the two. In other words the map file just lists the possible correspondences W w,t1 w,t2 w,t3 ... (the probabilities can be omitted and default to 1). If you do this and nothing else you would need an N-gram LM over the combined (Wi,Ti) sequence. But you say you want a more specific model of the form P(Ti | Wi-1, Ti-1) This, too, can be done but requires some work. You construct a trigram count file of 3-grams (Wi-1, Ti-1, Ti) from your training data, and estimate an LM for it (be sure to specify all the words as non-events so they don't receive any probability). Then you construct a bigram LM in terms of the (W,T) tokens, such that it gives exactly the same probabilities as the more constrained model you just estimated. So you have to construct a bigram LM file and make sure that the bigram Wi-1,Ti-1 Wi,Ti gets the probility P(Ti | Wi-1, Ti-1) * P(Wi|Ti), for all Wi-1,Ti-1,Wi,Ti . You have to write your own program to construct this file in ARPA LM format, but it's not rocket science once you understand the format. Then you decode using this LM and disambig. --Andreas From barhaim at cs.technion.ac.il Mon May 17 11:25:52 2004 From: barhaim at cs.technion.ac.il (Roy Bar Haim) Date: Mon, 17 May 2004 20:25:52 +0200 Subject: FW: A simple question about SRILM Message-ID: <001701c43c3c$65fc62c0$34284484@cs.technion.ac.il> Hi, I have the same problem. I want the LM to give maximum-likelihood estimates. That is, all the backoff weights should be zero. I applied the solution below, but still I get backoff weights. For example, when I build the lm like this: ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text corpus.tags -lm corpus.tags.lm I found that the once-occuring trigrams DO NOT APPEAR in the lm, so probablity mass is still discounted. When I turned on the debug messages, I saw many messages like: warning: 0 backoff probability mass left for "AT SCLN" -- incrementing denominator Does it mean that smoothing is enforced here? Is there a way to get a pure maximum-likelihood language model, without backoff weights at all, using ngram-count? Thanks, Roy. > -----Original Message----- > From: owner-srilm-user at speech.sri.com > [mailto:owner-srilm-user at speech.sri.com] On Behalf Of Andreas Stolcke > Sent: Tuesday, April 06, 2004 6:34 PM > To: David Pic? > Cc: srilm-user at speech.sri.com; Jorge Gonz?lez > Subject: Re: A simple question about SRILM > > > > The ngram-count man page says > > -gtnmax count > where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the > maximal count of N-grams of order n that are dis- > counted under Good-Turing. All N-grams more fre- > quent than that will receive maximum likelihood > estimates. Discounting can be effectively disabled > by setting this to 0. > > Therefore, you can disable smoothing with > > ngram-count -gt1max 0 -gt2max 0 -gt3max 0 ... > > --Andreas > > In message <40726957.3070101 at dsic.upv.es>you wrote: > > Hello, > > > > I also have a little question about SRILM. How can I infer > a trigram > > (or > > bigram, or tetragram...) with no smoothing at all? I need > to do some > > experiments to check the effect of n-gram smoothing in my > models and I > > need a pure trigram with no probability mass derived to > lower levels. Is > > this possible in SRILM? I need to be sure that I really get > a trigram > > (with the whole trigram probabilities). > > > > Thank you very much in advance for your help and attention! David > > > > -- > > David Pic?-Vila > > Universitat Polit?cnica de Val?ncia > > Departament de Sistemes Inform?tics i Computaci? > > Val?ncia, Spain > > > > From stolcke at speech.sri.com Mon May 17 10:37:58 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 17 May 2004 10:37:58 PDT Subject: FW: A simple question about SRILM In-Reply-To: Your message of Mon, 17 May 2004 20:25:52 +0200. <001701c43c3c$65fc62c0$34284484@cs.technion.ac.il> Message-ID: <200405171737.KAA17206@huge> In message <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote: > Hi, > > I have the same problem. I want the LM to give maximum-likelihood estimates. > That is, all the backoff weights should be zero. > > I applied the solution below, but still I get backoff weights. > > For example, when I build the lm like this: > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text corpus.tags -lm corp > us.tags.lm > > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so probablit > y mass is still discounted. the default minimum coccurrence count for trigrams is 2. set it to 1 to include all trigrams: -gt3min 1 etc. that's why you still get backoff. > > When I turned on the debug messages, I saw many messages like: > warning: 0 backoff probability mass left for "AT SCLN" -- incrementing denomi > nator > > Does it mean that smoothing is enforced here? > > Is there a way to get a pure maximum-likelihood language model, without backo > ff weights at all, using ngram-count? see above. --Andreas From barhaim at cs.technion.ac.il Mon May 17 13:05:31 2004 From: barhaim at cs.technion.ac.il (Roy Bar Haim) Date: Mon, 17 May 2004 22:05:31 +0200 Subject: FW: A simple question about SRILM In-Reply-To: <200405171737.KAA17206@huge> Message-ID: <002701c43c4a$4f810b00$34284484@cs.technion.ac.il> Hi Andreas, Thanks for you super-fast reply! I tried it like you suggested: ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0 -gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1 Many of the backoff weights indeed became 99 (which is good), but many remained non-zero (although small: -6,-7,-8...) Is there a way to make them all 99? The debug messages I got are listed below. Thanks a lot, Roy. ------------------------------------------------------------------------ --------------------- corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 Good-Turing discounting 1-grams GT-count [0] = 0 GT-count [1] = 0 warning: no singleton counts GT discounting disabled Good-Turing discounting 2-grams GT-count [0] = 0 GT-count [1] = 126 GT discounting disabled Good-Turing discounting 3-grams GT-count [0] = 0 GT-count [1] = 2142 GT discounting disabled discarded 1 2-gram contexts containing pseudo-events discarded 2 3-gram contexts containing pseudo-events writing 41 1-grams writing 800 2-grams writing 5145 3-grams > -----Original Message----- > From: Andreas Stolcke [mailto:stolcke at speech.sri.com] > Sent: Monday, May 17, 2004 7:38 PM > To: Roy Bar Haim > Cc: srilm-user at speech.sri.com > Subject: Re: FW: A simple question about SRILM > > > > In message > <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote: > > Hi, > > > > I have the same problem. I want the LM to give maximum-likelihood > > estimates. That is, all the backoff weights should be zero. > > > > I applied the solution below, but still I get backoff weights. > > > > For example, when I build the lm like this: > > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text > corpus.tags > > -lm corp us.tags.lm > > > > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so > > probablit y mass is still discounted. > > the default minimum coccurrence count for trigrams is 2. set > it to 1 to > include all trigrams: > > -gt3min 1 etc. > > that's why you still get backoff. > > > > > When I turned on the debug messages, I saw many messages like: > > warning: 0 backoff probability mass left for "AT SCLN" -- > incrementing denomi > > nator > > > > Does it mean that smoothing is enforced here? > > > > Is there a way to get a pure maximum-likelihood language model, > > without backo ff weights at all, using ngram-count? > > see above. > > --Andreas > > From stolcke at speech.sri.com Tue May 18 20:03:39 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 18 May 2004 20:03:39 PDT Subject: FW: A simple question about SRILM In-Reply-To: Your message of Mon, 17 May 2004 22:05:31 +0200. <002701c43c4a$4f810b00$34284484@cs.technion.ac.il> Message-ID: <200405190303.UAA16121@huge> In message <002701c43c4a$4f810b00$34284484 at cs.technion.ac.il>you wrote: > Hi Andreas, > > Thanks for you super-fast reply! > > I tried it like you suggested: > ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0 > -gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1 > > Many of the backoff weights indeed became 99 (which is good), but many > remained non-zero (although small: -6,-7,-8...) > > Is there a way to make them all 99? This might not be necessary. If the left-over probability mass in some context is 0 (as it should be when using ML estimates) AND the sum of the lower-order probabilities for the occurring N-grams is also 0 (since those are also ML estimates), the backoff weight is 0/0, and due to numerical inaccuracies this may turn out to be one of the values your observed. (The code catches actual 0/0 divisions and generates -99 in those cases.) However, this is not a problem because the backoff log prob value for one of the non-observed ngrams would be -infinity, and the particular value of the backoff weight that gets applied doesn't matter for the outcome (-infinity plus any value is still -infinity). To verify that that's the case just feed some of those unobserved ngrams to ngram -debug 2 -ppl and make sure the log probabilities are -infinity. --Andreas > > The debug messages I got are listed below. > > Thanks a lot, > Roy. > ------------------------------------------------------------------------ > --------------------- > corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs > 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1 > Good-Turing discounting 1-grams > GT-count [0] = 0 > GT-count [1] = 0 > warning: no singleton counts > GT discounting disabled > Good-Turing discounting 2-grams > GT-count [0] = 0 > GT-count [1] = 126 > GT discounting disabled > Good-Turing discounting 3-grams > GT-count [0] = 0 > GT-count [1] = 2142 > GT discounting disabled > discarded 1 2-gram contexts containing pseudo-events > discarded 2 3-gram contexts containing pseudo-events > writing 41 1-grams > writing 800 2-grams > writing 5145 3-grams > > > -----Original Message----- > > From: Andreas Stolcke [mailto:stolcke at speech.sri.com] > > Sent: Monday, May 17, 2004 7:38 PM > > To: Roy Bar Haim > > Cc: srilm-user at speech.sri.com > > Subject: Re: FW: A simple question about SRILM > > > > > > > > In message > > <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote: > > > Hi, > > > > > > I have the same problem. I want the LM to give maximum-likelihood > > > estimates. That is, all the backoff weights should be zero. > > > > > > I applied the solution below, but still I get backoff weights. > > > > > > For example, when I build the lm like this: > > > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text > > corpus.tags > > > -lm corp us.tags.lm > > > > > > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so > > > probablit y mass is still discounted. > > > > the default minimum coccurrence count for trigrams is 2. set > > it to 1 to > > include all trigrams: > > > > -gt3min 1 etc. > > > > that's why you still get backoff. > > > > > > > > When I turned on the debug messages, I saw many messages like: > > > warning: 0 backoff probability mass left for "AT SCLN" -- > > incrementing denomi > > > nator > > > > > > Does it mean that smoothing is enforced here? > > > > > > Is there a way to get a pure maximum-likelihood language model, > > > without backo ff weights at all, using ngram-count? > > > > see above. > > > > --Andreas > > > > > From Caroline.Lavecchia at loria.fr Wed May 19 01:58:23 2004 From: Caroline.Lavecchia at loria.fr (lavecchia) Date: Wed, 19 May 2004 10:58:23 +0200 Subject: script select-vocab Message-ID: <40AB21AF.D96BE599@loria.fr> Hi, I would like to use the script "select-vocab" but there is a problem. When I put "select-vocab -heldout corpus.text corpus2.text ", the error message is " unkown option -heldout Usage : select-vocab -heldout corph corp1 corp2 ... " Does anyone know what is the problem ??? Thanks, Caroline From anand at speech.sri.com Wed May 19 04:39:20 2004 From: anand at speech.sri.com (Anand Venkataraman) Date: Wed, 19 May 2004 04:39:20 -0700 (PDT) Subject: script select-vocab In-Reply-To: <40AB21AF.D96BE599@loria.fr> (message from lavecchia on Wed, 19 May 2004 10:58:23 +0200) Message-ID: <200405191139.EAA23419@clara> Caroline I believe there was a slight mismatch between the program and its man page at one point. You can use "-held" instead (the "-" in "held-out" was not optional). Let me know if this works. (You can look inside the script, btw, to see exactly what the option is). & From anair at usc.edu Fri May 21 19:14:36 2004 From: anair at usc.edu (Anish Nair) Date: Fri, 21 May 2004 19:14:36 -0700 Subject: srilam class library Message-ID: <47101927874.20040521191436@usc.edu> hi, has anyone successfully compiled the srilm class libraries in visual c++. i need to compile on win32 and not cygwin because of other dependencies. it would be nice if someone could send me a pointer to some pre-built library. that plus sample code would be ideal. thanks, anish From solen.quiniou at irisa.fr Thu May 27 08:05:05 2004 From: solen.quiniou at irisa.fr (Solen Quiniou) Date: Thu, 27 May 2004 17:05:05 +0200 Subject: compiling srilm using visual c++ Message-ID: <40B603A1.7090601@irisa.fr> Hi, I don't know how to compile srilm using visual c++ but I'm also interested in knowing how to do that. Solen. From tanel.alumae at aqris.com Wed Jun 23 03:44:09 2004 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Wed, 23 Jun 2004 13:44:09 +0300 Subject: lattice-tool and trigram Message-ID: <1087987449.4078.13.camel@pc118.host2.starman.ee> Hello, I'm trying to expand word lattices with trigrams (and with factored LMs) and find a best path (using -viterbi-decode). However, I'm quite confused with the lattice-tool, because I cannot really understand what it does - sometimes some parameters and options seem to be ignored. I have a word lattice produced with HTK (using bigram LM), and a trigram LM in ARPA format.I run: lattice-tool -lm trigram.arpa -in-lattice -read-htk -viterbi-decode When I try this, I get following output: reading 60002 1-grams reading 4038368 2-grams reading 1184013 3-grams Lattice::expandToLM: starting expansion to general LM (maxNodes = 0) ... Lattice::bestWords: processing Lattice::bestWords: best path prob = -inf /home/tanel/devel/data/mfc/fts/sentences/aa/s10000_01.mfc Is this caused by the fact that there are !NULL nodes at the start and end of the ? I tried adding the -no-htk-nulls -no-nulls options but this doesn't seem to help... Maybe somebody can provide a correct workflow to get a trigram-scored best path from HTK lattices using lattice-tool? Thanks and best regards, Tanel Alum?e From stolcke at speech.sri.com Wed Jun 23 22:59:52 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 23 Jun 2004 22:59:52 PDT Subject: lattice-tool and trigram In-Reply-To: Your message of Wed, 23 Jun 2004 13:44:09 +0300. <1087987449.4078.13.camel@pc118.host2.starman.ee> Message-ID: <200406240559.WAA29115@huge> In message <1087987449.4078.13.camel at pc118.host2.starman.ee>you wrote: > Hello, > > I'm trying to expand word lattices with trigrams (and with factored LMs) > and find a best path (using -viterbi-decode). However, I'm quite > confused with the lattice-tool, because I cannot really understand what > it does - sometimes some parameters and options seem to be ignored. > > I have a word lattice produced with HTK (using bigram LM), and a trigram > LM in ARPA format.I run: > lattice-tool -lm trigram.arpa -in-lattice -read-htk > -viterbi-decode > > When I try this, I get following output: > reading 60002 1-grams > reading 4038368 2-grams > reading 1184013 3-grams > Lattice::expandToLM: starting expansion to general LM (maxNodes = 0) ... > Lattice::bestWords: processing > Lattice::bestWords: best path prob = -inf > /home/tanel/devel/data/mfc/fts/sentences/aa/s10000_01.mfc > > > Is this caused by the fact that there are !NULL nodes at the start and > end of the ? I tried adding the -no-htk-nulls -no-nulls > options but this doesn't seem to help... > > Maybe somebody can provide a correct workflow to get a trigram-scored > best path from HTK lattices using lattice-tool? > > Thanks and best regards, > > Tanel Alum?e You cannot do lattice expansion and decoding in the same run (the decoding would happen on the original lattices, not the expanded ones). So, 1. expand you lattices, store the results 2. decode 1-best words from the expanded lattices This is also convenient if you want to play with different score weights in step 2 (since step 1 takes much longer than step 2, typically). In the first step you might want to also apply some pruning to keep the size and runtime manageable. You CAN do pruning and expansion in the same run, since the pruning happens before the expansion. The order of application of the various option of lattice-tool needs to be documented better. One of these days ... --Andreas From barhaim at cs.technion.ac.il Tue Jun 29 09:20:20 2004 From: barhaim at cs.technion.ac.il (Roy Bar Haim) Date: Tue, 29 Jun 2004 18:20:20 +0200 Subject: Lattice tool Message-ID: <00d701c45df4$f9d70c00$34284484@cs.technion.ac.il> Hi, I have a few questions about lattices: 1. Is it possible to get the n-best word sequences from a lattice, according to Viterbi decoding, and not only the 1-best? If not, what is the function of n-best lists in SRILM? How they are created? 2. When applying lattice-expansion in lattice-tool: are the original probabilities in the lattice just ignored? 3. How does lattice-expansion work, for instance, for a trigram backoff model. How do the states and transitions change? I would be grateful to see a toy example that clarifies that, or to get a reference for such an explanation. 4. Just to make sure: if the transition prbablity is p, should I encode it as 10000.5*log(p) (log is the natural log) in the pfsg? 5. In pfsg format, are n1 n2 in the transition lines state numbers, or the actual words? If they are numbers, does the numbering start with 0 or with 1? Thanks a lot, Roy. From stolcke at speech.sri.com Tue Jun 29 21:34:48 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 29 Jun 2004 21:34:48 PDT Subject: Lattice tool In-Reply-To: Your message of Tue, 29 Jun 2004 18:20:20 +0200. <00d701c45df4$f9d70c00$34284484@cs.technion.ac.il> Message-ID: <200406300434.VAA27321@huge> In message <00d701c45df4$f9d70c00$34284484 at cs.technion.ac.il>you wrote: > Hi, > > I have a few questions about lattices: > > 1. Is it possible to get the n-best word sequences from a lattice, > according to Viterbi decoding, and not only the 1-best? If not, what is > the function of n-best lists in SRILM? How they are created? There is currently no facility in SRILM to generate N-best lists from lattices, although that is a perfectly legitimate thing to want to do. It just hasn't been a high-priority thing for us because we generate N-best with our recognizer, not from lattices directly. I think HTK has some tool that does it. So at least for HTK lattices you could try that. > 2. When applying lattice-expansion in lattice-tool: are the original > probabilities in the lattice just ignored? Yes, except if you preprocess the lattices in some way that relies on the probabilities, e.g., by pruning. Then the "old" probabilities are used in the pruning step, prior to expansion. > 3. How does lattice-expansion work, for instance, for a trigram backoff > model. How do the states and transitions change? I would be grateful to > see a toy example that clarifies that, or to get a reference for such an > explanation. Try the paper referenced in the lattice-tool man page. A postscript file can be found at http://www.speech.sri.com/papers/icslp98-lattices.ps.gz > 4. Just to make sure: if the transition prbablity is p, should I encode > it as 10000.5*log(p) (log is the natural log) in the pfsg? Correct. > 5. In pfsg format, are n1 n2 in the transition lines state numbers, or > the actual words? If they are numbers, does the numbering start with 0 > or with 1? The numbers are the state indices, starting at 0. --Andreas