From lakshmi at lantana.tenet.res.in Tue Oct 3 22:27:51 2006 From: lakshmi at lantana.tenet.res.in (Lakshmi A) Date: Wed, 4 Oct 2006 10:57:51 +0530 (IST) Subject: query regarding usage of SRILM toolkit In-Reply-To: <200609291604.JAA05487@tonga> References: <200609291604.JAA05487@tonga> Message-ID: Greetings!!! Thanks for the prompt reply. But the ideas you mentioned seems to be for boundary marking when the whole sequence is correct. Our recognition output is only 50% correct. That is we have a sequence of syllables that are just 50% correct from which we need to extract the words. The n-best results of the recognizer could be used to improve the performance. We can have a lattice of syllable sequence where each syllable has a n-best list. Now, the task is to find the best word sequence from this n-best lattice. Do you have any similar programs. Please do reply. Thanks in Advance. Regards, Lakshmi On Fri, 29 Sep 2006, Andreas Stolcke wrote: > > In message you wrote: >> >> Greetings!!! >> >> We are developing a syllable based isolated style continuous speech recognize >> r >> for Indian languages. Currently, our recognizer output is just a sequence of >> syllables. We want to extract the sequence of words from this syllable sequen >> ce >> using statistical language models and lexicon.I thought may be one of the >> programs in this toolkit must be doing something similar (sub-word >> sequence to word sequence conversion). But all the programs seems to use >> word lattices. >> >> Is there any program in this toolkit that extracts the word sequence from >> the sub-word sequence using LM and lexicon. > > Lashmi, > > first you have to remember that when the documentation of a program says > 'words' it doesn't mean you have to use words in the conventional sense. > you can use any kind of token (phones, syllables, etc.) in your lattices > etc. > > The task you describe sounds like a boundary tagging problem, i.e., given > a sequence of tokens, you want to label each transition between tokens as > either a "boundary" or a "non-boundary". There are two tools in SRILM > that can do this, using different kind of models. One is > "hidden-ngram", which performs boundary tagging explicitly. > The other is "disambig" which tags the tokens themselves, not the boundaries > between them. But by assigining tags that denote "first token in a unit", > "token insde a unit', etc. you can perform boundary tagging implicitly. > (The tokens in your case are the syllables, the units would be the words.) > Both tools use ngram language models to disambiguate the input. > The model can be trained from syllabified training data, in your case. > > I suggest you look up papers on "word segmentation", "sentence segmentation", > "Mandarin tokenization", "chunk parsing" and "shallow parsing" to > get a good idea of the existing models for this type of task, > then study the manual pages for the programs. > > --Andreas > > >> >> Thanks in Advance. >> Regards, >> Lakshmi > From stolcke at speech.sri.com Wed Oct 4 12:41:47 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 04 Oct 2006 12:41:47 -0700 Subject: query regarding usage of SRILM toolkit In-Reply-To: References: <200609291604.JAA05487@tonga> Message-ID: <45240E7B.9000509@speech.sri.com> Lakshmi A wrote: > > Greetings!!! > > Thanks for the prompt reply. But the ideas you mentioned seems to be > for boundary marking when the whole sequence is correct. Our > recognition output is only 50% correct. That is we have a sequence of > syllables that are just 50% correct from which we need to extract the > words. The n-best results of the recognizer could be used to improve > the performance. We can have a lattice of syllable sequence where each > syllable has a n-best list. > Now, the task is to find the best word sequence from this n-best > lattice. Do you have any similar programs. Please do reply. > > Thanks in Advance. > Regards, > Lakshmi > > On Fri, 29 Sep 2006, Andreas Stolcke wrote: > If your output if n-best, you can apply the disambig or hidden-ngram taggers to each of the hypotheses, and then extract the 1-best segmentation by some criterion. If your output is in lattice format, things are more involved. You'd have to edit the lattices to insert nodes representing the different tagging choices (e.g., boundary/no-boundary). then rescore the lattice with the tagging LM to extract the best hypothesis. Andreas From chiateek at comp.nus.edu.sg Sat Oct 21 00:54:40 2006 From: chiateek at comp.nus.edu.sg (chiateek at comp.nus.edu.sg) Date: Sat, 21 Oct 2006 15:54:40 +0800 Subject: Implementation details of -write-ngrams? Message-ID: <20061021075440.GA3056@localhost.localdomain> Hello, Where can I find a detailed description of the algorithm for computing n-gram counts (-write-ngrams) in SRILM? Thanks! From stolcke at speech.sri.com Sat Oct 21 09:38:49 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 21 Oct 2006 09:38:49 PDT Subject: Implementation details of -write-ngrams? In-Reply-To: Your message of Sat, 21 Oct 2006 15:54:40 +0800. <20061021075440.GA3056@localhost.localdomain> Message-ID: <200610211638.k9LGcos06332@huge> In message <20061021075440.GA3056 at localhost.localdomain>you wrote: > Hello, > > Where can I find a detailed description of the algorithm for computing > n-gram counts (-write-ngrams) in SRILM? Thanks! The concept of posterior ngram counts is explained in section 3.3.2 of the paper A. O. Hatch, B. Peskin, and A. Stolcke (2005), Improved Phonetic Speaker Recognition Using Lattice Decoding, Proc. IEEE ICASSP, Philadelphia, vol. 1, pp. 169-172. http://www.speech.sri.com/cgi-bin/run-distill?papers/icassp2005-spkr-phonelats.ps.gz (where you have to replace "phone" with "word" since the default is to compute word ngrams). Note this is not a new concept. The algorithm is a forward-backward computation with on-the-fly lattice expansion. For further details you'll have to read the source code. Andreas From ioparin at yahoo.co.uk Sun Oct 22 10:50:56 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Sun, 22 Oct 2006 18:50:56 +0100 (BST) Subject: [SRILM] FLM model training on large data Message-ID: <20061022175056.33004.qmail@web25403.mail.ukl.yahoo.com> Hi, everybody! Does anyone have any experience of building a Factored Language Model on large data? There is still no problem with, say, processing a file in FLM format containing 5 mln entries, but as far as I try to feed a 50 mln FLM corpus, it needs unfeasible amount of memory (since it loads everything in memory). Does anyone know if there are any tricks how to train an FLM model in this case? Something like building partial LMs and then merging with standard ngram-count... What could you suggest as a solution? best regards, Ilya --------------------------------- Try the all-new Yahoo! Mail . "The New Version is radically easier to use" ? The Wall Street Journal -------------- next part -------------- An HTML attachment was scrubbed... URL: From ioparin at yahoo.co.uk Mon Oct 23 04:58:52 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Mon, 23 Oct 2006 12:58:52 +0100 (BST) Subject: [SRILM] Some more FLM questions In-Reply-To: <453C481E.3000204@ee.washington.edu> Message-ID: <20061023115852.65010.qmail@web25410.mail.ukl.yahoo.com> Dear Katrin, thanks for the reply There is a couple of other questions to those concerned with FLM development: 1) Is there any possibility to interpolate FLMs with normal LMs? I tried to do this with "ngram" using "-factored" and then "-lm-mix" options but it didn't work since it expeted even a general (standard) word model to be factored as well and I couldn't see how to show the system that the first of interpolated models is conventional, though others are factored. In "fngram" there is no such option as well, as I get it. 2) Could you please specify how you work with large data? When I was training the model on 5M data, it was taking 1.2G of memory. Actually, I work with inflectional languages (Russian and Czech) so the factors are really "rich": features for each word include its stem, inflection, detailed morphological tag and lemma. May be that's why it takes so much space? Otherwise I can not get how you managed to run it for 30G words in English: in my case if I want to enlarge data it seems like I'll have to switch to 64-bit architecture. Does SRILM and FLM support 64-bit somehow? If it's only me that lucky with memory loads, what could you suggest to reduce it? 3) Which parameters does the training time depend on? Thanks in advance, regards, ilya --- Katrin Kirchhoff wrote: > > Ilya, > > We have trained FLMs with ~30M words without > problems, but yes, > beyond that it becomes a problem. We are currently > working > on updates to the code that make it possible to use > larger > corpora - these haven't been publicly released yet > but > I'll let you know when they become available. > > best, > Katrin > > ilya oparin wrote: > > Hi, everybody! > > > > Does anyone have any experience of building a > Factored Language Model on > > large data? There is still no problem with, say, > processing a file in > > FLM format containing 5 mln entries, but as far as > I try to feed a 50 > > mln FLM corpus, it needs unfeasible amount of > memory (since it loads > > everything in memory). > > > > Does anyone know if there are any tricks how to > train an FLM model in > > this case? Something like building partial LMs and > then merging with > > standard ngram-count... What could you suggest as > a solution? > > > > > > best regards, > > Ilya > > > > > ------------------------------------------------------------------------ > > Try the all-new Yahoo! Mail > > > > > > . "The New Version is radically easier to use" ? > The Wall Street Journal > > best regards, Ilya Send instant messages to your online friends http://uk.messenger.yahoo.com From stolcke at speech.sri.com Mon Oct 23 15:46:53 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 23 Oct 2006 15:46:53 -0700 Subject: [SRILM] Some more FLM questions In-Reply-To: <20061023115852.65010.qmail@web25410.mail.ukl.yahoo.com> References: <20061023115852.65010.qmail@web25410.mail.ukl.yahoo.com> Message-ID: <453D465D.9020105@speech.sri.com> ilya oparin wrote: > > 2) Could you please specify how you work with large > data? > When I was training the model on 5M data, it was > taking 1.2G of memory. Actually, I work with > inflectional languages (Russian and Czech) so the > factors are really "rich": features for each word > include its stem, inflection, detailed morphological > tag and lemma. May be that's why it takes so much > space? Otherwise I can not get how you managed to run > it for 30G words in English: in my case if I want to > enlarge data it seems like I'll have to switch to > 64-bit architecture. Does SRILM and FLM support 64-bit > somehow? > If it's only me that lucky with memory loads, what > could you suggest to reduce it? > Yes, SRILM supports 64bit linux (and other) platforms. For Linux running on AMD64-compatible machines use make MACHINE_TYPE=i686-m64 So reduce memory consumptions use the strategies described in doc/FAQ. I'm copying here the relevant bits, many of which apply to FLMs as well. > Topic: Large data / too little memory issues > > 1) I'm getting a message saying (among other things) > > Assertion `body != 0' failed. > > A: You are running out of memory. See subsequent questions depending on > what you are trying to do. Note: the above message means you are > running > out of "virtual" memory on your computer, which could be because of > limits in swap space, administrative resource limits, or limitations of > the machine architecture (a 32-bit machine cannot address more than > 4GB no matter how many resources your system has). > Another symptom of not enough memory is that your program runs, but > very, very slowly, i.e., it is "paging" or "swapping" as it tries to > use more memory than the machine has RAM installed. > > 2) I am trying to count N-grams in a text file and running out of memory. > > A: Don't use ngram-count directly to count N-grams. Instead, use the > make-batch-counts and merge-batch-counts scripts described in > training-scripts(1). That way you can create N-gram counts limited > only > by the maximum file size on your system. > > 3) I am trying to build an N-gram LM and ngram-count runs out of memory. > > A: You are running out of memory either because of the size of ngram > counts, > or of the LM being built. The following are strategies for reducing the > memory requiredments for training LMs. > > a) Assuming you are using Good-Turing or Kneser-Ney discounting, > don't use > ngram-count in "raw" form. Instead, use the make-big-lm wrapper > script described in the traning-scripts(1) man page. > b) Switch to using the "_c" or "_s" versions of the SRI binaries. For > instructions on how to build them, see the INSTALL file. > Once built, set your executable seach path accordingly, and try > make-big-lm again. > > c) Lower the minimum counts for N-grams included in the LM, i.e., the > values of the options -gt2min, -gt3min, gt4min, etc. The higher > order N-grams typically get higher minumum counts. > > d) Get a machine with more memory. If you are hitting the > limitations of > a 32-bit machine architecture, get a 64-bit machine and > recompile SRILM > to take advantage of the expanded address space. (The "i686-m64" > MACHINE_TYPE setting is for systems based on 64-bit AMD > processors.) > Note: that the 64-bit pointers will require a memory overhead in > themselves, so will need a machine with significantly, not just a > little, more memory than 4GB. > > 4) I am trying to apply a large LM to some data and am running out of > memory. > > A: Again, there are several strategies to reduce memory requirements. > > a) Use the "_c" or "_s" versions of the SRI binaries. See 3b) above. > > b) Precompute the vocabulary of your test data and use the > ngram -limit-vocab option to load only the N-gram parameters > relevant > to your data. This approach should allow you to use arbitrarily > large LMs provided the data is divided into small enough chunks. > > c) If the LM can be built on a large machine, but then is to be > used on > machines with limited memory, use ngram -prune to remove the less > important parametere of the model. This usually gives huge size > reductions with relatively modest performance degradation. The > tradeoff is adjustable by varying the pruning parameter. > Andreas From pclouds at gmail.com Tue Oct 31 10:01:42 2006 From: pclouds at gmail.com (Nguyen Thai Ngoc Duy) Date: Wed, 1 Nov 2006 01:01:42 +0700 Subject: SRILM and GCC 4.1.1 Message-ID: Hi all, I tried to compile SRILM with GCC 4.1.1 and got lots of errors (mostly undefined references). Has anyone tried it with GCC 4.1? Best regards, -- Duy From stolcke at speech.sri.com Tue Oct 31 10:13:39 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 31 Oct 2006 10:13:39 PST Subject: SRILM and GCC 4.1.1 In-Reply-To: Your message of Wed, 01 Nov 2006 01:01:42 +0700. Message-ID: <200610311813.k9VIDdu12157@huge> In message you wro te: > Hi all, > I tried to compile SRILM with GCC 4.1.1 and got lots of errors (mostly > undefined references). Has anyone tried it with GCC 4.1? It compiles cleanly with gcc 4.1.0 and the right compiler options. See $SRILM/common/Makefile.machine.i686-gcc4 Andreas From pclouds at gmail.com Tue Oct 31 11:05:17 2006 From: pclouds at gmail.com (Nguyen Thai Ngoc Duy) Date: Wed, 1 Nov 2006 02:05:17 +0700 Subject: SRILM and GCC 4.1.1 In-Reply-To: <200610311813.k9VIDdu12157@huge> References: <200610311813.k9VIDdu12157@huge> Message-ID: On 11/1/06, Andreas Stolcke wrote: > $SRILM/common/Makefile.machine.i686-gcc4 Thank you. After setting MACHINE_TYPE=i686-gcc4, I still got errors: make[2]: Entering directory `/home/pclouds/tmp/srilm/lm/src' /usr/bin/g++ -I. -I/home/pclouds/tmp/srilm/include -u matherr -L/home/pclouds/tmp/srilm/lib/i686-gcc4 -g -O3 -o ../bin/i686-gcc4/ngram ../obj/i686-gcc4/ngram.o ../obj/i686-gcc4/liboolm.a -lm -ldl /home/pclouds/tmp/srilm/lib/i686-gcc4/libflm.a /home/pclouds/tmp/srilm/lib/i686-gcc4/libdstruct.a /home/pclouds/tmp/srilm/lib/i686-gcc4/libmisc.a -ltcl -lm 2>&1 | c++filt ../obj/i686-gcc4/liboolm.a(Vocab.o): In function `LHash::remove(char const*, bool&)': /home/pclouds/tmp/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/pclouds/tmp/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' /home/pclouds/tmp/srilm/include/LHash.cc:424: undefined reference to `LHash::removedData' /home/pclouds/tmp/srilm/include/LHash.cc:473: undefined reference to `LHash::removedData' I'm using SRILM 1.5.0 > Andreas > > -- Duy From stolcke at speech.sri.com Tue Oct 31 11:36:14 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 31 Oct 2006 11:36:14 PST Subject: SRILM and GCC 4.1.1 In-Reply-To: Your message of Wed, 01 Nov 2006 02:05:17 +0700. Message-ID: <200610311936.k9VJaEm19959@huge> make sure the c++ compiler is invoked with -DINSTANTIATE_TEMPLATES If it is then there seems to be a strange problem with your linker or compiler installation that I cannot reproduce. --Andreas In message you wro te: > On 11/1/06, Andreas Stolcke wrote: > > $SRILM/common/Makefile.machine.i686-gcc4 > Thank you. After setting MACHINE_TYPE=i686-gcc4, I still got errors: > > make[2]: Entering directory `/home/pclouds/tmp/srilm/lm/src' > /usr/bin/g++ -I. -I/home/pclouds/tmp/srilm/include -u matherr > -L/home/pclouds/tmp/srilm/lib/i686-gcc4 -g -O3 -o > ../bin/i686-gcc4/ngram ../obj/i686-gcc4/ngram.o > ../obj/i686-gcc4/liboolm.a -lm -ldl > /home/pclouds/tmp/srilm/lib/i686-gcc4/libflm.a > /home/pclouds/tmp/srilm/lib/i686-gcc4/libdstruct.a > /home/pclouds/tmp/srilm/lib/i686-gcc4/libmisc.a -ltcl -lm 2>&1 | > c++filt > ../obj/i686-gcc4/liboolm.a(Vocab.o): In function `LHash unsigned int>::remove(char const*, bool&)': > /home/pclouds/tmp/srilm/include/LHash.cc:416: undefined reference to > `LHash::removedData' > /home/pclouds/tmp/srilm/include/LHash.cc:417: undefined reference to > `LHash::removedData' > /home/pclouds/tmp/srilm/include/LHash.cc:424: undefined reference to `LHash::removedData' > /home/pclouds/tmp/srilm/include/LHash.cc:473: undefined reference to > `LHash::removedData' > > I'm using SRILM 1.5.0 > > > Andreas > > > > > > > -- > Duy From pclouds at gmail.com Tue Oct 31 22:22:00 2006 From: pclouds at gmail.com (Nguyen Thai Ngoc Duy) Date: Wed, 1 Nov 2006 13:22:00 +0700 Subject: SRILM and GCC 4.1.1 In-Reply-To: <200610311936.k9VJaEm19959@huge> References: <200610311936.k9VJaEm19959@huge> Message-ID: On 11/1/06, Andreas Stolcke wrote: > > make sure the c++ compiler is invoked with > > -DINSTANTIATE_TEMPLATES Obviously I blindly overwrite CC and CXX variables without looking into Makefiles. It works now. Thank you and sorry for the noise. > > If it is then there seems to be a strange problem with your linker or > compiler installation that I cannot reproduce. > > --Andreas -- Duy From stolcke at speech.sri.com Wed Nov 1 09:27:32 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 01 Nov 2006 09:27:32 PST Subject: -gt1min In-Reply-To: Your message of Wed, 01 Nov 2006 08:28:21 +0100. <45484C95.4030401@web.de> Message-ID: <200611011727.kA1HRWr04897@huge> In message <45484C95.4030401 at web.de>you wrote: > Andreas Stolcke wrote: > > In message <45475E03.4040105 at web.de>you wrote: > >> Hi Andreas, > >> > >> ngram-count effectively ignores the -gt1min option, i.e. the cutoff > >> value for unigrams. Is that the desired behavior? > > > > How ddo you reach this conclusions? > > > > Andreas > > > > > e.g., > ngram-count -order 1 -gt1min 1 -text -lm lm1 > ngram-count -order 1 -gt1min 5 -text -lm lm5 > both produce the same list of unigrams (same length), just the logprob > changes. I would have expected unigrams below gt1min being pruned (as > are ngrams of higher order) and hence the list in lm5 being shorter... > > Ronny > > -- > ------------------------------------ > Ronny Melz > IfI, NLP Dept, University of Leipzig > Augustusplatz 10/11 > 04109 Leipzig, Germany > ------------------------------------ > Ronny, the fact that all words appear in the unigrams does not mean that -gt1min doesn't work. For historical reasons the unigram list also serves the purpose of listing the vocabulary of the LM. Therefore SRILM always includes all words in the unigrams. However, those words that are excluded by -gt1min would get a probability that corresponds to the zero-order backoff probability. Zero-order backoff probabilities are obtained by distributing the probability mass left over from unigram discounting over all words. If you want to exclude certain words from the LM altogether use the -vocab option. Andreas From ioparin at yahoo.co.uk Wed Nov 8 00:28:43 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Wed, 8 Nov 2006 08:28:43 +0000 (GMT) Subject: bug in lattice-tool? Message-ID: <20061108082843.47119.qmail@web25401.mail.ukl.yahoo.com> Andreas, We've possibly found a bug in lattice-tool. Here, in Brno, we work with th Czech language that has diacritized letters. So, lattice-tool does everything well with all the calculations until it comes to matching of the best path with the reference file to get number of del, subs and ins - and finally WER. It appears that if both files are in ISO encoding and there is a diacritized letter in the reference, it can be matched to a non-diacritized word in the output, that is actually a different word. So, the WER goes down significantly from what really is (and what is correctly output by HResults in HTK). best regards, Ilya Send instant messages to your online friends http://uk.messenger.yahoo.com From stolcke at speech.sri.com Wed Nov 8 06:30:00 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 08 Nov 2006 06:30:00 PST Subject: bug in lattice-tool? In-Reply-To: Your message of Wed, 08 Nov 2006 08:28:43 +0000. <20061108082843.47119.qmail@web25401.mail.ukl.yahoo.com> Message-ID: <200611081430.kA8EU0Z05065@huge> SRILM uses the strcmp() C library function to compare strings. I suspect what you're seeing is a function of locale settings by way of environment variable such as LANG and LC_COLLATE. This is almost certainly an OS-dependent issue. First, I would try setting $LANG to "C" and unset any of the LC_* variables. I would write a little test program that invokes strcmp() and observe the effect of locale settings on the result. BTW, I have used SRILM with spanish, which also has diacritics in the vocabulary and it works fine. --Andreas In message <20061108082843.47119.qmail at web25401.mail.ukl.yahoo.com>you wrote: > Andreas, > > We've possibly found a bug in lattice-tool. Here, in > Brno, we work with th Czech language that has > diacritized letters. So, lattice-tool does everything > well with all the calculations until it comes to > matching of the best path with the reference file to > get number of del, subs and ins - and finally WER. It > appears that if both files are in ISO encoding and > there is a diacritized letter in the reference, it can > be matched to a non-diacritized word in the output, > that is actually a different word. So, the WER goes > down significantly from what really is (and what is > correctly output by HResults in HTK). > > best regards, > Ilya > > Send instant messages to your online friends http://uk.messenger.yahoo.com From ioparin at yahoo.co.uk Tue Nov 14 08:17:44 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Tue, 14 Nov 2006 16:17:44 +0000 (GMT) Subject: [SRILM] lattice-tool LM interpolarion Message-ID: <20061114161745.10337.qmail@web25415.mail.ukl.yahoo.com> Hi, Could anyone give me any hints on the following: when I interpolate different LMs (with different vocabs) to rescore lattices with lattice-tool (in HTK format), in the output lattice several links get l=-inf probability, that leads to the fact it is impossible to calculate viterbi best path etc. For me it looks like the loglinear mix is performed, that leads to getting -inf in case at least one of the LMs gives this output (that is possible due to different vocabs). If so, is there any way to interpolate LMs with different vocabs in lattice-tool, or all the LMs should have the same vocab beforehand? Or I just miss something crucial in the way the whole thing works? lattice-tool -in-lattice-list lat.list -read-htk -no-htk-nulls -htk-words-on-nodes -lm LM1 -mix-lm LM2 -write-htk -htk-logbase 2.71828 -out-lattice-dir out_dir BTW, Andeas, thanks for the comments on accented characters, it works. best regards, Ilya Send instant messages to your online friends http://uk.messenger.yahoo.com From stolcke at speech.sri.com Tue Nov 14 08:58:08 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 14 Nov 2006 08:58:08 PST Subject: [SRILM] lattice-tool LM interpolarion In-Reply-To: Your message of Tue, 14 Nov 2006 16:17:44 +0000. <20061114161745.10337.qmail@web25415.mail.ukl.yahoo.com> Message-ID: <200611141658.kAEGw8L03438@huge> In message <20061114161745.10337.qmail at web25415.mail.ukl.yahoo.com>you wrote: > Hi, > > Could anyone give me any hints on the following: when > I interpolate different LMs (with different vocabs) to > rescore lattices with lattice-tool (in HTK format), in > the output lattice several links get l=-inf > probability, that leads to the fact it is impossible > to calculate viterbi best path etc. > For me it looks like the loglinear mix is performed, > that leads to getting -inf in case at least one of the > LMs gives this output (that is possible due to > different vocabs). If so, is there any way to > interpolate LMs with different vocabs in lattice-tool, > or all the LMs should have the same vocab beforehand? > Or I just miss something crucial in the way the whole > thing works? > > lattice-tool -in-lattice-list lat.list -read-htk > -no-htk-nulls -htk-words-on-nodes -lm LM1 -mix-lm LM2 > -write-htk -htk-logbase 2.71828 -out-lattice-dir > out_dir This produces a linear (not loglinear) mixture of models. The vocabulary of such a model is the union of the component models. The -inf scores must be due to words that are not in the union of the vocabularies of LM1 and LM2, or probabilities that are explicitly 0 in the LMs. > > BTW, Andeas, thanks for the comments on accented > characters, it works. Glad to hear it. Andreas From stolcke at speech.sri.com Wed Dec 6 14:07:11 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 06 Dec 2006 14:07:11 PST Subject: deleted estimation using SRILM In-Reply-To: Your message of Wed, 06 Dec 2006 14:44:53 -0500. Message-ID: <200612062207.kB6M7BY25643@huge> In message you wrote: > Hello Andreas, > > I have the latest SRILM toolkit version and I am trying to implement > deleted interpolation using ngram/ngram-count but I cannot seem to get it > to work. Would it be possible to get a sample of how the command(s) would > look like? The latest version of SRILM implements deleted interpolation as part of the "count-LM" LM class. Look up the -count-lm option in both the ngram-count and the ngram man pages. Then look at $SRILM/test/tests/ngram-count-lm/run-test for an example of how it all fits together. Deleted interpolation is not typically as good as other schemes such as modified Kneser Ney smoothing, but has some practical advantages (efficient memory implementation) when applied to very large count sets. Andreas From stolcke at speech.sri.com Thu Dec 7 14:08:01 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 07 Dec 2006 14:08:01 PST Subject: deleted estimation using SRILM In-Reply-To: Your message of Wed, 06 Dec 2006 17:21:11 -0500. Message-ID: <200612072208.kB7M81305989@huge> In message you wrote: > Quick question. > Is there a way to get the deleted-interpolation LM in arpa or fstn format? > > thanks, > -Ghinwa No, because a deleted-interpolation LM cannot exactly be represented as a backoff LM in general (short of listing all ngrams). What you can do, however, is define a set of ngrams and then create a backoff LM whose probabilities match exactly those of the deleted-interpolation LM for those ngrams (and use backoff for all others). This way, most SRILM LM classes can be approximated by backoff LMs. To do this use the ngram -rescore-ngram option (see man page). ngram -rescore-ngram BACKOFF-LM \ OTHER-LM-OPTIONS \ -write-lm NEW-BACKOFF-LM where OTHER-LM-OPTIONS specifies the LM from which the new probabilities are taken. By chosing the set of ngrams in BACKOFF-LM large or small you control the goodness of the approximation. Andreas > > On Wed, 6 Dec 2006, Andreas Stolcke wrote: > > > > > In message you wrote > : > >> Hello Andreas, > >> > >> I have the latest SRILM toolkit version and I am trying to implement > >> deleted interpolation using ngram/ngram-count but I cannot seem to get it > >> to work. Would it be possible to get a sample of how the command(s) would > >> look like? > > > > The latest version of SRILM implements deleted interpolation as part > > of the "count-LM" LM class. Look up the -count-lm option in both the > > ngram-count and the ngram man pages. > > Then look at $SRILM/test/tests/ngram-count-lm/run-test for an example > > of how it all fits together. > > > > Deleted interpolation is not typically as good as other schemes such > > as modified Kneser Ney smoothing, but has some practical advantages > > (efficient memory implementation) when applied to very large count sets. > > > > Andreas > > > >