From vancrusoe at hotmail.com Tue Apr 12 14:01:39 2005 From: vancrusoe at hotmail.com (zhou hao) Date: Wed, 13 Apr 2005 05:01:39 +0800 Subject: about Compiling the SRILM In-Reply-To: <200503300801.j2U81l805125@huge> Message-ID: Dear all, I am having a problem compiling the SRILM under Cygwin. in my makefile, now i have it like this: SRILM = /home/crusoe/srilm MACHINE_TYPE := $(shell $(SRILM)/sbin/machine-type) but when I type "make World", the system says that /home/crusoe/srilm/sbin/machine-type: command not found however, obviously i can find it under the directory. so what are the possbile problem that i might have. thanks crusoe >From: Andreas Stolcke >To: "zhou hao" >CC: srilm-user at speech.sri.com >Subject: Re: about the ngram -hmm option >Date: Wed, 30 Mar 2005 00:01:45 PST > >In message you wrote: > > Hey, > > > > just got a question in my mind, in the ngram command, it comes with an > > option -hmm, which needs to take a HMM file as input, so how can I create > > this file when I train the language model. or should I write some code > > myself to generate that. > >You typically create the file by hand, thus SRILM comes with no >special tools for this. However, if you are building a large HMM >structure it is best done by a program or script. > >I hope you don't expect SRILM to include some kind of "mini HMM toolkit". >It doesn't. ngram -hmm is meant for building simple models that >switch ngram distributions as they generate a sentence. > >--Andreas > _________________________________________________________________ ???? MSN Explorer: http://explorer.msn.com/lccn/ From anand at speech.sri.com Tue Apr 12 14:11:33 2005 From: anand at speech.sri.com (Anand Venkataraman) Date: Tue, 12 Apr 2005 14:11:33 -0700 (PDT) Subject: about Compiling the SRILM Message-ID: <200504122111.j3CLBX024883@clara> > but when I type "make World", the system says that > /home/crusoe/srilm/sbin/machine-type: command not found It probably means that /bin/csh, which is what machine-type is written in, is not available on your machine. If you don't intend to compile and run on multiple platforms, as a quick fix, you may just substitute machine-type with a command that echoes the current machine type -- "i686-cygwin". & From harryking at gmail.com Tue Apr 12 18:50:21 2005 From: harryking at gmail.com (Harry King) Date: Wed, 13 Apr 2005 09:50:21 +0800 Subject: memory use problem Message-ID: Hello. When making a big LM, I gotta an error message of "assertion "body != 0" failed: file "/home/model/srilm/include/SArray.cc", line 300 ". I used "OPTION=_c" when making world and used make-big-lm scripts to train the LM. And the machine has a P4 2.4G CPU and 1G memory, the OS is FreeBSD 4.10. When getting the error message above, the memory used is a little more than 512M. What can I do to save the memory use? How big LM can I make? Thanks. From stolcke at speech.sri.com Tue Apr 12 19:36:07 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 12 Apr 2005 19:36:07 PDT Subject: memory use problem In-Reply-To: Your message of Wed, 13 Apr 2005 09:50:21 +0800. Message-ID: <200504130236.j3D2a7w10512@huge> You probably have an OS memory usage limit set. On an Intel machines you should be able to use at least 2GB of virtual memory. try the csh "limit" (or sh ulimit) command and research how to change the limits if needed. --Andreas In message you wrote: > Hello. > > When making a big LM, I gotta an error message of > "assertion "body != 0" failed: file > "/home/model/srilm/include/SArray.cc", line 300 ". > > I used "OPTION=_c" when making world and used make-big-lm scripts to > train the LM. > And the machine has a P4 2.4G CPU and 1G memory, the OS is FreeBSD > 4.10. When getting the error message above, the memory used is a > little more than 512M. > > What can I do to save the memory use? How big LM can I make? > Thanks. > From dtkurtz at afterlife.ncsc.mil Mon Apr 18 07:27:10 2005 From: dtkurtz at afterlife.ncsc.mil (Daniel T Kurtz) Date: Mon, 18 Apr 2005 10:27:10 -0400 (EDT) Subject: disambig Message-ID: <200504181426.j3IEQcAu025791@afterlife.ncsc.mil> I'm looking at the disambig manual, and I'm trying to make sense of general use. One of the (I'm assuming) required parameters is -map [file]. Is there any way to generate this map file, or is it a manual process? Thanks a lot. -Dan From vancrusoe at hotmail.com Mon Apr 18 10:55:58 2005 From: vancrusoe at hotmail.com (zhou hao) Date: Tue, 19 Apr 2005 01:55:58 +0800 Subject: about the disambig function In-Reply-To: <200504122111.j3CLBX024883@clara> Message-ID: Dear all, I have a question here about using the disambig function. I've made the map file, and also the lm file. my input is a paragraph of all lowercased words, i expect the output to be cased words. In the console, I can see the output is pretty much I am expecting. However, I have tried different options to output this result, all failed. the outputs are bigrams, single vocabularies, but i want a paragraph of words. so anyone can help me about this. i mean seems it can output to the console, got be some way to output a file. btw, the corners, full stops are displayed as integer number in the console, sort of wierd. thanks a lot crusoe _________________________________________________________________ ?????????????? MSN Messenger: http://messenger.msn.com/cn From lambert at jhu.edu Thu Apr 21 06:16:00 2005 From: lambert at jhu.edu (lambert mathias) Date: Thu, 21 Apr 2005 09:16:00 -0400 Subject: FW: LM ppl In-Reply-To: Message-ID: > Hi, > > When I check perplexity on my bigram LM using the debug 2 switch, I find that > when the history word is then my LM backsoff to the unigram even though > I have a entry for that bigram in the ARPA model. Of course, I want it to be > able to pick up this bigram entry. How do I get it to do this? > > Lambert From stolcke at speech.sri.com Mon Apr 25 21:29:16 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 25 Apr 2005 21:29:16 PDT Subject: FW: LM ppl In-Reply-To: Your message of Thu, 21 Apr 2005 09:16:00 -0400. Message-ID: <200504260429.j3Q4TG506643@huge> In message you wrote: > > > Hi, > > > > When I check perplexity on my bigram LM using the debug 2 switch, I find th > at > > when the history word is then my LM backsoff to the unigram even thou > gh > > I have a entry for that bigram in the ARPA model. Of course, I want it to > be > > able to pick up this bigram entry. How do I get it to do this? Try using the ngram -unk switch. --Andreas From stolcke at speech.sri.com Mon Apr 25 22:08:48 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 25 Apr 2005 22:08:48 -0700 Subject: disambig In-Reply-To: <200504181426.j3IEQcAu025791@afterlife.ncsc.mil> References: <200504181426.j3IEQcAu025791@afterlife.ncsc.mil> Message-ID: <426DCCE0.8050906@speech.sri.com> Daniel T Kurtz wrote: >I'm looking at the disambig manual, and I'm trying to make sense of general use. >One of the (I'm assuming) required parameters is -map [file]. Is there any way >to generate this map file, or is it a manual process? > >Thanks a lot. > >-Dan > > Dan, I think people usually make the map files by writing a script (perl, gawk, etc.) that generates the required map entries, or translates them from another representation of the underlying tagging HMM. For very small problems you might be able to do it by hand, but I don't recommend it. --Andreas From stolcke at speech.sri.com Mon Apr 25 22:11:02 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 25 Apr 2005 22:11:02 -0700 Subject: about the disambig function In-Reply-To: References: Message-ID: <426DCD66.80506@speech.sri.com> zhou hao wrote: > Dear all, > > I have a question here about using the disambig function. I've made > the map file, and also the lm file. my input is a paragraph of all > lowercased words, i expect the output to be cased words. > > In the console, I can see the output is pretty much I am expecting. > However, I have tried different options to output this result, all > failed. the outputs are bigrams, single vocabularies, but i want a > paragraph of words. > > so anyone can help me about this. i mean seems it can output to the > console, got be some way to output a file. > > btw, the corners, full stops are displayed as integer number in the > console, sort of wierd. > thanks a lot > crusoe It looks like you have problems with your operating system, not with SRILM itself. I suggest you contact a local expert. --Andreas > > _________________________________________________________________ > ?????????????? MSN Messenger: http://messenger.msn.com/cn From kermorvant at gmail.com Mon May 9 08:55:44 2005 From: kermorvant at gmail.com (Christopher Kermorvant) Date: Mon, 9 May 2005 17:55:44 +0200 Subject: "format error in lattice file"? In-Reply-To: References: Message-ID: Hi, I'm trying to use lattices and language models. I have a lattice of words coming from a low level decoding process (each word is associated to a probability). If I use lattice-tool.exe -in-lattice my_lattice.pfsg -viterbi-decode I get the best path in this lattice. So far so good. Now I want to add a language model to this decoding. But if I use lattice-tool.exe -in-lattice my_lattice.pfsg -viterbi-decode -lm my_language_model.bo it seems that I get the best path according to the language model, not taking into account the low level probabilities. Am I right ? Is there a way to decode with both probabilities ? Thanks in advance, -- C. Kermorvant From stolcke at speech.sri.com Mon May 9 10:23:40 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 09 May 2005 10:23:40 PDT Subject: "format error in lattice file"? In-Reply-To: Your message of Mon, 09 May 2005 17:55:44 +0200. Message-ID: <200505091723.j49HNeC02921@huge> PFSG format supports only a single score so there is no way to combine acoustic and LM probabilities etc. You should encode your lattices in HTK format, rescore the lattices with your chosen LM, and decode using a weighted combination of scores. --Andreas In message you wrote: > Hi, > > I'm trying to use lattices and language models. I have a lattice of > words coming from a low level decoding process (each word is > associated to a probability). > > If I use > > lattice-tool.exe -in-lattice my_lattice.pfsg -viterbi-decode > > I get the best path in this lattice. So far so good. > > Now I want to add a language model to this decoding. But if I use > > lattice-tool.exe -in-lattice my_lattice.pfsg -viterbi-decode -lm > my_language_model.bo > > it seems that I get the best path according to the language model, not > taking into account the low level probabilities. > > Am I right ? Is there a way to decode with both probabilities ? > > Thanks in advance, > > -- > C. Kermorvant > From abhinav.sethy at gmail.com Thu May 12 18:00:56 2005 From: abhinav.sethy at gmail.com (Abhinav Sethy) Date: Thu, 12 May 2005 18:00:56 -0700 Subject: negative weights for meging language models Message-ID: <136aa037050512180056587e9b@mail.gmail.com> Hi I am tring to implement the Anti language model approach used in the "THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM" This requires the use of a negative weight on a language model. Is it possible to do this in the toolkit? The ngram tool lets you specify the weight on the primary language model but does not allow a negative weight on the mixture language model. Does some other tool allow negative weights? -- Abhinav From gemma.boleda at upf.edu Tue May 17 13:45:40 2005 From: gemma.boleda at upf.edu (Gemma Boleda) Date: Tue, 17 May 2005 20:45:40 +0000 Subject: -tagged option? Message-ID: <200505172045.40590.gemma.boleda@upf.edu> Hi, I am using the -tagged option for ngram-count and I am experiencing 2 problems: a) the slash is taken into account in the ngram counts: taking as input "la/DT nena/N5 ?s/V maca/JQ ./PT", the bigrams look as follows: ~~la 1 ~~/DT 1 la nena 1 nena ?s 1 ?s maca 1 /N5 ?s 1 /N5 /V 1 /V maca 1 /V /JQ 1 /DT nena 1 /DT /N5 1 maca . 1 /JQ . 1 /JQ /PT 1 .~~ 1 /PT~~ 1 Why is the slash considered as part of the tag? b) as can be seen in the example, the n-grams with tags are only built left-to-right, e.g. there is no bigram "la /N5", as I would have expected (and needed). Can you help me? Thanks a lot, Gemma Boleda Universitat Pompeu Fabra Barcelona From udani at streamsage.com Wed May 25 12:18:49 2005 From: udani at streamsage.com (Goldee Udani) Date: Wed, 25 May 2005 15:18:49 -0400 Subject: LM missing back-off probabilities Message-ID: <4294CF99.2030006@streamsage.com> Hi there, I am sorry if this problem has already been addressed before on this forum. I am trying to generate a small LM for using in Sphinx Speech Recognition system but the back-off probabilities for every ngram occuring at the end of sentence(s) are missing. For example - ~~we cannot afford to fight the war against poverty with accounting tricks~~ For a trigram LM, it doesn't generate back-off probabilities for "tricks" (unigram) and "accounting tricks " (bigram). This tends to happen for all the sentences in the test set taken from the corpus. I am trying to use the "ngram-count" script with witten bell discounting applied to all n-grams in a trigram model. If any of you have faced a similar problem before, I would appreciate it if you could help me out here. Thanks, Goldee From yannick.esteve at lium.univ-lemans.fr Wed May 25 13:57:50 2005 From: yannick.esteve at lium.univ-lemans.fr (=?ISO-8859-1?Q?Yannick_Est=E8ve_-_LIUM?=) Date: Wed, 25 May 2005 22:57:50 +0200 Subject: LM missing back-off probabilities In-Reply-To: <4294CF99.2030006@streamsage.com> References: <4294CF99.2030006@streamsage.com> Message-ID: <4294E6CE.3020104@lium.univ-lemans.fr> I hope this message can help you. To use CMU Sphinx with LM estimated with SRILM you have to use two tools provided with SRILM toolkit : -add-dummy-bows: this program adds the 'missing' back-off weights (in fact, when these weights equal to 0 ngram-count doesn't print them) -sort-lm: this program sorts n-grams in lexical order (lm3gdmp works only if the n-grams are sorted. In fact, 2-3-...-k-grams have to be sorted in the same order). These two tools are programmed in awk (awk or gawk have to be installed on your computer). -- Yannick Goldee Udani a ?crit : > Hi there, > > I am sorry if this problem has already been addressed before on this > forum. > > I am trying to generate a small LM for using in Sphinx Speech > Recognition system but the back-off probabilities for every ngram > occuring at the end of sentence(s) are missing. > For example - > > ~~we cannot afford to fight the war against poverty with accounting > tricks~~ > > For a trigram LM, it doesn't generate back-off probabilities for > "tricks" (unigram) and "accounting tricks " (bigram). This tends to > happen for all the sentences in the test set taken from the corpus. > > I am trying to use the "ngram-count" script with witten bell > discounting applied to all n-grams in a trigram model. > > If any of you have faced a similar problem before, I would appreciate > it if you could help me out here. > > Thanks, > Goldee > > From stolcke at speech.sri.com Wed May 25 15:49:35 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 25 May 2005 15:49:35 PDT Subject: LM missing back-off probabilities In-Reply-To: Your message of Wed, 25 May 2005 22:57:50 +0200. <4294E6CE.3020104@lium.univ-lemans.fr> Message-ID: <200505252249.PAA11184@tonga> In message <4294E6CE.3020104 at lium.univ-lemans.fr>you wrote: > I hope this message can help you. > > To use CMU Sphinx with LM estimated with SRILM you have to use two tools > provided with SRILM toolkit : > > -add-dummy-bows: this program adds the 'missing' back-off weights (in > fact, when these weights equal to 0 ngram-count doesn't print them) > -sort-lm: this program sorts n-grams in lexical order (lm3gdmp works > only if the n-grams are sorted. In fact, 2-3-...-k-grams have to be > sorted in the same order). > > These two tools are programmed in awk (awk or gawk have to be installed > on your computer). > > -- Yannick I agree with the above. But I think there is something else going on in the case described. The default minimum ngram count for trigrams is 2, so trigrams occurring only once in your data will not show up in the LM. Use ngram-count -gt3min 1 .... and you will (hopefully) find that the trigram "accounting tricks " shows up in the LM, along with all its prefixes. --Andreas > > > Goldee Udani a ?crit : > > > Hi there, > > > > I am sorry if this problem has already been addressed before on this > > forum. > > > > I am trying to generate a small LM for using in Sphinx Speech > > Recognition system but the back-off probabilities for every ngram > > occuring at the end of sentence(s) are missing. > > For example - > > > > ~~we cannot afford to fight the war against poverty with accounting > > tricks~~ > > > > For a trigram LM, it doesn't generate back-off probabilities for > > "tricks" (unigram) and "accounting tricks " (bigram). This tends to > > happen for all the sentences in the test set taken from the corpus. > > > > I am trying to use the "ngram-count" script with witten bell > > discounting applied to all n-grams in a trigram model. > > > > If any of you have faced a similar problem before, I would appreciate > > it if you could help me out here. > > > > Thanks, > > Goldee > > > > > > > From Nachiappan.Nachiappan at postgrads.unisa.edu.au Thu May 26 01:05:19 2005 From: Nachiappan.Nachiappan at postgrads.unisa.edu.au (Nachiappan, Nachiappan - NACNY001) Date: Thu, 26 May 2005 17:35:19 +0930 Subject: SRILM toolkit Message-ID: <959E6440BFDDD447AC681EAE9B6F3C5C02AFC5F2@ITUPROD-EXCL2.UniNet.unisa.edu.au> Hi everyone, I am new to this group. I am trying to develop an n-gram language model to eliminate filled pauses. I would like to know whether I can use SRILM toolkit to develop the language model .I would also like to know whether I can train and test the language model with the toolkit. Thank you nachi -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nachiappan.Nachiappan at postgrads.unisa.edu.au Wed Jun 1 18:38:36 2005 From: Nachiappan.Nachiappan at postgrads.unisa.edu.au (Nachiappan, Nachiappan - NACNY001) Date: Thu, 2 Jun 2005 11:08:36 +0930 Subject: REG- SRILM toolkit Message-ID: <959E6440BFDDD447AC681EAE9B6F3C5C02AFC807@ITUPROD-EXCL2.UniNet.unisa.edu.au> Hi there, I am Nachiappan.I am new to this group. I am trying to develop a language model to eliminate filled pauses. I have download SRILM toolkit and its supporting files (CYGWIN) and I have installed it. After installation I got only some extra files like include, bin. I would like to know how I can use SRILM to develop a language model. I would also like to know if there is any user manual for SRILM toolkit. Looking forward for reply, Thank you, nachiappan -------------- next part -------------- An HTML attachment was scrubbed... URL: From tanel.alumae at aqris.com Mon Jun 6 09:03:31 2005 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Mon, 06 Jun 2005 19:03:31 +0300 Subject: KN discounting and zeroton words Message-ID: <1118073811.16700.12.camel@localhost> Hello, I've noticed that when using -kndiscount, the zeroton words (words that are in the vocabulary but not in the training corpus) get a higher unigram LM probability than words that actually occur (rarely) in the training corpus. Shouldn't the zeroton words get the same unigram probability as the words that are discounted to 0 using the -gt1min option? With GT, WB and natural discounting, everything works as expected: zeroton words get the same unigram probability as the words discounted to 0. Regards, Tanel A. From tanel.alumae at aqris.com Mon Jun 6 09:38:59 2005 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Mon, 06 Jun 2005 19:38:59 +0300 Subject: KN discounting and zeroton words In-Reply-To: <1118073811.16700.12.camel@localhost> References: <1118073811.16700.12.camel@localhost> Message-ID: <1118075939.16700.23.camel@localhost> A little correction: also with KN discounting, zeroton words get the same unigram probability as words discounted to zero (using -gt1min). What I don't understand, is why can this probability be higher than for words that are not discounted to zero? E.g. E.g. for a very little test set, and using '-gt1min 2', zeroton and singleton words get a probability -0.7323937, but a word occurring twice gets a probability -1.556303. I believe this is some magic property of KN discounting, in which case I apologize for polluting the list and go back to reading the description of the algorithm. Regards, Tanel A. On Mon, 2005-06-06 at 19:03 +0300, Tanel Alum?e wrote: > Hello, > > I've noticed that when using -kndiscount, the zeroton words (words that > are in the vocabulary but not in the training corpus) get a higher > unigram LM probability than words that actually occur (rarely) in the > training corpus. Shouldn't the zeroton words get the same unigram > probability as the words that are discounted to 0 using the -gt1min > option? > > With GT, WB and natural discounting, everything works as expected: > zeroton words get the same unigram probability as the words discounted > to 0. > > Regards, > Tanel A. > > From shachi at streamsage.com Wed Jun 8 06:29:56 2005 From: shachi at streamsage.com (Shachi Dave) Date: 08 Jun 2005 09:29:56 -0400 Subject: read/write counts in FLMs In-Reply-To: <959E6440BFDDD447AC681EAE9B6F3C5C02AFC5F2@ITUPROD-EXCL2.UniNet.unisa.edu.au> References: <959E6440BFDDD447AC681EAE9B6F3C5C02AFC5F2@ITUPROD-EXCL2.UniNet.unisa.edu.au> Message-ID: <1118237396.3826.162.camel@sanskruti.streamsage.com> Hi, I am trying to build a factored language model(FLM) using "fngram-count" in SRILM toolkit. When I run it using "-write-counts" and "-lm" options together, it builds the FLM correctly. But when I try to break it down into two steps: (a) only "-write-counts" option to write the counts file (b) "-read-counts" and "-lm" options to build the FLM using the counts file it gives errors. I checked the debug output; it seems it is getting the count-of-counts for modified Kneser-Ney discounting wrong in the step (b) above. The counts file generated in step (a) is exactly similar to the one generated using both "-write-counts" and "-lm" options together. I tried these steps using a couple of different FLM specifications and the error is the same. Has anyone faced this problem before? I will appreciate if you can help me out here. Thanks, Shachi From tanel.alumae at aqris.com Fri Jun 10 06:46:50 2005 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Fri, 10 Jun 2005 16:46:50 +0300 Subject: read/write counts in FLMs In-Reply-To: <1118237396.3826.162.camel@sanskruti.streamsage.com> References: <959E6440BFDDD447AC681EAE9B6F3C5C02AFC5F2@ITUPROD-EXCL2.UniNet.unisa.edu.au> <1118237396.3826.162.camel@sanskruti.streamsage.com> Message-ID: <1118411210.27799.16.camel@localhost> Hello, As far as I understand, you need both the FLM LM file and the FLM counts file to actually use the FLM. So you should actually always use both the -write-counts and the -lm option when building FLM. As for -read-counts, I believe that you could use a general counts file there (i.e. which counts the occurrances of tagged words rather than the factors). You can get the general counts file from the tagged corpus using the ngram-count program, just like for untagged corpus. The FLM counts file uses a special format (look into it and you see) which probably confuses fngram-count when fed into it using -read-counts. Hope this helps, Tanel A. On Wed, 2005-06-08 at 09:29 -0400, Shachi Dave wrote: > Hi, > > I am trying to build a factored language model(FLM) using "fngram-count" > in SRILM toolkit. > > When I run it using "-write-counts" and "-lm" options together, it > builds the FLM correctly. But when I try to break it down into two > steps: > (a) only "-write-counts" option to write the counts file > (b) "-read-counts" and "-lm" options to build the FLM using the counts > file > > it gives errors. I checked the debug output; it seems it is getting the > count-of-counts for modified Kneser-Ney discounting wrong in the step > (b) above. The counts file generated in step (a) is exactly similar to > the one generated using both "-write-counts" and "-lm" options together. > I tried these steps using a couple of different FLM specifications and > the error is the same. Has anyone faced this problem before? I will > appreciate if you can help me out here. > > Thanks, > Shachi > > > > From stolcke at speech.sri.com Sat Jun 11 20:40:28 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 11 Jun 2005 20:40:28 PDT Subject: KN discounting and zeroton words In-Reply-To: Your message of Mon, 06 Jun 2005 19:38:59 +0300. <1118075939.16700.23.camel@localhost> Message-ID: <200506120340.UAA29915@tonga> In message <1118075939.16700.23.camel at localhost>you wrote: > > A little correction: also with KN discounting, zeroton words get the > same unigram probability as words discounted to zero (using -gt1min). > What I don't understand, is why can this probability be higher than for > words that are not discounted to zero? E.g. > > E.g. for a very little test set, and using '-gt1min 2', zeroton and > singleton words get a probability -0.7323937, but a word occurring twice > gets a probability -1.556303. > > I believe this is some magic property of KN discounting, in which case I > apologize for polluting the list and go back to reading the description > of the algorithm. The unigram probabilities for zeroton words are obtained by distributing the backoff mass left by the non-zeroton words evenly over all the zerotons (this corresponds to backing off to a uniform distribution). Now, if the number of zerotons is small they might actually get more probability than the low-count observed unigrams that way. The -interpolate1 option should prevent this since it distributes the backoff mass over ALL unigrams (adding to the probability of those words that were observed). Please check if this is the case, and if not, send me a test case so I can look into why it doesn't work as intended. --Andreas From tanel.alumae at aqris.com Mon Jun 13 06:50:00 2005 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Mon, 13 Jun 2005 16:50:00 +0300 Subject: KN discounting and zeroton words In-Reply-To: <200506120340.UAA29915@tonga> References: <200506120340.UAA29915@tonga> Message-ID: <1118670600.13946.7.camel@localhost> > The unigram probabilities for zeroton words are obtained by distributing > the backoff mass left by the non-zeroton words evenly over all the zerotons > (this corresponds to backing off to a uniform distribution). > Now, if the number of zerotons is small they might actually get more > probability than the low-count observed unigrams that way. > > The -interpolate1 option should prevent this since it distributes the > backoff mass over ALL unigrams (adding to the probability of those words > that were observed). > Please check if this is the case, and if not, send me a test case so > I can look into why it doesn't work as intended. Yes, the -interpolate1 option prevents this from happening. hanks for the help. Tanel From tanel.alumae at aqris.com Mon Jun 13 08:20:09 2005 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Mon, 13 Jun 2005 18:20:09 +0300 Subject: Ney's absolute discounting and zeroton words Message-ID: <1118676009.13946.22.camel@localhost> Hello, I continue my quest with zeroton words. I want to control the amount of probability that is distributed upon words that are in the vocabulary but are not in the training corpus. It seems that Ney's absolute discounting is good for that. So, I started experimenting with the constant for Ney's discounting. Here are the unigram probability for an unseen word, for different discounting factors: 0.1 -1.410174 0.01 -2.410174 0.001 -3.410148 0.0001 -4.410249 0.00001 -5.409665 0.000001 -1.278751 0.0000001 -1.278753 As you see, there is a abrupt increase in probability when the constant gets to 0.000001, which is unexpected. Is this how it should be or caused by some numerical problems? I'm using SRILM on 32-bit x86 processor. The numbers here are given for a small test set but I've seen similar behaviour for large sets. Regards, Tanel From stolcke at speech.sri.com Fri Jun 24 16:14:44 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 24 Jun 2005 16:14:44 PDT Subject: negative weights for meging language models In-Reply-To: Your message of Thu, 12 May 2005 18:00:56 -0700. <136aa037050512180056587e9b@mail.gmail.com> Message-ID: <200506242314.j5ONEia22298@huge> In message <136aa037050512180056587e9b at mail.gmail.com>you wrote: > Hi > > I am tring to implement the Anti language model approach used in the > "THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM" > > This requires the use of a negative weight on a language model. Is it > possible to do this in the toolkit? The ngram tool lets you specify > the weight on the primary language model but does not allow a negative > weight on the mixture language model. Does some other tool allow > negative weights? You shouldn't try to (linealy) interpolate the anti-LM wiht the regular LM. The combination uses a log-linear interpolation. This is typically done when resoring nbest lists or lattices. You just specify the regular LM and the anti-LM scores separately, and a negative scaling factor for the latter. --Andreas From stolcke at speech.sri.com Fri Jun 24 16:23:30 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 24 Jun 2005 16:23:30 PDT Subject: -tagged option? In-Reply-To: Your message of Tue, 17 May 2005 20:45:40 -0000. <200505172045.40590.gemma.boleda@upf.edu> Message-ID: <200506242323.j5ONNU022974@huge> In message <200505172045.40590.gemma.boleda at upf.edu>you wrote: > Hi, > > I am using the -tagged option for ngram-count and I am experiencing 2 > problems: > > a) the slash is taken into account in the ngram counts: taking as input "la/D > T > nena/N5 ?s/V maca/JQ ./PT", the bigrams look as follows: > > ~~la 1 > ~~/DT 1 > la nena 1 > nena ?s 1 > ?s maca 1 > /N5 ?s 1 > /N5 /V 1 > /V maca 1 > /V /JQ 1 > /DT nena 1 > /DT /N5 1 > maca . 1 > /JQ . 1 > /JQ /PT 1 > .~~ 1 > /PT~~ 1 > > Why is the slash considered as part of the tag? The / in front of a token signifies that it's a tag, as opposed to a word. It's just a way to encode word/tags, as well as word and tags individually, without ambiguity. > > b) as can be seen in the example, the n-grams with tags are only built > left-to-right, e.g. there is no bigram "la /N5", as I would have expected > (and needed). The program collects only those N-gram statistics that are required by the underlying model. Since the goal is to use the tags in backoff the statistics needed are asymmetrical. If you want a different set of N-grams you can probably write a simple perl script to do the job. --Andreas