From amelie.deltour at ira.uka.de Tue Jan 28 08:11:04 2003 From: amelie.deltour at ira.uka.de (=?ISO-8859-1?Q?Am=E9lie?= DELTOUR) Date: Tue, 28 Jan 2003 17:11:04 +0100 Subject: disambig with "open vocabulary" LM Message-ID: <3E36AB98.3070405@ira.uka.de> Hi, I would like to use the disambig program with an open-vocabulary LM (built with ngram-count and -unk option). I get the following error message: "warning: non-zero probability for in closed-vocabulary LM" (the LM read by disambig is not recognized as an open-vocabulary LM). What is the matter? Are we supposed to use only closed-vocabulary LM with disambig? Can anyone help? Thanks, Am?lie PS: is there anywhere I can find an archive of the mailing-list? -- -------------------------------------------------------------------- Am?lie DELTOUR ENSIMAG / Universit?t Karlsruhe E-mail : amelie.deltour at ira.uka.de -------------------------------------------------------------------- From stolcke at speech.sri.com Tue Jan 28 09:34:06 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 28 Jan 2003 09:34:06 PST Subject: disambig with "open vocabulary" LM In-Reply-To: Your message of Tue, 28 Jan 2003 17:11:04 +0100. <3E36AB98.3070405@ira.uka.de> Message-ID: <200301281734.JAA11165@tonga> In message <3E36AB98.3070405 at ira.uka.de>you wrote: > Hi, > I would like to use the disambig program with an open-vocabulary LM > (built with ngram-count and -unk option). > I get the following error message: "warning: non-zero probability for > in closed-vocabulary LM" (the LM read by disambig is not > recognized as an open-vocabulary LM). > What is the matter? Are we supposed to use only closed-vocabulary LM > with disambig? > Can anyone help? > Thanks, > > Am?lie > > PS: is there anywhere I can find an archive of the mailing-list? > Am?lie, this is an omission in disambig, to tell the vocabulary object that is to be treated as a regular word. Please try the following patch: =================================================================== RCS file: RCS/disambig.cc,v retrieving revision 1.34 diff -c -r1.34 disambig.cc *** /tmp/T00M2saV Tue Jan 28 09:30:49 2003 --- disambig.cc Tue Jan 28 09:23:02 2003 *************** *** 709,714 **** --- 709,715 ---- vocab.toLower = tolower1? true : false; hiddenVocab.toLower = tolower2 ? true : false; + hiddenVocab.unkIsWord = keepUnk ? true : false; if (mapFile) { File file(mapFile, "r"); =================================================================== A similar patch belongs in hidden-ngram.cc: =================================================================== RCS file: RCS/hidden-ngram.cc,v retrieving revision 1.37 diff -c -r1.37 hidden-ngram.cc *** /tmp/T0aSC8P_ Tue Jan 28 09:32:03 2003 --- hidden-ngram.cc Tue Jan 28 09:24:59 2003 *************** *** 1007,1012 **** --- 1007,1013 ---- */ Vocab vocab; vocab.toLower = toLower? true : false; + vocab.unkIsWord = keepUnk ? true : false; SubVocab hiddenVocab(vocab); SubVocab *classVocab = 0; =================================================================== As to the mailing list archives: send a message to majordomo at speech.sri.com with "help" in the body. You will receive instructions on how to retrieve the archives of this mailing list. (Unfortunately there is no web interface.) --Andreas From thomae at ei.tum.de Fri Feb 7 07:35:48 2003 From: thomae at ei.tum.de (Matthias Thomae) Date: Fri, 07 Feb 2003 16:35:48 +0100 Subject: sum of out-transition probabilities != 1.0 ? Message-ID: <3E43D254.80406@ei.tum.de> Hello SRILM users, how is it possible that PFSGs, generated with ngram-count and make-ngram-pfsg (default discounting) have nodes whose sum of the probabilities of all outgoing transitions do not sum to 1.0 (but, e.g., 1.1)? I thought this was an important constraint, but maybe I am missing some theory... Regards. Matthias From stolcke at speech.sri.com Fri Feb 7 11:16:43 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 07 Feb 2003 11:16:43 PST Subject: sum of out-transition probabilities != 1.0 ? In-Reply-To: Your message of Fri, 07 Feb 2003 16:35:48 +0100. <3E43D254.80406@ei.tum.de> Message-ID: <200302071916.LAA08493@huge> In message <3E43D254.80406 at ei.tum.de>you wrote: > Hello SRILM users, > > how is it possible that PFSGs, generated with ngram-count and > make-ngram-pfsg (default discounting) have nodes whose sum of the > probabilities of all outgoing transitions do not sum to 1.0 (but, e.g., > 1.1)? > > I thought this was an important constraint, but maybe I am missing some > theory... The PFSG generate by make-ngram-pfsg are meant to be used in a Viterbi decoding framework. This means the cost along any given path through the network corresponds to the correct LM probability of the word strings, but there is no need to probabilities to sum to 1 at any given node. Specifically, the network has the following structure which can lead to probabilities summing to more than 1. From a given context, you have transitions corresponding to the explicit N-gram probabilities in the LM ("a b", "a c", "a d", etc. out of context "a"). In addition you have a transition into a backoff node that carries a "probability" equal to the backoff weight for the context. The backoff weight multiplied by the sum of the probabilities of all the words that don't have explicit N-grams in that context equals the probability mass left over from those N-grams. The backoff weight itself is not a true probability and therefore the sum over transitions out of the context may well exceed 1. --Andreas From amelie.deltour at ira.uka.de Wed Feb 12 07:10:24 2003 From: amelie.deltour at ira.uka.de (=?ISO-8859-1?Q?Am=E9lie?= DELTOUR) Date: Wed, 12 Feb 2003 16:10:24 +0100 Subject: LM for tagged words Message-ID: <3E4A63E0.3000003@ira.uka.de> Hi, What does "LM support for tagged words is incomplete" (in the "Bugs" section of the help for ngram-count) more precisely mean? I wanted to used ngram-count with -tagged option to build a language model over word/tag pairs, and then use this LM with hidden-ngram to find hidden tags. It does not seem to work (no tag is found) - is it because of the LM? How could I build the LM I need? Thanks, Am?lie -- -------------------------------------------------------------------- Am?lie DELTOUR ENSIMAG / Universit?t Karlsruhe E-mail : amelie.deltour at ira.uka.de -------------------------------------------------------------------- From stolcke at speech.sri.com Thu Feb 13 12:44:26 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 13 Feb 2003 12:44:26 PST Subject: LM for tagged words In-Reply-To: Your message of Wed, 12 Feb 2003 16:10:24 +0100. <3E4A63E0.3000003@ira.uka.de> Message-ID: <200302132044.MAA16113@huge> In message <3E4A63E0.3000003 at ira.uka.de>you wrote: > Hi, > What does "LM support for tagged words is incomplete" (in the "Bugs" > section of the help for ngram-count) more precisely mean? > I wanted to used ngram-count with -tagged option to build a language > model over word/tag pairs, and then use this LM with hidden-ngram to > find hidden tags. > It does not seem to work (no tag is found) - is it because of the LM? > How could I build the LM I need? Amelie, the ngram-count -tagged option has nothing to do with the "tagging" done by hidden-ngram. ngram-count -tagged is used to build an LM that uses word tags (classes) for estimating backoff probabilities. (This feature is rather experimental, and hasn't been touched in a long time, hence the warning in the man page.) For hidden-ngram you build a standard LM with ngram-count, treating the event tags as regular words. You just prepare a training text files that contains data like word tag word tag word word ... (The way hidden-ngram works it makes sense to have multiple words without intervening tags, but not to have multiple tags between words.) You then give this LM to hidden-ngram, together with the list of tags (-hidden-vocab) and some test data that contains only words but no tags. It will output the automatically tagged data. hidden-ngram is rather heavily used and should be working fine. Let me know if you have problems. Hope this helps, --Andreas From amelie.deltour at ira.uka.de Tue Feb 25 08:13:00 2003 From: amelie.deltour at ira.uka.de (=?ISO-8859-1?Q?Am=E9lie?= DELTOUR) Date: Tue, 25 Feb 2003 17:13:00 +0100 Subject: Open-vocabulary LM Message-ID: <3E5B960C.6010704@ira.uka.de> Hi, Is it normal that in an open-vocabulary LM (built with the "-unk" option) the token is present as unigram, but not in bigrams and trigrams? (Sorry if this is a silly question, but I am not so familiar with language models, and I was told that it would not be the case with other toolkits). Thanks again, Am?lie -- -------------------------------------------------------------------- Am?lie DELTOUR ENSIMAG / Universit?t Karlsruhe E-mail : amelie.deltour at ira.uka.de -------------------------------------------------------------------- From stolcke at speech.sri.com Tue Feb 25 09:02:59 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 25 Feb 2003 09:02:59 PST Subject: Open-vocabulary LM In-Reply-To: Your message of Tue, 25 Feb 2003 17:13:00 +0100. <3E5B960C.6010704@ira.uka.de> Message-ID: <200302251702.JAA12151@huge> Amelie, it is possible if there are no unknown words in your data, or if you didn't specify a vocabulary file (because then all words are added implicitly). It is also possible that you set ngram cutoffs such that all ngrams involving fall below the cutoffs and are therefore excluded from the LM. To understand what's going on run ngram-count with -write COUNTFILE (in addition to the other options you use) and check what ngrams are generated containing . --Andreas In message <3E5B960C.6010704 at ira.uka.de>you wrote: > Hi, > Is it normal that in an open-vocabulary LM (built with the "-unk" > option) the token is present as unigram, but not in bigrams and > trigrams? > (Sorry if this is a silly question, but I am not so familiar with > language models, and I was told that it would not be the case with other > toolkits). > Thanks again, > > Am?lie > > -- > -------------------------------------------------------------------- > Am?lie DELTOUR > ENSIMAG / Universit?t Karlsruhe > E-mail : amelie.deltour at ira.uka.de > -------------------------------------------------------------------- > > From xlm7b at mizzou.edu Fri Feb 28 14:36:38 2003 From: xlm7b at mizzou.edu (Xiaolong Li) Date: Fri, 28 Feb 2003 14:36:38 -0800 Subject: the shell language of srilm installation is different from cygwin Message-ID: <003e01c2df79$db8db7d0$6314ce80@speechwork> Hi, I am now trying to install srilm in my Win XP machine, it is said that if I have cygwin, I can do that in PC, but I find that the shell scripts for installation is witten in csh, but cygwin only supports bash, they are much different, so I can not run "make World" successfully. Could you tell me how to resolve this problem? Thanks. Hans Li -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Feb 28 12:40:35 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 28 Feb 2003 12:40:35 PST Subject: the shell language of srilm installation is different from cygwin In-Reply-To: Your message of Fri, 28 Feb 2003 14:36:38 -0800. <003e01c2df79$db8db7d0$6314ce80@speechwork> Message-ID: <200302282040.MAA21242@caballe> > > Hi, > I am now trying to install srilm in my Win XP machine, it is said = > that if I have cygwin, I can do that in PC, but I find that the shell = > scripts for installation is witten in csh, but cygwin only supports = > bash, they are much different, so I can not run "make World" = > successfully. Could you tell me how to resolve this problem? you need to install the "tcsh" package, and then make a link /bin/csh.exe -> /bin/tcsh.exe this is explained in doc/README.windows. --Andreas From xlm7b at mizzou.edu Fri Feb 28 21:36:38 2003 From: xlm7b at mizzou.edu (Xiaolong Li) Date: Fri, 28 Feb 2003 21:36:38 -0800 Subject: About two PPLs Message-ID: <002101c2dfb4$8801ec90$6314ce80@speechwork> Hi, I have installed srilm successfully, thanks a lot! Now I have a small question about PPL output: when I run "ngram" to count PPL of a testing text, there are two ppls output: ppl and ppl1, what's the difference of them? ?I can't find this from the documents). Thanks a lot. Hans Li -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Feb 28 21:38:08 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 28 Feb 2003 21:38:08 PST Subject: About two PPLs In-Reply-To: Your message of Fri, 28 Feb 2003 21:36:38 -0800. <002101c2dfb4$8801ec90$6314ce80@speechwork> Message-ID: <200303010538.VAA04063@huge> In message <002101c2dfb4$8801ec90$6314ce80 at speechwork>you wrote: > This is a multi-part message in MIME format. > > ------=_NextPart_000_001E_01C2DF71.79CB99C0 > Content-Type: text/plain; > charset="gb2312" > Content-Transfer-Encoding: quoted-printable > > Hi, > I have installed srilm successfully, thanks a lot! Now I have a = > small question about PPL output: > when I run "ngram" to count PPL of a testing text, there are two = > ppls output: ppl and ppl1, what's the difference of them?=20 > =A3=A8I can't find this from the documents). ppl is the perplexity normalized over all input tokens, ppl1 is omits end-of-sentence tokens from the denominator. ppl1 is more meaningful for comparing texts that differ in their sentence segmentations. BTW, this will be documented in the man page for the next release. --Andreas From xlm7b at mizzou.edu Mon Mar 3 22:56:00 2003 From: xlm7b at mizzou.edu (Xiaolong Li) Date: Mon, 3 Mar 2003 22:56:00 -0800 Subject: About -bayes option in "ngram" command Message-ID: <023b01c2e21b$1dcc0720$6314ce80@speechwork> Hi, I met a problem in using "-bayes" option of "ngram" command when I want to interpolate two LM models (using posterier mixture weights). The below command doesn't work with an error message as "write() method not implemented". ngram -lm LM1 -mix-lm LM2 -bayes 0 -write-lm MIX_LM but it works when I use a priori mixture weights (without -bayes option) Could you help me? Thanks a lot. Hans Li -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Mar 4 17:22:38 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 04 Mar 2003 17:22:38 PST Subject: [Q] on mix-lm? Message-ID: <200303050122.RAA25728@huge> This message bounced because the sender address is not (no longer? )subscribed. I would like to add that while the ngram -bayes enables posterior-weighted mixture, the special case -bayes 0 gives a weight of zero to the posterior, effectively resulting in the standard context-independent linear interpolation of models. --Andreas ------- Forwarded Message Date: Tue, 04 Mar 2003 15:20:59 -0500 From: Woosung Kim Subject: Re: About -bayes option in "ngram" command In-reply-to: <023b01c2e21b$1dcc0720$6314ce80 at speechwork> To: Xiaolong Li Cc: srilm-user at speech.sri.com Message-id: <20030304152059.50a95685.woosung at cs.jhu.edu> Organization: CLSP/JHU MIME-version: 1.0 X-Mailer: Sylpheed version 0.8.8 (GTK+ 1.2.8; sparc-sun-solaris2.7) Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7bit References: <023b01c2e21b$1dcc0720$6314ce80 at speechwork> X-Spam-Status: No, score=0.0 threshold=8.0 X-Spam-Level: This is exactly the same question I asked before, so I can give you the answer I got from Dr. Stolcke. Maybe it needs to be more explicitly mentioned in the manual. - -- Woosung Kim From: Andreas Stolcke To: Woosung Kim Subject: Re: [Q] on mix-lm? Date: Thu, 03 Oct 2002 09:22:00 PDT You cannot "-write-lm" a dynamically interpolated LM (this is also mentioned in the man page). To evaluated such an LM (compute perplexity, rescore nbest, etc.) you always give the component LMs and the interpolation weights. - --Andreas On Mon, 03 Mar 2003 22:56:00 -0800 Xiaolong Li wrote: > Hi, > I met a problem in using "-bayes" option of "ngram" command when I want to interpolate two LM models (using posterier mixture weights). The below command doesn't work with an error message as "write() method not implemented". > > ngram -lm LM1 -mix-lm LM2 -bayes 0 -write-lm MIX_LM > > but it works when I use a priori mixture weights (without -bayes option) > > Could you help me? Thanks a lot. > > Hans Li > ------- End of Forwarded Message From mirjam.sepesy at uni-mb.si Tue Mar 11 09:36:34 2003 From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec) Date: Tue, 11 Mar 2003 18:36:34 +0100 Subject: ngram-merge Message-ID: <3E6E1EA1.B19BCCA4@uni-mb.si> Hi. I have problems with ngram-merge, when I want to merge 2 huge sorted 6-gram files (the first is about 2G and contains 61M counts and the second is 700M and contains 21M counts). At once ngram-merge stucks. Output file does not change any more, but ngram-merge is still doing something. When I look at the info of the output file, I see, that the time of the last modification is changing and there is stil space on the disc. When I split both input files at the critical 6-gram and merge the top parts and the botton parts of both files separatelly, it works well, but I think this is not the case. I have to do merging many times :-( One more question. If my count file contains 4-grams and 6-grams and I use -recompute option in ngram-count. Are in this case 5-grams recomputed from 6-grams and 3-grams from 4-grams? Regards, Mirjam. -------------- next part -------------- A non-text attachment was scrubbed... Name: mirjam.sepesy.vcf Type: text/x-vcard Size: 302 bytes Desc: Card for Mirjam Sepesy Maucec URL: From stolcke at speech.sri.com Tue Mar 11 11:39:01 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 11 Mar 2003 11:39:01 PST Subject: ngram-merge In-Reply-To: Your message of Tue, 11 Mar 2003 18:36:34 +0100. <3E6E1EA1.B19BCCA4@uni-mb.si> Message-ID: <200303111939.LAA10436@huge> > > Hi. > > I have problems with ngram-merge, when I want to merge 2 huge sorted > 6-gram files (the first is about 2G and contains 61M counts and the > second > is 700M and contains 21M counts). > At once ngram-merge stucks. Output file does not change any more, but > ngram-merge is still doing something. When I look at the info > of the output file, I see, that the time of the last modification is > changing and there is stil space on the disc. > When I split both input files at the critical 6-gram and merge the top > parts and the botton parts of both files separatelly, it works well, but > I think > this is not the case. I have to do merging many times :-( Some systems have problems with files exceeding 2GB in size. (This is because on older systems file offsets are stored as signed 32-bit integers.) There is nothing SRILM can do about it because files are accessed through the stdio library that comes with the system. However, it could help to compress or gzip the file, even if the file size stay above 2GB. This is because then the IO routines effectively read from a pipe, and there should be no limit on the amount of data read that way. > > One more question. If my count file contains 4-grams and 6-grams and I > use -recompute option in ngram-count. Are in this case 5-grams > recomputed from 6-grams and 3-grams from 4-grams? No. All the 1-5 grams will be recomputed from the highest order counts. The exception are 4grams that don't have a common prefix with any of your 6-grams. If you have a 6-gram "a b c d e f" then the counts for a a b a b c a b c d a b c d e will all be recomputed. --Andreas From melis at cs.utwente.nl Tue Mar 11 14:21:59 2003 From: melis at cs.utwente.nl (Paul Melis) Date: Tue, 11 Mar 2003 23:21:59 +0100 Subject: ARPA format (sorting) Message-ID: <20030311232159.A15739@luistervink.cs.utwente.nl> Hello Andreas, Is there any explicit sorting that LM's in ARPA format should have? Specifically, is there a standard sort order for the words of uni-, bi- and trigrams? (e.g. first, then diacritics, then alphabetically, then...). We've had some problems with arpa's written by SRILM that the CMU toolkit can't handle, and we suspect a problem in the sorting of n-grams. Regards, Paul -- melis at cs.utwente.nl From stolcke at speech.sri.com Tue Mar 11 14:33:44 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 11 Mar 2003 14:33:44 PST Subject: ARPA format (sorting) In-Reply-To: Your message of Tue, 11 Mar 2003 23:21:59 +0100. <20030311232159.A15739@luistervink.cs.utwente.nl> Message-ID: <200303112233.OAA18513@huge> I'm not aware of any specific sorting requirements. SRILM outputs the N-grams in and order that optimizes memory caching behavior (essentially by proximity in the underlying tree data structure), but of course it can read N-grams in any order. However, I have heard that some CMU software (like Sphinx) expects the N-grams to be sorted lexicographically left-to-right. The latest release contains a script "sort-lm" that reorders the N-grams in a manner that should be agreeable to the CMU software. It is documented in the lm-scripts(1) man page. --Andreas In message <20030311232159.A15739 at luistervink.cs.utwente.nl>you wrote: > Hello Andreas, > > Is there any explicit sorting that LM's in ARPA format should have? Specifica > lly, is there a standard sort order for the words of uni-, bi- and trigrams? > (e.g. first, then diacritics, then alphabetically, then...). > We've had some problems with arpa's written by SRILM that the CMU toolkit can > 't handle, and we suspect a problem in the sorting of n-grams. > > Regards, > Paul > -- > melis at cs.utwente.nl From stolcke at speech.sri.com Tue Mar 11 23:48:04 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 11 Mar 2003 23:48:04 PST Subject: Problems with reading data from STDIN in: SRILM 1.3.3 In-Reply-To: Your message of Tue, 11 Mar 2003 17:04:31 +0100. Message-ID: <200303120748.XAA21443@tonga> In message you wrote: > Hi Andreas, > > I installed the new version 1.3.3 of the SRI LM toolkit on a Linux > machine, (Linux 2.4.19, GNU libc 2.2.5, gcc version 2.95.3). I have > problems with reading data from STDIN in ngram: > > Version 1.3.2 and older this worked: > cat 300classes | ngram -order 2 -ppl DEVTEST.sri -unk -lm 300Klassen.LM -clas > ses - > file DEVTEST.sri: 515 sentences, 13964 words, 0 OOVs > 0 zeroprobs, logprob= -30572.9 ppl= 129.28 ppl1= 154.67 > > > This produces a warning with Version 1.3.3: > cat 300classes | ngram -order 2 -ppl DEVTEST.sri -unk -lm 300Klassen.LM -clas > ses - > warning: '-' used multiple times for input > file DEVTEST.sri: 515 sentences, 13964 words, 0 OOVs > 0 zeroprobs, logprob= -78894.3 ppl= 281112 ppl1= 446516 > > But this works perfectly well with Version 1.3.3: > ngram -order 2 -ppl DEVTEST.sri -unk -lm 300Klassen.LM -classes 300classes > file DEVTEST.sri: 515 sentences, 13964 words, 0 OOVs > 0 zeroprobs, logprob= -30572.9 ppl= 129.28 ppl1= 154.67 > > Is this problem due to my configuration? > > Regards, > Karl > Karl, what you see is an unfortunate byproduct of the new -limit-vocab facility. It requires the class definition file to be read multiple times to work correctly (at least in the current implementation). However, the simple patch included below avoids the problem when the -limit-vocab option is not being used (as in your case). Note that another scenario where the classes file is read multiple times is when you are mixing several models. The message warning: '-' used multiple times for input at least warns you that something is trying to read stdin multiple times. --Andreas *** /tmp/T00o6hhK Tue Mar 11 23:40:05 2003 --- lm/src/ngram.cc Tue Mar 11 23:37:22 2003 *************** *** 369,377 **** --- 369,379 ---- * the class names (the first column of the class definitions) * into the vocabulary. */ + if (limitVocab) { File file(classesFile, "r"); classVocab->read(file); } + } ngramLM = decipherHack ? new DecipherNgram(*vocab, order, !decipherNoBackoff) : From stolcke at speech.sri.com Fri Mar 14 11:41:10 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 14 Mar 2003 11:41:10 PST Subject: Some SRILM questions In-Reply-To: Your message of Fri, 14 Mar 2003 15:17:29 +0200. Message-ID: <200303141941.LAA01026@huge> In message you wrote: > Hi, > > I have a couple of question about the SRILM toolkit and I was hoping you > would have time to answer my questions: > > 1) Is there any way to make a n-gram model without sentence start and end > tags (,) ? yes. you can supply the counts yourself and make sure no ngrams containing those symbols are included. both symbols will still appear in the unigrams, but if you declare to be a "non-event" (ngram-count -nonevents) then they will get 0 probability. (I haven't tried this recently, let me know if you run into problems). > > 2) I tried teaching a Kneser-Ney smoothed 5-gram model > ( -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 ) > and got the error > warning: one of required count-of-counts is zero > error in discount estimator for order 4 > > I suppose this is a feature of K-N smoothing. Is there any way around this > or have I done something stupid ? KN (as well as GT) discounting require count-of-counts statistics, and to work well they need to be from "natural" data, in the sense that you didn't delete, duplicate, or otherwise manipulate the raw corpus counts. For example, you might be using a vocabulary that does not include all the training words, and that would skew the count-of-count statistics. If there is nothing obvious that you did, try using the "make-big-lm" script, which is a wrapper around ngram-count that avoids truncating the vocabulary prior to estimating the discounting statistics. --Andreas From stolcke at speech.sri.com Tue Mar 18 12:52:55 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 18 Mar 2003 12:52:55 PST Subject: SRILM 1.3.3 released In-Reply-To: Your message of Tue, 18 Mar 2003 18:47:49 +0100. <3E775BC5.726B1494@itc.it> Message-ID: <200303182052.MAA08145@huge> In message <3E775BC5.726B1494 at itc.it>you wrote: > Dear Dr Andreas Stolcke, > I would like to ask you some questions: does SIRLM support confusion network > operations ? I mean are there some functions in Lattice Tool / Word Lattice t > o > transform the (posterior) lattice to confusion Network ? > By the way, as I have only read in detail about Lattice Tool (of course not f > ind > the above mentioned function), could you please explain for me some words abo > ut > the difference between Lattice Tool and Word Lattice in SRILM. > Best regards > Vu, SRILM only supports building confusion networks from N-best lists. For processing lattices I recommend Lidia Mangu's tool. The program dealing with confusion networks is "nbest-lattice". nbest-lattice -use-mesh -nbest L -write N will produce a confusion network N from nbest list L. See the man page for details. --Andreas PS. Please do not send email to "srilm-announce", that list is for release announcements only. If you want to enlist the help of other SRILM users, please join the "srilm-user" mailing list. Send mail to "majordomo at speech.sri.com" with "help" in the body for instructions. From weilkar at phonetik.uni-muenchen.de Tue Mar 18 14:37:13 2003 From: weilkar at phonetik.uni-muenchen.de (Karl Weilhammer) Date: Tue, 18 Mar 2003 23:37:13 +0100 (CET) Subject: Where have all the 3-grams gone? Message-ID: Hi Andreas, experimenting a little with SRILM, I found that ngram-count does not enter trigrams into the language model, that occur only once, while it does so with bigrams. The command echo "the man hit the ball" | ngram-count -order 3 -text - -cdiscount3 0.5 -cdiscount2 0.5 -cdiscount1 0.5 -unk -lm test_C3gram.lm results in the following language model: __________________________________________ \data\ ngram 1=7 ngram 2=6 ngram 3=0 \1-grams: -1.079181 -99 -0.1760913 -0.3802113 -1.079181 ball -0.2632414 -1.079181 hit -0.1760913 -1.079181 man -0.2632414 -0.60206 the -0.2218487 \2-grams: -0.30103 the -0.30103 ball -0.30103 hit the -0.30103 man hit -0.60206 the ball -0.60206 the man \3-grams: \end\ _________________________________________ The same command with "-order 2" results in basically the same language model (only the lines "ngram 3=0" and "\3-grams:" are missing). Using "-minprune 4" and "-prune 0" did not change the result. Is there a possibility to get entries for singular trigrams in the language model? Karl ---------------------------------------------------------------------------- Karl Weilhammer Institut fuer Phonetik und Sprachliche Kommunikation Ludwig-Maximilians-Universitaet Muenchen Tel.: +49-(0)89-2180-2454 Schellingstr. 3 Fax : +49-(0)89-2800362 80799 Muenchen Email: weilkar at phonetik.uni-muenchen.de GERMANY www : http://www.phonetik.uni-muenchen.de/ ---------------------------------------------------------------------------- From stolcke at speech.sri.com Tue Mar 18 14:43:42 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 18 Mar 2003 14:43:42 PST Subject: Where have all the 3-grams gone? In-Reply-To: Your message of Tue, 18 Mar 2003 23:37:13 +0100. Message-ID: <200303182243.OAA14815@huge> In message you wrote: > Hi Andreas, > > experimenting a little with SRILM, I found that ngram-count does not enter > trigrams into the language model, that occur only once, while it does so > with bigrams. The command > > echo "the man hit the ball" | ngram-count -order 3 -text - -cdiscount3 0.5 > -cdiscount2 0.5 -cdiscount1 0.5 -unk -lm test_C3gram.lm The default minimum counts are as follows: 1grams 1 2grams 1 3grams 2 4grams 2 You can use the -gt1min, -gt2min, etc. options to change these thresholds at will. (Maybe counter-intuitively, these options apply to all smoothing schemes.) --Andreas