From thomae at ei.tum.de Wed Oct 2 06:09:46 2002 From: thomae at ei.tum.de (Matthias Thomae) Date: Wed, 02 Oct 2002 15:09:46 +0200 Subject: N-Gram without backoff? Message-ID: <3D9AF01A.4010505@ei.tum.de> Hello SRILM users, does anyone know if and how it is possible to construct n-gram language models without backoff, and to convert them into pfsg format? I could not find any corresponding option for ngram or ngram-count. I tried manually deleting the lower-order n-grams from the ARPA format file, but I am not sure if the weights are still correct then. Regards. Matthias From stolcke at speech.sri.com Wed Oct 2 10:04:56 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 02 Oct 2002 10:04:56 PDT Subject: N-Gram without backoff? In-Reply-To: Your message of Wed, 02 Oct 2002 15:09:46 +0200. <3D9AF01A.4010505@ei.tum.de> Message-ID: <200210021704.KAA04741@huge> You can disable probability smoothing with ngram-count -gt1max 0 -gt2max 0 ... This will still include lower-order N-grams in the models, but they are effectively never used because no probability mass is left for backing off. you could then remove the lower-order ngrams to save space (but leave the unigrams in). the conversion to pfsg should be unaffected by any of this. --Andreas In message <3D9AF01A.4010505 at ei.tum.de>you wrote: > Hello SRILM users, > > does anyone know if and how it is possible to construct n-gram language > models without backoff, and to convert them into pfsg format? I could > not find any corresponding option for ngram or ngram-count. I tried > manually deleting the lower-order n-grams from the ARPA format file, but > I am not sure if the weights are still correct then. > > Regards. > Matthias > From woosung at clsp.jhu.edu Wed Oct 2 21:15:18 2002 From: woosung at clsp.jhu.edu (Woosung Kim) Date: Thu, 3 Oct 2002 00:15:18 -0400 Subject: [Q] on mix-lm? Message-ID: <20021003001518.430ac8d8.woosung@clsp.jhu.edu> Dear Dr. Stolcke, I am doing some experiments using interpolated LMs, and I've noticed that mixed LMs give slightly different PPLs from PPLs that should be. I mean, PPLs calculated by getting weighted sums after getting respective models' word probs. Do you have any documentations or explanations how that 'mix-lm' works in your toolkit or how it is different from the correct way? Of course, the best ways would be to look at the source code, but I am looking for an easier way. According to my experiments, mix-lm gives better results when the baseline model (before mixing) is good (PPL less than 300), but it gives worse results when it is not good (PPL above 500). Thanks in advance, -- Woosung Kim From anand at speech.sri.com Wed Oct 2 22:49:49 2002 From: anand at speech.sri.com (Anand Venkataraman) Date: Wed, 2 Oct 2002 22:49:49 -0700 (PDT) Subject: [Q] on mix-lm? Message-ID: <200210030549.WAA01531@huge> Dear Woosung, >I am doing some experiments using interpolated LMs, and >I've noticed that mixed LMs give slightly different >PPLs from PPLs that should be. I mean, PPLs calculated >by getting weighted sums after getting respective >models' word probs. Do you have any documentations or >explanations how that 'mix-lm' works in your toolkit or >how it is different from the correct way? There is no one "correct way". But I presume you mean by that the unmixed estimation procedure. mix-lm simply does \sum_i \lambda_i P(w_i|h_i) where the probability is the backed-off ngram word level probability. You can in fact calculate this value by hand quite easily from the individual ngram -ppl outputs using the above expression. However, there is a slight nuance involved. One should generally use lambdas that were estimated to maximize the likelihood of some held out data in the domain. The awk script compute-best-mix will do this for you. You can also calculate a sentence level mixture similarly interpolated with tuned weights (see compute-best-sentence-mix). This uses sentence level probabilities (as for instance obtained from ngram -debug 1 -ppl). >experiments, mix-lm gives better results when the >baseline model (before mixing) is good (PPL less than >300), but it gives worse results when it is not good >(PPL above 500). > Regardless of the quality of the lms, the mixed likelihood on the held out set should alwasy be at least as much as the likelihood of most likely component likelihood becaues the EM procedure to compute the best weights maximises this quantity. Of course the test set likelihood (and conseqnetly -PPL) may not necessarily higher, but usually is. hope this helps. & From stolcke at speech.sri.com Thu Oct 3 08:25:58 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 03 Oct 2002 08:25:58 PDT Subject: [Q] on mix-lm? In-Reply-To: Your message of Thu, 03 Oct 2002 00:15:18 -0400. <20021003001518.430ac8d8.woosung@clsp.jhu.edu> Message-ID: <200210031525.IAA20309@tonga> Woosung, I suspect that you are noticing the difference between "static" and "dynamic" interpolation. The former is sometimes called N-gram "merging", while the latter is the commonly used mixture of probabilities. ngram -bayes 0 -mix-lm performs dynamic interpolation. Without the -bayes option you get static interpolation. This is also explained in the man page: -mix-lm file Read a second N-gram model for interpolation pur- poses. The second and any additional interpolated models can also be class N-grams (using the same -classes definitions), but are otherwise con- strained to be standard N-grams, i.e., the options -df, -tagged, -skip, and -hidden-vocab do not apply to then. NOTE: Unless -bayes (see below) is specified, -mix- lm triggers a static interpolation of the models in memory. In most cases a more efficient, dynamic interpolation is sufficient, requested by -bayes 0. There is some discussion of the two methods in the paper that just appeared in ICSLP (http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp2002-srilm.ps.gz, last paragraph of section 3.2). --Andreas In message <20021003001518.430ac8d8.woosung at clsp.jhu.edu>you wrote: > Dear Dr. Stolcke, > > I am doing some experiments using interpolated LMs, and > I've noticed that mixed LMs give slightly different PPLs > from PPLs that should be. I mean, PPLs calculated by getting > weighted sums after getting respective models' word probs. > Do you have any documentations or explanations how that 'mix-lm' > works in your toolkit or how it is different from the correct way? > Of course, the best ways would be to look at the source code, > but I am looking for an easier way. > According to my experiments, mix-lm gives better results when > the baseline model (before mixing) is good (PPL less than 300), > but it gives worse results when it is not good (PPL above 500). > > Thanks in advance, > -- > Woosung Kim From mirjam.sepesy at uni-mb.si Tue Oct 8 05:34:37 2002 From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec) Date: Tue, 08 Oct 2002 14:34:37 +0200 Subject: class LM Message-ID: <3DA2D0DD.AE6387DB@uni-mb.si> Andreas! Thank you for your answers. Few more questions: 1.) I understand the transitions like: [2gram]POSITION = 2 FROM: <504,NULL> TO: <756 504,NULL> WORD = primeri PROB = -1.76748 EXPANDPROB = 0.0106105 (504, 756 are classs), but not the transitions like: [OOV]POSITION = 2 FROM: <504,NULL> TO: <,NULL> WORD = primeri PROB = -inf What does [OOV] mean? These transitions are not present in the test example of the toolkit. 2.) In which case is the history string cleaned (FROM: <504,NULL> TO: <,NULL>) ? 3.) Is the vocabulary size in SRI-LM limited? Thanks a lot, Mirjam From stolcke at speech.sri.com Tue Oct 8 08:52:48 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 08 Oct 2002 08:52:48 PDT Subject: class LM In-Reply-To: Your message of Tue, 08 Oct 2002 14:34:37 +0200. <3DA2D0DD.AE6387DB@uni-mb.si> Message-ID: <200210081552.IAA10950@huge> In message <3DA2D0DD.AE6387DB at uni-mb.si>you wrote: > Andreas! > > Thank you for your answers. > > Few more questions: > > 1.) > I understand the transitions like: > > [2gram]POSITION = 2 FROM: <504,NULL> TO: <756 504,NULL> WORD = primeri > PROB = -1.76748 EXPANDPROB = 0.0106105 > > (504, 756 are classs), > > but not the transitions like: > > [OOV]POSITION = 2 FROM: <504,NULL> TO: <,NULL> WORD = primeri PROB = > -inf > > What does [OOV] mean? These transitions are not present in the test > example of the toolkit. [OOV] means a word was not found even in the unigrams of your model. The ClassNgram code handles LMs that contains both word and class ngrams. It therefore always tries to also find an N-gram probabilty for each word (without class lookup), and if you don't include all class member words in your vocabulary when building the LM you will get this "OOV" condition. But is is harmless since presumably all your words get some probability by virtue of being members in some class. > 2.) In which case is the history string cleaned (FROM: <504,NULL> TO: > <,NULL>) ? When there a are no histories in the LM that start with the given class (504). The history is kept only a long as it needs to be to compute subsequent N-gram probabilities (so as to minimize the state space). > > 3.) Is the vocabulary size in SRI-LM limited? To the range of unsigned integers (2^32). --Andreas From jachym at kky.zcu.cz Fri Oct 11 04:47:17 2002 From: jachym at kky.zcu.cz (=?iso-8859-2?B?SuFjaHltIEtvbOH4?=) Date: Fri, 11 Oct 2002 13:47:17 +0200 Subject: Problem with language-specific characters in segment Message-ID: <000c01c2711b$f461a1d0$3f2fe493@ui.kky.fav.zcu.cz> Hi to all! I have a following problem with segment tool. In the output of segment appears token instead of words including language-specific characters - although in language model file they are saved correctly and input text file has the same coding (ISO-Latin 2) as the training text. Does anybody know what's the problem? Language model was buil using: ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm lmfile2 Segment tool was used with option: segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous Disabling -unk option I got right words in the output but posteriors are probably not correct. Jachym Kolar Department of Cybernetics University of West-Bohemia Pilsen, Czech Republic -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sun Oct 13 08:20:53 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 13 Oct 2002 08:20:53 -0700 Subject: Problem with language-specific characters in segment References: <000c01c2711b$f461a1d0$3f2fe493@ui.kky.fav.zcu.cz> Message-ID: <3DA98F55.3000604@speech.sri.com> Hi, sorry to hear about the problems. I think it has to do with the fact that the locale is never set in segment.cc. try putting setlocale(LC_CTYPE, ""); setlocale(LC_COLLATE, ""); right at the beginning of main() in segment.cc. (This applies to several other programs as well, and will be fixed in the next release.) BTW, the -unk option only makes sense if your LM was trained with instances of (or the ngram-count -unk option). Otherwise unknown words will get zero probability either way. --Andreas J?chym Kol?? wrote: > Hi to all! > I have a following problem with segment tool. In the output of segment > appears token instead of words including > language-specific characters - although in language model file they > are saved correctly and input text file has the same coding (ISO-Latin > 2) as the training text. > Does anybody know what's the problem? > > Language model was buil using: > ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm > lmfile2 > > Segment tool was used with option: > segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous > > Disabling -unk option I got right words in the output but posteriors > are probably not correct. > > Jachym Kolar > Department of Cybernetics > University of West-Bohemia > Pilsen, Czech Republic > From iris_jing_2000 at yahoo.com Fri Oct 25 10:47:07 2002 From: iris_jing_2000 at yahoo.com (Bing Jing) Date: Fri, 25 Oct 2002 10:47:07 -0700 (PDT) Subject: Q: probabilities calculation In-Reply-To: <3DA98F55.3000604@speech.sri.com> Message-ID: <20021025174707.57738.qmail@web12501.mail.yahoo.com> Hello there, Does anyone know how the SRI tool generate unigram probabilities for the words that NOT occur in the training transcript but covered by the training dictionary? As I read the NgramLM.cc, I think all those words are assigned a probability as LogP_Zero, but it seems to me that this value is various regarding different LMs. I used two sets of quite small transcription to train LMs, and use the same training dictionary ( 46K). The number of unique words in trans1 and trans2 are 620 and 700, respectively. And for those words that covered by the lexicon but now in the training trans, the unigram probabilities are -5.337341 and -5.383736, respectively. I still can't figure out how these two numbers are generated. Thanks in advance! Bing __________________________________________________ Do you Yahoo!? Y! Web Hosting - Let the expert host your web site http://webhosting.yahoo.com/ From stolcke at speech.sri.com Sun Oct 27 10:15:16 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 27 Oct 2002 10:15:16 -0800 Subject: Q: probabilities calculation References: <20021025174707.57738.qmail@web12501.mail.yahoo.com> Message-ID: <3DBC2D34.4050306@speech.sri.com> Bing, words with zero unigram counts can still get a non-zero probability as a result of probability smoothing. The discounting method applied to unigrams will cause the total probability mass of the oberserved unigrams to be less than zero. SRILM then effectively implements a backing off to a "zero-gram" (uniform) distribution. Since the DARPA format has no provision for such a backoff this is done implicitly: If there is at least one word with zero counts (sometimes called a "zeroton") then the left-over unigram probability mass is distributed evenly over all zeroton words. If all words in the vocabuary had non-zero counts (i.e., no zerotons) then the left-over probability is split evenly among all words and added to the previously estimated unigram probabilities. This is all implemented in Ngram::distributeProb(), which in turn is invoked as part of the backoff weight normalization step. So the short answer is that depending on the discounting method chosen for unigrams, zerotons get some non-zero probabiility via backoff to a uniform distribution. If you want to disable that you just need to disable unigram discounting (-gt1max 0). I hope this answers your question. --Andreas Bing Jing wrote: >Hello there, > >Does anyone know how the SRI tool generate >unigram probabilities for the words that NOT >occur in the training transcript but covered >by the training dictionary? As I read >the NgramLM.cc, I think all those words are >assigned a probability as LogP_Zero, but it >seems to me that this value is various regarding >different LMs. > >I used two sets of quite small transcription to >train LMs, and use the same training dictionary ( >46K). The number of unique words in trans1 and trans2 >are 620 and 700, respectively. And for those words >that covered by the lexicon but now in the training >trans, the unigram probabilities are -5.337341 and >-5.383736, respectively. I still can't figure out how >these two numbers are generated. > >Thanks in advance! > >Bing > > > > >__________________________________________________ >Do you Yahoo!? >Y! Web Hosting - Let the expert host your web site >http://webhosting.yahoo.com/ > > From stolcke at speech.sri.com Tue Nov 5 16:30:04 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 05 Nov 2002 16:30:04 PST Subject: SRILM 1.3.2 In-Reply-To: Your message of Tue, 05 Nov 2002 14:00:13 -0500. <3DC8153D.E75FFED2@crim.ca> Message-ID: <200211060030.QAA10980@huge> In message <3DC8153D.E75FFED2 at crim.ca>you wrote: > > Hi, > > I did many tests to find the best suited language model for a given text > with the "ngram" program with the -prune option and I maybe have > discovered a bug with the OOV displayed in ngram. > > With a command like: > jarjar jfbeaumo/mlf> ngram -order 3 -vocab vocab20k.txt -unk -lm > transtalk10.arpa -ppl test.txt > file test.txt: 635 sentences, 9448 words, 0 OOVs > 0 zeroprobs, logprob=3D -17926.9 ppl=3D 59.9706 ppl1=3D 78.9647 > > I am always ending with 0 OOV. The language model does contain the > token. I supposed with a sufficient large value for -prune I will begin > to get OOV word but it is fixed on 0. If I specified an empty vocabulary > file, again, there is 0 OOV and I suppose this isn't correct. Maybe > ngram is taking its vocabulary from the LM but then, there will be no > use for the switch -vocab. > > Can you help me? Did I miss something? > > Best regards, > > JF > -- > Jean-Fran=E7ois Beaumont - Agent de recherche (jfbeaumont at crim.ca) > CRIM - 550, rue Sherbrooke Ouest Bureau 100 (www.crim.ca) > Montr=E9al (Qu=E9bec) H3A 1B9 T=E9l.: 514.840-1235 #3625 Dear JF, it is actually a feature (not a bug) that ngram -unk counts OOVs as regular words. They would only be counted as OOVs in the ppl output if the LM did not contain the token, or if it had probability 0. Of course whether this is what you expect is debatable. You can get the OOV count you want by grepping the ngram -ppl 2 output for "p( | ". --Andreas From geetu at clsp.jhu.edu Tue Nov 12 09:29:06 2002 From: geetu at clsp.jhu.edu (Geetu Ambwani) Date: Tue, 12 Nov 2002 12:29:06 -0500 (EST) Subject: Class Language Modelling Message-ID: Hi, I am trying to use the SRILM toolkit to calculate perplexity results for the following language model - a regular trigram model interpolated with the class model P(w0/CW0,CW1,CW2) * P(Cw0/CW1,CW2) where CW0,CW1 & CW2 are the equivalence classes for the predicted word and 2 the preceding words respectively. I generated the equivalence classifications for the words by myself and i want to know if it is possible to use the toolkit to do the perplexity measurements if i input the class files as data files. Can this be done at all? If any of you know how to do this, please reply pointing out the relevant sections of the manual i should look up for this. Thanks a ton, Geetu From yangl at ecn.purdue.edu Tue Nov 12 10:57:37 2002 From: yangl at ecn.purdue.edu (Yang Liu) Date: Tue, 12 Nov 2002 13:57:37 -0500 (EST) Subject: Class Language Modelling In-Reply-To: Message-ID: Hi Geetu, If your own class definition is alrady in the format of SRILM's classes-format, then you can easily get the PP using the mixed LMs (word based and class based) from 'ngram'. Check the mannual of ngram for details. I'm not sure if I understand your question correctly. If this does not help, then please wait for the answers from Andreas. Regards. Yang On Tue, 12 Nov 2002, Geetu Ambwani wrote: > > Hi, > I am trying to use the SRILM toolkit to calculate perplexity results for > the following language model - a regular trigram model > interpolated with the class model P(w0/CW0,CW1,CW2) * P(Cw0/CW1,CW2) where > CW0,CW1 & CW2 are the equivalence classes for the predicted word and 2 the > preceding words respectively. I generated the equivalence classifications > for the words by myself and i want to know if it is possible to use the > toolkit to do the perplexity measurements if i input the class files as > data files. Can this be done at all? If any of you know how to do this, > please reply pointing out the relevant sections of the manual i should > look up for this. > Thanks a ton, > Geetu > > > From geetu at clsp.jhu.edu Tue Nov 19 08:18:45 2002 From: geetu at clsp.jhu.edu (Geetu Ambwani) Date: Tue, 19 Nov 2002 11:18:45 -0500 (EST) Subject: Class Language Modelling Message-ID: Suppose i wish to build a language model P(w0/CW0,CW1,CW2) where CW0, CW1 & CW2 are the equivalence classes for the predicted word and the 2 preceding words respectively amd i wish to use absolute discounting with a fixed D. The input files i have available are (1) a trigram count file (format - w0 w1 w2 count) (2) a vocab file (3) 3 class files in format classno word1 word2 ....) for w0, w1 & w2 positions . Can someone please tell me the syntax of the ngram-count command needed to build a LM using this sort of a class LM as i am not very sure i understand it clearly. Thanks, Geetu From stolcke at speech.sri.com Tue Nov 19 20:20:01 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 19 Nov 2002 20:20:01 PST Subject: Class Language Modelling In-Reply-To: Your message of Tue, 19 Nov 2002 11:18:45 -0500. Message-ID: <200211200420.UAA24301@huge> In message you wrote : > > Suppose i wish to build a language model P(w0/CW0,CW1,CW2) where CW0, CW1 > & CW2 are the equivalence classes for the predicted word and the 2 > preceding words respectively amd i wish to use absolute discounting with a > fixed D. The input files i have available are (1) a trigram count file > (format - w0 w1 w2 count) (2) a vocab file (3) 3 class files in format > classno word1 word2 ....) for w0, w1 & w2 positions . > Can someone please tell me the syntax of the ngram-count command needed to > build a LM using this sort of a class LM as i am not very sure i > understand it clearly. > Thanks, > Geetu Geetu, SRILM does not currently support class LMs with separate class membership functions for the different positions in an N-gram. All word positions must share the same class definitions. Under these constraints, we typically train class LM as follows: 1. prepare class definition file in the format described in the classes-format(5) manual page. this can be done by hand or from other knowledge sources, or automatically using word clustering algorithms (see ngram-class(1)). it is a bad idea to use plain numbers as class names. when in doubt use names like CLASS1, CLASS2, etc. this avoids confusion in places where a file can be either a class name, word, or integer count. 2. condition the training data or counts to replace words with class labels, using the "replace-words-with-classes" filter (see training-scripts(1) man page). 3. run ngram-count on the result of step 2. Although multiple class definitions for different word positions are not supported by the above training procedure, or the LM evaluation code, there is a fairly straightforward way to fake it. I'm assuming now that classes expand to exactly one word at a time, and that a word has a unique class in a given ngram position. You need to write a filter that maps word ngram counts to class ngram counts (w1 w2 w3 N -> c1 c2 c3 N, and similarly for unigrams and bigrams). then you can train and evaluate your class LM by operating on counts rather than text. to train: ngram-counts -text DATA -write - | word-to-class-filter | \ ngram-counts -read - -lm LM [smoothing-options] Similary, you can map the test data to counts, filter them, and use the ngram -counts option to compute perplexities and log probabilities from counts. there is one detail in LM estimation: you need to prevent class labels that can only occur in the history portion of an ngram from receiving backoff probability mass as a result of smoothing . you can accomplish that by listing those not-to-be-predicted classes in a file, and specifying them with the ngram-count -nonevents option. see the man page for details. you need to also keep track of the probabilities incurred for replacing a word by its class for each word in the test set. (the filter script could do that as a side effect), and add the log probability for class expansions to the log probability for class ngrams. hope this helps, --Andreas From David.Mas at limsi.fr Wed Nov 27 06:45:17 2002 From: David.Mas at limsi.fr (David Mas) Date: Wed, 27 Nov 2002 15:45:17 +0100 Subject: Memory issues Message-ID: <3DE4DA7D.9662ED5E@limsi.fr> Hi, I'm a french PhD Student, using the toolkit to compute ngram and class-ngram models on Hub4 and Hub5 data. I recently tried to mix several models with ngram -mix-lm, which works fine except for big models (learned on Hub4). It seems to be matter of memory. So I used the -memuse option to have an idea of the memory load. But this option doesn't reflect the actual load of the memory. It says 900M when a top running of the same machine gives a amount a 2,5G used. So my 2 questions are : - is it normal that the -memuse option gives a wrong result ? - is it normal that the toolkit use so much memory, or have I done something wrong in the installation ? Any help is welcome. David Mas -- David Mas LIMSI/CNRS, groupe TLP Tel : 01 69 85 80 05 http://www.limsi.fr/Individu/mas/ From stolcke at speech.sri.com Wed Nov 27 10:25:49 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 27 Nov 2002 10:25:49 PST Subject: Memory issues In-Reply-To: Your message of Wed, 27 Nov 2002 15:45:17 +0100. <3DE4DA7D.9662ED5E@limsi.fr> Message-ID: <200211271825.KAA29413@huge> In message <3DE4DA7D.9662ED5E at limsi.fr>you wrote: > Hi, > > I'm a french PhD Student, using the toolkit to compute ngram and > class-ngram models on Hub4 and Hub5 data. > > I recently tried to mix several models with ngram -mix-lm, which works > fine except for big models (learned on Hub4). > > It seems to be matter of memory. So I used the -memuse option to have an > idea of the memory load. > > But this option doesn't reflect the actual load of the memory. It says > 900M when a top running of the same machine gives a amount a 2,5G used. That's because -memuse only calculates the memory used by the final model. For static interpolation with -mix-lm the program needs to temporarily allocate both the input models and the resulting mixture model, so 2.5 GB doesn't sound too outlandish. (I know one could implement this operation without requiring all models to be fully in memory, but i preferred to keep the code simple.) > So my 2 questions are : > - is it normal that the -memuse option gives a wrong result ? see above. > - is it normal that the toolkit use so much memory, or have I done > something wrong in the installation ? The default build optimizes data structures for speed, not space. that's why you see a significant portion of memory "wasted" (according to -memuse output). That's the extra space needed to keep hash tables sparse. As of SRILM version 1.3.2, you can build a separate version of the binaries optimized for space, and that's usually worth it once you start dealing with Hub4 ;-) Follow the instructions under item 9 in the INSTALL file. --Andreas From valsan at sony.de Tue Dec 3 06:13:11 2002 From: valsan at sony.de (Valsan, Zica) Date: Tue, 3 Dec 2002 15:13:11 +0100 Subject: perplexity evaluation Message-ID: Hi all, I'm a new user of the toolkit and I need a little bit support in order to understand how the perplexity is computed and why it is different from the expected value. For instance, I have the training data in the file train.text that contain only a line: a b c and the vocabulary (train.vocab) that contains all these words, and I want to generate a LM based on unigram only and to evaluate it on the same training data. I don't want any discounting strategy to be applied. Here are the commands I used: ngram-count -order 1 -vocab train.vocab -text train.text -lm lm.arpa gt1max 0 ngram -lm out.arpa -debug 2 -vocab train.vocab -ppl train.text > out.ppl So, according to the theory, the expected value for perplexity is PP=3 if the context cues are not taken into account. This is also what one can get using CMU toolkit. Using this toolkit and the above commands what I've got actually, is PP=4. Looking inside of the created arpa model , I could see that has the same probability as any of the real word (a, b,c). Does anybody could explain me why is like this? Did I make a mistake or is something that miss me? Thank you in advance for your support, Zica From stolcke at speech.sri.com Tue Dec 3 08:48:13 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 03 Dec 2002 08:48:13 PST Subject: perplexity evaluation In-Reply-To: Your message of Tue, 03 Dec 2002 15:13:11 +0100. Message-ID: <200212031648.IAA22088@huge> In message you wrot e: > Hi all, > > I'm a new user of the toolkit and I need a little bit support in order to > understand how the perplexity is computed and why it is different from the > expected value. > > For instance, I have the training data in the file train.text that contain > only a line: > a b c > and the vocabulary (train.vocab) that contains all these words, and I want > to generate a LM based on unigram only and to evaluate it on the same > training data. I don't want any discounting strategy to be applied. > Here are the commands I used: > > ngram-count -order 1 -vocab train.vocab -text train.text -lm lm.arpa gt1max > 0 > ngram -lm out.arpa -debug 2 -vocab train.vocab -ppl train.text > out.ppl > > > So, according to the theory, the expected value for perplexity is PP=3 if > the context cues are not taken into account. This is also what one can get > using CMU toolkit. > Using this toolkit and the above commands what I've got actually, is PP=4. > Looking inside of the created arpa model , I could see that has the > same probability as any of the real word (a, b,c). > Does anybody could explain me why is like this? Did I make a mistake or is > something that miss me? You didn't make a mistake and this is the right answer as far as I can tell. needs to get a probability in order to be able to compute a probability for the whole "sentence". Are you saying that the CMU software doesn't give any probabiliy to ? that would be quite odd. Maybe someone on this list who is more familiar with the CMU toolkit can contribute an explanation. --Andreas From valsan at sony.de Wed Dec 4 00:21:31 2002 From: valsan at sony.de (Valsan, Zica) Date: Wed, 4 Dec 2002 09:21:31 +0100 Subject: perplexity evaluation Message-ID: Thank you for your prompt answer. I have understood that is taken into account but the question is way only it and not the other one, too? I read papers where people resort to this strategy (choosing only one) but is not clear for me the reason for which they do like this. Regarding the CMU toolkit I did not say it doesn't output any probabilities for these context cues, but it outputs the same small values for each of them (-98.999 very close to the values outputted by SRILM toolkit). This is somehow "equivalent" with saying there are not taken into account for perplexity computation, I think. Regards, Zica -----Original Message----- From: Andreas Stolcke [mailto:stolcke at speech.sri.com] Sent: Dienstag, 3. Dezember 2002 17:48 To: Valsan, Zica Cc: 'srilm-user at speech.sri.com' Subject: Re: perplexity evaluation In message you wrot e: > Hi all, > > I'm a new user of the toolkit and I need a little bit support in order to > understand how the perplexity is computed and why it is different from the > expected value. > > For instance, I have the training data in the file train.text that contain > only a line: > a b c > and the vocabulary (train.vocab) that contains all these words, and I want > to generate a LM based on unigram only and to evaluate it on the same > training data. I don't want any discounting strategy to be applied. > Here are the commands I used: > > ngram-count -order 1 -vocab train.vocab -text train.text -lm lm.arpa gt1max > 0 > ngram -lm out.arpa -debug 2 -vocab train.vocab -ppl train.text > out.ppl > > > So, according to the theory, the expected value for perplexity is PP=3 if > the context cues are not taken into account. This is also what one can get > using CMU toolkit. > Using this toolkit and the above commands what I've got actually, is PP=4. > Looking inside of the created arpa model , I could see that has the > same probability as any of the real word (a, b,c). > Does anybody could explain me why is like this? Did I make a mistake or is > something that miss me? You didn't make a mistake and this is the right answer as far as I can tell. needs to get a probability in order to be able to compute a probability for the whole "sentence". Are you saying that the CMU software doesn't give any probabiliy to ? that would be quite odd. Maybe someone on this list who is more familiar with the CMU toolkit can contribute an explanation. --Andreas From melis at cs.utwente.nl Tue Dec 17 05:38:52 2002 From: melis at cs.utwente.nl (Paul Melis) Date: Tue, 17 Dec 2002 14:38:52 +0100 Subject: Unexpected "ngram-count -recompute" result Message-ID: <20021217143852.A7495@luistervink.cs.utwente.nl> Hello, We just noticed the following when using the -recompute flag of ngram-count. We're just try to generate uni- and bigram counts from trigram counts but some are missing: [1 - directly summing uni-, bi- and trigram counts of a simple text file] melis at luistervink:/local/export/melis/lm> cat t this is a test melis at luistervink:/local/export/melis/lm> ngram-count -text t -sort 1 1 this 1 this is 1 a 1 a test 1 a test 1 is 1 is a 1 is a test 1 test 1 test 1 this 1 this is 1 this is a 1 [2 - only summing trigram counts] melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort this is 1 a test 1 is a test 1 this is a 1 [3 - using the previous trigram counts to generate uni- and bigram counts] melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort | ngram-count -recompute -sort -read - 1 this 1 this is 1 a 1 a test 1 a test 1 is 1 is a 1 is a test 1 this 1 this is 1 this is a 1 We expected the output of 1 and 3 to be the same, but notice the missing unigrams "" and "test". Also, the bigram "test " is missing. Is this a bug, or is there something we're missing here? It seems to be related to the end of sentence symbol. This is with SRILM 1.3.2, BTW. Regards, Paul -- melis at cs.utwente.nl From mirjam.sepesy at uni-mb.si Wed Dec 18 04:58:41 2002 From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec) Date: Wed, 18 Dec 2002 13:58:41 +0100 Subject: missing counts Message-ID: <3E007101.7A413D18@uni-mb.si> Hi, I have the following problem. The n-gram counts are computed from raw text corpus by using 'ngram-count' and 'ngram-merge'. I experiment with different vocabularies and bigram and trigram models. In each experiment I run again 'ngram-count -vocab -order' and make the language model with ' make-big-lm -trust-totals'. I test language models on my test set and noticed some mistakes. Some bigrams, which are present in the bigram model get lost in the trigram model. When I omit the -trust-totals option, the results are OK. Why should I not trust the totals in my case? Are the counts of different orders made by 'ngram-count' and 'ngram-merge' not in line? Regards, Mirjam. -------------- next part -------------- A non-text attachment was scrubbed... Name: mirjam.sepesy.vcf Type: text/x-vcard Size: 302 bytes Desc: Card for Mirjam Sepesy Maucec URL: From stolcke at speech.sri.com Wed Dec 18 22:21:20 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 18 Dec 2002 22:21:20 PST Subject: missing counts In-Reply-To: Your message of Wed, 18 Dec 2002 13:58:41 +0100. <3E007101.7A413D18@uni-mb.si> Message-ID: <200212190621.WAA01439@huge> In message <3E007101.7A413D18 at uni-mb.si>you wrote: > > Hi, > > I have the following problem. > > The n-gram counts are computed from raw text corpus by using > 'ngram-count' and 'ngram-merge'. > I experiment with different vocabularies and bigram and trigram models. > In each experiment I run again 'ngram-count -vocab -order' and make the > language model with ' make-big-lm -trust-totals'. > I test language models on my test set and noticed some mistakes. Some > bigrams, which are present in the bigram model get lost in the trigram > model. When I omit the -trust-totals option, the results are OK. > Why should I not trust the totals in my case? Are the counts of > different orders made by 'ngram-count' and 'ngram-merge' not in line? > > Regards, > > Mirjam. This is indeed a little strange. However, the -trust-totals option is obsolete, as it does not interact well with some discounting methods (e.g., KN). It was always a hack, and the latest version of make-big-lm uses a different strategy for saving memory on ngrams discarded by cutoffs (the ngram-count -meta-tag and -read-with-mincounts options, see the man page). Still, if you can reduce your problem to a small test case I could look at it to understand exactly what's going on. --Andreas From mirjam.sepesy at uni-mb.si Fri Dec 20 00:12:33 2002 From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec) Date: Fri, 20 Dec 2002 09:12:33 +0100 Subject: missing counts References: <200212190621.WAA01439@huge> Message-ID: <3E02D0F1.FF6EBC23@uni-mb.si> Andreas Stolcke wrote: > In message <3E007101.7A413D18 at uni-mb.si>you wrote: > > > > Hi, > > > > I have the following problem. > > > > The n-gram counts are computed from raw text corpus by using > > 'ngram-count' and 'ngram-merge'. > > I experiment with different vocabularies and bigram and trigram models. > > In each experiment I run again 'ngram-count -vocab -order' and make the > > language model with ' make-big-lm -trust-totals'. > > I test language models on my test set and noticed some mistakes. Some > > bigrams, which are present in the bigram model get lost in the trigram > > model. When I omit the -trust-totals option, the results are OK. > > Why should I not trust the totals in my case? Are the counts of > > different orders made by 'ngram-count' and 'ngram-merge' not in line? > > > > Regards, > > > > Mirjam. > > This is indeed a little strange. However, the -trust-totals option > is obsolete, as it does not interact well with some discounting > methods (e.g., KN). It was always a hack, and the latest version of > make-big-lm uses a different strategy for saving memory on ngrams discarded by > cutoffs (the ngram-count -meta-tag and -read-with-mincounts options, > see the man page). > > Still, if you can reduce your problem to a small test case I could look > at it to understand exactly what's going on. > > --Andreas Thank you for answering so quick. You are right. I used KN discounting. I see, it's time to switch from the version 1.3.1 to 1.3.2. I will report the results. Have nice holidays! Mirjam -------------- next part -------------- A non-text attachment was scrubbed... Name: mirjam.sepesy.vcf Type: text/x-vcard Size: 302 bytes Desc: Card for Mirjam Sepesy Maucec URL: From stolcke at speech.sri.com Fri Dec 20 00:54:21 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 20 Dec 2002 00:54:21 PST Subject: missing counts In-Reply-To: Your message of Fri, 20 Dec 2002 09:12:33 +0100. <3E02D0F1.FF6EBC23@uni-mb.si> Message-ID: <200212200854.AAA21640@huge> --Andreas In message <3E02D0F1.FF6EBC23 at uni-mb.si>you wrote: > This is a multi-part message in MIME format. > > --Boundary_(ID_pd4a/8W91VuCtRvCI8wYoA) > Content-type: text/plain; charset=us-ascii > Content-transfer-encoding: 7BIT > > Andreas Stolcke wrote: > > > In message <3E007101.7A413D18 at uni-mb.si>you wrote: > > > > > > Hi, > > > > > > I have the following problem. > > > > > > The n-gram counts are computed from raw text corpus by using > > > 'ngram-count' and 'ngram-merge'. > > > I experiment with different vocabularies and bigram and trigram models. > > > In each experiment I run again 'ngram-count -vocab -order' and make the > > > language model with ' make-big-lm -trust-totals'. > > > I test language models on my test set and noticed some mistakes. Some > > > bigrams, which are present in the bigram model get lost in the trigram > > > model. When I omit the -trust-totals option, the results are OK. > > > Why should I not trust the totals in my case? Are the counts of > > > different orders made by 'ngram-count' and 'ngram-merge' not in line? > > > > > > Regards, > > > > > > Mirjam. > > > > This is indeed a little strange. However, the -trust-totals option > > is obsolete, as it does not interact well with some discounting > > methods (e.g., KN). It was always a hack, and the latest version of > > make-big-lm uses a different strategy for saving memory on ngrams discarded > by > > cutoffs (the ngram-count -meta-tag and -read-with-mincounts options, > > see the man page). > > > > Still, if you can reduce your problem to a small test case I could look > > at it to understand exactly what's going on. > > > > --Andreas > > Thank you for answering so quick. > You are right. I used KN discounting. I see, it's time to switch from the > version 1.3.1 to 1.3.2. > I will report the results. And of course KN discounting modifies the lower-order counts, so at a given cutoff > 1 you might lose ngrams because after the KN method is applied the counts below the cutoff. this is consistent with your observation that a bigram is not in the trigram model while it is in the bigram model. --Andreas From stolcke at speech.sri.com Fri Dec 20 02:03:18 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 20 Dec 2002 02:03:18 PST Subject: Unexpected "ngram-count -recompute" result In-Reply-To: Your message of Tue, 17 Dec 2002 14:38:52 +0100. <20021217143852.A7495@luistervink.cs.utwente.nl> Message-ID: <200212201003.CAA23504@huge> In message <20021217143852.A7495 at luistervink.cs.utwente.nl>you wrote: > Hello, > > We just noticed the following when using the -recompute flag of ngram-count. > We're just try to generate uni- and bigram counts from trigram counts but som > e are missing: > > [1 - directly summing uni-, bi- and trigram counts of a simple text file] > > melis at luistervink:/local/export/melis/lm> cat t > this is a test > > melis at luistervink:/local/export/melis/lm> ngram-count -text t -sort > 1 > 1 > this 1 > this is 1 > a 1 > a test 1 > a test 1 > is 1 > is a 1 > is a test 1 > test 1 > test 1 > this 1 > this is 1 > this is a 1 > > [2 - only summing trigram counts] > > melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 > -sort > this is 1 > a test 1 > is a test 1 > this is a 1 > > [3 - using the previous trigram counts to generate uni- and bigram counts] > > melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 > -sort | ngram-count -recompute -sort -read - > 1 > this 1 > this is 1 > a 1 > a test 1 > a test 1 > is 1 > is a 1 > is a test 1 > this 1 > this is 1 > this is a 1 > > We expected the output of 1 and 3 to be the same, but notice the missing unig > rams "" and "test". Also, the bigram "test " is missing. > Is this a bug, or is there something we're missing here? It seems to be relat > ed to the end of sentence symbol. > This is with SRILM 1.3.2, BTW. > > Regards, > Paul > It's a bug of sorts, or a feature depending on your point of view. Because is not followed by anything, discarding unigrams and bigrams ending in will in fact discard information that is not contained in the trigrams. I'm not sure why you are doing what you describe, but a quick solution would be to introduce "dummy" N-grams that complete the ngrams ending in to the full length of the counts you want to keep. The little scripts below does that. If you call it "complete-eos-ngrams" then ngram-count -text t -write - | \ complete-eos-ngrams | \ ngram-count -read - -write-order 3 | \ ngram-count -recompute -sort -read - will produce the output you expect. Alternatively you could tack dummy words onto the end of your input sentences. in either case you have to delete the dummy ngrams from the final output. --Andreas #!/usr/local/bin/gawk -f BEGIN { order = 3; } { print; } $(NF - 1) == "" { count = $NF; for (i = NF; i <= order; i ++) { $i = "DUMMY"; print $0, count; } }