From stolcke at speech.sri.com Thu Apr 3 22:05:17 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 03 Apr 2008 22:05:17 -0700 Subject: ngram-class with -incremental + -save-maxclasses In-Reply-To: <47ED40FD.7020408@cs.brown.edu> References: <200705300356.l4U3u3R26372@huge> <46B7608D.6080002@speech.sri.com> <47ED40FD.7020408@cs.brown.edu> Message-ID: <47F5B70D.50202@speech.sri.com> Matt Lease wrote: > What is the behavior of -save-maxclasses for ngram-class when > -incremental is used? My understanding of -incremental is that C as > specified by -numclasses determines the number of classes for the > entire run-time (i.e. C+1 for the new word being merged into the > existing C classes), in which case -save-maxclasses would seem not to > add anything (ie perhaps it's only intended for V^3 clustering). Incremental merging works differently. It first makes one class per word (typically giving a number >> C), then merges the classes starting at C+1 into the first C until only C classes are left. So the -save-maxclasses option has the intended effect. > > If one wanted to get different clusterings with the greedy algorithm > without re-running each from scratch, it looks like you can use the > -class-counts option and then feed this counts file into a subsequent > invocation of ngram-class. For example, run it initially with C=1000, > then feed the output class counts into a second invocation with C=500, > say. Is this the correct procedure? It will work in principle, except that the second run will have no access to the original word vocabulary, so the class definitions it produces will be in terms of the class vocabulary produced by the first run. Also (I haven't checked this), there might be name collisions since the "words" and "classes" use the same names. What is really needed (but not implemented so far) is a mechanism for reading the saved classes and counts from a prior run into ngram-class and continue the merging from there. Andreas From stolcke at speech.sri.com Thu Apr 3 22:39:46 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 03 Apr 2008 22:39:46 PDT Subject: lattice-tool In-Reply-To: Your message of Wed, 26 Mar 2008 07:04:16 -0700. <790936.73554.qm@web36808.mail.mud.yahoo.com> Message-ID: <200804040539.m345dkO25188@huge> Martha, I looked at the sample data you sent and believe I found your problem. In your lattice, you have two words that are not in the LM vocabulary: !ENTER !EXIT I assume these are non-standard begin/end sentence tokens, and should be ignored by the LM. A convenient way to achieve this is as follows: Make a file "?gnore.vocab" containing the two words. Then add the option lattice-tool -ignore-vocab ignore.vocab ... to your lattice expansion command. Alternative approaches would be to remove these tokens from the lattices somehow, or to include them in the LM. --Andreas In message <790936.73554.qm at web36808.mail.mud.yahoo.com>you wrote: > Dear Andreas Stolcke, > > We are trying to rescore lattices in HTK format using > lattice-tool. The rescoring seems OK. But, when we > decode using the option -viterbi-decode it gives us > the following output: > > tesfeature/d502026.mfc > > We are using srilm version 1.4.6 which addresses the > problem of htk quoting. We did the rescoring and the > decoding in two separate steps. The commands we used > are: > > lattice-tool -read-htk -write-htk -in-lattice-list > dev20klattice/Alllattice.lst -out-lattice-dir > dev20klattice/Rescoredlattice2g/ -order 2 -lm > LanguageModels_randselsent/LM_type_N5_KN_INT -no-nulls > > and > > lattice-tool -read-htk -htk-lmscale 10 > -in-lattice-list > dev20klattice/Rescoredlattice2g/Alllattice.lst > -viterbi-decode > > The decoding works with the lattice generated by HTK > (not rescored by lattice-tool), our problem is with > the rescored lattices. > > We could not figure out the problem, the options we > used seems right. We also tried to use -htk-quotes > option, but with out success. Would you please give us > some hints? > Thank you, > > Martha & Solomon > > > _______________________________________________________________________ > _____________ > Looking for last minute shopping deals? > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/c > ategory.php?category=shopping From goldberg at cs.wisc.edu Sat Apr 5 21:15:09 2008 From: goldberg at cs.wisc.edu (Andrew Goldberg) Date: Sat, 5 Apr 2008 23:15:09 -0500 Subject: Querying count-based LM for specific n-gram probabilities Message-ID: <55141B22-A42B-482C-A8B5-D9608AC6CE7E@cs.wisc.edu> Dear list, I am using the Google 1T ngram corpus, and have successfully built a count-based LM as per the instructions on the FAQ. Thanks for those tips to get started! I have also been able to compute perplexities for test sentences using the -ppl option of the ngram program, and got this working with the newer server options, too! Very cool. However, what I really want to do is to be able to retrieve just the probabilities for particular n-grams to use them in another application. In other words, given a word and a history (say, words h1 h2 h3 h4), I would like to know the LM's probability P( word | h1 h2 h3 ), after taking into account interpolation, etc. I know one hack-ish way to do this would be to put "h1 h2 h3 h4 w" in a test file, and then parse the debug output to get the desired probability. This would be complicated for higher-order ngrams since the output truncates the histories with "..."; plus this idea of parsing the output just seems really messy. Since I'm using the Google corpus with a count-based model, I don't think it's possible/feasible to write the model's probabilties to disk, but maybe there's a way around this using -limit-vocab. So my question is: Is there a direct way to query for a specific probability using one of the existing programs (i.e., to find P( is | my name), specify some options like -word "is" -history "my name")? Or is my only option to use the libraries to write my own tool for this purpose? If so, can you recommend an existing program that would be a good place to start? What would be especially great is if I could request ngram probabilities as described here using the LM server options (i.e., start the server and load the counts for some limited vocab, then have a client program that can make requests). Thanks in advance! - Andrew From stolcke at speech.sri.com Sat Apr 5 21:52:22 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 05 Apr 2008 21:52:22 PDT Subject: Querying count-based LM for specific n-gram probabilities In-Reply-To: Your message of Sat, 05 Apr 2008 23:15:09 -0500. <55141B22-A42B-482C-A8B5-D9608AC6CE7E@cs.wisc.edu> Message-ID: <200804060452.m364qMC11900@huge> Have a look at the ngram -counts option. --Andreas In message <55141B22-A42B-482C-A8B5-D9608AC6CE7E at cs.wisc.edu>you wrote: > Dear list, > > I am using the Google 1T ngram corpus, and have successfully built a > count-based LM as per the instructions on the FAQ. Thanks for those > tips to get started! I have also been able to compute perplexities > for test sentences using the -ppl option of the ngram program, and > got this working with the newer server options, too! Very cool. > > However, what I really want to do is to be able to retrieve just the > probabilities for particular n-grams to use them in another > application. In other words, given a word and a history (say, words > h1 h2 h3 h4), I would like to know the LM's probability P( word | h1 > h2 h3 ), after taking into account interpolation, etc. I know one > hack-ish way to do this would be to put "h1 h2 h3 h4 w" in a test > file, and then parse the debug output to get the desired probability. > This would be complicated for higher-order ngrams since the output > truncates the histories with "..."; plus this idea of parsing the > output just seems really messy. Since I'm using the Google corpus > with a count-based model, I don't think it's possible/feasible to > write the model's probabilties to disk, but maybe there's a way > around this using -limit-vocab. > > So my question is: > Is there a direct way to query for a specific probability using one > of the existing programs (i.e., to find P( is | my name), specify > some options like -word "is" -history "my name")? Or is my only > option to use the libraries to write my own tool for this purpose? If > so, can you recommend an existing program that would be a good place > to start? What would be especially great is if I could request ngram > probabilities as described here using the LM server options (i.e., > start the server and load the counts for some limited vocab, then > have a client program that can make requests). > > Thanks in advance! > > - Andrew From amantrac at ulb.ac.be Tue Apr 8 06:08:19 2008 From: amantrac at ulb.ac.be (Amin Mantrach) Date: Tue, 8 Apr 2008 15:08:19 +0200 Subject: Text Categorisation using SRILM package Message-ID: Best srilm users, I wanted to have your opinion about the using of the SRILM package for text categorisation purpose. My goal is to compare on some known data sets (newsgroup, Reuters,...) and other data sets the performance in classification of the SRILM package to some well known other techniques (SVMs, Decision Trees,...) that are given good results. The unique problem I'm facing is that the SRILM package is well huge and I will be embarrassed if the "wrongly" way I'm configuring the package infers into the results. So I summit you the methodology I'll use in order to have your advices, suggestions and corrections. Each data set (pre-processed with stop-words and stemming) has a number of categories. Each document belong to a unique category (multi- class , mono-label). For each category I build a trainingFile containing all the documents of that category. Then for the category I get model file using the following command : ngram-count -text trainingFile -lm modelFile I'm using 10 fold cross-validation for avoiding over-fitting purposes. So each trainingFile consists of 90% of the documents. The model obtained is tested on the 10% with the following command ngram -lm modelFile -ppl testFile -debug 0 The output gives me the perplexity as well as the logprob. I consider the logprob as the likelihood of the data it is = log P(documents | category) (Is it ok to use directly the logprob? Or should I use the perplexity. Since each category has his own vocabulary, may be oovs could influence in the categorisation? ) For the categorisation I'm using the bayes rule : P(category | document ) = P(documents | category) * P(category) /P(document). Since P(document) is constant for different categories. I obtained the posterior proba simply by P(documents | category) * P(category). I'm estimating the prior as the portions of total documents classified in that category. Finally I'm classifying a document into the category given the max posterior proba (P(category | document ) ). Is for you this simple test sufficiently good for assessing the performance in classification of the SRILM package or is it mandatory to use other commands for taking into account other features (such as oovs,...)? Thank you for your contribution. I hope that this question will help other users after also. @min. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ebicici at ku.edu.tr Wed Apr 9 06:57:01 2008 From: ebicici at ku.edu.tr (Ergun Bicici) Date: Wed, 9 Apr 2008 16:57:01 +0300 Subject: ngram-count -read performance difference for tokens that start with different characters Message-ID: <4ded78d60804090657g18bbd725t9b4b233fe25196dc@mail.gmail.com> Dear SRILM List Members, I am using / augmenting SRILM for our own language modeling purposes. One decision that I make is to separate language models for different types of tokens. In my corpus, one type of token starts with a '+' character, whereas another does not. The difference between these is that although their counts are exactly the same and their respective count files, language models generated by them have similar sizes, I am observing significant differences in their respective performances in running the ngram-count command. For instance, the tokens that does not start with a '+' may finish creating a language model for a training data count file by using ngram-count in 6 seconds (by using the -read option), whereas the other one would finish in 42 seconds. Thus there seems to be a 6-7 times difference in ngram-count performance using count files generated for tokens that start with a '+' and for the ones that do not. I am curious if there is an internal decision that prevents model building procedure for tokens that start with a '+' character to perform as fast as tokens of other types. What might be causing this performance difference? Thanks, Ergun end -- Ergun Bicici Koc University -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Apr 12 00:03:54 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 12 Apr 2008 00:03:54 PDT Subject: ngram-count -read performance difference for tokens that start with different characters In-Reply-To: Your message of Wed, 09 Apr 2008 16:57:01 +0300. <4ded78d60804090657g18bbd725t9b4b233fe25196dc@mail.gmail.com> Message-ID: <200804120703.m3C73sE08679@huge> > > Dear SRILM List Members, > > I am using / augmenting SRILM for our own language modeling purposes. One > decision that I make is to separate language models for different types of > tokens. In my corpus, one type of token starts with a '+' character, whereas > another does not. The difference between these is that although their counts > are exactly the same and their respective count files, language models > generated by them have similar sizes, I am observing significant differences > in their respective performances in running the ngram-count command. > > For instance, the tokens that does not start with a '+' may finish creating > a language model for a training data count file by using ngram-count in 6 > seconds (by using the -read option), whereas the other one would finish in > 42 seconds. Thus there seems to be a 6-7 times difference in ngram-count > performance using count files generated for tokens that start with a '+' and > for the ones that do not. > > I am curious if there is an internal decision that prevents model building > procedure for tokens that start with a '+' character to perform as fast as > tokens of other types. What might be causing this performance difference? > > Thanks, > Ergun > Ergun, your problem has nothing to do with the characters in your words. The problem is in the counts themselves. Your two counts files have different count values, and that is all that matters. The counts containing the '+' characters has a peculiar distribution of unigram counts (after applying the KN discounting). In interpolated discounting the uniform distribution is added to the unigram estimates; for some reason in this case this makes the probabilities sum to something > 1. This triggers a "counter-measure" that successively increments the denominator in the estimator, and in this case this has to be repeated many, many times to yield a proper unigram probability distribution. Hence the long run time. There are two ways to fix this. Just avoid using interpolated KN discounting for the unigrams. Instead of -interpolate, use -interpolate2 -interpolate3 or download the updated SRILM beta release, which has an automatic fix for this problem. Andreas From stolcke at speech.sri.com Tue Apr 15 12:24:14 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 15 Apr 2008 12:24:14 -0700 Subject: SRILM to Sphinx lm.DMP In-Reply-To: <8746867BF368F94EA9DAAF28B89E9B220297E2@exchange.mediaclipping.local> References: <8746867BF368F94EA9DAAF28B89E9B220297E2@exchange.mediaclipping.local> Message-ID: <480500DE.8060009@speech.sri.com> Christian Schrumpf wrote: > Dear Mr. Stolcke, > > how can I convert an n-gram lm prduced with the ngram-count program of > SRILM to a lm I can use in Sphinx 3? > Thank you in advance. I understand Sphinx LMs require the N-grams to be sorted in a certain way. The "sort-lm" command described in the lm-scripts man page was made for this reason. If you google "srilm sphinx" you will find several mentions of apparently successful use of SRILM in combination with Sphinx. Andreas From sopheap.seng at gmail.com Wed Apr 16 03:40:36 2008 From: sopheap.seng at gmail.com (Sopheap SENG) Date: Wed, 16 Apr 2008 12:40:36 +0200 Subject: SRILM to Sphinx lm.DMP In-Reply-To: <480500DE.8060009@speech.sri.com> References: <8746867BF368F94EA9DAAF28B89E9B220297E2@exchange.mediaclipping.local> <480500DE.8060009@speech.sri.com> Message-ID: <3b7711ea0804160340yf33eb7k8191d79f11f39c54@mail.gmail.com> Hello, On the Sphinx website (http://cmusphinx.sourceforge.net/html/cmusphinx.php) there is a tool called lm3g2dmp that converts a 3-gram lm to binary DMP format to use in Sphinx 3 decoder. The ngram-count doesnt output n-gram in the rigth order for Sphinx's lm3g2dmp utility. You will need to resort the lm somehow. sort-lm could do that but I used a script written by fuegen at ira.uka.de to convert before passing to lm3g2dmp. if you could not find this script on the net, please e-mail me. Best, Sopheap On Tue, Apr 15, 2008 at 9:24 PM, Andreas Stolcke wrote: > Christian Schrumpf wrote: > > > Dear Mr. Stolcke, > > how can I convert an n-gram lm prduced with the ngram-count program of > > SRILM to a lm I can use in Sphinx 3? > > Thank you in advance. > > > I understand Sphinx LMs require the N-grams to be sorted in a certain way. > The "sort-lm" command described in the lm-scripts man page was made for > this reason. > > If you google "srilm sphinx" you will find several mentions of apparently > successful use of SRILM in combination with Sphinx. > > Andreas > > > -- --------------------------------------------- Sopheap SENG Laboratoire d'Informatique de Grenoble (LIG) Equipe GETALP Bureau C118 220, avenue de la Chimie Campus Scientifique, BP53 38041 GRENOBLE Cedex 9, FRANCE T?l : (33)-4-76-63-55-81 T?l?copie : (33)-4-76-63-55-52 Courriel : sopheap.seng at imag.f URL : http://www-geod.imag.fr --------------------------------------------- Enseignant Institut de Technologie du Cambodge BP 86, Bd de Pochentong Phnom Penh - Cambodge T?l : (855)-23-88-03-70/98-24-45 T?l?copie : (855)-23-88-03-69 Courriel : sopheap.seng at itc.edu.kh URL : http://www.itc.edu.kh --------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From yannick.esteve at lium.univ-lemans.fr Wed Apr 16 06:28:59 2008 From: yannick.esteve at lium.univ-lemans.fr (=?ISO-8859-1?Q?Yannick_Est=E8ve?=) Date: Wed, 16 Apr 2008 15:28:59 +0200 Subject: SRILM to Sphinx lm.DMP In-Reply-To: <3b7711ea0804160340yf33eb7k8191d79f11f39c54@mail.gmail.com> References: <8746867BF368F94EA9DAAF28B89E9B220297E2@exchange.mediaclipping.local> <480500DE.8060009@speech.sri.com> <3b7711ea0804160340yf33eb7k8191d79f11f39c54@mail.gmail.com> Message-ID: <23E3DCB0-D505-402E-845F-28FB6FB68276@lium.univ-lemans.fr> In fact, I believe it is necessary to use the "add-dummy-bows" script too which is a part of srilm. lm3g2dmp waits for a value for each low-order ngrams. This script adds the value 0 to missing back-off weights. So you have to do something like that: #srilm tools gunzip -c lm.arpa.gz | gawk -f sort-lm | gzip -c > lm.sorted.arpa.gz gunzip -c lm.sorted.arpa.gz | gawk -f add-dummy-bows | gzip -c > lm.sphinx.arpa.gz and then: #cmu sphinx tools lm3g2dmp lm.sphinx.arpa.gz . Notice that only 3-gram LMs work with Sphinx. LIUM distributes a open source tool which allows to rescore sphinx3 word-lattices with a 4- gram LM. This is available here: http://www-lium.univ-lemans.fr/tools/index.php?option=com_content&task=blogcategory&id=21&Itemid=47 Best regards, -Yannick Le 16 avr. 08 ? 12:40, Sopheap SENG a ?crit : > Hello, > > On the Sphinx website (http://cmusphinx.sourceforge.net/html/cmusphinx.php > ) there is a tool called lm3g2dmp that converts a 3-gram lm to > binary DMP format to use in Sphinx 3 decoder. > > The ngram-count doesnt output n-gram in the rigth order for Sphinx's > lm3g2dmp utility. You will need to resort the lm somehow. > > sort-lm could do that but I used a script written by > fuegen at ira.uka.de to convert before passing to lm3g2dmp. > > if you could not find this script on the net, please e-mail me. > > Best, > > Sopheap > > > > On Tue, Apr 15, 2008 at 9:24 PM, Andreas Stolcke > wrote: > Christian Schrumpf wrote: > Dear Mr. Stolcke, > how can I convert an n-gram lm prduced with the ngram-count program > of SRILM to a lm I can use in Sphinx 3? > Thank you in advance. > I understand Sphinx LMs require the N-grams to be sorted in a > certain way. > The "sort-lm" command described in the lm-scripts man page was made > for this reason. > > If you google "srilm sphinx" you will find several mentions of > apparently successful use of SRILM in combination with Sphinx. > > Andreas > > > > > > -- > --------------------------------------------- > Sopheap SENG > > Laboratoire d'Informatique de Grenoble (LIG) > Equipe GETALP Bureau C118 > 220, avenue de la Chimie > Campus Scientifique, BP53 > 38041 GRENOBLE Cedex 9, FRANCE > T?l : (33)-4-76-63-55-81 > T?l?copie : (33)-4-76-63-55-52 > Courriel : sopheap.seng at imag.f > URL : http://www-geod.imag.fr > --------------------------------------------- > Enseignant > Institut de Technologie du Cambodge > BP 86, Bd de Pochentong > Phnom Penh - Cambodge > T?l : (855)-23-88-03-70/98-24-45 > T?l?copie : (855)-23-88-03-69 > Courriel : sopheap.seng at itc.edu.kh > URL : http://www.itc.edu.kh > --------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Wed Apr 16 10:22:53 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 16 Apr 2008 10:22:53 PDT Subject: AW: SRILM to Sphinx lm.DMP In-Reply-To: Your message of Wed, 16 Apr 2008 16:04:12 +0200. <8746867BF368F94EA9DAAF28B89E9B220297E4@exchange.mediaclipping.local> Message-ID: <200804161722.m3GHMr817618@huge> > > Dear Mr. Stolcke, > =20 > Thanks for yor fast reply. I already tried the "sort-lm" script you = > suggested. Unfortunately using this sorted n-gram lm with the tool = > "lm3g2dmp" results in errors. I found out, that the lm output of the = > SRILM n-gram tool has no values if the backoff weight is 0. This fact = > causes the errors in "lm3g2dmp". By adding the value 0.0 to every 1-gram = > and 2-gram with no backoff weight, I managed to have the model dumped in = > Sphinx 3 format. > I wonder why there are so many backoff weights 0? Does this depend on = > these warnings I get? =20 > > warning: no singleton counts > GT discounting disabled > warning: no singleton counts > GT discounting disabled > warning: no singleton counts > GT discounting disabled > =20 > I call the program with: > > ngram-count -order 3 -vocab in.vocab -read-with-mincounts -read in.count = > -lm out.lm -gt1min 1 -gt2min 3 -gt3min 3 -gt1max 7 -gt2max 7 -gt3max 7 > =20 > What have I got the change for not getting these warnings? How can I get = > backoff weights that are not 0 for 1-grams? Example output: > > -4.648435 b_aI_n_a:_@ -2.68299 > -6.30688 b_aI_n_a:_m_ at _n > -6.186905 b_aI_n_b_r_U_x -1.056842 You are getting the warnings because -read-with-mincounts discards counts below your minimum counts, yet those are needed for computing the discounting factors according to the Good Turing method. If memory is not an issue, simply don't use -read-with-mincounts. If memory is a problem, use the "make-big-lm" script instead of ngram-count. (and the read the FAQ on memory issues). The "missing" backoff weights are not there because they are redundant. Only ngrams that are a prefix to longer ngrams need a backoff weight. Due to count cutoffs, you typically have many lower-order ngrams that don't need backoff weights. As someone already pointed out, you can use the "add-dummy-bows" command to insert 0 backoff weights for software that requires them. Andreas From Dmitriy.Dligach at colorado.edu Thu Apr 17 10:01:29 2008 From: Dmitriy.Dligach at colorado.edu (Dmitriy Dligach) Date: Thu, 17 Apr 2008 11:01:29 -0600 (MDT) Subject: wild cards Message-ID: Hello, I am new to language modeling and to SRILM so I appologize if this question has already been discussed here: SRILM can compute the probability of a string such as "a b c d e". I was wondering if there is a way to compute the probability of a string where one of the words is a wildcard. E.g., suppose I want to compute the probability of "a b * d e" where "*" is any word. I believe this probability P(a, b, *, d, e) can be approximated as P(a, b) * P(d, e), but I am still wondering whether there is a better way to compute it (e.g. by passing the wildcard to SRILM). I need it to compute a conditional probability (e.g. P(c | a, b, d, e)). Thank you! Dmitriy From stolcke at speech.sri.com Thu Apr 17 11:24:05 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 17 Apr 2008 11:24:05 PDT Subject: wild cards In-Reply-To: Your message of Thu, 17 Apr 2008 11:01:29 -0600. Message-ID: <200804171824.m3HIO5318226@huge> Sorry, no wildcards in SRILM! What you need to do is collect the ngram counts yourself (simple, using a gawk or perl script), and structure them such that n(a, b, c, d, e) and n(a, b, d, e) are given to ngram-count pretending to be for the ngrams "a b c d e c" and "a b d e", respectively. ngram-count will then compute the desired conditional probabilities for you. --Andreas In message you wrote: > Hello, > > I am new to language modeling and to SRILM so I appologize if this > question has already been discussed here: > > SRILM can compute the probability of a string such as "a b c d e". I was > wondering if there is a way to compute the probability of a string where > one of the words is a wildcard. E.g., suppose I want to compute the > probability of "a b * d e" where "*" is any word. > > I believe this probability P(a, b, *, d, e) can be approximated as > P(a, b) * P(d, e), but I am still wondering whether there is a better way > to compute it (e.g. by passing the wildcard to SRILM). > > I need it to compute a conditional probability (e.g. P(c | a, b, d, e)). > > Thank you! > > Dmitriy > > From Dmitriy.Dligach at colorado.edu Wed Apr 23 09:01:38 2008 From: Dmitriy.Dligach at colorado.edu (Dmitriy Dligach) Date: Wed, 23 Apr 2008 10:01:38 -0600 (MDT) Subject: begining/end of sentence tags Message-ID: Andreas, First of all I wanted to thank you for your SRILM toolkit; I find it extremely useful in my research! Also, I had a question about the beginning/end of sentence tags: I need to compute probabilities of strings that are *not* complete sentences. My understanding is both 'ngram-count' and 'ngram' tools automatically add these tags if they are not explicitly present. Is there any way to prevent the 'ngram' tool from doing so? Perhaps the '-limit-vocab' option can somehow help by specifying all words in the vocabulary except for the ~~and~~ ? Thanks, Dima From jachym at kky.zcu.cz Wed Apr 23 13:53:48 2008 From: jachym at kky.zcu.cz (Jachym Kolar) Date: Wed, 23 Apr 2008 22:53:48 +0200 Subject: begining/end of sentence tags In-Reply-To: References: Message-ID: <20080423225348.cohg3783y8k0g8oc@webmail.zcu.cz> Dmitriy, you can use the "continuous-ngram-count" script to generate counts not containing sentence boundary tags. It can be combined with ngram-count, such as 'continuous-ngram-count order=3 train.txt | ngram-count -read - -lm lm3gram' Best, Jachym Quoting Dmitriy Dligach : > Andreas, > > First of all I wanted to thank you for your SRILM toolkit; I find it > extremely useful in my research! > > Also, I had a question about the beginning/end of sentence tags: > > I need to compute probabilities of strings that are *not* complete > sentences. My understanding is both 'ngram-count' and 'ngram' tools > automatically add these tags if they are not explicitly present. > > Is there any way to prevent the 'ngram' tool from doing so? > > Perhaps the '-limit-vocab' option can somehow help by specifying all > words in the vocabulary except for the ~~and~~ ? > > Thanks, > > > Dima From stolcke at speech.sri.com Thu Apr 24 15:31:07 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 24 Apr 2008 15:31:07 PDT Subject: begining/end of sentence tags In-Reply-To: Your message of Wed, 23 Apr 2008 22:53:48 +0200. <20080423225348.cohg3783y8k0g8oc@webmail.zcu.cz> Message-ID: <200804242231.m3OMV7V29097@huge> In message <20080423225348.cohg3783y8k0g8oc at webmail.zcu.cz>you wrote: > Dmitriy, > you can use the "continuous-ngram-count" script to generate counts > not containing sentence boundary tags. It can be combined with > ngram-count, such as > > 'continuous-ngram-count order=3 train.txt | ngram-count -read - -lm lm3gram' > > Best, > Jachym Jachym is right, and you can use a similar approach for testing the LM (using continuous-ngram-count and ngram-count -counts). I suspect that Dmitriy wants to preserve sentences as units, and just needs to avoid ~~and~~ being added automatically. This is also possible, by counting the ngrams first, and then filtering out those that have the start/end tags. However, it is much easier to use the latest beta verison of SRILM (now on the web server) that has the options -no-sos -no-eos for ngram and ngram-count. Andreas > > Quoting Dmitriy Dligach : > > > Andreas, > > > > First of all I wanted to thank you for your SRILM toolkit; I find it > > extremely useful in my research! > > > > Also, I had a question about the beginning/end of sentence tags: > > > > I need to compute probabilities of strings that are *not* complete > > sentences. My understanding is both 'ngram-count' and 'ngram' tools > > automatically add these tags if they are not explicitly present. > > > > Is there any way to prevent the 'ngram' tool from doing so? > > > > Perhaps the '-limit-vocab' option can somehow help by specifying all > > words in the vocabulary except for the ~~and~~ ? > > > > Thanks, > > > > > > Dima > > > From ioparin at yahoo.co.uk Wed May 7 08:01:59 2008 From: ioparin at yahoo.co.uk (ilya oparin) Date: Wed, 7 May 2008 15:01:59 +0000 (GMT) Subject: behaviour of class models Message-ID: <67936.54080.qm@web25405.mail.ukl.yahoo.com> Hi, I have recently found the unexpected behaviour of class based models in SRILM. It will probably be useful to know about this for other people who also deal with such models for inflectional languages. I now deal with liguistically motivated classes. What I have is, for example, a stem-based model. That is, stems are regarded as classes for wordforms, as encoded in a class definition file that I generate myself, that's rather straightforward. Then, when I calculate perplexity with that model and in interpolation with the conventional word LM, it appears much lower than I expect for the given data and vocabulary. At the same time some stems (that serve as classes) coincid with some of the wordforms (that is natural) - so I had the feeling the unexpected numbers are the results that in SRILM ngram with the -classes option can treat a class LM file as consisting of both class markers and wordforms (is some entry is not listed among classes). That probably screws the results in my case. After I added to each stem in both class definition and LM file a postfix, that guaranteed there are no stems that coincide with wordforms, the perplexity results became much more realistic. best regards, Ilya __________________________________________________________ Sent from Yahoo! Mail. A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html From stolcke at speech.sri.com Wed May 7 10:14:15 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 07 May 2008 10:14:15 PDT Subject: behaviour of class models In-Reply-To: Your message of Wed, 07 May 2008 15:01:59 -0000. <67936.54080.qm@web25405.mail.ukl.yahoo.com> Message-ID: <200805071714.m47HEFZ22405@huge> In message <67936.54080.qm at web25405.mail.ukl.yahoo.com>you wrote: > Hi, > > I have recently found the unexpected behaviour of class based models in SRILM > . It will probably be useful to know about this for other people who also dea > l with such models for inflectional languages. > > I now deal with liguistically motivated classes. What I have is, for example, > a stem-based model. That is, stems are regarded as classes for wordforms, as > encoded in a class definition file that I generate myself, that's rather str > aightforward. > Then, when I calculate perplexity with that model and in interpolation with t > he conventional word LM, it appears much lower than I expect for the given da > ta and vocabulary. At the same time some stems (that serve as classes) coinci > d with some of the wordforms (that is natural) - so I had the feeling the une > xpected numbers are the results that in SRILM ngram with the -classes option > can treat a class LM file as consisting of both class markers and wordforms ( > is some entry is not listed among classes). That probably screws the results > in my case. After I added to each stem in both class definition and LM file a > postfix, that guaranteed there are no stems that coincide with wordforms, th > e perplexity results became much more realistic. To clarify: A class-based LM can contain N-grams that mix word and class labels. For example, it might contain the N-gram "a B c D" where the lower-case tokens are words and the upper-case tokens are classes. However, you still need to keep words and classes separate. so "B" should not occur as a word both as a word and class labels (on the left and right hand colummns of a class definitions file). So you need some kind of spelling convention that distinguishes words and classes if there are conflicts. Andreas From stolcke at speech.sri.com Fri May 30 09:05:50 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 30 May 2008 09:05:50 -0700 Subject: [SRILM]: linear interpolation of LMs In-Reply-To: <483FEA4F.9090905@elis.ugent.be> References: <483FEA4F.9090905@elis.ugent.be> Message-ID: <484025DE.4020700@speech.sri.com> Bert Reveil wrote: > Dear Dr. Stolcke, > > I have recently been trying to evaluate linear combinations of LMs > using your SRILM-toolkit. Therefore I used the following command form > > "ngram -debug 0 -lm LM1.arpa -lambda 0.6/0.7/... -mix-lm LM2.arpa > -ppl some_text.txt" > > Although every run of this command returns plausible output, it also > produces the following warning/error-line > > BOW numerator for context "" is -0.1 < 0 > > At first I thought it might have been because I had some double spaces > in my texts, but after correcting that the warning still > remained...I've been looking this problem up on the mailing list, but > I have found no priors, so I'm directing this question to you...have > you got any idea what this warning means and how I can make it > disappear? Maybe I'm not using the 'ngram'-program correctly? They way you invoked ngram it merges the two LMs into a single new backoff ngram model, and then uses that merged LM (this is also called "static" interpolation). In the merging step the backoff weights are recomputed to normalize the merged probabilities. The message you are seeing indicates that the unigram probabilities add up to something > 1. This could be a problem with your original LMs. Where those created by SRILM as well ? If so we need to investigate. If you computed LM1 and LM2 by some other means you can use SRILM to renormalize them individually before doing the interpolation: ngram -lm LM1 -renorm -write-lm LM1norm Separate from all this, you can do "dynamic" interpolation where the mixed probabilities are computed on the fly . This is faster. Add the option "-bayes 0" to your ngram options in the command you used. Andreas > > With kind regards, > > Bert From deliverable at gmail.com Sun Jun 1 15:21:29 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 1 Jun 2008 15:21:29 -0700 Subject: command line for make-big-lm In-Reply-To: <484025DE.4020700@speech.sri.com> References: <483FEA4F.9090905@elis.ugent.be> <484025DE.4020700@speech.sri.com> Message-ID: I'm studying training-scripts to estimate a big LM for modified Kneser- Ney. Will this do the job: make-big-lm -name my-kn-model -read my.counts.gz -max-per-file 10000000 -kndiscount 5 -- is -kndiscount all what's needed to trigger KN estimation? And the number is the maximum order N, i.e. we don't need to repeat it from 1 up to N, like -kndiscount 1, -kndiscount 2, ...? -- also, how do I estimate -max-per-file for 16 GB RAM and 5-grams? Cheers, Alexy From deliverable at gmail.com Sun Jun 1 15:58:08 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 1 Jun 2008 15:58:08 -0700 Subject: auxiliary scripts and make-big-lm Message-ID: <3B9C2620-3BF0-4043-984C-2105FD5CE32D@gmail.com> What are the typical situations when some of the training-scripts are useful? Eg., there're get-gt-counts, which produce a few small files out of my huge 5-gram count file. Also there're make-gt-discounts, make-kn- counts, make-kn-discounts. Are these mostly called by make-big-lm, or have their own uses? With the ngram-count, there's -kn set of options to read the counts -- when is it useful to save/read them with it? I'd really like to try a few huge models with make-big-lm. Is it by itself sufficient for model estimation, calling the auxiliary script on its own? Cheers, Alexy From stolcke at speech.sri.com Sun Jun 1 19:01:45 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 01 Jun 2008 19:01:45 PDT Subject: command line for make-big-lm In-Reply-To: Your message of Sun, 01 Jun 2008 15:21:29 -0700. Message-ID: <200806020201.m5221jC09963@huge> In message you wrote: > I'm studying training-scripts to estimate a big LM for modified Kneser- > Ney. Will this do the job: > > make-big-lm -name my-kn-model -read my.counts.gz -max-per-file > 10000000 -kndiscount 5 > -- is -kndiscount all what's needed to trigger KN estimation? And the > number is the maximum order N, i.e. we don't need to repeat it from 1 > up to N, like -kndiscount 1, -kndiscount 2, ...? Not quite: use -kndiscount -order 5 > -- also, how do I estimate -max-per-file for 16 GB RAM and 5-grams? It really depends on your data, so it's hard to predict. 10000000 is the default actually, and 16GB is quite a bit of memory, so you should have no problem. Andreas From stolcke at speech.sri.com Sun Jun 1 19:07:15 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 01 Jun 2008 19:07:15 PDT Subject: auxiliary scripts and make-big-lm In-Reply-To: Your message of Sun, 01 Jun 2008 15:58:08 -0700. <3B9C2620-3BF0-4043-984C-2105FD5CE32D@gmail.com> Message-ID: <200806020207.m5227F110271@huge> In message <3B9C2620-3BF0-4043-984C-2105FD5CE32D at gmail.com>you wrote: > What are the typical situations when some of the training-scripts are > useful? > > Eg., there're get-gt-counts, which produce a few small files out of > my huge 5-gram count file. Also there're make-gt-discounts, make-kn- > counts, make-kn-discounts. Are these mostly called by make-big-lm, or > have their own uses? With the ngram-count, there's -kn set of options > to read the counts -- when is it useful to save/read them with it? The ngram-count -kn options are used to separate the discount estimation process from the LM building proper. They are used by make-big-lm to reduce the maximum amount of memory needed. > > I'd really like to try a few huge models with make-big-lm. Is it by > itself sufficient for model estimation, calling the auxiliary script > on its own? Yes, so you don't really have to know what these scripts do. Some of them are useful by themselves, or they could be used as instructional tools. Andreas From deliverable at gmail.com Sun Jun 1 21:01:42 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 1 Jun 2008 21:01:42 -0700 Subject: command line for make-big-lm In-Reply-To: <200806020201.m5221jC09963@huge> References: <200806020201.m5221jC09963@huge> Message-ID: On Jun 1, 2008, at 7:01 PM, Andreas Stolcke wrote: > -kndiscount -order 5 Ao what happens when -kndiscount 5 is given but -order 5 is not -- the default order 3? (And what does it mean then?...) Cheers, Alexy From stolcke at speech.sri.com Sun Jun 1 21:10:07 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 01 Jun 2008 21:10:07 PDT Subject: command line for make-big-lm In-Reply-To: Your message of Sun, 01 Jun 2008 21:01:42 -0700. Message-ID: <200806020410.m524A7p16868@huge> In message you wrote: > On Jun 1, 2008, at 7:01 PM, Andreas Stolcke wrote: > > -kndiscount -order 5 > > Ao what happens when -kndiscount 5 is given but -order 5 is not -- the > default order 3? (And what does it mean then?...) "-kndiscount 5" is not a meaningful option. I assume you mean "-kndiscount5", which enables KN discounting just for order 5 ngrams. you are right: if no -order option is specified the default is use, which is -order 3. In that case -kndiscount5 is ignored. Andreas From deliverable at gmail.com Sun Jun 1 21:16:27 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 1 Jun 2008 21:16:27 -0700 Subject: command line for make-big-lm In-Reply-To: <200806020410.m524A7p16868@huge> References: <200806020410.m524A7p16868@huge> Message-ID: <56BB4B80-8462-4970-94FA-6657A7D6B378@gmail.com> On Jun 1, 2008, at 9:10 PM, Andreas Stolcke wrote: > > "-kndiscount 5" is not a meaningful option. > I assume you mean "-kndiscount5", which enables KN discounting just > for > order 5 ngrams. Ah, I see. When I specified it with a space, I see that all order kndiscount were set to 1: kndiscount1=1 kndiscount2=1 kndiscount3=1 kndiscount4=1 kndiscount5=1 kndiscount6=1 kndiscount7=1 kndiscount8=1 kndiscount9=1 -- and with -kndiscount5 I get kndiscount1=0 kndiscount2=0 kndiscount3=0 kndiscount4=0 kndiscount5=1 kndiscount6=0 kndiscount7=0 kndiscount8=0 kndiscount9=0 -- so apparently it ignores the trailing five and does a full KN model if there's a space, right? I'm after the "full" KN model as per Chen & Goodman -- so should I just say -kndiscount without any numbers to get it, with -order 5 for 5-grams maximum? Cheers, Alexy From stolcke at speech.sri.com Sun Jun 1 21:20:55 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 01 Jun 2008 21:20:55 PDT Subject: command line for make-big-lm In-Reply-To: Your message of Sun, 01 Jun 2008 21:16:27 -0700. <56BB4B80-8462-4970-94FA-6657A7D6B378@gmail.com> Message-ID: <200806020420.m524KtX17531@huge> In message <56BB4B80-8462-4970-94FA-6657A7D6B378 at gmail.com>you wrote: > > -- so apparently it ignores the trailing five and does a full KN model > if there's a space, right? I'm after the "full" KN model as per Chen > & Goodman -- so should I just say -kndiscount without any numbers to > get it, with -order 5 for 5-grams maximum? Indeed. Andreas From deliverable at gmail.com Thu Jun 5 01:40:30 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 5 Jun 2008 01:40:30 -0700 Subject: format error in kncounts.gz Message-ID: <5AAD9631-3ACC-4690-82DE-0CADA3C058E3@gmail.com> I've created a KN model with SRILM 1.5.5, and then wanted to query it remotely. Got 1.5.7-beta for it. When loading into ngram -server- port, got this error: [alexyk at corptech]/data/press% ngram -lm model.kncounts.gz: line 1064849030: reached EOF before \end\ format error in lm file Do I need any other options for ngram, or it it a discrepancy between ngram-count from 1.5.5 and ngram from 1.5.7-beta? Cheers, Alexy From ioparin at yahoo.co.uk Thu Jun 5 05:46:13 2008 From: ioparin at yahoo.co.uk (ilya oparin) Date: Thu, 5 Jun 2008 12:46:13 +0000 (GMT) Subject: format error in kncounts.gz In-Reply-To: <5AAD9631-3ACC-4690-82DE-0CADA3C058E3@gmail.com> Message-ID: <653928.42500.qm@web25401.mail.ukl.yahoo.com> That usually means you're loading something else than a LM in the ARPA format. Have you visually checked your model.kncounts.gz? best regards, Ilya --- On Thu, 5/6/08, Alexy Khrabrov wrote: > From: Alexy Khrabrov > Subject: format error in kncounts.gz > To: "srilm-user" > Date: Thursday, 5 June, 2008, 12:40 PM > I've created a KN model with SRILM 1.5.5, and then > wanted to query it > remotely. Got 1.5.7-beta for it. When loading into ngram > -server- > port, got this error: > > [alexyk at corptech]/data/press% ngram -lm model.kncounts.gz: > line > 1064849030: reached EOF before \end\ format error > in lm file > > Do I need any other options for ngram, or it it a > discrepancy between > ngram-count from 1.5.5 and ngram from 1.5.7-beta? > > Cheers, > Alexy __________________________________________________________ Sent from Yahoo! Mail. A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html From deliverable at gmail.com Thu Jun 5 07:46:58 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 5 Jun 2008 07:46:58 -0700 Subject: format error in kncounts.gz In-Reply-To: <653928.42500.qm@web25401.mail.ukl.yahoo.com> References: <653928.42500.qm@web25401.mail.ukl.yahoo.com> Message-ID: <19AD3854-B878-4CA1-8917-05D56C137BB7@gmail.com> Hmm -- I've run make-big-lm, and got a few small files, a .kndir, and that kncounts.gz -- which looks just like counts and is a few gigabytes, so I thought that's my model. I've posted my command line earlier when figuring out exactly the way to get a Kneser-Ney model... The kncounts.gz looks just like a counts file. The counts I fed to make-big-lm with -read are the ones I got with make/merge-batch-counts -order 5 for 5-grams. Should I have done anything extra before or after? Cheers, Alexy On Jun 5, 2008, at 5:46 AM, ilya oparin wrote: > That usually means you're loading something else than a LM in the > ARPA format. Have you visually checked your model.kncounts.gz? From ioparin at yahoo.co.uk Thu Jun 5 08:36:42 2008 From: ioparin at yahoo.co.uk (ilya oparin) Date: Thu, 5 Jun 2008 15:36:42 +0000 (GMT) Subject: format error in kncounts.gz In-Reply-To: <19AD3854-B878-4CA1-8917-05D56C137BB7@gmail.com> Message-ID: <857861.48200.qm@web25403.mail.ukl.yahoo.com> You have probably set wrong parameters to make-big-lm or took wrong output file. make-big-lm -name name -read counts -lm new-model [ -trust-totals ] [-max-per-file M ] [ -ngram-filter filter ] [ ngram-options ... ] May it happen that took counts file (from manual: "The -name parameter is used to name various auxiliary files. counts contains the raw N-gram counts; it may be (and usually is) a compressed file. "), instead of the resulting LM file generated by the script (the name of which you put after -lm option)? Basically a count file is used to generate LMs that are subsequently read with "ngram -lm my_LM ...". Counts file is not a language model on its own. best regards, Ilya --- On Thu, 5/6/08, Alexy Khrabrov wrote: > From: Alexy Khrabrov > Subject: Re: format error in kncounts.gz > To: ioparin at yahoo.co.uk > Cc: "srilm-user" > Date: Thursday, 5 June, 2008, 6:46 PM > Hmm -- I've run make-big-lm, and got a few small files, > a .kndir, and > that kncounts.gz -- which looks just like counts and is a > few > gigabytes, so I thought that's my model. I've > posted my command line > earlier when figuring out exactly the way to get a > Kneser-Ney > model... The kncounts.gz looks just like a counts file. > > The counts I fed to make-big-lm with -read are the ones I > got with > make/merge-batch-counts -order 5 for 5-grams. Should I > have done > anything extra before or after? > > Cheers, > Alexy > > On Jun 5, 2008, at 5:46 AM, ilya oparin wrote: > > > That usually means you're loading something else > than a LM in the > > ARPA format. Have you visually checked your > model.kncounts.gz? __________________________________________________________ Sent from Yahoo! Mail. A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html From deliverable at gmail.com Thu Jun 5 14:51:58 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 5 Jun 2008 14:51:58 -0700 Subject: format error in kncounts.gz In-Reply-To: <857861.48200.qm@web25403.mail.ukl.yahoo.com> References: <857861.48200.qm@web25403.mail.ukl.yahoo.com> Message-ID: Hmm -- I might have omitted the -lm new-mode switch, and as a result got only kncounts.gz file in that case. How would I get a KN model out of it, and is it faster than rerunning make-big-lm from scratch? Cheers, Alexy On Jun 5, 2008, at 8:36 AM, ilya oparin wrote: > You have probably set wrong parameters to make-big-lm or took wrong > output file. > > make-big-lm -name name -read counts -lm new-model [ -trust-totals ] > [-max-per-file M ] [ -ngram-filter filter ] [ ngram-options ... ] > > May it happen that took counts file (from manual: "The -name > parameter is used to name various auxiliary files. counts contains > the raw N-gram counts; it may be (and usually is) a compressed file. > "), instead of the resulting LM file generated by the script (the > name of which you put after -lm option)? Basically a count file is > used to generate LMs that are subsequently read with "ngram -lm > my_LM ...". Counts file is not a language model on its own. > > > best regards, > Ilya > > > --- On Thu, 5/6/08, Alexy Khrabrov wrote: > >> From: Alexy Khrabrov >> Subject: Re: format error in kncounts.gz >> To: ioparin at yahoo.co.uk >> Cc: "srilm-user" >> Date: Thursday, 5 June, 2008, 6:46 PM >> Hmm -- I've run make-big-lm, and got a few small files, >> a .kndir, and >> that kncounts.gz -- which looks just like counts and is a >> few >> gigabytes, so I thought that's my model. I've >> posted my command line >> earlier when figuring out exactly the way to get a >> Kneser-Ney >> model... The kncounts.gz looks just like a counts file. >> >> The counts I fed to make-big-lm with -read are the ones I >> got with >> make/merge-batch-counts -order 5 for 5-grams. Should I >> have done >> anything extra before or after? >> >> Cheers, >> Alexy >> >> On Jun 5, 2008, at 5:46 AM, ilya oparin wrote: >> >>> That usually means you're loading something else >> than a LM in the >>> ARPA format. Have you visually checked your >> model.kncounts.gz? > > > __________________________________________________________ > Sent from Yahoo! Mail. > A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html > From stolcke at speech.sri.com Thu Jun 5 15:06:52 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 05 Jun 2008 15:06:52 -0700 Subject: format error in kncounts.gz In-Reply-To: References: <857861.48200.qm@web25403.mail.ukl.yahoo.com> Message-ID: <4848637C.5060406@speech.sri.com> Alexy Khrabrov wrote: > Hmm -- I might have omitted the -lm new-mode switch, and as a result > got only kncounts.gz file in that case. How would I get a KN model > out of it, and is it faster than rerunning make-big-lm from scratch? make-big-lm will reuse the files already generated provided you use the same -name argument. Andreas From deliverable at gmail.com Thu Jun 5 15:25:33 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 5 Jun 2008 15:25:33 -0700 Subject: format error in kncounts.gz In-Reply-To: <4848637C.5060406@speech.sri.com> References: <857861.48200.qm@web25403.mail.ukl.yahoo.com> <4848637C.5060406@speech.sri.com> Message-ID: So if I already have a file my.kncounts.gz, and a series of and just need a model out of my.kn1,...,my.kn5 files, should I restart with just make-big-lm -name my -lm my -max-per-file 100000000 -kndiscount -order 5 -read my.kncounts.gz -- how will it know the counts are already kncounts? Or need I supply the intermediate results with -kn options? -- or should I instead give it the original counts to -read, and it will find the intermediate results? Cheers, Alexy On Jun 5, 2008, at 3:06 PM, Andreas Stolcke wrote: > Alexy Khrabrov wrote: >> Hmm -- I might have omitted the -lm new-mode switch, and as a >> result got only kncounts.gz file in that case. How would I get a >> KN model out of it, and is it faster than rerunning make-big-lm >> from scratch? > make-big-lm will reuse the files already generated provided you use > the same -name argument. > > Andreas > > > From deliverable at gmail.com Thu Jun 5 15:51:49 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 5 Jun 2008 15:51:49 -0700 Subject: format error in kncounts.gz In-Reply-To: <4848637C.5060406@speech.sri.com> References: <857861.48200.qm@web25403.mail.ukl.yahoo.com> <4848637C.5060406@speech.sri.com> Message-ID: <98B7E56D-E81D-4CEB-BBAF-54F6C72197F3@gmail.com> Ah -- when saying only -read my.kncounts.gz, I get: using existing lm-press-kn5.kncounts.gz using existing gtcounts -- and it goes straight into ngram-count -lm estimtation. Will -max-per-file still apply here, so if swapping hits, I can decrease and rerun from here again? Cheers, Alexy On Jun 5, 2008, at 3:06 PM, Andreas Stolcke wrote: > Alexy Khrabrov wrote: >> Hmm -- I might have omitted the -lm new-mode switch, and as a >> result got only kncounts.gz file in that case. How would I get a >> KN model out of it, and is it faster than rerunning make-big-lm >> from scratch? > make-big-lm will reuse the files already generated provided you use > the same -name argument. > > Andreas > > > From cwsunshine at gmail.com Thu Jun 5 20:34:31 2008 From: cwsunshine at gmail.com (wei chen) Date: Fri, 6 Jun 2008 11:34:31 +0800 Subject: question about lattice-tool Message-ID: Hi,all I hava just studied the functions about lattice-tool,I can not understand - split-multiwords well, (1)What does multiwords mean? (2)How to use - split-multiwords ?Is there any need to add another function -multiword-dictionary ? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From liuchangliang at hccl.ioa.ac.cn Thu Jun 5 22:34:47 2008 From: liuchangliang at hccl.ioa.ac.cn (liuchangliang) Date: Fri, 6 Jun 2008 13:34:47 +0800 Subject: =?gb2312?B?tPC4tDogcXVlc3Rpb24gYWJvdXQgbGF0dGljZS10b29s?= In-Reply-To: References: Message-ID: <008a01c8c797$08a7e870$19f7b950$@ioa.ac.cn> the multiword, for example, ??red_apple??. The node in the lattice may be a multiword. ?? ?Csplit-multiwords?? meas splitting the multiword node into single word nodes when loading the lattice. ??red_apple?? -?? ??red??-??apple?? ??????: owner-srilm-user at speech.sri.com [mailto:owner-srilm-user at speech.sri. com] ???? wei chen ????????: 2008??6??6?? 11:35 ??????: srilm-user at speech.sri.com ????: question about lattice-tool Hi,all I hava just studied the functions about lattice-tool,I can not understand - split-multiwords well, (1)What does multiwords mean? (2)How to use - split-multiwords ?Is there any need to add another function -multiword-dictionary ? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Jun 5 23:10:49 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 05 Jun 2008 23:10:49 -0700 Subject: format error in kncounts.gz In-Reply-To: <98B7E56D-E81D-4CEB-BBAF-54F6C72197F3@gmail.com> References: <857861.48200.qm@web25403.mail.ukl.yahoo.com> <4848637C.5060406@speech.sri.com> <98B7E56D-E81D-4CEB-BBAF-54F6C72197F3@gmail.com> Message-ID: <4848D4E9.30209@speech.sri.com> Alexy Khrabrov wrote: > Ah -- when saying only -read my.kncounts.gz, I get: > > using existing lm-press-kn5.kncounts.gz > using existing gtcounts > > -- and it goes straight into ngram-count -lm estimtation. > > Will -max-per-file still apply here, so if swapping hits, I can > decrease and rerun from here again? -max-per-file is only used in generating the kncounts. So once they are generated the option is ignored. Andreas From sopheap.seng at gmail.com Thu Jun 5 23:23:15 2008 From: sopheap.seng at gmail.com (Sopheap SENG) Date: Fri, 6 Jun 2008 08:23:15 +0200 Subject: question about lattice-tool In-Reply-To: References: Message-ID: <3b7711ea0806052323s70554dd1xbc4d8e029800539c@mail.gmail.com> Hi, > I hava just studied the functions about lattice-tool,I can not > understand - split-multiwords well, > (1)What does multiwords mean? > multiwords is a node in lattice which contents 2 or more words linked by "_", for example w1_w2. > (2)How to use - split-multiwords ?Is there any need to add another > function -multiword-dictionary ? > > When the option -split-multiwords is used, lattice-tool will split the node w1_w2 into 2 nodes : w1 and w2. Note that lattice-tool do not do anything about the score and the time frame of the new nodes. (The multiword-dictionary may do this) Sopheap -- --------------------------------------------- Sopheap SENG Laboratoire d'Informatique de Grenoble (LIG) Equipe GETALP Bureau C118 220, avenue de la Chimie Campus Scientifique, BP53 38041 GRENOBLE Cedex 9, FRANCE T?l : (33)-4-76-63-55-81 T?l?copie : (33)-4-76-63-55-52 Courriel : sopheap.seng at imag.f URL : http://www-geod.imag.fr --------------------------------------------- Enseignant Institut de Technologie du Cambodge BP 86, Bd de Pochentong Phnom Penh - Cambodge T?l : (855)-23-88-03-70/98-24-45 T?l?copie : (855)-23-88-03-69 Courriel : sopheap.seng at itc.edu.kh URL : http://www.itc.edu.kh --------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From cwsunshine at gmail.com Tue Jun 10 01:33:38 2008 From: cwsunshine at gmail.com (wei chen) Date: Tue, 10 Jun 2008 16:33:38 +0800 Subject: lattice tool on VC Message-ID: Hi ,all I have just studied lattice tool,and know a little about this tool,so I want to build a project on vc6.0,But it seems difficult to compile,Is there another way to study the code of lattice tool?thanks a lot -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Jun 10 07:09:56 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 10 Jun 2008 07:09:56 -0700 Subject: lattice tool on VC In-Reply-To: References: Message-ID: <484E8B34.3050101@speech.sri.com> wei chen wrote: > Hi ,all > I have just studied lattice tool,and know a little about this > tool,so I want to build a project on vc6.0,But it seems difficult to > compile,Is there another way to study the code of lattice tool?thanks > a lot I only know that it compiles with VC++ 8, which is available for free from microsoft.com as "Visual C++ 2005 Express Edition". Before trying anything else I would upgrade to that version. Andreas From deliverable at gmail.com Wed Jun 11 17:37:05 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Wed, 11 Jun 2008 17:37:05 -0700 Subject: ppl1 Message-ID: What's the definition of ppl1 vs ppl as reported by ngram -debug 2 ? Cheers, Alexy From ioparin at yahoo.co.uk Thu Jun 12 02:14:28 2008 From: ioparin at yahoo.co.uk (ilya oparin) Date: Thu, 12 Jun 2008 09:14:28 +0000 (GMT) Subject: ppl1 In-Reply-To: Message-ID: <212333.97549.qm@web25402.mail.ukl.yahoo.com> ppl = 10^(-logprob / (ntokens - noov + nsentences)) ppl1 = 10^(-logprob / (ntokens - noov)) You can basically find info of such kind in the mailing archive on the SRILM site. best regards, Ilya --- On Thu, 12/6/08, Alexy Khrabrov wrote: > From: Alexy Khrabrov > Subject: ppl1 > To: "srilm-user" > Date: Thursday, 12 June, 2008, 4:37 AM > What's the definition of ppl1 vs ppl as reported by > ngram -debug 2 ? > > Cheers, > Alexy __________________________________________________________ Sent from Yahoo! Mail. A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html From stolcke at speech.sri.com Thu Jun 12 07:40:44 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 12 Jun 2008 07:40:44 -0700 Subject: ppl1 In-Reply-To: <212333.97549.qm@web25402.mail.ukl.yahoo.com> References: <212333.97549.qm@web25402.mail.ukl.yahoo.com> Message-ID: <4851356C.6020106@speech.sri.com> ilya oparin wrote: > ppl = 10^(-logprob / (ntokens - noov + nsentences)) > ppl1 = 10^(-logprob / (ntokens - noov)) > > You can basically find info of such kind in the mailing archive on the SRILM site. > Or simply using google. Searching for "srilm ppl1" will find the mailing that explains what ppl1 is and why it can be useful. Andreas From deliverable at gmail.com Thu Jun 12 23:53:52 2008 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 12 Jun 2008 23:53:52 -0700 Subject: Reproducing SuperARV paper with SRILM Message-ID: I'd like to reproduce the 2007 SuperARV paper for Russian. I have a parser and a corpus. Do I need extra things in addition to the parser and SRILM? Is the SuperARV equivalent to FLMs -- or if not, how was it implemented in SRILM (sample command lines would be good to look at)? Cheers, Alexy From save.climate at gmail.com Fri Jun 13 12:57:27 2008 From: save.climate at gmail.com (Kamadev Bhanuprasad) Date: Fri, 13 Jun 2008 21:57:27 +0200 Subject: Reproducing SuperARV paper with SRILM In-Reply-To: References: Message-ID: <244d59a50806131257u47c7c502id196f398f699b987@mail.gmail.com> Hi, On Fri, Jun 13, 2008 at 8:53 AM, Alexy Khrabrov wrote: > I'd like to reproduce the 2007 SuperARV paper for Russian. I have a parser > and a corpus. Do I need extra things in addition to the parser and SRILM? yes, you also need a brain to think about it. > Is the SuperARV equivalent to FLMs -- or if not, how was it implemented in > SRILM (sample command lines would be good to look at)? > I really 'love' the way you use this forum. Maybe you could try to think a bit or try to inspect google before you send a question here. Then you will also figure out easily whether FLM and SARV are the same models. Best, Kamadev -------------- next part -------------- An HTML attachment was scrubbed... URL: From save.climate at gmail.com Sun Jun 15 00:58:47 2008 From: save.climate at gmail.com (Kamadev Bhanuprasad) Date: Sun, 15 Jun 2008 09:58:47 +0200 Subject: Reproducing SuperARV paper with SRILM In-Reply-To: <3992F9BE-9991-41C5-B19C-A332145A1D4A@gmail.com> References: <244d59a50806131257u47c7c502id196f398f699b987@mail.gmail.com> <3992F9BE-9991-41C5-B19C-A332145A1D4A@gmail.com> Message-ID: <244d59a50806150058n7585cdb7v158cebfaee1c8abd@mail.gmail.com> Well I guess we should more exact rules for using the list. Or maybe having a web-based forum instead of the mailing list would be more useful and less annoying. Kamadev On Fri, Jun 13, 2008 at 11:08 PM, Alexy Khrabrov wrote: > Dude, stop spamminf the list by CCing it with non-informative content -- > use your brain too, OK? > Nobody said you can't ask on the list whatever you want, however you want > it. Don't like it -- tough sh*t, as they say in the US. > > Best, > Alexy > > On Jun 13, 2008, at 12:57 PM, Kamadev Bhanuprasad wrote: > > Hi, > > On Fri, Jun 13, 2008 at 8:53 AM, Alexy Khrabrov > wrote: > >> I'd like to reproduce the 2007 SuperARV paper for Russian. I have a >> parser and a corpus. Do I need extra things in addition to the parser and >> SRILM? > > > yes, you also need a brain to think about it. > > >> Is the SuperARV equivalent to FLMs -- or if not, how was it implemented >> in SRILM (sample command lines would be good to look at)? >> > > I really 'love' the way you use this forum. Maybe you could try to think a > bit or try to inspect google before you send a question here. Then you will > also figure out easily whether FLM and SARV are the same models. > > Best, > Kamadev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: