From abbas.malik at gmail.com Wed Jul 1 07:17:22 2009 From: abbas.malik at gmail.com (Abbas Malik) Date: Wed, 1 Jul 2009 16:17:22 +0200 Subject: MAP File Message-ID: <5462500907010717l2121903fm9e6ac0c305db88a3@mail.gmail.com> Dear All, I am running a statistical system using the 'disambig' command with a map file. In the map file, I want to map an empty string on a word from the corpus, nothing w1 w2 w3... I want to establish a link between an empty string of V1 (nothing) and a multiple choices of from the data V2. Normal entry in the Map file is like follow w w1 w2 ... Is it possible that I just delete the w, put a space at the start of the line and give the possible word list after this first space? I do not know that this line will establish links with EMPTY STRING and word list followed by the space at the start of the line or not. I hope that someone of you can help me. Thank you in advance, best regards, -- --- M G Abbas Malik Doctorant (PhD Student) Universit? Joseph Fourier, Groupe d'Etude pour la Traduction Automatique et le Traitement Automatis? des Langues et de la Parole (GETALP) Laboratoire d'Informatique de Grenoble (LIG) / Grenoble Informatics Laboratory GETALP, LIG-Campus, BP53 385 Rue de la Biblioth?que, 38041 Grenoble Cedex 9, France Off: +33 (0)4 76 51 48 17 Mob: +33 (0)6 74 50 46 01 e-mail: abbas.malik at imag.fr abbas.malik at gmail.com URL: www.puran.info -------------- next part -------------- An HTML attachment was scrubbed... URL: From abbas.malik at gmail.com Wed Jul 1 07:41:07 2009 From: abbas.malik at gmail.com (Abbas Malik) Date: Wed, 1 Jul 2009 16:41:07 +0200 Subject: MAP File In-Reply-To: <5462500907010717l2121903fm9e6ac0c305db88a3@mail.gmail.com> References: <5462500907010717l2121903fm9e6ac0c305db88a3@mail.gmail.com> Message-ID: <5462500907010741v4bf1b139p756917973dd64467@mail.gmail.com> Dear All, Issue 1: I am running a statistical system using the 'disambig' command with a map file. In the map file, I want to map an empty string on a word from the corpus, nothing w1 w2 w3... I want to establish a link between an empty string of V1 (nothing) and a multiple choices of from the data V2. Normal entry in the Map file is like follow w w1 w2 ... Is it possible that I just delete the w, put a space at the start of the line and give the possible word list after this first space? I do not know that this line will establish links with EMPTY STRING and word list followed by the space at the start of the line or not. I hope that someone of you can help me. Issue 2: In the map file, is it possible to map one word of V1 on to multiple words of of V2. I mean that if we encounter a word w from V1 then it is replaced or transformed by both words [w1 w2] of V2, such that w maps on the set [w1 w2] and does not map on w1 and w2 separately. I hope that I have cleared my point. Thank you in advance, --- M G Abbas Malik Doctorant (PhD Student) Universit? Joseph Fourier, Groupe d'Etude pour la Traduction Automatique et le Traitement Automatis? des Langues et de la Parole (GETALP) Laboratoire d'Informatique de Grenoble (LIG) / Grenoble Informatics Laboratory GETALP, LIG-Campus, BP53 385 Rue de la Biblioth?que, 38041 Grenoble Cedex 9, France Off: +33 (0)4 76 51 48 17 Mob: +33 (0)6 74 50 46 01 e-mail: abbas.malik at imag.fr abbas.malik at gmail.com URL: www.puran.info -------------- next part -------------- An HTML attachment was scrubbed... URL: From fsanchez at dlsi.ua.es Wed Jul 1 08:50:52 2009 From: fsanchez at dlsi.ua.es (Felipe =?ISO-8859-1?Q?S=E1nchez_Mart=EDnez?=) Date: Wed, 01 Jul 2009 16:50:52 +0100 Subject: Lattice Viterbi decoding Message-ID: <1246463452.6600.43.camel@pipe> Hi all, I am using SRILM to score a set of translation candidates of a given sentence. The sentence is divide into chunks, some of them having a fix translation and others having different alternatives: text1 | text2 | text3.1 or text3.2 | text4 | text5.1 or text5.2 As the number of combinations is exponential in the length of the sentences I have been trying to use lattice-tool to compute the Viterbi path but I am not able to make it work. I am using the following command line: $ lattice-tool -viterbi-decode -in-lattice lattice.pfsg -lm model.lm -order 5 -debug 1 but I get exactly the same result with 5, 3 or even 0 n-gram order. In addition, with the example sentence I am working with I get a different path if I use SRILM in the usual way by scoring all possible translations of the sentence. What am I doing wrong? Thank you very much in advance. PS: I am using srilm 1.5.7 -- Felipe S?nchez Mart?nez Departamento de Lenguajes y Sistemas Inform?ticos Universidad de Alicante, E-03071 Alicante (Spain) Tel.: +34 965 903 400, ext: 2966 Fax: +34 965 909 326 http://www.dlsi.ua.es/~fsanchez From stolcke at speech.sri.com Tue Jul 28 16:21:00 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 28 Jul 2009 16:21:00 PDT Subject: Question Concerning ARPA-Format In-Reply-To: Your message of Thu, 23 Jul 2009 01:49:10 -0700. <454548.84797.qm@web63405.mail.re1.yahoo.com> Message-ID: <200907282321.n6SNL0d23219@ns2> In message <454548.84797.qm at web63405.mail.re1.yahoo.com>you wrote: > > Dear Andreas Stolcke, > I have a question concerning your toolkit/arpa-format. I know, that this question c > ould probably be answered by doing research - but after exhaustive research I found > no real answer... > > I want to include a list of, say, syntactically equal words in an ARPA-slm, if poss > ible as an external file. With this, my input-sentences would look like this, f.e.: > > "Please give me the OBJECT" > "Can I have the OBJECT" > > OBJECT: spoon, book, remote-control ... (these in an external file) > > > Can you have such an external reference with ARPA and your toolkit - or do you have > to copy the sentences, like this: > > "Please give me the spoon" > "Please give me the book" > "Please give me the remote-control" > > "Can I have the spoon" > "Can I have the book" > "Can I have the remote-control" > > > It would be great, if you could give me a brief answer. What you are describing is known as a "class-based" ngram LM. It is supported by SRILM. The steps are roughly: 1. Define the classes and their membership. The format is defined in the classes-format(5) man page. You can create one by hand, or induce word classes from a corpus based on bigram cooccurrence statistics, using the ngram-class(1) program. 2. Preprocess your training corpus to replace words with classes. See the replace-words-with-classes script described in the training-scripts(1) man page. 3. Training a standard ngram on the processed data, using ngram-count(1). 4. Test the class-based LM using ngram or another tool, supplying both the LM file and the class definitions file (from step 1), via the -classes option. See the ngram(1) man page. Andreas From kereoz at kereoz.org Thu Aug 6 00:11:14 2009 From: kereoz at kereoz.org (Christophe) Date: Thu, 6 Aug 2009 16:11:14 +0900 Subject: Acoustic model Message-ID: <20090806071110.GL10626@puredyne.hil.t.u-tokyo.ac.jp> (I might have sent a similar message last week, but I don't think it actually worked - sorry if it did). Hello, I would like to use SRILM to compute the most liely sequence of words given an acoustic model and a language model. My acoustic model is a simple matrix of observations. It corresponds to a sequence of observations along with the probabilities that they match words from the vocabulary. The vocabulary itself is composed of 24 words, so the matrix is Nx24 with N being the length of the sequence. In other words, I can get the probability of each node from this matrix. My language model is a n-gram model generated by ngram-count. It gives me transition probabilities between nodes. I think that the lattice-tool from SRILM can do the viterbi decoding stuff, but I can't figure out how to import the acoustic model in it. As far as I understand, it is expected to be in the pfsg format. Are there any tool that would allow me to generate such a lattice from what I have ? Thank you. Kind regards, -- Christophe http://www.kereoz.org From stolcke at speech.sri.com Wed Aug 12 10:47:11 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 12 Aug 2009 10:47:11 -0700 Subject: language models In-Reply-To: <939770.48418.qm@web38005.mail.mud.yahoo.com> References: <821248.89886.qm@web38007.mail.mud.yahoo.com> <4A53321D.9050807@speech.sri.com> <785361.72829.qm@web38004.mail.mud.yahoo.com> <4A664D26.5000805@speech.sri.com> <939770.48418.qm@web38005.mail.mud.yahoo.com> Message-ID: <4A83001F.30006@speech.sri.com> Md. Akmal Haidar wrote: > Dear Andreas, > Thanks for your reply. > Is the sum of n-gram probabilities sharing common (n-1) gram should be > equal to 1? No, because smoothing results in some probability mass being assigned to ngrams not observed in the training data (and hence in the LM). This probability mass is then assigned to the unobserved ngrams via the backoff formula. > if yes, > Is there any tool to normalize the language model probabilities such > that sum of n-gram probabilities sharing common (n-1) gram is equal to 1? To make the probabilities of only the observed ngrams add up to 1 you need to disable smoothing, and also make sure all observed ngrams are include in the model. Try ngram-count with these options: -gt3min 1 -gt4min 1 (etc.) -gt1max 0 -gt2max 0 -gt3max 0 -gt4max 0 (etc. up to the order of ngram you need) For more details on smoothing check http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html Andreas Andreas > 1 > Thanks > Best Regards > Akmal > > > ------------------------------------------------------------------------ > *From:* Andreas Stolcke > *To:* Md. Akmal Haidar > *Sent:* Tuesday, July 21, 2009 7:20:06 PM > *Subject:* Re: language models > > Md. Akmal Haidar wrote: > > Dear Andreas, > > Thanks for your reply. > > what is the difference between language model creating from a text > file and a count file. > > if i use like -text textfile -lm lmfile & -read countfile -vocab > vocabfile -lm lmfile. the first one gives smaller perplexity. > The difference is probably due to use of the -vocab option. It limits > the vocabulary of the LM. > If you use it in both cases, or not at all your should get the same > results. > > Andreas > > > Could you please tell me what's the reason? > > Thanks & Regards > > Akmal > > > > ------------------------------------------------------------------------ > > *From:* Andreas Stolcke > > > *To:* Md. Akmal Haidar > > > *Sent:* Tuesday, July 7, 2009 7:31:41 AM > > *Subject:* Re: Mixing several topic models > > > > Md. Akmal Haidar wrote: > > > Hi, > > > > > > I am new in srilm. > > > > > > I am working for language model adaptation using LDA. I need to mix > > > several topic models through weighting factor. > > > > Is there any way in srilm to mix several language models? > > Read the ngram(1) man page, specifically about the options -mix-lm, > > -mix-lm2, etc. > > > > Andreas > > > > > > > > Thanks > > > > > > Kind Regards > > > Akmal > > > > > > > > > > > > From jmcrego at limsi.fr Mon Aug 17 06:05:53 2009 From: jmcrego at limsi.fr (Josep Maria Crego) Date: Mon, 17 Aug 2009 15:05:53 +0200 Subject: using FLM library Message-ID: <409a8e0c0908170605y2f542550i215c724b5cbaa768@mail.gmail.com> dear users, I am trying to use the factored LM library classes (version 1.5.8) directly in my code. Mainly, I would like to use the flm function which is equivalent to *wordProb(ixWord, history)* in standard lm's of srilm. So, my question is: does it exist for FLM's a function equivalent to *wordProb(ixWord, history)* where ixWord and history consist of a vector of factors??? does anyone have an example of code employing ngram probabilities from a sequence of factors (Ex: W-wrd1:T-pos1 W-wrd2:T-pos2 ... W-wrdN:T-posN) according to a FLM description file??? thanks in advance, jm -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Mon Aug 17 14:26:18 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 17 Aug 2009 14:26:18 -0700 Subject: srilm-user list changes Message-ID: <4A89CAFA.6000709@speech.sri.com> Hi all, we'll be switching the srilm-user mailing list management software from Majordomo to GNU Mailman some time later today. The list will be offline until then. All current members will be added to the new list, and you'll be getting a welcome message with instructions on how to manage your list membership.. Sorry for the troubles we've been having with the software in the recent past, and for the temporary disruption. Andreas From stolcke at speech.sri.com Mon Aug 17 14:40:56 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 17 Aug 2009 14:40:56 PDT Subject: SRILM-USER list changes Message-ID: <200908172140.n7HLeu410817@ns2> Hi all, we will be switching the srilm-user mailing list management software from Majordomo to GNU Mailman some time later today. The list will be offline until then. All current members will be added to the new list, and you'll be getting a welcome message with instructions on how to manage your list membership.. Sorry for the troubles we've been having with the software in the recent past, and for the temporary disruption. Andreas From stolcke at speech.sri.com Mon Aug 17 15:47:00 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 17 Aug 2009 15:47:00 -0700 Subject: [SRILM User List] Testing new srilm-user list Message-ID: <4A89DDE4.5070709@speech.sri.com> From stolcke at speech.sri.com Wed Aug 19 17:05:40 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 19 Aug 2009 17:05:40 -0700 Subject: [SRILM User List] language models In-Reply-To: <595005.19488.qm@web38006.mail.mud.yahoo.com> References: <200908131724.n7DHOfW29493@ns2> <595005.19488.qm@web38006.mail.mud.yahoo.com> Message-ID: <4A8C9354.4010901@speech.sri.com> Md. Akmal Haidar wrote: > Hi, > I have three 3 lm file. > The first one i got by ngram-count. > The second one is by applying some matlab programming on the first. > The third one is by renormalizing the second one using ngram -renorm > option. > > In creating the third one, i faced some message like: BOW denominator > for context "been has" is -0.382151<=0, numerator is 0.846874 That's expected if you changed the probabilities such that they sum to > 1 for a given context. ngram -renorm cannot deal with this. It simply recomputes the backoff weights to normalize the model, but it won't change the existing ngram probabilities. Obviously if just the explicit ngram probabilities sum to > 1 there is no way to assign backoff weights such that the model is normalized, hence the above message. > > The second and third one gives too lowest perplexity(7.53 & 5.70) . > The first one gives 73.73 That's right, if your probabilities don't sum to 1 (over the entire vocabulary, for all contexts) perplexities are meaningless. You can run ngram -debug 3 -ppl to check that probabilities are normalized for all contexts occurring in your test set. I don't have a simple solution for your problem. Since you manipulated the probabilities you have to figure out a way to get them normalized ! I suggest you use the srilm-user mailing list if you want to seek further advice this. But you would first have to explain in more detail how you assign your probabilities. Andreas > > Could you please tell me whats the meaning of these message? > > Thanks & Regards > Haidar > > > ------------------------------------------------------------------------ > *From:* Andreas Stolcke > *To:* Md. Akmal Haidar akmalcuet00 at yahoo.com > > *Sent:* Thursday, August 13, 2009 1:24:41 PM > *Subject:* Re: language models > > > In message <92580.94445.qm at web38002.mail.mud.yahoo.com > >you wrote: > > > > Dear Andreas, > > I attahced 2 lm file. > > here, train3.lm is the original lm file which i got by applying > ngram-count. > > So does that file have probabilities summing to 1? > I would think not. > > > ntrain3.lm is the modified lm which i got by some matlab > programming. But, he > > re sum the of seen 2-gram probabilities sharing common 1 gram is > greater than > > 1. > > I cannot help you debugging you matlab script if that's what's giving > you unnormalized probabilities. > > > > > If i changed the 1 gram back off weight to make the sum of > 2-gram(seen & unse > > en) proability sharing common 1 gram is equal to 1, is the method > will correc > > t? > > yes. > > ngram -renorm will also do this for you. > > Andreas > > From stolcke at speech.sri.com Fri Aug 21 10:45:03 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 21 Aug 2009 10:45:03 -0700 Subject: [SRILM User List] a question about FLMs and SRILM In-Reply-To: <409a8e0c0908210905q6e3be46bi4200c48915d5d95d@mail.gmail.com> References: <409a8e0c0908210905q6e3be46bi4200c48915d5d95d@mail.gmail.com> Message-ID: <4A8EDD1F.8080200@speech.sri.com> Josep Maria Crego wrote: > dear Andreas, > > My name is Josep M. Crego, a post-doc working on SMT at LIMSI-CNRS > (France) > > I am trying to use the factored LM library classes (version 1.5.8) > directly in my code. Mainly, I would like to use an flm function > equivalent to /*wordProb(ixWord, history)*/ for standard lm's of srilm. > > So, my question is: does it exist for FLM's a function equivalent to > /*wordProb(ixWord, history)*/ where ixWord and history consist of a > vector of factors??? it would be perfect if you could send me an > example of code employing ngram probabilities from a sequence of > factors (Ex: W-wrd1:T-pos1 W-wrd2:T-pos2 ... W-wrdN:T-posN) according > to a FLM description file. There is a wordProb function for FLMs, since FLMs are just a special kind of LM class. You need create an LM object of class ProductNgram and then invoke the wordProb function in it. Look in lm/src/ngram.cc for an example (look in the places where the variable "factored" is used). > > thanks in advance, > jm > > PS: sorry for sending directly the question to you... I don't know why > I couldn't use the srilm mailing list There were problems with the mailing list admin software. We solved those and there is now an easy way to join/leave the list. Just go to http://www.speech.sri.com/mailman/listinfo/srilm-user/ and follow the instructions there. Andreas > > -- > Josep-Maria Crego > LIMSI-CNRS > Phone: +33/0 1 69 85 80 68 > Postmail: BP 133, 91403 Orsay (France) From shl.thcn at yahoo.com.cn Thu Aug 27 00:21:25 2009 From: shl.thcn at yahoo.com.cn (=?gb2312?B?uqPB+iDKtw==?=) Date: Thu, 27 Aug 2009 00:21:25 -0700 (PDT) Subject: [SRILM User List] A confusion of the interpolated language model Message-ID: <942355.92368.qm@web15307.mail.cnb.yahoo.com> I am a new student user of srilm from Asia.Here I used the command below to construct a interpolated mod-kn discount language model: ~ ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_all_pruned.lm However in my model several N-grams' back-off werght(bow) appears to be greater than 1.That is ,in the text LM file,I've got a line: -6.457229 1635 0.1270406 (Here we just use a kind of index to represent a chinese word) in whitch the 1og10(bow) is greater than 0.We don't think a normal interplotate discount method can produce an N-gram bow greater than 1,besides this circumstance only occured to several(less than 5) different N-grams.So I am confused and would like to ask if there is someyone who encounterd this circumstance or happens to know what is wrong. Thank you very much! ??? Hailoon Shi w63,EE Dpt.Thu Univ.PRC __________________________________________________ ??????????????? http://cn.mail.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From yannick.esteve at lium.univ-lemans.fr Thu Aug 27 01:19:44 2009 From: yannick.esteve at lium.univ-lemans.fr (=?GB2312?Q?Yannick_Est=A8=A8ve?=) Date: Thu, 27 Aug 2009 10:19:44 +0200 Subject: [SRILM User List] A confusion of the interpolated language model In-Reply-To: <942355.92368.qm@web15307.mail.cnb.yahoo.com> References: <942355.92368.qm@web15307.mail.cnb.yahoo.com> Message-ID: <42586A5A-C583-4404-9018-EF2C0193C5F9@lium.univ-lemans.fr> Hi, Back-off weights are not probabilities: they can be greater than 1. So, your values are normal. You can have some explanations about back- off weight computation here, particularly for the use of the modified Kneser-Ney discounting method: http://www.speech.sri.com/projects/srilm/manpages/pdfs/chen-goodman-tr-10-98.pdf Regards, Yannick Est?ve LIUM - University of Le Mans France Le 27 ao?t 09 ? 09:21, ?? ? a ?crit : > > > > > I am a new student user of srilm from Asia.Here I used the command > below to construct a interpolated mod-kn discount language model: > ~ ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 - > gt3min 2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm > 1994-2003_lm_all_pruned.lm > > However in my model several N-grams' back-off werght(bow) appears > to be greater than 1.That is ,in the text LM file,I've got a line: > -6.457229 1635 0.1270406 > (Here we just use a kind of index to represent a chinese word) > in whitch the 1og10(bow) is greater than 0.We don't think a normal > interplotate discount method can produce an N-gram bow greater than > 1,besides this circumstance only occured to several(less than 5) > different N-grams.So I am confused and would like to ask if there is > someyone who encounterd this circumstance or happens to know what is > wrong. > Thank you very much! > > ??? > Hailoon Shi > w63,EE Dpt.Thu Univ.PRC > > > > > > __________________________________________________ > ??????????????? > http:// > cn.mail.yahoo.com_______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Aug 27 13:38:35 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 27 Aug 2009 13:38:35 -0700 Subject: [SRILM User List] language models In-Reply-To: <2037.50270.qm@web38003.mail.mud.yahoo.com> References: <200908131724.n7DHOfW29493@ns2> <595005.19488.qm@web38006.mail.mud.yahoo.com> <4A8C9354.4010901@speech.sri.com> <2037.50270.qm@web38003.mail.mud.yahoo.com> Message-ID: <4A96EECB.4090107@speech.sri.com> Md. Akmal Haidar wrote: > > Hi, > Thanks for your reply. > I need to mix 20 topic models. srilm provide 10 LM file one at a time. > I use the following command:(t:topic,w:topic weight) > ngram -lm t1.lm w1 -mix-lm t2.lm w2 -mix-lm2 t3.lm w3 > .............-mix-lm9 t10.lm w10 -write-lm t1to10.lm > ngram -lm t11.lm w11 -mix-lm t12.lm w12 -mix-lm2 t13.lm w13 > .............-mix-lm9 t20.lm w20 -write-lm t11to20.lm > ngram -lm t1to10.lm .5 -mix-lm t11to20.lm .5 -write-lm t1to20.lm You can mix the models recursively. To mix three models L1 L2 L3 with weights w1 w2 w3 (w1 + w2+ w3 = 1) you first build L12 = w1/(w1+w2) L1 + w2/(w1+w2) L2 and then L = (w1 + w2) L12 + w3 L3. I'll leave it to you to generalize this to a larger number of models. Please direct future questions of this nature to the srilm-user mailing list. Andreas > > could you please tell me is the command correct for mixing LM file? > > Thanks > Akmal > > ------------------------------------------------------------------------ > *From:* Andreas Stolcke > *To:* Md. Akmal Haidar > *Cc:* srilm-user > *Sent:* Wednesday, August 19, 2009 8:05:40 PM > *Subject:* Re: language models > > Md. Akmal Haidar wrote: > > Hi, > > I have three 3 lm file. > > The first one i got by ngram-count. > > The second one is by applying some matlab programming on the first. > > The third one is by renormalizing the second one using ngram -renorm > option. > > In creating the third one, i faced some message like: BOW > denominator for context "been has" is -0.382151<=0, numerator is 0.846874 > That's expected if you changed the probabilities such that they sum to > > 1 for a given context. > ngram -renorm cannot deal with this. It simply recomputes the backoff > weights to normalize the model, but it won't change the existing ngram > probabilities. Obviously if just the explicit ngram probabilities sum > to > 1 there is no way to assign backoff weights such that the model > is normalized, hence the above message. > > The second and third one gives too lowest perplexity(7.53 & 5.70) . > The first one gives 73.73 > That's right, if your probabilities don't sum to 1 (over the entire > vocabulary, for all contexts) perplexities are meaningless. > > You can run ngram -debug 3 -ppl to check that probabilities are > normalized for all contexts occurring in your test set. > > I don't have a simple solution for your problem. Since you > manipulated the probabilities you have to figure out a way to get them > normalized ! I suggest you use the srilm-user mailing list if you > want to seek further advice this. But you would first have to explain > in more detail how you assign your probabilities. > > Andreas > > > Could you please tell me whats the meaning of these message? > > Thanks & Regards > > Haidar > > > > > ------------------------------------------------------------------------ > > *From:* Andreas Stolcke > > > *To:* Md. Akmal Haidar akmalcuet00 at yahoo.com > > > > *Sent:* Thursday, August 13, 2009 1:24:41 PM > > *Subject:* Re: language models > > > > > > In message <92580.94445.qm at web38002.mail.mud.yahoo.com > > >>you wrote: > > > > > > Dear Andreas, > > > I attahced 2 lm file. > > > here, train3.lm is the original lm file which i got by applying > ngram-count. > > > > So does that file have probabilities summing to 1? > > I would think not. > > > > > ntrain3.lm is the modified lm which i got by some matlab > programming. But, he > > > re sum the of seen 2-gram probabilities sharing common 1 gram is > greater than > > > 1. > > > > I cannot help you debugging you matlab script if that's what's giving > > you unnormalized probabilities. > > > > > > > > If i changed the 1 gram back off weight to make the sum of > 2-gram(seen & unse > > > en) proability sharing common 1 gram is equal to 1, is the method > will correc > > > t? > > > > yes. > > > > ngram -renorm will also do this for you. > > > > Andreas > > > > > > From akmalcuet00 at yahoo.com Fri Aug 28 09:50:07 2009 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Fri, 28 Aug 2009 09:50:07 -0700 (PDT) Subject: [SRILM User List] different perplexity Message-ID: <553811.97317.qm@web38005.mail.mud.yahoo.com> Hi, i faced a problem in perplexity calculation.. when i used the commands: 1) ngram -lm l1.lm -ppl t.txt? ? ??????????????????????????????????????? 2) ngram -lm l2.lm -lambda 0 -mix-lm l1.lm -ppl? t.txt the first gives lowest perplexity that the second one. Should the above commands give the different perplexity? thanks Akmal -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Aug 28 10:39:45 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 28 Aug 2009 10:39:45 -0700 Subject: [SRILM User List] different perplexity In-Reply-To: <553811.97317.qm@web38005.mail.mud.yahoo.com> References: <553811.97317.qm@web38005.mail.mud.yahoo.com> Message-ID: <4A981661.9050306@speech.sri.com> Md. Akmal Haidar wrote: > Hi, > > i faced a problem in perplexity calculation.. > when i used the commands: 1) ngram -lm l1.lm -ppl t.txt > 2) ngram -lm l2.lm -lambda 0 > -mix-lm l1.lm -ppl t.txt > > the first gives lowest perplexity that the second one. > Should the above commands give the different perplexity? They may, though not by much. Realize that ngram -mix-lm WITHOUT the -bayes option performs an "ngram merging" that APPROXIMATES the result of interpolating the two LMs according to the classical formula. This is describe in the the SRILM paper: > The ability to approximate class-based and interpolated Ngram > LMs by a single word N-gram model deserves some discussion. > Both of these operations are useful in situations where > other software (e.g., a speech recognizer) supports only standard > N-grams. Class N-grams are approximated by expanding class labels > into their members (which can contain multiword strings) and > then computing the marginal probabilities of word N-gram strings. > This operation increases the number of N-grams combinatorially, > and is therefore feasible only for relatively small models. > An interpolated backoff model is obtained by taking the union > of N-grams of the input models, assigning each N-gram the > weighted average of the probabilities from those models (in some > of the models this probability might be computed by backoff), and > then renormalizing the new model. We found that such interpolated > backoff models consistently give slightly lower perplexities > than the corresponding standard word-level interpolated models. > The reason could be that the backoff distributions are themselves > obtained by interpolation, unlike in standard interpolation, where > each component model backs off individually. So the result may differ because because the merging process introduces new backoff nodes into the LM and that may change some probabilities arrived at through backing off. However, if you use ngram -lm l2.lm -lambda 0 -mix-lm l1.lm -ppl t.txt -bayes 0 you get exact interpolation and then the perplexities should be identical. But you cannot save such an interpolated model back into a single ngram LM. In practice the difference should not matter (at least in my experience). Andreas > > thanks > > Akmal > > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From akmalcuet00 at yahoo.com Fri Aug 28 12:17:10 2009 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Fri, 28 Aug 2009 12:17:10 -0700 (PDT) Subject: [SRILM User List] different perplexity In-Reply-To: <4A981661.9050306@speech.sri.com> References: <553811.97317.qm@web38005.mail.mud.yahoo.com> <4A981661.9050306@speech.sri.com> Message-ID: <309353.14233.qm@web38002.mail.mud.yahoo.com> ? Hi, Thanks for your reply. I need to compare two lm file by perplexity evaluation. ? 1. i) ngram -lm general.lm -lambda .5 -mix-lm l1.lm -ppl test1.txt ??? ii) ngram -lm general.lm -lambda .5 -mix-lm l1.lm -ppl test1.txt -bayes 0 ??????? in both commands it gives same perplexity but when 2. i) ngram -lm general.lm -lambda .5 -mix-lm l2.lm -ppl test1.txt ??????? ppl=460 ?? ii)ngram -lm general.lm -lambda .5 -mix-lm l2.lm -ppl test1.txt -bayes 0 ????? ppl=148 ??? the 2(ii)??command gives lower perplexity. ? could you please tell me why the second one gives lower perplexity? ? thanks akmal ??? Md. Akmal Haidar wrote: > Hi, >? i faced a problem in perplexity calculation.. > when i used the commands: 1) ngram -lm l1.lm -ppl t.txt? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2) ngram -lm l2.lm -lambda 0 -mix-lm l1.lm -ppl? t.txt >? the first gives lowest perplexity that the second one. > Should the above commands give the different perplexity? They may, though not by much. Realize that ngram -mix-lm WITHOUT the -bayes option performs an "ngram merging" that APPROXIMATES the result of interpolating the two LMs according to the classical formula.? This is describe in the the SRILM paper: > The ability to approximate class-based and interpolated Ngram > LMs by a single word N-gram model deserves some discussion. > Both of these operations are useful in situations where > other software (e.g., a speech recognizer) supports only standard > N-grams. Class N-grams are approximated by expanding class labels > into their members (which can contain multiword strings) and > then computing the marginal probabilities of word N-gram strings. > This operation increases the number of N-grams combinatorially, > and is therefore feasible only for relatively small models. > An interpolated backoff model is obtained by taking the union > of N-grams of the input models, assigning each N-gram the > weighted average of the probabilities from those models (in some > of the models this probability might be computed by backoff), and > then renormalizing the new model. We found that such interpolated > backoff models consistently give slightly lower perplexities > than the corresponding standard word-level interpolated models. > The reason could be that the backoff distributions are themselves > obtained by interpolation, unlike in standard interpolation, where > each component model backs off individually. So the result may differ because because the merging process introduces new backoff nodes into the LM and that may change some probabilities arrived at through backing off. However, if you use ? ngram -lm l2.lm -lambda 0 -mix-lm l1.lm -ppl? t.txt -bayes 0 you get exact interpolation and then the perplexities should be identical. But you cannot save such an interpolated model back into a single ngram LM. In practice the difference should not matter (at least in my experience). Andreas >? thanks >? Akmal > >? > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user ________________________________ From: Andreas Stolcke To: Md. Akmal Haidar Cc: srilm-user Sent: Friday, August 28, 2009 1:39:45 PM Subject: Re: [SRILM User List] different perplexity -------------- next part -------------- An HTML attachment was scrubbed... URL: From akmalcuet00 at yahoo.com Fri Aug 28 14:01:36 2009 From: akmalcuet00 at yahoo.com (Md. Akmal Haidar) Date: Fri, 28 Aug 2009 14:01:36 -0700 (PDT) Subject: [SRILM User List] different perplexity In-Reply-To: <4A984434.9090909@speech.sri.com> References: <553811.97317.qm@web38005.mail.mud.yahoo.com> <4A981661.9050306@speech.sri.com> <309353.14233.qm@web38002.mail.mud.yahoo.com> <4A984434.9090909@speech.sri.com> Message-ID: <477380.80961.qm@web38002.mail.mud.yahoo.com> the perplexity for 1(i)=450, 1(ii)=450. both are same by the way, some back-off weights for l2.lm are greater than 1. thanks Akmal ________________________________ From: Andreas Stolcke To: Md. Akmal Haidar Sent: Friday, August 28, 2009 4:55:16 PM Subject: Re: [SRILM User List] different perplexity Md. Akmal Haidar wrote: > > Hi, > Thanks for your reply. > I need to compare two lm file by perplexity evaluation. > 1. i) ngram -lm general.lm -lambda .5 -mix-lm l1.lm -ppl test1.txt > ii) ngram -lm general.lm -lambda .5 -mix-lm l1.lm -ppl test1.txt -bayes 0 > in both commands it gives same perplexity but when > 2. i) ngram -lm general.lm -lambda .5 -mix-lm l2.lm -ppl test1.txt > ppl=460 > ii)ngram -lm general.lm -lambda .5 -mix-lm l2.lm -ppl test1.txt -bayes 0 > ppl=148 > the 2(ii) command gives lower perplexity. that is quite odd. What is the perplexity for 1(i) and 1(ii) ? andreas > could you please tell me why the second one gives lower perplexity? > thanks > akmal > ------------------------------------------------------------------------ > *From:* Andreas Stolcke > *To:* Md. Akmal Haidar > *Cc:* srilm-user > *Sent:* Friday, August 28, 2009 1:39:45 PM > *Subject:* Re: [SRILM User List] different perplexity > > Md. Akmal Haidar wrote: > > Hi, > > i faced a problem in perplexity calculation.. > > when i used the commands: 1) ngram -lm l1.lm -ppl t.txt 2) ngram -lm l2.lm -lambda 0 -mix-lm l1.lm -ppl t.txt > > the first gives lowest perplexity that the second one. > > Should the above commands give the different perplexity? > They may, though not by much. > > Realize that ngram -mix-lm WITHOUT the -bayes option performs an "ngram merging" that APPROXIMATES the result of interpolating the two LMs according to the classical formula. This is describe in the the SRILM paper: > > The ability to approximate class-based and interpolated Ngram > > LMs by a single word N-gram model deserves some discussion. > > Both of these operations are useful in situations where > > other software (e.g., a speech recognizer) supports only standard > > N-grams. Class N-grams are approximated by expanding class labels > > into their members (which can contain multiword strings) and > > then computing the marginal probabilities of word N-gram strings. > > This operation increases the number of N-grams combinatorially, > > and is therefore feasible only for relatively small models. > > An interpolated backoff model is obtained by taking the union > > of N-grams of the input models, assigning each N-gram the > > weighted average of the probabilities from those models (in some > > of the models this probability might be computed by backoff), and > > then renormalizing the new model. We found that such interpolated > > backoff models consistently give slightly lower perplexities > > than the corresponding standard word-level interpolated models. > > The reason could be that the backoff distributions are themselves > > obtained by interpolation, unlike in standard interpolation, where > > each component model backs off individually. > So the result may differ because because the merging process introduces new backoff nodes into the LM and that may change some probabilities arrived at through backing off. However, if you use > > ngram -lm l2.lm -lambda 0 -mix-lm l1.lm -ppl t.txt -bayes 0 > > you get exact interpolation and then the perplexities should be identical. > But you cannot save such an interpolated model back into a single ngram LM. > > In practice the difference should not matter (at least in my experience). > > Andreas > > > > thanks > > Akmal > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > SRILM-User site list > > SRILM-User at speech.sri.com > > http://www.speech.sri.com/mailman/listinfo/srilm-user > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Aug 28 18:04:55 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 28 Aug 2009 18:04:55 -0700 Subject: [SRILM User List] different perplexity In-Reply-To: <477380.80961.qm@web38002.mail.mud.yahoo.com> References: <553811.97317.qm@web38005.mail.mud.yahoo.com> <4A981661.9050306@speech.sri.com> <309353.14233.qm@web38002.mail.mud.yahoo.com> <4A984434.9090909@speech.sri.com> <477380.80961.qm@web38002.mail.mud.yahoo.com> Message-ID: <4A987EB7.5000003@speech.sri.com> Md. Akmal Haidar wrote: > the perplexity for 1(i)=450, 1(ii)=450. both are same > > by the way, some back-off weights for l2.lm are greater than 1. My guess would be that l2.lm is not properly normalized. Try running it with ngram -debug 3 -ppl on some test data. When you interpolate with -bayes 0 no normalization is applied to the resulting model (it should be automatically normalized assuming the component models are normalized), to the resulting more will also be unnormalized and give bogus low perplexity. Andreas > > thanks > Akmal > > > ------------------------------------------------------------------------ > *From:* Andreas Stolcke > *To:* Md. Akmal Haidar > *Sent:* Friday, August 28, 2009 4:55:16 PM > *Subject:* Re: [SRILM User List] different perplexity > > Md. Akmal Haidar wrote: > > > > Hi, > > Thanks for your reply. > > I need to compare two lm file by perplexity evaluation. > > 1. i) ngram -lm general.lm -lambda .5 -mix-lm l1.lm -ppl test1.txt > > ii) ngram -lm general.lm -lambda .5 -mix-lm l1.lm -ppl test1.txt > -bayes 0 > > in both commands it gives same perplexity but when > > 2. i) ngram -lm general.lm -lambda .5 -mix-lm l2.lm -ppl test1.txt > > ppl=460 > > ii)ngram -lm general.lm -lambda .5 -mix-lm l2.lm -ppl test1.txt > -bayes 0 > > ppl=148 > > the 2(ii) command gives lower perplexity. > that is quite odd. What is the perplexity for 1(i) and 1(ii) ? > > andreas > > > could you please tell me why the second one gives lower perplexity? > > thanks > > akmal > > > ------------------------------------------------------------------------ > > *From:* Andreas Stolcke > > > *To:* Md. Akmal Haidar > > > *Cc:* srilm-user > > > *Sent:* Friday, August 28, 2009 1:39:45 PM > > *Subject:* Re: [SRILM User List] different perplexity > > > > Md. Akmal Haidar wrote: > > > Hi, > > > i faced a problem in perplexity calculation.. > > > when i used the commands: 1) ngram -lm l1.lm -ppl t.txt > 2) ngram -lm l2.lm -lambda 0 -mix-lm > l1.lm -ppl t.txt > > > the first gives lowest perplexity that the second one. > > > Should the above commands give the different perplexity? > > They may, though not by much. > > > > Realize that ngram -mix-lm WITHOUT the -bayes option performs an > "ngram merging" that APPROXIMATES the result of interpolating the two > LMs according to the classical formula. This is describe in the the > SRILM paper: > > > The ability to approximate class-based and interpolated Ngram > > > LMs by a single word N-gram model deserves some discussion. > > > Both of these operations are useful in situations where > > > other software (e.g., a speech recognizer) supports only standard > > > N-grams. Class N-grams are approximated by expanding class labels > > > into their members (which can contain multiword strings) and > > > then computing the marginal probabilities of word N-gram strings. > > > This operation increases the number of N-grams combinatorially, > > > and is therefore feasible only for relatively small models. > > > An interpolated backoff model is obtained by taking the union > > > of N-grams of the input models, assigning each N-gram the > > > weighted average of the probabilities from those models (in some > > > of the models this probability might be computed by backoff), and > > > then renormalizing the new model. We found that such interpolated > > > backoff models consistently give slightly lower perplexities > > > than the corresponding standard word-level interpolated models. > > > The reason could be that the backoff distributions are themselves > > > obtained by interpolation, unlike in standard interpolation, where > > > each component model backs off individually. > > So the result may differ because because the merging process > introduces new backoff nodes into the LM and that may change some > probabilities arrived at through backing off. However, if you use > > > > ngram -lm l2.lm -lambda 0 -mix-lm l1.lm -ppl t.txt -bayes 0 > > > > you get exact interpolation and then the perplexities should be > identical. > > But you cannot save such an interpolated model back into a single > ngram LM. > > > > In practice the difference should not matter (at least in my > experience). > > > > Andreas > > > > > > > thanks > > > Akmal > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > SRILM-User site list > > > SRILM-User at speech.sri.com > > > > > http://www.speech.sri.com/mailman/listinfo/srilm-user > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > SRILM-User site list > > SRILM-User at speech.sri.com > > http://www.speech.sri.com/mailman/listinfo/srilm-user > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From shl.thcn at yahoo.com.cn Fri Aug 28 22:21:25 2009 From: shl.thcn at yahoo.com.cn (=?utf-8?B?5rW36b6ZIOWPsg==?=) Date: Fri, 28 Aug 2009 22:21:25 -0700 (PDT) Subject: [SRILM User List] =?utf-8?b?5Zue5aSN77yaICBBIGNvbmZ1c2lvbiBvZiB0?= =?utf-8?q?he_interpolated_language_model?= In-Reply-To: <42586A5A-C583-4404-9018-EF2C0193C5F9@lium.univ-lemans.fr> References: <942355.92368.qm@web15307.mail.cnb.yahoo.com> <42586A5A-C583-4404-9018-EF2C0193C5F9@lium.univ-lemans.fr> Message-ID: <549193.81220.qm@web15303.mail.cnb.yahoo.com> Hi,Thanks for your concern! I do know that back-off weight is not a probability,but in the interpolated mod-kn smoothing method,bows are not supposed to be greater than 1. In the man document of srilm ngram-discount.7.html,I've got this: For back-off smoothing,there is (1) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) where f(a_z) depends on the smoothing method and the bow(a_) is calculated below: Sum_Z p(a_z) = 1 Sum_Z1 f(a_z) + Sum_Z0 bow(a_) p(_z) = 1 (2) bow(a_) = (1- Sum_Z1 f(a_z)) / Sum_Z0 p(_z) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 p(_z)) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z)) but for interpolated smoothing, there is (3) f(a_z) = g(a_z) + bow(a_) p(_z) (4) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) and Sum_Z p(a_z) = 1 Sum_Z1 g(a_z) + Sum_Z bow(a_) p(_z) = 1 (5) bow(a_) = 1 - Sum_Z1 g(a_z) (Where Z be the set of all words in the vocabulary, Z0 be the set of all words with c(a_z) = 0, and Z1 be the set of all words with c(a_z) > 0) However in the srilm sourse codes ,it seems that the interpolated bows is calculated using (5) and then the probs and bows is trasfered into back-off model using (3) ,then the back-off version of the bows are recomputed using (2).I just don't understand why srilm do not use the bow calculated using (5)directedly. Besides,I used to use the entropy-prune method to construct a language model: ~ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 0 -kndiscount -interpolate -prune 0.000000001 -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_pruned1e-9.lm and there is definitely no bow greater than 1. So this problem is wired and I wonder if anyone of you knows that.And was the command I used to build the mod-kn discount language model(where I want to exclude the 3-grams with the count of 1) correct? ~ ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_all_pruned.lm Thank you very much! ??? Hailoon Shi w63,EE Dpt.Tinghua.Unv.Beijing.China ????????? ________________________________ ???? Yannick Est?ve ???? ?? ? ??? srilm-user at speech.sri.com ???? 2009/8/27(??), ??4:19:44 ??? Re: [SRILM User List] A confusion of the interpolated language model Hi, Back-off weights are not probabilities: they can be greater than 1. So, your values are normal. You can have some explanations about back-off weight computation here, particularly for the use of the modified Kneser-Ney discounting method: http://www.speech.sri.com/projects/srilm/manpages/pdfs/chen-goodman-tr-10-98.pdf Regards, Yannick Est?ve LIUM - University of Le Mans France Le 27 ao?t 09 ? 09:21, ?? ? a ?crit : > > > > > >I am a new student user of srilm from Asia.Here I used the command below to construct a interpolated mod-kn discount language model: >~ ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_all_pruned.lm > > > However in my model several N-grams' back-off werght(bow) appears to be greater than 1.That is ,in the text LM file,I've got a line: >-6.457229 1635 0.1270406 >(Here we just use a kind of index to represent a chinese word) >in whitch the 1og10(bow) is greater than 0.We don't think a normal interplotate discount method can produce an N-gram bow greater than 1,besides this circumstance only occured to several(less than 5) different > N-grams.So I am confused and would like to ask if there is someyone who encounterd this circumstance or happens to know what is wrong. >Thank you very much! > >??? >Hailoon Shi >w63,EE Dpt.Thu Univ.PRC > > > > > > > >__________________________________________________ >??????????????? >http://cn.mail.yahoo.com_______________________________________________ >SRILM-User site list >SRILM-User at speech.sri.com >http://www.speech.sri.com/mailman/listinfo/srilm-user ___________________________________________________________ ????????????????? http://card.mail.cn.yahoo.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From fsanchez at dlsi.ua.es Mon Aug 31 00:43:32 2009 From: fsanchez at dlsi.ua.es (Felipe =?ISO-8859-1?Q?S=E1nchez_Mart=EDnez?=) Date: Mon, 31 Aug 2009 09:43:32 +0200 Subject: [SRILM User List] Lattice Viterbi decoding Message-ID: <1251704612.8359.2.camel@pipe> Hi all, I am using SRILM to score a set of translation candidates of a given sentence. The sentence is divide into chunks, some of them having a fix translation and others having different alternatives: text1 | text2 | text3.1 or text3.2 | text4 | text5.1 or text5.2 As the number of combinations is exponential in the length of the sentences I have been trying to use lattice-tool to compute the Viterbi path but I am not able to make it work. I am using the following command line: $ lattice-tool -viterbi-decode -in-lattice lattice.pfsg -lm model.lm -order 5 -debug 1 but I get exactly the same result with 5, 3 or even 0 n-gram order. In addition, with the example sentence I am working with I get a different path if I use SRILM in the usual way by scoring all possible translations of the sentence. What am I doing wrong? Thank you very much in advance. PS: I am using srilm 1.5.7 -- Felipe From beleira at gmail.com Tue Sep 8 10:34:31 2009 From: beleira at gmail.com (Manuel Alves) Date: Tue, 8 Sep 2009 18:34:31 +0100 Subject: [SRILM User List] Ngram Command Message-ID: <495c9ccd0909081034j60f99470lb1cbd97dcfe1534a@mail.gmail.com> Thanks Andreas. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Sep 10 08:25:54 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 10 Sep 2009 08:25:54 -0700 Subject: [SRILM User List] Question on SRILM Toolkit In-Reply-To: References: Message-ID: <4AA91A82.50703@speech.sri.com> Saeedeh Momtazi wrote: > Dear Andreas Stolcke, > > I, Saeedeh Momtazi, use the SRILM toolkit for a while. The main part > that I use from this toolkit is "ngram-class". So far, I had no > problem with this toolkit. However, recently I tried to cluster the > terms that I have based on a count file which is about 6 GB. I faced > an error message that I copy here: > > ngram-class: ../../include/LHash.cc:138: void LHash DataT>::alloc(unsigned int) [with KeyT = unsigned int, DataT = > Trie]: Assertion `body != 0' failed. > /var/torque/mom_priv/jobs/53195.maste.SC : line > 39: 25464 Aborted You are simply running out of memory. You need more memory or swap space, and probably you need to switch to a 64bit machine. However, first you should make sure to use the memory-optimized version of the tools (compiled with make OPTION=_c). You can always sample your data, or simply prune the count file by eliminating low-count ngrams. This might not change your results much. When inducing word classes the words with low counts are not handled robustly anyway. I found it best to replace all words with low counts with an "Infrequent word" class label ahead of time. As a by product, this will dramatically reduce the number of distinct bigrams because most of the bigrams involve rare words (Zipf's law etc.). Andreas > > > I appreciate in advance if you let me know how I can solve this problem. > To be more precise, my vocabulary is about 35000 words and I want to > cluster them into 3000 classes. The input items that I use when > calling ngram-class are the vocab file (-vocab), the count file > (-counts) and the number of classes (-numclasses). The only output > that I need is a mapping between words and classes (-classes). > > > Looking forward to hearing from you. > > Thanks in advance, > Saeedeh Momtazi From heintz.38 at osu.edu Sun Sep 20 12:03:55 2009 From: heintz.38 at osu.edu (Ilana Heintz) Date: Sun, 20 Sep 2009 15:03:55 -0400 (EDT) Subject: [SRILM User List] vocab size from make-batch-counts Message-ID: Hello, I am wondering about what type of ngram pruning is done in the make-batch-counts training script, and if it can be handled with flags. I've looked through the code and man pages but I'm not sure whether I can pass the right argument. I discovered that the pruning happens because, when I vary the batch size, the resulting vocabulary size changes. For instance, on a small development corpus: > make-batch-counts files.list 10 xmlfilter.sh counts_10perbatch > merge-batch-counts counts_10perbatch > ngram-count -read counts_10perbatch/files.list-1.ngrams.gz -write-vocab 10perbatch.vocab > wc 10perbatch.vocab 2763 2763 32999 10perbatch.vocab > make-batch-counts files.list 1 xmlfilter.sh counts_1perbatch > merge-batch-counts counts_1perbatch > ngram-count -read counts_1perbatch/merge-iter2-1.ngrams.gz -write-vocab 1perbatch.vocab > wc 1perbatch.vocab 5923 5923 72237 1perbatch.vocab Same sort of result when I use a larger corpus or other batch sizes; the vocab decreases with an increase in the size of the batch. I have tried experimenting with -gtmin to change the output, without success. I'm confused as to why batch size would make a difference here. I am using version 1.5.5. Thanks, Ilana Ilana Heintz Department of Linguistics Ohio State University http://www.ling.ohio-state.edu/~bromberg From stolcke at speech.sri.com Mon Sep 21 01:07:36 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 21 Sep 2009 01:07:36 -0700 Subject: [SRILM User List] vocab size from make-batch-counts In-Reply-To: References: Message-ID: <4AB73448.6010902@speech.sri.com> Ilana Heintz wrote: > Hello, > > I am wondering about what type of ngram pruning is done in the > make-batch-counts training script, and if it can be handled with > flags. I've looked through the code and man pages but I'm not sure > whether I can pass the right argument. I discovered that the pruning > happens because, when I vary the batch size, the resulting vocabulary > size changes. For instance, on a small development corpus: > >> make-batch-counts files.list 10 xmlfilter.sh counts_10perbatch >> merge-batch-counts counts_10perbatch >> ngram-count -read counts_10perbatch/files.list-1.ngrams.gz -write-vocab What you are doing is not working as intended. make-batch-counts passes the -write-vocab option to ngram-count, but each ngram-count invocation will dump only the vocabulary of the batch it is seeing (hence the result you observed). To get the combined vocab of your data, run ngram-count -order 1 -read COUNTS -write-vocab VOCAB on the final count file. Andreas > 10perbatch.vocab >> wc 10perbatch.vocab > 2763 2763 32999 10perbatch.vocab > >> make-batch-counts files.list 1 xmlfilter.sh counts_1perbatch >> merge-batch-counts counts_1perbatch >> ngram-count -read counts_1perbatch/merge-iter2-1.ngrams.gz -write-vocab > 1perbatch.vocab >> wc 1perbatch.vocab > 5923 5923 72237 1perbatch.vocab > > Same sort of result when I use a larger corpus or other batch sizes; > the vocab decreases with an increase in the size of the batch. I have > tried experimenting with -gtmin to change the output, without > success. I'm confused as to why batch size would make a difference here. > > I am using version 1.5.5. > > Thanks, > Ilana > > > Ilana Heintz > Department of Linguistics > Ohio State University > http://www.ling.ohio-state.edu/~bromberg > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From sylvain.raybaud at crans.org Mon Sep 21 05:43:16 2009 From: sylvain.raybaud at crans.org (Sylvain Raybaud) Date: Mon, 21 Sep 2009 14:43:16 +0200 Subject: [SRILM User List] problem getting srilm Message-ID: <200909211443.16924.sylvain.raybaud@crans.org> Hello everyone Getting an archive of srilm toolkit from the website has been near-impossible for me for some months now... After I fill in the form at http://www.speech.sri.com/projects/srilm/download.html and accept the license the download begins but is incredibly slow (throughput varying between 0 and 1 kilo bytes per second). After a few hours I get a message saying that the connection was closed or something like that. Did I miss something? thanks a lot. regards, -- Sylvain Raybaud LORIA/Nancy/France From stolcke at speech.sri.com Mon Sep 21 07:49:22 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 21 Sep 2009 07:49:22 -0700 Subject: [SRILM User List] problem getting srilm In-Reply-To: <200909211443.16924.sylvain.raybaud@crans.org> References: <200909211443.16924.sylvain.raybaud@crans.org> Message-ID: <4AB79272.6000102@speech.sri.com> Sylvain Raybaud wrote: > Hello everyone > > Getting an archive of srilm toolkit from the website has been near-impossible > for me for some months now... After I fill in the form at > http://www.speech.sri.com/projects/srilm/download.html and accept the license > the download begins but is incredibly slow (throughput varying between 0 and > 1 kilo bytes per second). After a few hours I get a message saying that the > connection was closed or something like that. Did I miss something? thanks a > lot. > I'm sure I would have heard lots of complaints if this was a general issue. Also, I just downloaded the package from my home cable connection, at about 300kB/s. So I would first investigate possible issues with your internet connection. A good site to for measuring network bandwidth is http://www.speakeasy.net/speedtest/ . Andreas > regards, > > From beleira at gmail.com Mon Sep 21 17:23:25 2009 From: beleira at gmail.com (Nel Alves) Date: Tue, 22 Sep 2009 00:23:25 +0000 (GMT) Subject: [SRILM User List] FREE international calls Message-ID: <909694296.10972061253579005492.JavaMail.tomcat@que1.sv.jaxtr.com> Hello, I am using jaxtr, and if you also sign up, we can talk for free on the phone at any time. -Nel P.S. Here is the link to sign up: http://www.jaxtr.com/user/ticket?n=T1glgga0fkegip&type=joininvite&tId=457238224_468_525 Delivered by jaxtr, Inc. 855 Oak Grove Ave., Menlo Park, CA 94025 To stop receiving jaxtr emails, send email to blockme at jaxtr.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.raybaud at crans.org Tue Sep 22 01:45:09 2009 From: sylvain.raybaud at crans.org (Sylvain Raybaud) Date: Tue, 22 Sep 2009 10:45:09 +0200 Subject: [SRILM User List] problem getting srilm In-Reply-To: <4AB79272.6000102@speech.sri.com> References: <200909211443.16924.sylvain.raybaud@crans.org> <4AB79272.6000102@speech.sri.com> Message-ID: <200909221045.09723.sylvain.raybaud@crans.org> On Monday 21 September 2009 16:49:22 Andreas Stolcke wrote: > Sylvain Raybaud wrote: > > Hello everyone > > > > Getting an archive of srilm toolkit from the website has been > > near-impossible for me for some months now... After I fill in the form at > > http://www.speech.sri.com/projects/srilm/download.html and accept the > > license the download begins but is incredibly slow (throughput varying > > between 0 and 1 kilo bytes per second). After a few hours I get a message > > saying that the connection was closed or something like that. Did I miss > > something? thanks a lot. > > I'm sure I would have heard lots of complaints if this was a general > issue. Also, I just downloaded the package from my home cable > connection, at about 300kB/s. So I would first investigate possible > issues with your internet connection. > A good site to for measuring network bandwidth is > http://www.speakeasy.net/speedtest/ . > > Andreas > > > regards, Hello You are right, it seems that the problem only occurs when attempting to download from my institute... puzzling... maybe it is because we use ipv6? I've encountered problem with ipv6-enabled servers before. Anyway, the problem is probably not related to sri, sorry about the fuss for nothing :) thanks a lot, keep up the good work, -- Sylvain From paul.a.johnston at manchester.ac.uk Wed Sep 23 00:52:32 2009 From: paul.a.johnston at manchester.ac.uk (Paul Johnston) Date: Wed, 23 Sep 2009 08:52:32 +0100 Subject: [SRILM User List] Compiling srilm Message-ID: Hi on my first attempt to build srilm version 1.5.9 I get lots of messages like gcc -mtune=pentium3 -Wreturn-type -Wimplicit -Wimplicit-int -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o ../obj/i686/option.o option.c option.c:1: error: CPU you selected does not support x86-64 instruction set Therefore make World fails, looking into it, the return value of machine-type is >/home/CO/mcasspj/srilm_dir/sbin/machine-type i686 And the actual machine is uname -a Linux servalan.humanities.manchester.ac.uk 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux More specifically cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.3 (Tikanga) Anyone seen this and have any ideas as to solving the problem? As I intend to build a machine dedicated to this system can anyone recommend a system they use i.e. Solaris, any of the BSDs or a variety of Linux just not Windows or a Mac :-) Many thanks! Paul Johnston Humanities ICT (Infrastructure) Samuel Alexander Building Room W1.19 e-mail Paul.Johnston at manchester.ac.uk web http://web-1.humanities.manchester.ac.uk/prjs/mcasspj/ Tuzoqlar granatalardan yuksak darajali portlovchi moddalardan yoki bosshqa narslardan qilingan? -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.raybaud at crans.org Wed Sep 23 01:15:43 2009 From: sylvain.raybaud at crans.org (Sylvain Raybaud) Date: Wed, 23 Sep 2009 10:15:43 +0200 Subject: [SRILM User List] Compiling srilm In-Reply-To: References: Message-ID: <200909231015.43259.sylvain.raybaud@crans.org> On Wednesday 23 September 2009 09:52:32 Paul Johnston wrote: > Hi on my first attempt to build srilm version 1.5.9 > > I get lots of messages like > > > > gcc -mtune=pentium3 -Wreturn-type -Wimplicit -Wimplicit-int > -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o > ../obj/i686/option.o option.c > > option.c:1: error: CPU you selected does not support x86-64 instruction > set > > > > Therefore make World fails, looking into it, the return value of > machine-type is > > >/home/CO/mcasspj/srilm_dir/sbin/machine-type > > i686 > > > > And the actual machine is > > > > uname -a > > Linux servalan.humanities.manchester.ac.uk 2.6.18-128.el5 #1 SMP Wed Dec > 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux > > > > More specifically > > > > cat /etc/redhat-release > > Red Hat Enterprise Linux Server release 5.3 (Tikanga) > > > > Anyone seen this and have any ideas as to solving the problem? > > As I intend to build a machine dedicated to this system can anyone > recommend a system they use i.e. Solaris, any of the BSDs or a variety > of Linux just not Windows or a Mac :-) > > > > Many thanks! > Hello I modified sbin/machine-type and added a common/Makefile.machine.x86_64-gcc4 so that it compiles for 64 bits. It works on my machine, no guarantee it will work anywhere else... You will find them attached. Hope this helps. regards, -- Sylvain A: You see ! > Q: You think ? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting annoying in email? -------------- next part -------------- A non-text attachment was scrubbed... Name: machine-type Type: application/x-shellscript Size: 4126 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Makefile.machine.x86_64-gcc4 Type: text/x-makefile Size: 1744 bytes Desc: not available URL: From stolcke at speech.sri.com Wed Sep 23 09:56:49 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 23 Sep 2009 09:56:49 -0700 Subject: [SRILM User List] Compiling srilm In-Reply-To: <200909231015.43259.sylvain.raybaud@crans.org> References: <200909231015.43259.sylvain.raybaud@crans.org> Message-ID: <4ABA5351.50403@speech.sri.com> Sylvain Raybaud wrote: > On Wednesday 23 September 2009 09:52:32 Paul Johnston wrote: > >> Hi on my first attempt to build srilm version 1.5.9 >> >> I get lots of messages like >> >> >> >> gcc -mtune=pentium3 -Wreturn-type -Wimplicit -Wimplicit-int >> -D_FILE_OFFSET_BITS=64 -I. -I../../include -c -g -O3 -o >> ../obj/i686/option.o option.c >> >> option.c:1: error: CPU you selected does not support x86-64 instruction >> set >> >> >> >> Therefore make World fails, looking into it, the return value of >> machine-type is >> >> >>> /home/CO/mcasspj/srilm_dir/sbin/machine-type >>> >> i686 >> >> >> >> And the actual machine is >> >> >> >> uname -a >> >> Linux servalan.humanities.manchester.ac.uk 2.6.18-128.el5 #1 SMP Wed Dec >> 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux >> >> >> >> More specifically >> >> >> >> cat /etc/redhat-release >> >> Red Hat Enterprise Linux Server release 5.3 (Tikanga) >> >> >> >> Anyone seen this and have any ideas as to solving the problem? >> >> As I intend to build a machine dedicated to this system can anyone >> recommend a system they use i.e. Solaris, any of the BSDs or a variety >> of Linux just not Windows or a Mac :-) >> >> >> >> Many thanks! >> >> > > Hello > > I modified sbin/machine-type and added a common/Makefile.machine.x86_64-gcc4 > so that it compiles for 64 bits. It works on my machine, no guarantee it will > work anywhere else... You will find them attached. Hope this helps. > The idea is that the default even for 64bit i686 machines is 32bit compilation. That's why machine-type returns i686 by default. I think the problem you saw can be avoided by adding -m32 in common/Makefile.machine.i686. To build 64bit binaries use make MACHINE_TYPE=i686-m64 Andreas > regards, > > > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user