From kutlak.roman at gmail.com Mon Apr 2 12:50:19 2012 From: kutlak.roman at gmail.com (Kutlak Roman) Date: Mon, 2 Apr 2012 20:50:19 +0100 Subject: [SRILM User List] SRILM source code Message-ID: Hello Andreas, I was compiling srilm using clang and libcxx (c++11) and got an error when compiling LaticeIndex.cc on line 78. The problem is that makeArray tries to create a variable length array and the compiler does not support this. Could you please add an appropriate exception for clang into Array.h line 92. Or perhaps better yet, just define it for all compilers as StaticArray A(n). In case someone also tries to do the same, there is one more problem. For some reason clang does not like the call to bind() in LM.cc line 876 (or somewhere there). This must be an error on the part of the library implementer as bind() is supposed to return int, but clang did not like it. I just removed the error checking and did not use the return value and that works. Kind regards, Roman From stolcke at icsi.berkeley.edu Mon Apr 2 15:08:54 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 02 Apr 2012 15:08:54 -0700 Subject: [SRILM User List] Question of replace-words-with-classes In-Reply-To: References: Message-ID: <4F7A2376.1030807@icsi.berkeley.edu> On 3/31/2012 8:00 PM, Meng Chen wrote: > Hi, I met a question when training class-based language model by > replace-words-with-classes command. My commands are as follows: > > * ngram-class -vocab wlist -text training_set -numclasses 200 > -incremental -classes output.classes > * replace-words-with-classes classes=output.classestraining_set > > training_set_classes > > After these two steps, I found that there are both words and classes > in training_set_classes. These words are OOVs in wlist, however, I > don't need them at all. Shouldn't these words belong to in > CLASS-00001? So I wonder to know how to process this situation? Does > SRILM support some scripts to map these OOVs to CLASS-00001? Or Do I > need to write a script by myself? It must be the case that wlist does not contain all the words in training_set, and therefore output.classes does not cover the entire vocabulary. In that case replace-words-with-classes will only operate on words contained in the class definitions. You can easily augment the class definitions to add an extra class that catches all your OOV words. The format should be self-explanatory, or check the classes-format(5) man page. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenmengdx at gmail.com Fri Apr 6 09:01:43 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Sat, 7 Apr 2012 00:01:43 +0800 Subject: [SRILM User List] How to interpolate two class-based language models Message-ID: Hi, I have a question about interpolating two class-based language models. Suppose I have two class-based language models trained from two different corpus. And each class-based lm has its own class definition files. For example, the class definition file for class-lm1 is lm1.classes, and lm2.classes for class-lm2. So my question is, how to interpolate these two different class-based language models? Can you give me the steps? with commands better. - Do I need to use the -classes option when interpolating them? - Do I need to use the -bays 0 option to interpolate them dynamically? I also confused about the expand class operation. If I expand the class-based language model to word-based language model, does the perplexity change with the same test set ? Thanks! Meng Chen -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Apr 6 11:09:38 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 06 Apr 2012 11:09:38 -0700 Subject: [SRILM User List] How to interpolate two class-based language models In-Reply-To: References: Message-ID: <4F7F3162.1030105@icsi.berkeley.edu> On 4/6/2012 9:01 AM, Meng Chen wrote: > Hi, I have a question about interpolating two class-based language > models. Suppose I have two class-based language models trained from > two different corpus. > And each class-based lm has its own class definition files. For > example, the class definition file for class-lm1 is lm1.classes, and > lm2.classes for class-lm2. So my question is, how to interpolate these > two different class-based language models? Can you give me the steps? > with commands better. > > * Do I need to use the -classes option when interpolating them? > You need to merge the class definitions for both LMs, making sure that there are no name conflicts. If necessary rename class labels CLASS01234 to LM1_CLASS01234 etc., in both the LM and the class definition files, then combine the two class definitions into one file, then interpolate the models. > > * Do I need to use the -bays 0 option to interpolate them dynamically? > Yes, you want use something like ngram -lm LM1 -mix-lm LM2 -lambda L -classes MERGED_CLASS_DEFINITIONS -bayes 0 > I also confused about the expand class operation. If I expand the > class-based language model to word-based language model, does the > perplexity change with the same test set ? ngram -expand-classes is an approximation, so you won't get exactly the same ppl, but something close. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Apr 7 16:22:27 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 07 Apr 2012 16:22:27 -0700 Subject: [SRILM User List] A question about lattice-tool In-Reply-To: <77E512AE-098E-49C5-B28F-5E1665A7F720@ccls.columbia.edu> References: <77E512AE-098E-49C5-B28F-5E1665A7F720@ccls.columbia.edu> Message-ID: <4F80CC33.9060502@icsi.berkeley.edu> On 4/7/2012 3:40 PM, Nizar Habash wrote: > Hi Andreas, Jing, > > We use your lattice-tool for weight assignment and n-best decoding of a lattice including simple paraphrases. > We're trying to process a huge number of sentences (lattices) with this tool. Our understanding is that if we > want to handle many sentences with one command (loading the LM once), we should put each sentence in > a file, and pass a list of these files with the -in-lattice-list option and expect the output as files in the directory > passed with the -out-lattice-dir option. If we process millions of sentences (i.e. lattices) this way, where they > end up being millions of files, we're afraid the file system may crash. Is there a way to pass all the source > lattices to lattice-tool in one file and get the output in one file? > > Thanks > Nizar and Wael Sorry, there is not such provision in lattice-tool. But usually the problem with flaky filesystems is having too many files within the same directory. So I would recommend you split your file list into batches of a few thousand each, and direct the output for each batch into a separate directory. Even on robust filesystems this is good idea because very large directories have slow access times in many filesystem implementations. Andreas From bulusheva at speechpro.com Tue Apr 10 00:21:04 2012 From: bulusheva at speechpro.com (bulusheva) Date: Tue, 10 Apr 2012 11:21:04 +0400 Subject: [SRILM User List] How does the option "-gtmin" work in ngram-count? Message-ID: <4F83DF60.2000601@speechpro.com> Hi, I have two questions: 1. If I generate the language model with Kneser-Ney smoothing (or Modified Kneser-Ney), why do the parameter "-gtnmin" apply to already modified counts? For example, if in the training data 2-gram "markov model" occurs only in the context "hidden markov model" and gt2min = 2, then the modified count for "markov model" = n(* markov model) = 1 < gt2min and prob("markov model") = bow("markov")*prob("model"). Instead of prob("markov model") = ( n(* markov model) - D)/ n(* markov *) ; 2. Let say I use ngram-count to generate the language model as following: ngram-count -text text.txt -vocab vocab.txt -gt1min 5 -lm sri.lm Let the word "hello" exists in "vocab.txt" and occurs 4 times in "text.txt". Then probability of "hello" is calculated as probability of zerotone. Is it correct? Thanks Anna Bulusheva -------------- next part -------------- An HTML attachment was scrubbed... URL: From saman_2004 at yahoo.com Tue Apr 10 01:29:37 2012 From: saman_2004 at yahoo.com (Saman Noorzadeh) Date: Tue, 10 Apr 2012 01:29:37 -0700 (PDT) Subject: [SRILM User List] how are the probabilities computed in ngram-count Message-ID: <1334046577.74819.YahooMailNeo@web162005.mail.bf1.yahoo.com> Hello I am getting confused about the models that ngram-count make: ngram-count -order 2? -write-vocab vocabulary.voc -text mytext.txt?? -write model1.bo ngram-count -order 2? -read model1.bo -lm model2.BO forexample: (the text is very large and these words are just a sample) in model1.bo: cook?? 14 cook was 1 in model2.BO: -1.904738? cook was? my question is that the probability of 'cook was' bigram should be log10(1/14), but ngram-count result shows: log(1/80)== -1.9047 how is these probabilities computed? -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Apr 10 16:41:15 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 10 Apr 2012 16:41:15 -0700 Subject: [SRILM User List] How does the option "-gtmin" work in ngram-count? In-Reply-To: <4F83DF60.2000601@speechpro.com> References: <4F83DF60.2000601@speechpro.com> Message-ID: <4F84C51B.7010405@icsi.berkeley.edu> On 4/10/2012 12:21 AM, bulusheva wrote: > Hi, I have two questions: > > 1. If I generate the language model with Kneser-Ney smoothing (or > Modified Kneser-Ney), why do the parameter "-gtnmin" apply to already > modified counts? > > For example, if in the training data 2-gram "markov model" occurs > only in the context "hidden markov model" and gt2min = 2, then the > modified count for "markov model" = n(* markov model) = 1 < gt2min > and > prob("markov model") = bow("markov")*prob("model"). > Instead of prob("markov model") = ( n(* markov model) - D)/ n(* > markov *) ; > That's how it is currently implemented. It is debatable how the minimum count should be applied in the case of the lower-order distributions in KN models. The way it currently works is natural from an implementation perspective, because the lower-order counts are physically modified before applying the discounting (you can examine them by adding -write COUNTS). But you are raising a good point. It might make more sense to have the -gtXmin values be interpreted independent of the discounting method. > > 2. Let say I use ngram-count to generate the language model as > following: > ngram-count -text text.txt -vocab vocab.txt -gt1min 5 -lm sri.lm > Let the word "hello" exists in "vocab.txt" and occurs 4 times in > "text.txt". Then probability of "hello" is calculated as > probability of zerotone. Is it correct? > That is correct, but the ARPA format doesn't allow you to prune unigrams, so the unigrams will always appear explicitly listed in the LM, even if their probabilities might be obtained by backing off to a uniform distribution. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Apr 10 16:46:54 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 10 Apr 2012 16:46:54 -0700 Subject: [SRILM User List] how are the probabilities computed in ngram-count In-Reply-To: <1334046577.74819.YahooMailNeo@web162005.mail.bf1.yahoo.com> References: <1334046577.74819.YahooMailNeo@web162005.mail.bf1.yahoo.com> Message-ID: <4F84C66E.9060003@icsi.berkeley.edu> On 4/10/2012 1:29 AM, Saman Noorzadeh wrote: > Hello > I am getting confused about the models that ngram-count make: > ngram-count -order 2 -write-vocab vocabulary.voc -text mytext.txt > -write model1.bo > ngram-count -order 2 -read model1.bo -lm model2.BO > > forexample: (the text is very large and these words are just a sample) > > in model1.bo: > cook 14 > cook was 1 > > in model2.BO: > -1.904738 cook was > > my question is that the probability of 'cook was' bigram should be > log10(1/14), but ngram-count result shows: log(1/80)== -1.9047 > how is these probabilities computed? It's called "smoothing" or "discounting" and ensures that word sequences of ngrams never seen in the training data receive nonzero probability. Please consult any of the basic LM tutorial sources listed at http://www.speech.sri.com/projects/srilm/manpages/, or specifically http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html . To obtain the unsmoothed probability estimates that you are expecting you need to change the parameters. Try ngram-count -addsmooth 0 .... Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From saman_2004 at yahoo.com Wed Apr 11 05:48:54 2012 From: saman_2004 at yahoo.com (Saman Noorzadeh) Date: Wed, 11 Apr 2012 05:48:54 -0700 (PDT) Subject: [SRILM User List] how are the probabilities computed in ngram-count In-Reply-To: <4F84C66E.9060003@icsi.berkeley.edu> References: <1334046577.74819.YahooMailNeo@web162005.mail.bf1.yahoo.com> <4F84C66E.9060003@icsi.berkeley.edu> Message-ID: <1334148534.73034.YahooMailNeo@web162006.mail.bf1.yahoo.com> Thank you,? -cdiscount 0 works perfectly, but now that I have read about smoothing and different methods of discounting I have another question: I want to know your ideas about this problem: I want to have a model out of a text. and then predict what the user is typing (a word prediction approach). at any moment I will predict what the next character would be according to my bigrams. Do you think methods of discounting and smoothing are useful in treating the training data? or it is more appropriate if I just disable it? Thank you Saman ________________________________ From: Andreas Stolcke To: Saman Noorzadeh Cc: Srilm group Sent: Wednesday, April 11, 2012 1:46 AM Subject: Re: [SRILM User List] how are the probabilities computed in ngram-count On 4/10/2012 1:29 AM, Saman Noorzadeh wrote: Hello >I am getting confused about the models that ngram-count make: >ngram-count -order 2? -write-vocab vocabulary.voc -text mytext.txt?? -write model1.bo >ngram-count -order 2? -read model1.bo -lm model2.BO > > >forexample: (the text is very large and these words are just a sample) > > > >in model1.bo: >cook?? 14 > >cook was 1 > > >in model2.BO: >-1.904738? cook was? > > >my question is that the probability of 'cook was' bigram should be log10(1/14), but ngram-count result shows: log(1/80)== -1.9047 >how is these probabilities computed? It's called "smoothing" or "discounting" and ensures that word sequences of ngrams never seen in the training data receive nonzero probability. Please consult any of the basic LM tutorial sources listed at http://www.speech.sri.com/projects/srilm/manpages/, or specifically http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html . To obtain the unsmoothed probability estimates that you are expecting you need to change the parameters.? Try ngram-count? -addsmooth 0 .... Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Apr 11 10:00:59 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 11 Apr 2012 10:00:59 -0700 Subject: [SRILM User List] how are the probabilities computed in ngram-count In-Reply-To: <1334148534.73034.YahooMailNeo@web162006.mail.bf1.yahoo.com> References: <1334046577.74819.YahooMailNeo@web162005.mail.bf1.yahoo.com> <4F84C66E.9060003@icsi.berkeley.edu> <1334148534.73034.YahooMailNeo@web162006.mail.bf1.yahoo.com> Message-ID: <4F85B8CB.5010701@icsi.berkeley.edu> On 4/11/2012 5:48 AM, Saman Noorzadeh wrote: > Thank you, > -cdiscount 0 works perfectly, but now that I have read about smoothing > and different methods of discounting I have another question: > > I want to know your ideas about this problem: > I want to have a model out of a text. and then predict what the user > is typing (a word prediction approach). at any moment I will predict > what the next character would be according to my bigrams. > Do you think methods of discounting and smoothing are useful in > treating the training data? > or it is more appropriate if I just disable it? It probably won't make a difference because in an application like this you are interested in finding the most probable next tokens, and smoothing helps you with the least probable tokens. However, this type of LM application has been studied extensively, and you should look online what others have done. Try http://scholar.google.com/scholar?q=character+prediction+typing&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From julia_hancke at yahoo.com Thu Apr 12 01:42:08 2012 From: julia_hancke at yahoo.com (Julia Hancke) Date: Thu, 12 Apr 2012 01:42:08 -0700 (PDT) Subject: [SRILM User List] all tests differ but only minimally Message-ID: <1334220128.28413.YahooMailNeo@web113503.mail.gq1.yahoo.com> Hi, I'm new to SRILM after installation I ran the tests. The stdout and stderr differed for all files. I checked on a few files with the nix diff command and found out, some numbers are off by small amounts likie this: < 0 zeroprobs, logprob= -4.23739 ppl= 25.8501ppl1= 131.43 > 0 zeroprobs, logprob= -4.23739 ppl= 25.8502 ppl1= 131.43 Is this normal? What else could I do, or should I just live with it? This is my machine: i686 athlon i386 GNU/Linux. Any help is apprechiated. Regards, Julia -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu Apr 12 11:11:39 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 12 Apr 2012 11:11:39 -0700 Subject: [SRILM User List] all tests differ but only minimally In-Reply-To: <1334220128.28413.YahooMailNeo@web113503.mail.gq1.yahoo.com> References: <1334220128.28413.YahooMailNeo@web113503.mail.gq1.yahoo.com> Message-ID: <4F871ADB.9050001@icsi.berkeley.edu> On 4/12/2012 1:42 AM, Julia Hancke wrote: > Hi, > > I'm new to SRILM after installation I ran the tests. The stdout and > stderr differed for all files. I checked on a few files with the nix > diff command and found out, some numbers are off by small amounts > likie this: > > < 0 zeroprobs, logprob= -4.23739 ppl= 25.8501ppl1= 131.43 > > 0 zeroprobs, logprob= -4.23739 ppl= 25.8502 ppl1= 131.43 > > Is this normal? What else could I do, or should I just live with it? > This is my machine: i686 athlon i386 GNU/Linux. Any help is apprechiated. Yes, this is normal, and shouldn't cause the tests to fail. The outputs of test runs are not compared using diff. Instead, the script $SRILM/sbin/compare-outputs is used. Try compare-outputs REFERENCE OUTPUT and that should not complain about small numerical differences as what you show above. If not, there is some kind of bug. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From julia_hancke at yahoo.com Thu Apr 12 12:07:18 2012 From: julia_hancke at yahoo.com (Julia Hancke) Date: Thu, 12 Apr 2012 12:07:18 -0700 (PDT) Subject: [SRILM User List] all tests differ but only minimally In-Reply-To: <4F871ADB.9050001@icsi.berkeley.edu> References: <1334220128.28413.YahooMailNeo@web113503.mail.gq1.yahoo.com> <4F871ADB.9050001@icsi.berkeley.edu> Message-ID: <1334257638.49001.YahooMailNeo@web113503.mail.gq1.yahoo.com> Hi Andreas, thank you very much! The problem was this: I had run the ???? make test which gave me a DIFFER for every test. Then, I used the command line diff to take a look at things. After you explained which script should be used, I took a look at it and noticed that it uses gawk and trying to run it, found out that I had neglected to install gawk. Running the test again after installing gawk gives me IDENTICALS everywhere :) Regards, Julia ________________________________ From: Andreas Stolcke To: Julia Hancke Cc: "srilm-user at speech.sri.com" Sent: Thursday, April 12, 2012 8:11 PM Subject: Re: [SRILM User List] all tests differ but only minimally On 4/12/2012 1:42 AM, Julia Hancke wrote: Hi, > > >I'm new to SRILM after installation I ran the tests. The stdout and stderr differed for all files. I checked on a few files with the nix diff command and found out, some numbers are off by small amounts likie this: > > >< 0 zeroprobs, logprob= -4.23739 ppl= 25.8501ppl1= 131.43 >> 0 zeroprobs, logprob= -4.23739 ppl= 25.8502 ppl1= 131.43 > > >Is this normal? What else could I do, or should I just live with it? This is my machine: i686 athlon i386 GNU/Linux. Any help is apprechiated. Yes, this is normal, and shouldn't cause the tests to fail. The outputs of test runs are not compared using diff.? Instead, the script $SRILM/sbin/compare-outputs is used. Try ??? compare-outputs REFERENCE OUTPUT and that should not complain about small numerical differences as what you show above. If not, there is some kind of bug. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From nshmyrev at yandex.ru Sun Apr 22 07:46:13 2012 From: nshmyrev at yandex.ru (Nickolay V. Shmyrev) Date: Sun, 22 Apr 2012 17:46:13 +0300 Subject: [SRILM User List] Lattice tool crash Message-ID: <1335105973.5055.12.camel@localhost.localdomain> Hello lattice tool crashes when trying to read the attached lattice with. lattice-tool -in-lattice problem.lat -read-htk The problem is that it fails to find the final node. So finalNode->word on line 1218 in HTK lattice causes access violation. -------------- next part -------------- A non-text attachment was scrubbed... Name: problem.lat.gz Type: application/x-gzip Size: 689 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From stolcke at icsi.berkeley.edu Mon Apr 23 10:36:28 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 23 Apr 2012 10:36:28 -0700 Subject: [SRILM User List] Lattice tool crash In-Reply-To: <1335105973.5055.12.camel@localhost.localdomain> References: <1335105973.5055.12.camel@localhost.localdomain> Message-ID: <4F95931C.4060803@icsi.berkeley.edu> On 4/22/2012 7:46 AM, Nickolay V. Shmyrev wrote: > Hello > > lattice tool crashes when trying to read the attached lattice with. > > lattice-tool -in-lattice problem.lat -read-htk > > The problem is that it fails to find the final node. So > finalNode->word on line 1218 in HTK lattice causes access violation. > > Indeed. Your lattice is missing a end= specification, and the heuristic used to determine a final node in this case fails (because your de-facto-final node has a self-loop). The same could happen with the start node, if undefined. The attached patch will issue an error in these cases. However, you should still fix your input lattices (include start/end attributes). Andreas -------------- next part -------------- Index: lattice/src/HTKLattice.cc =================================================================== RCS file: /home/srilm/CVS/srilm/lattice/src/HTKLattice.cc,v retrieving revision 1.53 diff -c -r1.53 HTKLattice.cc *** lattice/src/HTKLattice.cc 24 Dec 2011 05:25:54 -0000 1.53 --- lattice/src/HTKLattice.cc 23 Apr 2012 17:27:32 -0000 *************** *** 1164,1169 **** --- 1164,1171 ---- // search for start node: the one without incoming transitions LHashIter nodeIter(nodes); NodeIndex nodeIndex; + + initialNode = 0; while (LatticeNode *node = nodeIter.next(nodeIndex)) { if (node->inTransitions.numEntries() == 0) { initial = nodeIndex; *************** *** 1172,1177 **** --- 1174,1185 ---- } } + if (!initialNode) { + file.position() << "could not find start node\n"; + if (!useNullNodes) vocab.remove(HTKNodeDummy); + return false; + } + // now find the HTK node info associated with first node LHashIter nodeMapIter(nodeMap); unsigned htkNode; *************** *** 1213,1218 **** --- 1221,1228 ---- // search for end node: the one without outgoing transitions LHashIter nodeIter(nodes); NodeIndex nodeIndex; + + finalNode = 0; while (LatticeNode *node = nodeIter.next(nodeIndex)) { if (node->outTransitions.numEntries() == 0) { final = nodeIndex; *************** *** 1221,1226 **** --- 1231,1242 ---- } } + if (!finalNode) { + file.position() << "could not find final node\n"; + if (!useNullNodes) vocab.remove(HTKNodeDummy); + return false; + } + // now find the HTK node info associated with final node LHashIter nodeMapIter(nodeMap); unsigned htkNode; From jonathanmcchow at hotmail.com Tue Apr 24 15:27:24 2012 From: jonathanmcchow at hotmail.com (Chow Jonathan) Date: Wed, 25 Apr 2012 06:27:24 +0800 Subject: [SRILM User List] Installing SRILM Message-ID: Hi, I was trying to compile SRILM on Mac OS X Lion 10.7 It kept throwing me the error LatticeIndex.cc:78:6: error: variable length array of non-POD element type 'NBestWordInfo' makeArray(NBestWordInfo, roundedNgram, len + 1); Can anyone help me on this please? Thanks,Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Apr 24 20:33:53 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 24 Apr 2012 20:33:53 -0700 Subject: [SRILM User List] Installing SRILM In-Reply-To: References: Message-ID: <4F9770A1.90705@icsi.berkeley.edu> On 4/24/2012 3:27 PM, Chow Jonathan wrote: > Hi, > > I was trying to compile SRILM on Mac OS X Lion 10.7 > > It kept throwing me the error > > LatticeIndex.cc:78:6: error: variable length array of non-POD element > type 'NBestWordInfo' > makeArray(NBestWordInfo, roundedNgram, len + 1); > > Can anyone help me on this please? In case you are using the clang compiler, the following change (from kutlak.roman at gmail.com) should fix your problem. In dstruct/src/Array.h, replace the line #if !defined(DEBUG) && defined(__GNUC__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) with #if !defined(DEBUG) && defined(__GNUC__) && !defined(__clang__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) Andreas > > Thanks, > Jonathan > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From mvp-songyoung at 163.com Mon May 14 01:41:51 2012 From: mvp-songyoung at 163.com (mvp-songyoung) Date: Mon, 14 May 2012 16:41:51 +0800 (CST) Subject: [SRILM User List] How to decode with an interpolated class-based LM with lattice-tool Message-ID: <660b26a4.15837.1374a82bf15.Coremail.mvp-songyoung@163.com> Hi,I meet a question when lattice rescoring with an interpolated class-based lm with lattice-tool. This class-based LM was trained by interpolating three other different class-based LMs:LM1 contian 3500 words and merged into 350 clases;LM2 contain 2500 words and merged into 250 classes ; LM3 contian 110 words and merged into 10 classes. I have renamed the class definitions for three class-based LMs before training and interpolating them.and I also merged the class definitions to a single file before decoding. My decoding comand is as follows: lattice-tool -read-htk -viterbi-decode -order 4 -lm class-4gram.lm -classes -in-lattice-list lattice.scp -htk-wdpenalty $PENALTY -htk-lmscale $LMSCALE But, I found that the decoding process was very slow and memory consuming. I wonder to know why I meet and how to process this situation? Are there any steps I have did incorrect? Please give me the right steps? thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon May 14 10:01:28 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 14 May 2012 13:01:28 -0400 Subject: [SRILM User List] How to decode with an interpolated class-based LM with lattice-tool In-Reply-To: <660b26a4.15837.1374a82bf15.Coremail.mvp-songyoung@163.com> References: <660b26a4.15837.1374a82bf15.Coremail.mvp-songyoung@163.com> Message-ID: <4FB13A68.80404@icsi.berkeley.edu> On 5/14/2012 4:41 AM, mvp-songyoung wrote: > Hi,I meet a question when lattice rescoring with an interpolated > class-based lm with lattice-tool. This class-based LM was trained by > interpolating three other different class-based LMs:LM1 c! ontian 3500 > words and merged into 350 clases;LM2 contain 2500 words and merged > into 250 classes ; LM3 contian 110 words and merged into 10 classes. > I have renamed the class definitions for three class-based LMs before > training and interpolating them.and I also merged the class > definitions to a single file before decoding. My decoding comand is as > follows: > lattice-tool -read-htk -viterbi-decode -order 4 -lm class-4gram.lm > -classes -in-lattice-list lattice.scp -htk-wdpenalty $PENALTY > -htk-lmscale $LMSCALE > But, I found that the decoding process was very slow and memory > consuming. I wonder to know why I meet and how to process this > situation? Are there any steps I have did incorrect? Please give me > the right steps? thank you > ! &nbs p; The -classes option leads to an LM that no longer uses only a finite history to evaluate the probability of the next word. This means that during lattice expansion all histories need to be kept distinct. You should try the -simple-classes option, assuming your models satisfy its requirements: > * > *-classes*/file/ > Interpret the LM as an N-gram over word classes. The expansions of > the classes are given in /file/ in classes-format(5) > . > Tokens in the LM that are not defined as classes in /file / are > assumed to be plain words, so that the LM can contain mixed > N-grams over both words and word classes. > *-simple-classes* > Assume a "simple" class model: each word is member of at most one > word class, and class expansions are exactly one word long. * Hope this helps, Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From mvp-songyoung at 163.com Mon May 14 20:23:32 2012 From: mvp-songyoung at 163.com (mvp-songyoung) Date: Tue, 15 May 2012 11:23:32 +0800 (CST) Subject: [SRILM User List] How to decode with an interpolated class-based LM with lattice-tool In-Reply-To: <4FB13A68.80404@icsi.berkeley.edu> References: <660b26a4.15837.1374a82bf15.Coremail.mvp-songyoung@163.com> <4FB13A68.80404@icsi.berkeley.edu> Message-ID: <66ce2627.76d5.1374e85ac03.Coremail.mvp-songyoung@163.com> I have tried the -simple-classes option. It seems that my models do not satisfy its requirements as I get such warning: ... ./LM/xx.class: line 6122: word holidays has multiple class memberships ./LM/xx.class: line 6122: word still has multiple class memberships ./LM/xx.class: line 6122: word five has multiple class memberships ./LM/xx.class: line 6122: word form has multiple class memberships ... I merged word classes for LM1 LM2 and LM3 from three different corpus separately. So it can not avoid that they have some duplicate words between each other. And I want to use interpolated class-based LM in my decoding task. How to operate it? Thank you At 2012-05-15 01:01:28,"Andreas Stolcke" wrote: On 5/14/2012 4:41 AM, mvp-songyoung wrote: Hi,I meet a question when lattice rescoring with an interpolated class-based lm with lattice-tool. This class-based LM was trained by interpolating three other different class-based LMs:LM1 c! ontian 3500 words and merged into 350 clases;LM2 contain 2500 words and merged into 250 classes ; LM3 contian 110 words and merged into 10 classes. I have renamed the class definitions for three class-based LMs before training and interpolating them.and I also merged the class definitions to a single file before decoding. My decoding comand is as follows: lattice-tool -read-htk -viterbi-decode -order 4 -lm class-4gram.lm -classes -in-lattice-list lattice.scp -htk-wdpenalty $PENALTY -htk-lmscale $LMSCALE But, I found that the decoding process was very slow and memory consuming. I wonder to know why I meet and how to process this situation? Are there any steps I have did incorrect? Please give me the right steps? thank you ! &nbs p; The -classes option leads to an LM that no longer uses only a finite history to evaluate the probability of the next word. This means that during lattice expansion all histories need to be kept distinct. You should try the -simple-classes option, assuming your models satisfy its requirements: -classes file Interpret the LM as an N-gram over word classes. The expansions of the classes are given in file in classes-format(5). Tokens in the LM that are not defined as classes in file are assumed to be plain words, so that the LM can contain mixed N-grams over both words and word classes. -simple-classes Assume a "simple" class model: each word is member of at most one word class, and class expansions are exactly one word long. Hope this helps, Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From invite+Ac3JpbG0tdXNlckBzcGVlY2guc3JpLmNvbQ at facebookmail.com Tue May 15 01:19:16 2012 From: invite+Ac3JpbG0tdXNlckBzcGVlY2guc3JpLmNvbQ at facebookmail.com (Huaqing Luo) Date: Tue, 15 May 2012 01:19:16 -0700 Subject: [SRILM User List] =?utf-8?b?5oiR5ZyoIEZhY2Vib29rIOS4iuaJvuS9oA==?= =?utf-8?b?5ZGi?= Message-ID: <9611f4ad87fdc8c8713986501ce64c5a@async.facebook.com> ??Srilm_user ??? ??????Facebook??????????????????????????????????????????????????????Facebook??????????????????? ??? Huaqing ?? Facebook?????????? http://www.facebook.com/p.php?i=100001793809844&k=AQBBJPrKD_FoI3BVq3O6H2qi4zVWOB1gUqWdx9Tr5aKKMAIZ3JcQod8kO489UEbBpLyPKi7c6uhAYuVkLsol5KrD9tI&r ???????????????????? http://www.facebook.com/merge_accounts.php?e=srilm-user%40speech.sri.com&c=AQA0UNBo9lYYFmZtcdAlE_Bg0k6Uh1blXgcGbwGzZ84tow ======================================= ??????? srilm-user at speech.sri.com ???????????????Facebook??????????? http://www.facebook.com/o.php?k=AS2eKilcq5wubEc2&e=srilm-user%40speech.sri.com&mid=HMTAyNTAyMDY1OnNyaWxtLXVzZXJAc3BlZWNoLnNyaS5jb206OA Facebook, Inc. Attention: Department 415 P.O Box 10005 Palo Alto CA 94303 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rico.sennrich at gmx.ch Tue May 15 03:19:14 2012 From: rico.sennrich at gmx.ch (Rico Sennrich) Date: Tue, 15 May 2012 12:19:14 +0200 Subject: [SRILM User List] nan in language model In-Reply-To: <4F5E254B.9040103@icsi.berkeley.edu> References: <1331557800.12711.25.camel@rico-work> <4F5E254B.9040103@icsi.berkeley.edu> Message-ID: <1337077154.2052.9.camel@rico-work> On Mon, 2012-03-12 at 09:33 -0700, Andreas Stolcke wrote: > On 3/12/2012 6:10 AM, Rico Sennrich wrote: > > Hi list, > > > > Occasionally, I get 'nan' as probability or backoff weight in LMs > > trained with SRILM. This is not expected in an ARPA file and eventually > > leads to crashes / undefined behaviour in other programs that use the > > model. > It's certainly not supposed to happen. > In your case it looks like 5-grams end up with nan probabilities, which > would then lead to BOWs also being computed as NaNs. > > I have never seens this, actually. It would help to try a few things: Sorry for the late reply. The short answer is that a negative kndiscount (discount3+ in biglm.kn5) is the problem. I guess it's a known problem that Kneser-Ney smoothing behaves weirdly for data with lots of duplicates, but I'd rather have an error message than for SRILM to silently build a corrupt LM. > - see if it only happens with -kndiscount. with -kndiscount -interpolate I get NaNs (as described before) with -kndiscount and without -interpolate, the last step (ngram-count) crashes. with default smoothing (no smoothing option specified), training seems to hang up at some point. There's no errors or warnings in any of these cases. > - see if those ngram counts have any special properties. The corpus the models were trained on is the News Crawl corpus http://www.statmt.org/wmt11/translation-task.html , and there are quite a few duplicate sentences in this corpus (which explains the negative kndiscount). The affected ngrams seem to all stem from these duplicate sentences. Rico From stolcke at icsi.berkeley.edu Tue May 15 10:27:20 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 15 May 2012 10:27:20 -0700 Subject: [SRILM User List] How to decode with an interpolated class-based LM with lattice-tool In-Reply-To: <1af95c8d.6d67.1374e72c757.Coremail.mvp-songyoung@163.com> References: <660b26a4.15837.1374a82bf15.Coremail.mvp-songyoung@163.com> <4FB13A68.80404@icsi.berkeley.edu> <1af95c8d.6d67.1374e72c757.Coremail.mvp-songyoung@163.com> Message-ID: <4FB291F8.9090603@icsi.berkeley.edu> On 5/14/2012 8:02 PM, mvp-songyoung wrote: > I have tried the -simple-classes option. It seems that my models do > not satisfy its requirements as I get such warning: > ... > ./LM/xx.class: line 6122: word holidays has multiple class memberships > ./LM/xx.class: line 6122: word still has multiple class memberships > ./LM/xx.class: line 6122: word five has multiple class memberships > ./LM/xx.class: line 6122: word form has multiple class memberships > ... > I merged word classes for LM1 LM2 and LM3 from three different corpus > separately. So it can not avoid that they have some duplicate words > between each other. And I want to use interpolated class-based LM in > my decoding task. How to operate it? Thank you > In that case you could try - converting the class-based LM into a purely word-based LM (ngram -expand-classes) - generating N-best from lattices and rescoring those instead Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From egrefen at gmail.com Wed May 16 09:25:56 2012 From: egrefen at gmail.com (Edward Grefenstette) Date: Wed, 16 May 2012 17:25:56 +0100 Subject: [SRILM User List] Compiling srilm with gcc 4.6.1 (and possibly any versions >= 4.3): patched Lattice sources Message-ID: Dear srilm users, As mentioned in another email, I've encountered and later resolved some problems building srilm with gcc v4.6. The fault, it seems, lays with the makearray declarations used in LatticeIndex.cc and LatticeNgrams.cc found in ./lattice/src/ of the srilm folder. I've managed to get srilm to compile by "cheating" and using an older version of g++ passed to make with CXX flag, but ideally it'd be better to fix the source to be compliant with C++0x, as enforced by gcc versions >= 4.3 (I think). I attach to this email the modified LatticeIndex.cc and LatticeNgrams.cc files from srilm 1.6.0 (diffs reproduced at the end of the email), which allowed me to compile srilm using gcc 4.6.1 without passing an older g++ using the CXX flag. Could someone please sanity check the changes? If they're good, it'd be nice to see these files updated in the main distribution so that others don't encounter this frustrating problem when they update their compilers and decide to (re)build srilm. I've included the output of 'make test' as well (srilm-tests.txt), which shows some occasional differences, although I don't know if that's a result of these modifications or something else on my system. Best, Ed Diff of original LatticeIndex.cc (srilm 1.6.0) with patched version =============================================== $ diff Old/LatticeIndex.cc New/LatticeIndex.cc 78c78 < makeArray(NBestWordInfo, roundedNgram, len + 1); --- > NBestWordInfo *roundedNgram = new NBestWordInfo[len + 1]; 102a103 > delete[] roundedNgram; =============================================== Diff of original LatticeNgrams.cc (srilm 1.6.0) with patched version =============================================== $ diff Old/LatticeNgrams.cc New/LatticeNgrams.cc 75c75 < makeArray(NBestWordInfo, extendedContext, contextLength + 2); --- > NBestWordInfo *extendedContext = new NBestWordInfo[contextLength + 2]; 165a166,168 > > delete[] extendedContext; > 182c185 < makeArray(VocabIndex, words, NBestWordInfo::length(ngram)+1); --- > VocabIndex *words = new VocabIndex[NBestWordInfo::length(ngram)+1]; 186a190,191 > > delete[] words; =============================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: LatticeIndex.cc Type: application/octet-stream Size: 5714 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: LatticeNgrams.cc Type: application/octet-stream Size: 7543 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: srilm-tests.txt URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at dowding.net Wed May 16 09:56:35 2012 From: john at dowding.net (John Dowding) Date: Wed, 16 May 2012 09:56:35 -0700 Subject: [SRILM User List] Any experiences using SRILM with Microsoft's Dictation Resource Kit? Message-ID: <045801cd3384$db8e7b20$92ab7160$@dowding.net> Just curious if anyone has had any experiences using SRILM language models in the Microsoft speech recognizer using their Dictation Resource Kit? http://www.microsoft.com/en-us/download/details.aspx?id=23262 It looks like it ought to be plausible, one of the steps take arpa-style ngram LMs as an input. Can anyone encourage me/warn me off trying this? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed May 16 11:05:29 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 16 May 2012 11:05:29 -0700 Subject: [SRILM User List] Compiling srilm with gcc 4.6.1 (and possibly any versions >= 4.3): patched Lattice sources In-Reply-To: References: Message-ID: <4FB3EC69.5030308@icsi.berkeley.edu> On 5/16/2012 9:25 AM, Edward Grefenstette wrote: > Dear srilm users, > > As mentioned in another email, I've encountered and later resolved > some problems building srilm with gcc v4.6 > . > The fault, it seems, lays with the makearray declarations used > in LatticeIndex.cc and LatticeNgrams.cc found in ./lattice/src/ of the > srilm folder. > > I've managed to get srilm to compile by "cheating" and using an older > version of g++ passed to make with CXX flag, but ideally it'd be > better to fix the source to be compliant with C++0x, as enforced by > gcc versions >= 4.3 (I think). > > I attach to this email the modified LatticeIndex.cc > and LatticeNgrams.cc files from srilm 1.6.0 (diffs reproduced at the > end of the email), which allowed me to compile srilm using gcc 4.6.1 > without passing an older g++ using the CXX flag. Could someone please > sanity check the changes? If they're good, it'd be nice to see these > files updated in the main distribution so that others don't encounter > this frustrating problem when they update their compilers and decide > to (re)build srilm. SRILM compiles fine on gcc 4.5.x, which is the latest version I've verified myself. It is possible that gcc 4.6.x changed (removed) the support for stack-allocated arrays with size computed at run-time, to be inline with standard C++. In that case a simple change in Array.h (the conditional definition of makeArray()) should suffice. I will try to get my hands on gcc 4.6.x to verify this, but feel free to submit a patch along those lines. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From egrefen at gmail.com Wed May 16 15:49:13 2012 From: egrefen at gmail.com (Edward Thomas Grefenstette) Date: Wed, 16 May 2012 23:49:13 +0100 Subject: [SRILM User List] Compiling srilm with gcc 4.6.1 (and possibly any versions >= 4.3): patched Lattice sources In-Reply-To: <4FB3EC69.5030308@icsi.berkeley.edu> References: <4FB3EC69.5030308@icsi.berkeley.edu> Message-ID: After further investigation, it turns out you are correct. I was also having compilation problems with apple's llvm-gcc 4.2 (on an older system), which I would not encounter if I explicitly asked to use the non-apple non-llvm-based g++-4.2 on the same system. The fault lies with the llvm-gcc/clang not liking the non-static definition in srlim/dstruct/src/Array.h: #if !defined(DEBUG) && defined(__GNUC__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) # define makeArray(T, A, n) T A[n] #else # define makeArray(T, A, n) StaticArray A(n) #endif The first option (T A[n]) is selected for llvm-gcc, when in fact it should be the second. If you I replace the whole section above in Array.h with the following line #if !defined(DEBUG) && defined(__GNUC__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) && !defined(__llvm__) then srlim compiles fine with llvm-gcc-4.x, without need for modification of LatticeIndex.cc and LatticeNgrams.cc, as it correctly detects llvm-gcc or clang which do not generally support variable length arrays, and defines makearray appropriately. How do I submit a patch? Best, Ed On 16 May 2012, at 19:05, Andreas Stolcke wrote: > On 5/16/2012 9:25 AM, Edward Grefenstette wrote: >> >> Dear srilm users, >> >> As mentioned in another email, I've encountered and later resolved some problems building srilm with gcc v4.6. The fault, it seems, lays with the makearray declarations used in LatticeIndex.cc and LatticeNgrams.cc found in ./lattice/src/ of the srilm folder. >> >> I've managed to get srilm to compile by "cheating" and using an older version of g++ passed to make with CXX flag, but ideally it'd be better to fix the source to be compliant with C++0x, as enforced by gcc versions >= 4.3 (I think). >> >> I attach to this email the modified LatticeIndex.cc and LatticeNgrams.cc files from srilm 1.6.0 (diffs reproduced at the end of the email), which allowed me to compile srilm using gcc 4.6.1 without passing an older g++ using the CXX flag. Could someone please sanity check the changes? If they're good, it'd be nice to see these files updated in the main distribution so that others don't encounter this frustrating problem when they update their compilers and decide to (re)build srilm. > SRILM compiles fine on gcc 4.5.x, which is the latest version I've verified myself. > > It is possible that gcc 4.6.x changed (removed) the support for stack-allocated arrays with size computed at run-time, to be inline with standard C++. In that case a simple change in Array.h (the conditional definition of makeArray()) should suffice. > > I will try to get my hands on gcc 4.6.x to verify this, but feel free to submit a patch along those lines. > > Andreas > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed May 16 16:06:09 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 16 May 2012 16:06:09 -0700 Subject: [SRILM User List] Compiling srilm with gcc 4.6.1 (and possibly any versions >= 4.3): patched Lattice sources In-Reply-To: References: <4FB3EC69.5030308@icsi.berkeley.edu> Message-ID: <4FB432E1.7000300@icsi.berkeley.edu> This was previously reported at http://www.speech.sri.com/pipermail/srilm-user/2012q2/001197.html . The current (not yet released) code in Array.h has #if !defined(DEBUG) && defined(__GNUC__) && !defined(__clang__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) and that should take care of your problem. Andreas On 5/16/2012 3:49 PM, Edward Thomas Grefenstette wrote: > After further investigation, it turns out you are correct. I was also > having compilation problems with apple's llvm-gcc 4.2 (on an older > system), which I would not encounter if I explicitly asked to use the > non-apple non-llvm-based g++-4.2 on the same system. > > The fault lies with the llvm-gcc/clang not liking the non-static > definition in srlim/dstruct/src/Array.h: > > #if !defined(DEBUG) && defined(__GNUC__) && > (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) > # define makeArray(T, A, n)T A[n] > #else > # define makeArray(T, A, n)StaticArray A(n) > #endif > > The first option (T A[n]) is selected for llvm-gcc, when in fact it > should be the second. > > If you I replace the whole section above in Array.h with the following > line > > #if !defined(DEBUG) && defined(__GNUC__) && > (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) && > !defined(__llvm__) > > then srlim compiles fine with llvm-gcc-4.x, without need for > modification of LatticeIndex.cc and LatticeNgrams.cc, as it correctly > detects llvm-gcc or clang which do not generally support variable > length arrays , and > defines makearray appropriately. > > How do I submit a patch? > > Best, > Ed > > On 16 May 2012, at 19:05, Andreas Stolcke wrote: > >> On 5/16/2012 9:25 AM, Edward Grefenstette wrote: >>> Dear srilm users, >>> >>> As mentioned in another email, I've encountered and later resolved >>> some problems building srilm with gcc v4.6 >>> . >>> The fault, it seems, lays with the makearray declarations used >>> in LatticeIndex.cc and LatticeNgrams.cc found in ./lattice/src/ of >>> the srilm folder. >>> >>> I've managed to get srilm to compile by "cheating" and using an >>> older version of g++ passed to make with CXX flag, but ideally it'd >>> be better to fix the source to be compliant with C++0x, as enforced >>> by gcc versions >= 4.3 (I think). >>> >>> I attach to this email the modified LatticeIndex.cc >>> and LatticeNgrams.cc files from srilm 1.6.0 (diffs reproduced at the >>> end of the email), which allowed me to compile srilm using gcc 4.6.1 >>> without passing an older g++ using the CXX flag. Could someone >>> please sanity check the changes? If they're good, it'd be nice to >>> see these files updated in the main distribution so that others >>> don't encounter this frustrating problem when they update their >>> compilers and decide to (re)build srilm. >> SRILM compiles fine on gcc 4.5.x, which is the latest version I've >> verified myself. >> >> It is possible that gcc 4.6.x changed (removed) the support for >> stack-allocated arrays with size computed at run-time, to be inline >> with standard C++. In that case a simple change in Array.h (the >> conditional definition of makeArray()) should suffice. >> >> I will try to get my hands on gcc 4.6.x to verify this, but feel free >> to submit a patch along those lines. >> >> Andreas >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From egrefen at gmail.com Wed May 16 16:11:33 2012 From: egrefen at gmail.com (Edward Thomas Grefenstette) Date: Thu, 17 May 2012 00:11:33 +0100 Subject: [SRILM User List] Compiling srilm with gcc 4.6.1 (and possibly any versions >= 4.3): patched Lattice sources In-Reply-To: <4FB432E1.7000300@icsi.berkeley.edu> References: <4FB3EC69.5030308@icsi.berkeley.edu> <4FB432E1.7000300@icsi.berkeley.edu> Message-ID: <8FC19A41-6107-4782-85D9-6BE5CCD134B2@gmail.com> Just tested a build with this macro line instead of the one I suggested, and I can confirm it builds on both of the macs I tested it on (llvm-gcc 4.2 and llvm-gcc 4.6.1). Good to hear it's fixed already... Best, Ed On 17 May 2012, at 00:06, Andreas Stolcke wrote: > This was previously reported at http://www.speech.sri.com/pipermail/srilm-user/2012q2/001197.html . > The current (not yet released) code in Array.h has > > #if !defined(DEBUG) && defined(__GNUC__) && !defined(__clang__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) > > and that should take care of your problem. > > Andreas > > > On 5/16/2012 3:49 PM, Edward Thomas Grefenstette wrote: >> >> After further investigation, it turns out you are correct. I was also having compilation problems with apple's llvm-gcc 4.2 (on an older system), which I would not encounter if I explicitly asked to use the non-apple non-llvm-based g++-4.2 on the same system. >> >> The fault lies with the llvm-gcc/clang not liking the non-static definition in srlim/dstruct/src/Array.h: >> >> #if !defined(DEBUG) && defined(__GNUC__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) >> # define makeArray(T, A, n) T A[n] >> #else >> # define makeArray(T, A, n) StaticArray A(n) >> #endif >> >> The first option (T A[n]) is selected for llvm-gcc, when in fact it should be the second. >> >> If you I replace the whole section above in Array.h with the following line >> >> #if !defined(DEBUG) && defined(__GNUC__) && (!defined(__INTEL_COMPILER) || __INTEL_COMPILER >=900) && !defined(__llvm__) >> >> then srlim compiles fine with llvm-gcc-4.x, without need for modification of LatticeIndex.cc and LatticeNgrams.cc, as it correctly detects llvm-gcc or clang which do not generally support variable length arrays, and defines makearray appropriately. >> >> How do I submit a patch? >> >> Best, >> Ed >> >> On 16 May 2012, at 19:05, Andreas Stolcke wrote: >> >>> On 5/16/2012 9:25 AM, Edward Grefenstette wrote: >>>> >>>> Dear srilm users, >>>> >>>> As mentioned in another email, I've encountered and later resolved some problems building srilm with gcc v4.6. The fault, it seems, lays with the makearray declarations used in LatticeIndex.cc and LatticeNgrams.cc found in ./lattice/src/ of the srilm folder. >>>> >>>> I've managed to get srilm to compile by "cheating" and using an older version of g++ passed to make with CXX flag, but ideally it'd be better to fix the source to be compliant with C++0x, as enforced by gcc versions >= 4.3 (I think). >>>> >>>> I attach to this email the modified LatticeIndex.cc and LatticeNgrams.cc files from srilm 1.6.0 (diffs reproduced at the end of the email), which allowed me to compile srilm using gcc 4.6.1 without passing an older g++ using the CXX flag. Could someone please sanity check the changes? If they're good, it'd be nice to see these files updated in the main distribution so that others don't encounter this frustrating problem when they update their compilers and decide to (re)build srilm. >>> SRILM compiles fine on gcc 4.5.x, which is the latest version I've verified myself. >>> >>> It is possible that gcc 4.6.x changed (removed) the support for stack-allocated arrays with size computed at run-time, to be inline with standard C++. In that case a simple change in Array.h (the conditional definition of makeArray()) should suffice. >>> >>> I will try to get my hands on gcc 4.6.x to verify this, but feel free to submit a patch along those lines. >>> >>> Andreas >>> >> > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From nshmyrev at yandex.ru Thu May 17 05:25:29 2012 From: nshmyrev at yandex.ru (Nickolay V. Shmyrev) Date: Thu, 17 May 2012 15:25:29 +0300 Subject: [SRILM User List] Lattice tool issue Message-ID: <1337257529.498.9.camel@localhost.localdomain> Hello I've found another corner case in lattice-tool processing from SRILM 1.6.0. If weights are not quite accurate (contain -inf or -nan), lattice-tool -viterbi-decode can crash. The stacktrace looks like this one 0x000000000044ad84 in cmpPath (p1=0x170bc58, p2=0x170bc68) at LatticeDecode.cc:169 169 float diff = (*p1)->m_Prob - (*p2)->m_Prob; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc ++-4.4.6-3.el6.x86_64 (gdb) bt #0 0x000000000044ad84 in cmpPath (p1=0x170bc58, p2=0x170bc68) at LatticeDecode.cc:169 #1 0x00000000004be418 in qsort (base=0x170bc60 "H.\313\001", n=6, size=8, compar=0x44ad6d ) at qsort.c:75 #2 0x000000000044bcd4 in Lattice::decode (this=0x7fffffffd4c0, contextLen=3, lm=0x144c530, finalPosition=214, sortedNodes=0x1ca7040, beamwidth=0, lmscale=9.5, nbest=0, maxFanIn=0, logP_floor=-inf, maxPaths=0) at LatticeDecode.cc:481 #3 0x000000000044c3dd in Lattice::decode1Best (this=0x7fffffffd4c0, winfo=0x7ffff0969018, maxWords=50000, ignoreWords=..., lm=0x144c530, contextLen=3, beamwidth=0, logP_floor=-inf, maxPaths=0) at LatticeDecode.cc:612 #4 0x000000000044bfdc in Lattice::decode1Best (this=0x7fffffffd4c0, words=0x7ffff7e33010, maxWords=50000, ignoreWords=..., lm=0x144c530, contextLen=3, beamwidth=0, logP_floor=-inf, The problem is in both qsort and in cmpPath function which is used to compare path scores This part of qsort assumes that qcmp returns 0 for hi == min, which is not always true while (qcmp(hi -= qsz, min) > 0) /* void */; This function // sort in descending order int cmpPath(const LatticeDecodePath ** p1, const LatticeDecodePath ** p2) { float diff = (*p1)->m_Prob - (*p2)->m_Prob; if (diff > 0) return -1; else if (diff == 0) return 0; else return 1; } Doesn't always return 0 properly if diff is nan or inf. So I suggest two things: 1. Add additional check in qsort in a while loop for hi >= min 2. Warn about diff being nan or compare it with 0 properly. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From burkay at mit.edu Thu May 17 10:59:29 2012 From: burkay at mit.edu (Burkay Gur) Date: Thu, 17 May 2012 13:59:29 -0400 Subject: [SRILM User List] How do you calculate perplexity given a test sentence? Message-ID: <4FB53C81.5050608@mit.edu> Hi, I was wondering how the perplexity is calculated given different test sentences to a single language model. 1) For example, does SRI calculate 2^-H(p) no matter what the input sentence is ? 2) Or does it calculate the perplexity based on the cross-entropy between the model and the input sentence? ie 2^-H(p,q) where p is the language model and q is (not sure what it would be) Best, Burkay From stolcke at icsi.berkeley.edu Fri May 18 10:18:07 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 18 May 2012 10:18:07 -0700 Subject: [SRILM User List] How do you calculate perplexity given a test sentence? In-Reply-To: <4FB53C81.5050608@mit.edu> References: <4FB53C81.5050608@mit.edu> Message-ID: <4FB6844F.801@icsi.berkeley.edu> On 5/17/2012 10:59 AM, Burkay Gur wrote: > Hi, > > I was wondering how the perplexity is calculated given different test > sentences to a single language model. > > 1) For example, does SRI calculate 2^-H(p) no matter what the input > sentence is ? > > 2) Or does it calculate the perplexity based on the cross-entropy > between the model and the input sentence? ie 2^-H(p,q) where p is the > language model and q is (not sure what it would be) Perplexity is always computed by evaluating the model on the test data. So the "q" in H(p,q) is approximated by taking an average over the test data (which is assumed to be a sample from the true "q" distribution). So the estimate used is H(p,q) = 1/N \sum_i log p(w_i|h_i) where N is the number of tokens, w_i is the i-th token and h_i its history. Andreas From stolcke at icsi.berkeley.edu Fri May 18 13:45:48 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 18 May 2012 13:45:48 -0700 Subject: [SRILM User List] How do you calculate perplexity given a test sentence? In-Reply-To: <20120518133929.80znm9a3wkg4g8ws@webmail.mit.edu> References: <4FB53C81.5050608@mit.edu> <4FB6844F.801@icsi.berkeley.edu> <20120518133929.80znm9a3wkg4g8ws@webmail.mit.edu> Message-ID: <4FB6B4FC.9000608@icsi.berkeley.edu> On 5/18/2012 10:39 AM, Burkay Gur wrote: > This is still not clear to me. When we calculate the perplexity of a > language > model alone, we just take p as the language model itself. This tells > us how > perplexed is that language model. > > This is H(p) = - Sum_i(p_i*log(p_i)) > > Now when we introduce a test sentence, I am not sure what we are > calculating. In > your example you are not mentioning q in the equation. > > H(p,q) = -Sum_i(p_i * log(q_i)) First, exchange p and q, if p is your LM, so you have H(p,q) = -Sum_i(q_i * log(p_i)) q_i is approximated by the empirical distribution of words in the test data. So effectively, q_i = number of occurrences of word i / length of test corpus. Of course for many (most) words q_i will be zero (they don't occur in the test data). With this approximation you get H(p,q) = - Sum_j log (p_j) where j now ranges over the word occurrences (tokens, not types) in the test set, and p_j is the probability of the j-th word. Andreas From stolcke at icsi.berkeley.edu Fri May 18 22:29:42 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 18 May 2012 22:29:42 -0700 Subject: [SRILM User List] nan in language model In-Reply-To: <1337077154.2052.9.camel@rico-work> References: <1331557800.12711.25.camel@rico-work> <4F5E254B.9040103@icsi.berkeley.edu> <1337077154.2052.9.camel@rico-work> Message-ID: <4FB72FC6.1030200@icsi.berkeley.edu> Attached is a patch that catches negative discounts when using make-big-lm . The discount estimator built into ngram-count (Discount.cc) already had this check, but for some reason it was not in the make-kn-discounts script. Andreas On 5/15/2012 3:19 AM, Rico Sennrich wrote: > On Mon, 2012-03-12 at 09:33 -0700, Andreas Stolcke wrote: >> On 3/12/2012 6:10 AM, Rico Sennrich wrote: >>> Hi list, >>> >>> Occasionally, I get 'nan' as probability or backoff weight in LMs >>> trained with SRILM. This is not expected in an ARPA file and eventually >>> leads to crashes / undefined behaviour in other programs that use the >>> model. >> It's certainly not supposed to happen. >> In your case it looks like 5-grams end up with nan probabilities, which >> would then lead to BOWs also being computed as NaNs. >> >> I have never seens this, actually. It would help to try a few things: > Sorry for the late reply. The short answer is that a negative kndiscount > (discount3+ in biglm.kn5) is the problem. I guess it's a known problem > that Kneser-Ney smoothing behaves weirdly for data with lots of > duplicates, but I'd rather have an error message than for SRILM to > silently build a corrupt LM. > >> - see if it only happens with -kndiscount. > with -kndiscount -interpolate I get NaNs (as described before) > with -kndiscount and without -interpolate, the last step (ngram-count) > crashes. > with default smoothing (no smoothing option specified), training seems > to hang up at some point. > > There's no errors or warnings in any of these cases. > >> - see if those ngram counts have any special properties. > The corpus the models were trained on is the News Crawl corpus > http://www.statmt.org/wmt11/translation-task.html , and there are quite > a few duplicate sentences in this corpus (which explains the negative > kndiscount). The affected ngrams seem to all stem from these duplicate > sentences. > > Rico > -------------- next part -------------- Index: utils/src/make-kn-discounts.gawk =================================================================== RCS file: /home/srilm/CVS/srilm/utils/src/make-kn-discounts.gawk,v retrieving revision 1.4 diff -c -r1.4 make-kn-discounts.gawk *** utils/src/make-kn-discounts.gawk 17 Jun 2007 01:21:18 -0000 1.4 --- utils/src/make-kn-discounts.gawk 19 May 2012 05:10:19 -0000 *************** *** 95,102 **** Y = countOfCounts[1]/(countOfCounts[1] + 2 * countOfCounts[2]); print "mincount", min; ! print "discount1", 1 - 2 * Y * countOfCounts[2] / countOfCounts[1]; ! print "discount2", 2 - 3 * Y * countOfCounts[3] / countOfCounts[2]; ! print "discount3+", 3 - 4 * Y * countOfCounts[4] / countOfCounts[3]; } --- 95,114 ---- Y = countOfCounts[1]/(countOfCounts[1] + 2 * countOfCounts[2]); + discount1 = 1 - 2 * Y * countOfCounts[2] / countOfCounts[1]; + discount2 = 2 - 3 * Y * countOfCounts[3] / countOfCounts[2]; + discount3plus = 3 - 4 * Y * countOfCounts[4] / countOfCounts[3]; + print "mincount", min; ! print "discount1", discount1; ! print "discount2", discount2; ! print "discount3+", discount3plus; ! ! # check for invalid values after output, so we see where the problem is ! if (discount1 < 0 || dicount2 < 0 || discount3plus < 0) { ! printf "error: one of modified KneserNey discounts is negative\n" \ ! >> "/dev/stderr"; ! exit(2); ! } ! } From burkay at mit.edu Sun May 20 20:28:49 2012 From: burkay at mit.edu (Burkay Gur) Date: Sun, 20 May 2012 23:28:49 -0400 Subject: [SRILM User List] Probability of Unknown Words - Kneser Ney? Message-ID: <4FB9B671.9080604@mit.edu> Hi! I was wondering how we calculate the probability of unk words while using unmodified Kneser Ney. I know that Kneser Ney never assigns zero probs. How is that possible with words that are never seen? Or words that are in the dictionary but not in the training corpus? Best, Burkay From stolcke at icsi.berkeley.edu Mon May 21 11:35:15 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 21 May 2012 11:35:15 -0700 Subject: [SRILM User List] Probability of Unknown Words - Kneser Ney? In-Reply-To: <4FB9B671.9080604@mit.edu> References: <4FB9B671.9080604@mit.edu> Message-ID: <4FBA8AE3.2070709@icsi.berkeley.edu> On 5/20/2012 8:28 PM, Burkay Gur wrote: > Hi! > > I was wondering how we calculate the probability of unk words while > using unmodified Kneser Ney. I know that Kneser Ney never assigns zero > probs. How is that possible with words that are never seen? Or words > that are in the dictionary but not in the training corpus? There is nothing special that KN smoothing does with unknown words. Like all smoothing methods, unknown words are either ignored (assigned 0 probability) or modeled by a designated token, depending on how your data is prepared and the ngram-count -unk option. For more information see the FAQ page http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html and look for "unknown" . Andreas From stolcke at icsi.berkeley.edu Tue May 22 17:05:33 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 22 May 2012 17:05:33 -0700 Subject: [SRILM User List] Does keep-unk work with lattice-tool and htk format? In-Reply-To: <9D363DB8-11DF-4534-AEB5-058E96E3A74C@upc.edu> References: <4FB9B671.9080604@mit.edu> <4FBA8AE3.2070709@icsi.berkeley.edu> <9D363DB8-11DF-4534-AEB5-058E96E3A74C@upc.edu> Message-ID: <4FBC29CD.3010601@icsi.berkeley.edu> On 5/22/2012 10:56 AM, Llu?s Formiga i Fanals wrote: > Hi, > > I was trying to execute the following command: > > lattice-tool -in-lattice-list lattice_lists.txt -read-htk -lm > /veu4/usuaris24/lluisf/EMS/misspelling2012/lm/interpolated-lm.en > -write-mesh-dir out -keep-unk > > but I find that unks ("") are still on the written CN (-write-mesh). > > Does -keep-unk option work only for lattices output? Am I doing > something wrong? No, the code is working as intended. The option is described as -keep-unk Treat out-of-vocabulary words as but preserve their labels in lattice output. What you are outputting is confusion networks, not lattices. In the CN building process, lattice nodes that are mapped to are treated as equivalent, and the word information is lost in the process. I would suggest that you simple do your lattice rescoring with -keep-unk, output the rescored lattices, and then run lattice-tool a second time without -keep-unk and without the -vocab option, so all word labels are preserved (all words are implicitly added to the vocabulary). Andreas > > Thanks, > > Llu?s > ** > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 8771 bytes Desc: not available URL: From dmytro.prylipko at ovgu.de Mon May 28 04:15:36 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Mon, 28 May 2012 13:15:36 +0200 Subject: [SRILM User List] Rescoring lattices Message-ID: Dear Andreas, As far as I know a convenient way to combine HTK and SRILM for speech recognition is: 1. Generate lattices with HTK. 2. Rescore them with lattice-tool and LM built with the SRILM toolkit. 3. Decode rescored lattices using, e.g. Vitebi decode procedure, and finally obtain the most likely hypothesis with the new language model. On the first step I also get a 1-best utterance(s) decoded with the HTK bigram model (built using HLStats and HBuild). And I found that the recognition accuracy of the SRILM trigram is much worse that the accuracy of the initial output obtained with the HTK bigram. For instance, 75.99% vs. 83.14%, 65.51% vs. 71.58%. The reason for this is the different training sets for bigrams and trigrams. HVite uses for decoding not real back-off N-gram but a word network. And this network contains just those sequences found in train data. In order to ensure that each test utterance is present in the initial lattice (and thus can be recognized with 100% accuracy), I cheated: built the network (initial lattice) using train as well as test data. In contrast, trigrams were trained just on train material but with the full vocabulary. However, the gap seems to me too large. Could you tell whether the scheme I described is correct? Another strangeness is a behavior of the models during incremental adaptation. In supervised mode the progress looks like: HTK SRILM 81.07 75.99 82.61 75.72 82.61 76.32 83.14 75.99 83.21 75.52 82.74 75.85 83.41 76.39 83.55 76.79 83.68 76.59 6 improvements and 1 worsening (in comparison to previous result) with HTK and 4:4 with SRILM. 2.61% vs. 0.8% absolute improvement, and 3.12% vs. 1.04% relative. This looks pretty strange for me: I supposed the dynamics should be roughly the same. Do you have any suggestions why the trigram performs so bad? Maybe, the language scale factor is too large (I use 10.0)? I would greatly appreciated for any help. Yours, Dmytro Prylipko. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yhifny at yahoo.com Thu May 31 10:51:47 2012 From: yhifny at yahoo.com (yasser hifny) Date: Thu, 31 May 2012 10:51:47 -0700 (PDT) Subject: [SRILM User List] gtnmin, gtnmax and smoothing methods Message-ID: <1338486707.87220.YahooMailNeo@web161505.mail.bf1.yahoo.com> Hello, ? I have a question about related to SRILM implmentation. I found that gtnmin? gtnmax affects some smoothing methods results ( absolute, wb, kn). According to its documentation (http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html)?, gtnmin?or gtnmax?should affect GT ?smoothing method only. Can any one explain this strange behavior for me? ? thanks in advance ? ? ? best regards, Yasser -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Thu May 31 15:20:20 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 31 May 2012 15:20:20 -0700 Subject: [SRILM User List] gtnmin, gtnmax and smoothing methods In-Reply-To: <1338486707.87220.YahooMailNeo@web161505.mail.bf1.yahoo.com> References: <1338486707.87220.YahooMailNeo@web161505.mail.bf1.yahoo.com> Message-ID: <4FC7EEA4.10604@icsi.berkeley.edu> On 5/31/2012 10:51 AM, yasser hifny wrote: > Hello, > I have a question about related to SRILM implmentation. I found that > gtnmin gtnmax affects some smoothing methods results ( absolute, wb, > kn). According to its documentation > (http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html) , > gtnmin or gtnmax should affect GT smoothing method only. Can any one > explain this strange behavior for me? gtNmax affects only GT discounting. gtNminx affects all methods. This is documented in the ngram-count man page: > -gtnmin count > where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the > minimal count of N-grams of order n that will be included in the LM. > All N-grams with frequency lower than that will > effectively be discounted to 0. If n is omitted the parameter for N- > grams of order > 9 is set. > NOTE: This option affects not only the default > Good-Turing discounting but the alternative discounting methods described > below as well. (see the NOTE). Andreas > thanks in advance > best regards, > Yasser > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From neetisonth at gmail.com Fri Jun 1 00:22:56 2012 From: neetisonth at gmail.com (NEETI SONTH) Date: Fri, 1 Jun 2012 12:52:56 +0530 Subject: [SRILM User List] need help with SRILM installation Message-ID: Hi, I am Neeti. I am new to this SRILM toolkit and I am having problems installing it. PLease help me. uname -a: Linux localhost.localdomain 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 GNU/Linux gcc -v: Reading specs from /usr/lib/gcc/i386-redhat-linux/3.4.6/specs Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-java-awt=gtk --host=i386-redhat-linux Thread model: posix gcc version 3.4.6 20060404 (Red Hat 3.4.6-8) The following difficulties are encountered by me. 1)Are we supposed to install srilm toolkit and start running commands in csh/tcsh and not in bash? Will we get some error if we do the installation in bash? I have downloaded the entire srilm toolkit along with the free third party software in the path: /home/nsonth/srilm. So i would be - Show quoted text - 5)when I ran '/sbin/machine-type',then I got the return script as 'i686'. But when I ran ' $SRILM/sbin/machine-type', as seen from your FAQ, I got the following error--- SRILM: Undefined variable. 6)I have csh/tcsh installed on my computer. Path is /sbin/tcsh or /sbin/csh. But when I run 'make', I get the following error-- make: /sbin/machine-type: Command not found cat: RELEASE: No such file or directory Makefile:13: /common/Makefile.common.variables: No such file or directory make: *** No rule to make target `/common/Makefile.common.variables'. Stop. Earnestly waiting for your reply. Thanking you regards, Neeti From aliasghar.toraby at gmail.com Fri Jun 1 06:04:07 2012 From: aliasghar.toraby at gmail.com (Ali Asghar Toraby Parizy) Date: Fri, 1 Jun 2012 17:34:07 +0430 Subject: [SRILM User List] Using SRILM for text classification Message-ID: Hi I wanna use SRILM for text classification. I've successfully compiled srilm and I could reach the classes and utilities in my own project by including header files in include folder and adding libraries in lib folder. I'm also familiar with concepts of language modeling and text categorization but I don't know where to start for using srilm in this regard. I need to create some language models from the corpus that I have and then guess the best model for a new text file using perplexity. Can anybody give me a review of classes and utilities or possibly a document that explains the class hierarchies? I don't have enough time to explore all codes to found out how to use it! Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Jun 1 13:35:01 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 01 Jun 2012 13:35:01 -0700 Subject: [SRILM User List] need help with SRILM installation In-Reply-To: References: Message-ID: <4FC92775.3050208@icsi.berkeley.edu> Set the SRILM variable in your environment, or on the make command line: make SRILM=/path/to/SRILM World Andreas On 6/1/2012 12:22 AM, NEETI SONTH wrote: > Hi, > I am Neeti. I am new to this SRILM toolkit and I am having problems > installing it. PLease help me. > uname -a: > Linux localhost.localdomain 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 > EDT 2007 i686 i686 i386 GNU/Linux > gcc -v: > Reading specs from /usr/lib/gcc/i386-redhat-linux/3.4.6/specs > Configured with: ../configure --prefix=/usr --mandir=/usr/share/man > --infodir=/usr/share/info --enable-shared --enable-threads=posix > --disable-checking --with-system-zlib --enable-__cxa_atexit > --disable-libunwind-exceptions --enable-java-awt=gtk > --host=i386-redhat-linux > Thread model: posix > gcc version 3.4.6 20060404 (Red Hat 3.4.6-8) > > The following difficulties are encountered by me. > > 1)Are we supposed to install srilm toolkit and start running commands > in csh/tcsh and not in bash? Will we get some error if we do the > installation in bash? > I have downloaded the entire srilm toolkit along with the free third > party software in the path: /home/nsonth/srilm. So i would be > - Show quoted text - > 5)when I ran '/sbin/machine-type',then I got the return script as 'i686'. > But when I ran ' $SRILM/sbin/machine-type', as seen from your FAQ, I > got the following error--- > SRILM: Undefined variable. > > 6)I have csh/tcsh installed on my computer. Path is /sbin/tcsh or > /sbin/csh. But when I run 'make', I get the following error-- > make: /sbin/machine-type: Command not found > cat: RELEASE: No such file or directory > Makefile:13: /common/Makefile.common.variables: No such file or directory > make: *** No rule to make target `/common/Makefile.common.variables'. Stop. > > Earnestly waiting for your reply. > > Thanking you > regards, > Neeti > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Fri Jun 1 15:09:08 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 01 Jun 2012 15:09:08 -0700 Subject: [SRILM User List] Using SRILM for text classification In-Reply-To: References: Message-ID: <4FC93D84.5010409@icsi.berkeley.edu> On 6/1/2012 6:04 AM, Ali Asghar Toraby Parizy wrote: > Hi > I wanna use SRILM for text classification. I've successfully compiled > srilm and I could reach the classes and utilities in my own project by > including header files in include folder and adding libraries in lib > folder. > I'm also familiar with concepts of language modeling and text > categorization but I don't know where to start for using srilm in this > regard. > I need to create some language models from the corpus that I have and > then guess the best model for a new text file using perplexity. > Can anybody give me a review of classes and utilities or possibly a > document that explains the class hierarchies? I don't have enough time > to explore all codes to found out how to use it! You probably don't need to link into the C++ API to do what you want. Instead, you can operate at the command line, train your LMs, and postprocess the output of ngram -debug 1 -ppl ... to obtain the model likelihoods on your test data. The file $SRILM/doc/lm-intro should contain all the info you need to get that going. Andreas From stolcke at icsi.berkeley.edu Sat Jun 2 10:06:10 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 02 Jun 2012 10:06:10 -0700 Subject: [SRILM User List] need help with SRILM installation In-Reply-To: References: <4FC92775.3050208@icsi.berkeley.edu> Message-ID: <4FCA4802.9060005@icsi.berkeley.edu> I meant for you to replace "/path/to/SRILM" with the actual path on your system where you unpacked the software. ;-) Andreas On 6/2/2012 9:21 AM, NEETI SONTH wrote: > Hi.. as you had said,, > I gave the command:*'make SRILM=/path/to/SRILM World*' on the make > command line. > But I got the following error: > *make: /path/to/SRILM/sbin/machine-type: Command not found* > *Makefile:13: /path/to/SRILM/common/Makefile.common.variables: * > *No such file or directory* > *make: *** No rule to make > target '/path/to/SRILM/common/Makefile.common.variables'. Stop* > > > > > Earnestly waiting for your reply. > > Thanking you > regards, > Neeti > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From spiketg at hotmail.com Sun Jun 3 03:21:16 2012 From: spiketg at hotmail.com (Tommy Gorman) Date: Sun, 3 Jun 2012 11:21:16 +0100 Subject: [SRILM User List] G++ error when compiling Message-ID: Hi there, Im trying to install SRILM on windows using CYGWIN and I keep getting the following error "g++: cannot specify -o with -c, -S or -E with multiple files".I have looked all over the internet and cannot find an answer as to why this is happening. I would greatly appreciate any help you can give me to figure out the solution.I should add that I am fairly new to using CYGWIN etc. so it could be a simple mistake. Thanks,Tommy Gorman Here is the error in context. make release-programsmake[1]: Entering directory `/cygdrive/c/users/spiketg/downloads/srilm'for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/cygdrive/c/users/spiketg/downloads/srilm MACHINE_TYPE=cygwin OPTION= MAKE_PIC=yes release-programs) || exit 1; \donemake[2]: Entering directory `/cygdrive/c/users/spiketg/downloads/srilm/misc/src'make[2]: Nothing to be done for `release-programs'.make[2]: Leaving directory `/cygdrive/c/users/spiketg/downloads/srilm/misc/src'make[2]: Entering directory `/cygdrive/c/users/spiketg/downloads/srilm/dstruct/src'make[2]: Nothing to be done for `release-programs'.make[2]: Leaving directory `/cygdrive/c/users/spiketg/downloads/srilm/dstruct/src'make[2]: Entering directory `/cygdrive/c/users/spiketg/downloads/srilm/lm/src'g++ -Wall -Wno-unused-variable -Wno-uninitialized -fpic -DINSTANTIATE_TEMPLATES /cygdrive/c/users/spiketg/downloads/msys_mingw8/msys/opt/tcl/include/tcl.h -I. -I../../include -c -g -O2 -o ../obj/cygwin/ngram-count.o ngram-count.ccg++: cannot specify -o with -c, -S or -E with multiple files/cygdrive/c/users/spiketg/downloads/srilm/common/Makefile.common.targets:93: recipe for target `../obj/cygwin/ngram-count.o' failedmake[2]: *** [../obj/cygwin/ngram-count.o] Error 1make[2]: Leaving directory `/cygdrive/c/users/spiketg/downloads/srilm/lm/src'Makefile:105: recipe for target `release-programs' failedmake[1]: *** [release-programs] Error 1make[1]: Leaving directory `/cygdrive/c/users/spiketg/downloads/srilm'Makefile:54: recipe for target `World' failedmake: *** [World] Error 2 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aliasghar.toraby at gmail.com Sun Jun 3 12:40:15 2012 From: aliasghar.toraby at gmail.com (Ali Asghar Toraby Parizy) Date: Mon, 4 Jun 2012 00:10:15 +0430 Subject: [SRILM User List] Using SRILM for text classification In-Reply-To: <4FC93D84.5010409@icsi.berkeley.edu> References: <4FC93D84.5010409@icsi.berkeley.edu> Message-ID: Hi Thanks for your reply. I'm trying to use ngram program to compute perplexity for several files in a directory. As you said I'm trying to build a simple shell script for that. ngram prints a large output but I only need perplexity as a number then I can save those numbers in a loop for every model and then compare those numbers. Something like this: for j in $models do echo model: $j ngram -lm $j -ppl $i done How can I adjust ngram to print only a number instead of this kind of output: file testfiles/test.test: 427 sentences, 2433 words, 1184 OOVs 0 zeroprobs, logprob= -5075.52 ppl= 1067.47 ppl1= 11578.9 I need only number 1067.47 in this case! Thanks for your help in advance. On Sat, Jun 2, 2012 at 2:39 AM, Andreas Stolcke wrote: > On 6/1/2012 6:04 AM, Ali Asghar Toraby Parizy wrote: > >> Hi >> I wanna use SRILM for text classification. I've successfully compiled >> srilm and I could reach the classes and utilities in my own project by >> including header files in include folder and adding libraries in lib folder. >> I'm also familiar with concepts of language modeling and text >> categorization but I don't know where to start for using srilm in this >> regard. >> I need to create some language models from the corpus that I have and >> then guess the best model for a new text file using perplexity. >> Can anybody give me a review of classes and utilities or possibly a >> document that explains the class hierarchies? I don't have enough time to >> explore all codes to found out how to use it! >> > You probably don't need to link into the C++ API to do what you want. > Instead, you can operate at the command line, train your LMs, and > postprocess the output of > > ngram -debug 1 -ppl ... > > to obtain the model likelihoods on your test data. > > The file $SRILM/doc/lm-intro should contain all the info you need to get > that going. > > Andreas > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sun Jun 3 19:17:12 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 03 Jun 2012 19:17:12 -0700 Subject: [SRILM User List] Using SRILM for text classification In-Reply-To: References: <4FC93D84.5010409@icsi.berkeley.edu> Message-ID: <4FCC1AA8.5000108@icsi.berkeley.edu> On 6/3/2012 12:40 PM, Ali Asghar Toraby Parizy wrote: > Hi > Thanks for your reply. > I'm trying to use ngram program to compute perplexity for several > files in a directory. As you said I'm trying to build a simple shell > script for that. ngram prints a large output but I only need > perplexity as a number then I can save those numbers in a loop for > every model and then compare those numbers. Something like this: > > for j in $models > do > echo model: $j > ngram -lm $j -ppl $i > done > > How can I adjust ngram to print only a number instead of this kind of > output: > > file testfiles/test.test: 427 sentences, 2433 words, 1184 OOVs > 0 zeroprobs, logprob= -5075.52 ppl= 1067.47 ppl1= 11578.9 > > I need only number 1067.47 in this case! Use any of a number of Unix/Linux text processing tools, like awk, perl, python, etc. Andreas From stolcke at icsi.berkeley.edu Sun Jun 3 19:23:58 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 03 Jun 2012 19:23:58 -0700 Subject: [SRILM User List] G++ error when compiling In-Reply-To: References: Message-ID: <4FCC1C3E.60103@icsi.berkeley.edu> On 6/3/2012 3:21 AM, Tommy Gorman wrote: > Hi there, > > Im trying to install SRILM on windows using CYGWIN and I keep getting > the following error "g++: cannot specify -o with -c, -S or -E with > multiple files". > I have looked all over the internet and cannot find an answer as to > why this is happening. I would greatly appreciate any help you can > give me to figure out the solution. > I should add that I am fairly new to using CYGWIN etc. so it could be > a simple mistake. The error message comes from the fact that /cygdrive/c/users/spiketg/downloads/msys_mingw8/msys/opt/tcl/include/tcl.h is passed to the compiler, instead of the -I option that allows the compiler to find the tcl.h file. Change TCL_INCLUDE to -I/cygdrive/c/users/spiketg/downloads/msys_mingw8/msys/opt/tcl/include . Andreas > > Thanks, > Tommy Gorman > > Here is the error in context. > > make release-programs > make[1]: Entering directory `/cygdrive/c/users/spiketg/downloads/srilm' > for subdir in misc dstruct lm flm lattice utils; do \ > (cd $subdir/src; make > SRILM=/cygdrive/c/users/spiketg/downloads/srilm MACHINE_TYPE=cygwin > OPTION= MAKE_PIC=yes release-programs) || exit 1; \ > done > make[2]: Entering directory > `/cygdrive/c/users/spiketg/downloads/srilm/misc/src' > make[2]: Nothing to be done for `release-programs'. > make[2]: Leaving directory > `/cygdrive/c/users/spiketg/downloads/srilm/misc/src' > make[2]: Entering directory > `/cygdrive/c/users/spiketg/downloads/srilm/dstruct/src' > make[2]: Nothing to be done for `release-programs'. > make[2]: Leaving directory > `/cygdrive/c/users/spiketg/downloads/srilm/dstruct/src' > make[2]: Entering directory > `/cygdrive/c/users/spiketg/downloads/srilm/lm/src' > g++ -Wall -Wno-unused-variable -Wno-uninitialized -fpic > -DINSTANTIATE_TEMPLATES > /cygdrive/c/users/spiketg/downloads/msys_mingw8/msys/opt/tcl/include/tcl.h > -I. -I../../include -c -g -O2 -o ../obj/cygwin/ngram-count.o > ngram-count.cc > g++: cannot specify -o with -c, -S or -E with multiple files > /cygdrive/c/users/spiketg/downloads/srilm/common/Makefile.common.targets:93: > recipe for target `../obj/cygwin/ngram-count.o' failed > make[2]: *** [../obj/cygwin/ngram-count.o] Error 1 > make[2]: Leaving directory > `/cygdrive/c/users/spiketg/downloads/srilm/lm/src' > Makefile:105: recipe for target `release-programs' failed > make[1]: *** [release-programs] Error 1 > make[1]: Leaving directory `/cygdrive/c/users/spiketg/downloads/srilm' > Makefile:54: recipe for target `World' failed > make: *** [World] Error 2 > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From trinity.sneh at gmail.com Mon Jun 4 09:34:58 2012 From: trinity.sneh at gmail.com (trinity s) Date: Mon, 4 Jun 2012 12:34:58 -0400 Subject: [SRILM User List] The bug in sys/wait.h with Cygwin 4.5 In-Reply-To: References: Message-ID: Hello I came across this post about the error in LM.cc http://www.speech.sri.com/pipermail/srilm-user/2012q1/001162.html I am facing the same issue but the fix suggested does not seem to work for me. I continue to get the exact same error message. I am trying to install SRILM 1.6 using cygwin 4.5.3 on Windows 7. Here's the error message I see: /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: cannot find -ltcl84 collect2: ld returned 1 exit status make[2]: *** [../bin/cygwin/maxalloc.exe] Error 1 make[1]: *** [release-programs] Error 1 make: *** [World] Error 2 Thank you. SJ From stolcke at icsi.berkeley.edu Mon Jun 4 11:37:06 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 04 Jun 2012 11:37:06 -0700 Subject: [SRILM User List] The bug in sys/wait.h with Cygwin 4.5 In-Reply-To: References: Message-ID: <4FCD0052.4010506@icsi.berkeley.edu> On 6/4/2012 9:34 AM, trinity s wrote: > Hello > I came across this post about the error in LM.cc > http://www.speech.sri.com/pipermail/srilm-user/2012q1/001162.html > > I am facing the same issue but the fix suggested does not seem to work > for me. I continue to get the exact same error message. I am trying to > install SRILM 1.6 using cygwin 4.5.3 on Windows 7. > Here's the error message I see: > > /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: > cannot find -ltcl84 > collect2: ld returned 1 exit status > make[2]: *** [../bin/cygwin/maxalloc.exe] Error 1 > make[1]: *** [release-programs] Error 1 > make: *** [World] Error 2 The subject line of the message is misleading. Your problem has nothing to do with wait.h, but with setting up the proper Tcl installation. 1. Install tcl/tk within cygwin, Ubuntu, ..., whatever your distribution is. 2. Find what the name of the installed library is: look for /usr/lib/libtcl.a, /usr/lib/libtcl8.5.a, etc. 3. Assuming the library is called libtcl8.5, edit common/Makefile.machine.$MACHINE_TYPE to use TCL_LIBRARY = -ltcl8.5 Andreas From trinity.sneh at gmail.com Mon Jun 4 11:51:33 2012 From: trinity.sneh at gmail.com (trinity s) Date: Mon, 4 Jun 2012 14:51:33 -0400 Subject: [SRILM User List] The bug in sys/wait.h with Cygwin 4.5 In-Reply-To: <4FCD0052.4010506@icsi.berkeley.edu> References: <4FCD0052.4010506@icsi.berkeley.edu> Message-ID: Yes, I just figured out I was getting mixed up between two issues. Sorry for the confusion. I turned off Tcl support by adding NO_TCL=X following the SRILM build FAQs. Hasn't messed anything for me so far. On Mon, Jun 4, 2012 at 2:37 PM, Andreas Stolcke wrote: > On 6/4/2012 9:34 AM, trinity s wrote: >> >> Hello >> I came across this post about the error in LM.cc >> http://www.speech.sri.com/pipermail/srilm-user/2012q1/001162.html >> >> I am facing the same issue but the fix suggested does not seem to work >> for me. I continue to get the exact same error message. I am trying to >> install SRILM 1.6 using cygwin 4.5.3 on Windows 7. >> Here's the error message I see: >> >> /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: >> cannot find -ltcl84 >> collect2: ld returned 1 exit status >> make[2]: *** [../bin/cygwin/maxalloc.exe] Error 1 >> make[1]: *** [release-programs] Error 1 >> make: *** [World] Error 2 > > > The subject line of the message is misleading. ?Your problem has nothing to > do with wait.h, but with setting up the proper Tcl installation. > > 1. Install tcl/tk within cygwin, Ubuntu, ..., whatever your distribution is. > 2. Find what the name of the installed library is: ?look for > /usr/lib/libtcl.a, /usr/lib/libtcl8.5.a, etc. > 3. Assuming the library is called libtcl8.5, edit > common/Makefile.machine.$MACHINE_TYPE to use > > ? ? ? ? ? ?TCL_LIBRARY = -ltcl8.5 > > Andreas > From trinity.sneh at gmail.com Mon Jun 4 11:52:29 2012 From: trinity.sneh at gmail.com (trinity s) Date: Mon, 4 Jun 2012 14:52:29 -0400 Subject: [SRILM User List] The bug in sys/wait.h with Cygwin 4.5 In-Reply-To: References: <4FCD0052.4010506@icsi.berkeley.edu> Message-ID: Just tried using -ltcl8.5 . That worked for me too. Thanks for pointing that out Andreas. On Mon, Jun 4, 2012 at 2:51 PM, trinity s wrote: > Yes, I just figured out I was getting mixed up between two issues. > Sorry for the confusion. > ?I turned off Tcl support by adding NO_TCL=X following the SRILM build > FAQs. Hasn't messed anything for me so far. > > On Mon, Jun 4, 2012 at 2:37 PM, Andreas Stolcke > wrote: >> On 6/4/2012 9:34 AM, trinity s wrote: >>> >>> Hello >>> I came across this post about the error in LM.cc >>> http://www.speech.sri.com/pipermail/srilm-user/2012q1/001162.html >>> >>> I am facing the same issue but the fix suggested does not seem to work >>> for me. I continue to get the exact same error message. I am trying to >>> install SRILM 1.6 using cygwin 4.5.3 on Windows 7. >>> Here's the error message I see: >>> >>> /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: >>> cannot find -ltcl84 >>> collect2: ld returned 1 exit status >>> make[2]: *** [../bin/cygwin/maxalloc.exe] Error 1 >>> make[1]: *** [release-programs] Error 1 >>> make: *** [World] Error 2 >> >> >> The subject line of the message is misleading. ?Your problem has nothing to >> do with wait.h, but with setting up the proper Tcl installation. >> >> 1. Install tcl/tk within cygwin, Ubuntu, ..., whatever your distribution is. >> 2. Find what the name of the installed library is: ?look for >> /usr/lib/libtcl.a, /usr/lib/libtcl8.5.a, etc. >> 3. Assuming the library is called libtcl8.5, edit >> common/Makefile.machine.$MACHINE_TYPE to use >> >> ? ? ? ? ? ?TCL_LIBRARY = -ltcl8.5 >> >> Andreas >> From shammurchowdhury at gmail.com Tue Jun 5 01:37:25 2012 From: shammurchowdhury at gmail.com (Shammur Absar Chowdhury) Date: Tue, 5 Jun 2012 10:37:25 +0200 Subject: [SRILM User List] class based language model Message-ID: Hello I am new to srilm and at the same time I am recently learning about language model. My aim was to build a class based language model with a given class definition. So far I have used, the below 3 commands from http://www.speech.sri.com/pipermail/srilm-user/2010q1/000843.html 1. ngram-class -vocab vocab.txt \ -text LM.txt \ -numclasses 16 \ -classes classfile 2. replace-words-with-classes classes=classfile LM.txt > Output_text_with_classes 3. ngram-count -text Output_text_with_classes -lm Class_based_model But as far as I think that the first command here induces the classes. Now what if I want srilm to use my assigned class tag and its followed words list to make the class model, how will I do it? I meant I try formating my classes tag in the class-format and then run the second step but as in the format I am suppose to assign a probability, p - which I cant assign in my manual created class file. Could any one please help me and give a direction or suggest some reading for me. Thank you . Shammur Absar Chowdhury -------------- next part -------------- An HTML attachment was scrubbed... URL: From trinity.sneh at gmail.com Tue Jun 5 07:24:21 2012 From: trinity.sneh at gmail.com (trinity s) Date: Tue, 5 Jun 2012 10:24:21 -0400 Subject: [SRILM User List] About SRILM extensions Message-ID: Hello Is this the correct forum to discuss issues around SRILM extensions? I am running into problems trying to get the maxent package to work. Thanks. Regards, SJ From stolcke at icsi.berkeley.edu Tue Jun 5 09:31:22 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 05 Jun 2012 09:31:22 -0700 Subject: [SRILM User List] About SRILM extensions In-Reply-To: References: Message-ID: <4FCE345A.4040506@icsi.berkeley.edu> On 6/5/2012 7:24 AM, trinity s wrote: > Hello > > Is this the correct forum to discuss issues around SRILM extensions? > I am running into problems trying to get the maxent package to work. I think it's fine to discuss SRILM extensions, but there is no guarantee that the authors of the extension will see the discussion. You should also contact the author directly, and if you learn something valuable that way post it back to this list. Thanks Andreas From stolcke at icsi.berkeley.edu Tue Jun 5 15:26:49 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 05 Jun 2012 15:26:49 -0700 Subject: [SRILM User List] class based language model In-Reply-To: References: Message-ID: <4FCE87A9.4090909@icsi.berkeley.edu> You can build class-based LMs using your own class assignments. Step 2 works with a classfile with or without probabilities (the probs are optional in the format). For step 3, you need some probability distribution over the words to obtain a proper language model. For example, use the "uniform-classes" script to insert uniform probabilities for those class assignments that don't have any. If you have a large training set, you can run replace-with-words-classes classes= addone=1 normalize=1 outfile=OUTPUT TEXTFILE to count the number of times each word occurs and estimate class expansion probabilities (written to OUTFILE). Andreas On 6/5/2012 1:37 AM, Shammur Absar Chowdhury wrote: > Hello > > I am new to srilm and at the same time I am recently learning about > language model. My aim was to build a class based language model with > a given class definition. > > So far I have used, the below 3 commands from > http://www.speech.sri.com/pipermail/srilm-user/2010q1/000843.html > > > 1. ngram-class -vocab vocab.txt \ > -text LM.txt \ > -numclasses 16 \ > -classes classfile > 2. replace-words-with-classes classes=classfile LM.txt > > Output_text_with_classes > 3. ngram-count -text Output_text_with_classes -lm Class_based_model > > > But as far as I think that the first command here induces the classes. > Now what if I want srilm to use my assigned class tag and its followed > words list to make the class model, how will I do it? I meant I try > formating my classes tag in the class-format and then run the second > step but as in the format I am suppose to assign a probability, p - > which I cant assign in my manual created class file. > > Could any one please help me and give a direction or suggest some > reading for me. > Thank you . > > Shammur Absar Chowdhury > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From shammurchowdhury at gmail.com Wed Jun 6 03:36:50 2012 From: shammurchowdhury at gmail.com (Shammur Absar Chowdhury) Date: Wed, 6 Jun 2012 12:36:50 +0200 Subject: [SRILM User List] class based language model In-Reply-To: <4FCE87A9.4090909@icsi.berkeley.edu> References: <4FCE87A9.4090909@icsi.berkeley.edu> Message-ID: Thank You sir for your help. I have actually another very silly question. After I get the probability distribution over word, I build another language model and when I try to find difference between my previous LM (where I used my class definition with no [p] value) and my recently created LM , I found no difference. I might have a understanding problem in basic theory [as just read about it in books] or am I doing anything wrong in the step. My recent steps that I am following: [1] replace-words-with-classes classes=atis_sphinx.def addone=1 normalize=1 outfile=countExpansion compound_LM.txt [2] replace-words-with-classes classes=countExpansion compound_LM.txt > output_text_with_classes [3] ngram-count -text output_text_with_classes classes=countExpansion -lm class_based_model_2.lm also tried ngram-count -text output_text_with_classes -lm class_based_model_2.lm Please do suggest me where I am wrong. And really sorry for my stupid question. Thank You On Wed, Jun 6, 2012 at 12:26 AM, Andreas Stolcke wrote: > You can build class-based LMs using your own class assignments. > > Step 2 works with a classfile with or without probabilities (the probs are > optional in the format). > > For step 3, you need some probability distribution over the words to > obtain a proper language model. > For example, use the "uniform-classes" script to insert uniform > probabilities for those class assignments that don't have any. > If you have a large training set, you can run > > replace-with-words-classes classes= addone=1 normalize=1 > outfile=OUTPUT TEXTFILE > > to count the number of times each word occurs and estimate class expansion > probabilities (written to OUTFILE). > > Andreas > > > On 6/5/2012 1:37 AM, Shammur Absar Chowdhury wrote: > > Hello > > I am new to srilm and at the same time I am recently learning about > language model. My aim was to build a class based language model with a > given class definition. > > So far I have used, the below 3 commands from > http://www.speech.sri.com/pipermail/srilm-user/2010q1/000843.html > > > 1. ngram-class -vocab vocab.txt \ > -text LM.txt \ > -numclasses 16 \ > -classes classfile > 2. replace-words-with-classes classes=classfile LM.txt > > Output_text_with_classes > 3. ngram-count -text Output_text_with_classes -lm Class_based_model > > > But as far as I think that the first command here induces the classes. Now > what if I want srilm to use my assigned class tag and its followed words > list to make the class model, how will I do it? I meant I try formating my > classes tag in the class-format and then run the second step but as in the > format I am suppose to assign a probability, p - which I cant assign in my > manual created class file. > > Could any one please help me and give a direction or suggest some reading > for me. > Thank you . > > Shammur Absar Chowdhury > > > > _______________________________________________ > SRILM-User site listSRILM-User at speech.sri.comhttp://www.speech.sri.com/mailman/listinfo/srilm-user > > > -- Shammur Absar Chowdhury -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Jun 6 14:27:08 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 06 Jun 2012 14:27:08 -0700 Subject: [SRILM User List] class based language model In-Reply-To: References: <4FCE87A9.4090909@icsi.berkeley.edu> Message-ID: <4FCFCB2C.60305@icsi.berkeley.edu> On 6/6/2012 3:36 AM, Shammur Absar Chowdhury wrote: > Thank You sir for your help. > > I have actually another very silly question. > After I get the probability distribution over word, I build another > language model and when I try to find difference between my previous > LM (where I used my class definition with no [p] value) and my > recently created LM , I found no difference. > > I might have a understanding problem in basic theory [as just read > about it in books] or am I doing anything wrong in the step. > > My recent steps that I am following: > > [1] replace-words-with-classes classes=atis_sphinx.def addone=1 > normalize=1 outfile=countExpansion compound_LM.txt > > [2] replace-words-with-classes classes=countExpansion compound_LM.txt > > output_text_with_classes Verify the output of these two steps. Do the class definitions and modified text look okay? > > [3] ngram-count -text output_text_with_classes > classes=countExpansion -lm class_based_model_2.lm > > also tried ngram-count -text output_text_with_classes -lm > class_based_model_2.lm The second form is correct. There is no need to specify the class definitions with ngram-count. You should be able to use the final LM using ngram -lm class_based_model_2.lm -classes countExpansion (...other options ...) Andreas From shammurchowdhury at gmail.com Thu Jun 7 05:06:39 2012 From: shammurchowdhury at gmail.com (Shammur Absar Chowdhury) Date: Thu, 7 Jun 2012 14:06:39 +0200 Subject: [SRILM User List] class based language model In-Reply-To: <4FCFCB2C.60305@icsi.berkeley.edu> References: <4FCE87A9.4090909@icsi.berkeley.edu> <4FCFCB2C.60305@icsi.berkeley.edu> Message-ID: Thank You On Wed, Jun 6, 2012 at 11:27 PM, Andreas Stolcke wrote: > On 6/6/2012 3:36 AM, Shammur Absar Chowdhury wrote: > >> Thank You sir for your help. >> >> I have actually another very silly question. >> After I get the probability distribution over word, I build another >> language model and when I try to find difference between my previous LM >> (where I used my class definition with no [p] value) and my recently >> created LM , I found no difference. >> >> I might have a understanding problem in basic theory [as just read about >> it in books] or am I doing anything wrong in the step. >> >> My recent steps that I am following: >> >> [1] replace-words-with-classes classes=atis_sphinx.def addone=1 >> normalize=1 outfile=countExpansion compound_LM.txt >> >> [2] replace-words-with-classes classes=countExpansion compound_LM.txt > >> output_text_with_classes >> > Verify the output of these two steps. Do the class definitions and > modified text look okay? > > >> [3] ngram-count -text output_text_with_classes classes=countExpansion >> -lm class_based_model_2.lm >> >> also tried ngram-count -text output_text_with_classes -lm >> class_based_model_2.lm >> > The second form is correct. There is no need to specify the class > definitions with ngram-count. > > You should be able to use the final LM using > > ngram -lm class_based_model_2.lm -classes countExpansion (...other > options ...) > > Andreas > > -- Shammur Absar Chowdhury -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Jun 13 11:33:38 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 13 Jun 2012 11:33:38 -0700 Subject: [SRILM User List] lattice-tool -timeout bugfix [was: need help with SRILM installation] In-Reply-To: References: <4FC92775.3050208@icsi.berkeley.edu> <4FCA4802.9060005@icsi.berkeley.edu> <4FCC1B60.5080004@icsi.berkeley.edu> <4FCCFD7B.5040505@icsi.berkeley.edu> <4FCF8520.7080702@icsi.berkeley.edu> Message-ID: <4FD8DD02.5050702@icsi.berkeley.edu> On 6/9/2012 3:46 AM, NEETI SONTH wrote: > > >> Hi. > > When I run the command 'lattice-tool -max-time 4 -write-ngrams > -in-lattice-list ' > the command limits the maximum time of operation for just the first > lattice file in the list of lattice files. For remaining > lattice-files, it doesnt limit the time of operation. The command says > "LIMITS THE MAXIMUM TIME OF OPERATION PER LATTICE" ... So why isnt it > doing so??? > > thanks. > Neeti Sonth > It seems that in Linux and compatible systems, unlike in Solaris, where the code was originally developed, the SIGALRM handler needs to use sigsetjmp/siglongjmp() instead of just plain setjmp/longjmp, or else subsequent alarms won't invoke the handler due to signal mask modification. The effect was that lattice-tool -timeout would work only for the first lattice triggering the timeout on Linux and Cygwin systems. The attached patch should fix the problem. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- Index: lattice-tool.cc =================================================================== RCS file: /home/srilm/CVS/srilm/lattice/src/lattice-tool.cc,v retrieving revision 1.156 diff -c -r1.156 lattice-tool.cc *** lattice-tool.cc 21 Apr 2011 06:12:49 -0000 1.156 --- lattice-tool.cc 13 Jun 2012 18:17:25 -0000 *************** *** 376,386 **** #endif typedef void (*sighandler_t)(_sigargs); ! static jmp_buf thisContext; void catchAlarm(int signal) { ! longjmp(thisContext, 1); } #endif /* !NO_TIMEOUT */ --- 376,386 ---- #endif typedef void (*sighandler_t)(_sigargs); ! static sigjmp_buf thisContext; void catchAlarm(int signal) { ! siglongjmp(thisContext, 1); } #endif /* !NO_TIMEOUT */ *************** *** 484,490 **** #ifndef NO_TIMEOUT if (maxTime) { alarm(maxTime); ! if (setjmp(thisContext)) { cerr << "WARNING: processing lattice " << inLat << " aborted after " << maxTime << " seconds\n"; return; --- 484,490 ---- #ifndef NO_TIMEOUT if (maxTime) { alarm(maxTime); ! if (sigsetjmp(thisContext, 1)) { cerr << "WARNING: processing lattice " << inLat << " aborted after " << maxTime << " seconds\n"; return; From stolcke at icsi.berkeley.edu Thu Jun 14 09:21:37 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 14 Jun 2012 09:21:37 -0700 Subject: [SRILM User List] help with keyword spotting In-Reply-To: References: Message-ID: <4FDA0F91.1010003@icsi.berkeley.edu> The lattice-tool -write-ngram-index option was created for keyword (and keyphrase) spotting, but would typically be used with word-based lattices. However, you could write a phone ngram index (using fairly high -order value) and then do approximate matching of your pronunciation against this index. The output format is described in the man page. In your case you might do better writing out a phone confusion network (lattice-tool -write-mesh) and then match against that. Either way, you won't find a complete ready-made solution. You have to postprocess the lattice-tool output using an appropriate matching function. Andreas On 6/14/2012 2:55 AM, NEETI SONTH wrote: > Andreas > I want to do keyword-spotting in srilm. I have a lattice-file in > htk-format generated from a single sentence utterance. The > lattice-file is phonetic lattice. Now I want to search/spot a word in > the lattice-file. I have the phonetic decomposition of the word I am > spotting for. Can you brief me with the necessary steps and srilm > commands for the same? > I tried using *lattice-tool -read-htk -in-lattice -ppl > * . > > *sentence.file *just contains the keyword I am spotting for. But what > I observed from one of your other user-mails, that *-ppl* only works > when the phonetic decomposition of the keyword exactly matches with > that of a path in the lattice. However, it is highly improbable that : > when I speak, the lattice file generated has a phonetic path which > exactly matches with the phonetic decomposition of the keyword. > How then would we spot a keyword? > I also want to know how *" -write-ngram-index" option *helps in > keyword spotting? > Please help. > thanks > > with regards, > Neeti Sonth > > > On Wed, Jun 13, 2012 at 11:33 AM, Andreas Stolcke > > wrote: > > On 6/9/2012 3:46 AM, NEETI SONTH wrote: >> >> >>> Hi. >> >> When I run the command 'lattice-tool -max-time 4 -write-ngrams >> -in-lattice-list ' >> the command limits the maximum time of operation for just the >> first lattice file in the list of lattice files. For remaining >> lattice-files, it doesnt limit the time of operation. The command >> says "LIMITS THE MAXIMUM TIME OF OPERATION PER LATTICE" ... So >> why isnt it doing so??? >> >> thanks. >> Neeti Sonth >> > > It seems that in Linux and compatible systems, unlike in Solaris, > where the code was originally developed, the SIGALRM handler needs > to use sigsetjmp/siglongjmp() instead of just plain > setjmp/longjmp, or else subsequent alarms won't invoke the handler > due to signal mask modification. The effect was that lattice-tool > -timeout would work only for the first lattice triggering the > timeout on Linux and Cygwin systems. > > The attached patch should fix the problem. > > Andreas > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From neetisonth at gmail.com Thu Jun 14 20:49:01 2012 From: neetisonth at gmail.com (NEETI SONTH) Date: Fri, 15 Jun 2012 09:19:01 +0530 Subject: [SRILM User List] help with keyword spotting In-Reply-To: <4FDA0F91.1010003@icsi.berkeley.edu> References: <4FDA0F91.1010003@icsi.berkeley.edu> Message-ID: Hi Andreas. as per your previous mail regarding the same issue, I tried using 'lattice-tool -write-mesh' option for keyword spotting (my lattice file is phonetic lattice if you remember). However, this command works only if we have a reference file. I wrote the phonetic decomposition of the keyword in this reference file. As a result, 'lattice-tool --in-lattice -ref-file -write-mesh' command generated a word confusion network giving information about the phonetic words it has aligned. However, it deletes those phones which have zero posterior probability during alignment. I require those phones as they are very much present in my keyword. How do I solve this issue??? Also, if I want to convert my mesh file into pfsg format, is the following command correct? 'wlat-to-pfsg mesh.file > pfsg.file' (in linux environment) I mean, does this command generate the correct pfsg file? I want even zero probability phones present in this pfsg file. What do I do? On 6/14/12, Andreas Stolcke wrote: > > The lattice-tool -write-ngram-index option was created for keyword (and > keyphrase) spotting, but would typically be used with word-based > lattices. However, you could write a phone ngram index (using fairly > high -order value) and then do approximate matching of your > pronunciation against this index. The output format is described in > the man page. > > In your case you might do better writing out a phone confusion network > (lattice-tool -write-mesh) and then match against that. > > Either way, you won't find a complete ready-made solution. You have to > postprocess the lattice-tool output using an appropriate matching function. > > Andreas > > On 6/14/2012 2:55 AM, NEETI SONTH wrote: >> Andreas >> I want to do keyword-spotting in srilm. I have a lattice-file in >> htk-format generated from a single sentence utterance. The >> lattice-file is phonetic lattice. Now I want to search/spot a word in >> the lattice-file. I have the phonetic decomposition of the word I am >> spotting for. Can you brief me with the necessary steps and srilm >> commands for the same? >> I tried using *lattice-tool -read-htk -in-lattice -ppl >> * . >> >> *sentence.file *just contains the keyword I am spotting for. But what >> I observed from one of your other user-mails, that *-ppl* only works >> when the phonetic decomposition of the keyword exactly matches with >> that of a path in the lattice. However, it is highly improbable that : >> when I speak, the lattice file generated has a phonetic path which >> exactly matches with the phonetic decomposition of the keyword. >> How then would we spot a keyword? >> I also want to know how *" -write-ngram-index" option *helps in >> keyword spotting? >> Please help. >> thanks >> >> with regards, >> Neeti Sonth >> >> >> On Wed, Jun 13, 2012 at 11:33 AM, Andreas Stolcke >> > wrote: >> >> On 6/9/2012 3:46 AM, NEETI SONTH wrote: >>> >>> >>>> Hi. >>> >>> When I run the command 'lattice-tool -max-time 4 -write-ngrams >>> -in-lattice-list ' >>> the command limits the maximum time of operation for just the >>> first lattice file in the list of lattice files. For remaining >>> lattice-files, it doesnt limit the time of operation. The command >>> says "LIMITS THE MAXIMUM TIME OF OPERATION PER LATTICE" ... So >>> why isnt it doing so??? >>> >>> thanks. >>> Neeti Sonth >>> >> >> It seems that in Linux and compatible systems, unlike in Solaris, >> where the code was originally developed, the SIGALRM handler needs >> to use sigsetjmp/siglongjmp() instead of just plain >> setjmp/longjmp, or else subsequent alarms won't invoke the handler >> due to signal mask modification. The effect was that lattice-tool >> -timeout would work only for the first lattice triggering the >> timeout on Linux and Cygwin systems. >> >> The attached patch should fix the problem. >> >> Andreas >> >> >> >> > > > From stolcke at icsi.berkeley.edu Fri Jun 15 15:16:57 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 15 Jun 2012 15:16:57 -0700 Subject: [SRILM User List] help with keyword spotting In-Reply-To: References: <4FDA0F91.1010003@icsi.berkeley.edu> Message-ID: <4FDBB459.2090801@icsi.berkeley.edu> On 6/14/2012 8:49 PM, NEETI SONTH wrote: > Hi Andreas. > as per your previous mail regarding the same issue, I tried using > 'lattice-tool -write-mesh' option for keyword spotting (my lattice > file is phonetic lattice if you remember). However, this command works > only if we have a reference file. I wrote the phonetic decomposition > of the keyword in this reference file. > As a result, 'lattice-tool --in-lattice -ref-file > -write-mesh' > command > generated a word confusion network giving information about the > phonetic words it has aligned. However, it deletes those phones which > have zero posterior probability during alignment. I require those > phones as they are very much present in my keyword. How do I solve > this issue??? The missing phones won't prevent your reference words from aligning. A non-present phone and one that has posterior probability = 0 are equivalent to the alignment algorithm. You simply need to take "deleted" phones into account when computing a matching score between your target word and the lattice. I wasn't actually suggesting that you use the reference alignment mechanism (-ref-file option) to match target words to confusion networks, although that is not a bad idea. The problem I see with this approach is that the alignment cost function has no bias toward keeping the phones of the reference string "together". The target (= reference) word phones are typically going to cover only a subset of the entire utterance, but the alignment won't prefer to put the necessary deletions (= portions of the utterance that are out side the target word instance) all before or after the reference, rather than somewhere inside the target phone sequence. What I did have in mind is that you yourself write a postprocessing function that looks for all possible positions in the confusion network and evaluates the match of the target phone string. Since the CN has a simple linear structure that should not be hard. > Also, if I want to convert my mesh file into pfsg format, is the > following command correct? > 'wlat-to-pfsg mesh.file > pfsg.file' (in linux environment) > I mean, does this command generate the correct pfsg file? I want even > zero probability phones present in this pfsg file. What do I do? The conversion should work as you suggest. I'm not sure what you hope to gain from it, though. Andreas From aliasghar.toraby at gmail.com Thu Jun 21 09:51:17 2012 From: aliasghar.toraby at gmail.com (Ali Asghar Torabi) Date: Thu, 21 Jun 2012 16:51:17 +0000 (UTC) Subject: [SRILM User List] Invitation to connect on LinkedIn Message-ID: <1961180302.27332906.1340297477365.JavaMail.app@ela4-app0128.prod> LinkedIn ------------ I'd like to add you to my professional network on LinkedIn. - Ali Asghar Ali Asghar Torabi Attended Amirkabir University of Technology - Tehran Polytechnic Iran Confirm that you know Ali Asghar Torabi: https://www.linkedin.com/e/t1zgkk-h3q2iga9-2u/isd/7584528334/rLWveNmn/?hs=false&tok=3dOg1xEz_PSBg1 -- You are receiving Invitation to Connect emails. Click to unsubscribe: http://www.linkedin.com/e/t1zgkk-h3q2iga9-2u/GYEITBnGRvUQDaFc0a-73k8G9tC-sl29aHrv22N/goo/srilm-user%40speech%2Esri%2Ecom/20061/I2569641657_1/?hs=false&tok=3sbvKbOkfPSBg1 (c) 2012 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gregor.donaj at uni-mb.si Fri Jun 22 02:50:35 2012 From: gregor.donaj at uni-mb.si (Gregor Donaj) Date: Fri, 22 Jun 2012 11:50:35 +0200 Subject: [SRILM User List] rescoring with fngram and sapare probabilities Message-ID: <4FE43FEB.404@uni-mb.si> Hi, I have two question about the fngram tool. I used it to re-score n-best lists of factored sentences. I took a look at the man pages, but I couldn't find my answers. 1) After taking a close look at the probabilities i realized, that the score seem already to be weighted by the factor 8. Is the any option to change this factor? How about the ngram tool? 2) Can fngram give also the original language score to output? I mean not just to replace the original language score with the re-scored probability but to write both in the output file? Thanks -- Gregor Donaj, univ. dipl. in?. el., univ. dipl. mat. Laboratorij za digitalno procesiranje signalov Fakulteta za elektrotehniko, ra?unalni?tvo in informatiko Smetanova ulica 17, 2000 Maribor Tel.: 02/220 72 05 E-mail: gregor.donaj at uni-mb.si From stolcke at icsi.berkeley.edu Fri Jun 22 20:17:40 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 23 Jun 2012 11:17:40 +0800 Subject: [SRILM User List] rescoring with fngram and sapare probabilities In-Reply-To: <4FE43FEB.404@uni-mb.si> References: <4FE43FEB.404@uni-mb.si> Message-ID: <4FE53554.6070905@icsi.berkeley.edu> On 6/22/2012 5:50 PM, Gregor Donaj wrote: > Hi, > > I have two question about the fngram tool. I used it to re-score > n-best lists of factored sentences. I took a look at the man pages, > but I couldn't find my answers. > 1) > After taking a close look at the probabilities i realized, that the > score seem already to be weighted by the factor 8. Is the any option > to change this factor? How about the ngram tool? I would not use fngram for nbest rescoring, lack of documentation being one problem. Also, this program has not been updated in a while. The better approach is to use fngram-count to train FLMs, but then use ngram -factored to apply the LM to data. So you would use ngram -factored -nbest or -nbest-files or -rescore (see ngram(1) man page). > 2) > Can fngram give also the original language score to output? I mean not > just to replace the original language score with the re-scored > probability but to write both in the output file? Well, you can always save the original nbest lists and use its LM scores as an additional input to you the score combination. Using more than the standard three scores (AM, LM, and word count) requires extra work, some of which is supported by the wrapper scripts described in the nbest-scripts(5) man page. The typical way to do this would be: 1) Use the rescore-decipher wrapper script with the -lm-only option (in addition to -factored -lm ...) to produce score files that contain only the FLM scores. 2) Use nbest-optimize (on a held-out tuning set) to determine the optimal score weightings (see man page) 3) Use rescore-reweight to combine all scores and output new 1-best hyps. Andreas