From stolcke at speech.sri.com Wed Jan 2 09:49:07 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 02 Jan 2008 09:49:07 PST Subject: SRILM 1.5.6 released Message-ID: <200801021749.m02Hn7B14128@huge> Happy New Year! The latest version of SRILM is available from http://www.speech.sri.com/projects/srilm/download.html . This release features much enhanced support for server-based LMs and improved documentation. Enjoy, Andreas ----------------------------------------------------------------------------- 1.5.6 2 January 2008 Functionality: * New ngram -use-server option to run the client side of a network LM server as implemented by ngram -server-port. Optionally, probabilities may be cached in the client (option -cache-served-ngrams). Mixtures of one or more network and file-based LMs are also possible. * Likewise, disambig, hidden-gram, and lattice-tool understand the -use-server option. * New LMClient class to implement the above (a stub LM subclass that queries a server for LM probabilities). * ngram -server-port now behaves like a true server daemon: it handles multiple simultaneous or sequential clients, and never exits (unless killed). The number of simultaneous clients may be limited with the -server-maxclients option. * Support for 7-zip compressed files (suggested by Alexy Khrabrov). * lattice-tool -split-multiwords will now print a warning message about multiwords that were not split because their LM probability was non-zero. * LoglinearMix LM class supports n-way mixtures directly, giving more efficient implementation for n > 2 than recursive object construction in ngram (contributed by Tanel Alumae). Bug fixes: * MultiwordLM now implicitly adds all words to the vocabulary, so that previously unseen multiwords get split. This has the side effect that OOVs will appear as zeroprob words. Documentation: * The doc/FAQ file has been expanded and reformated as a man page. It can be viewed with "man srilm-faq" or online at http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html . The major content additions are questions about the build process, how to build a "Google N-gram LM", smoothing issues, and OOV-handling (the latter by Deniz Yuret). Corrections and additions to this document are most welcome! * A new manual page ngram-discount(7) gives a detailed overview of smoothing methods found in SRILM (contributed by Deniz Yuret). * The conversion of man pages to html has been enhanced to better handle code samples and nested itemized lists. From gelbart at icsi.berkeley.edu Mon Jan 14 17:23:42 2008 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Mon, 14 Jan 2008 17:23:42 -0800 (PST) Subject: SRILM BOW denominator warning Message-ID: Hello, I am trying to build a trigram LM for the OGI Numbers corpus, in which utterances are spoken strings of numbers such as 'eighty nine eighty eight'. Since there are no singletons, I am using Witten-Bell discounting instead of Good-Turing. ngram-count displays "BOW denominator for context... is zero" warnings. Does this mean the LM is broken? If I try adding "-gt3min 1 -gt2min 1" to the ngram-count options, I still see these warnings. Here is the ngram-count output: $ ngram-count -wbdiscount -text /u/gelbart/tmp/train.trans -order 3 \ -lm /u/gelbart/tmp/numbers-wb.lm BOW denominator for context "seven" is zero; scaling probabilities to sum to 1 BOW denominator for context "six" is zero; scaling probabilities to sum to 1 BOW denominator for context "four" is zero; scaling probabilities to sum to 1 BOW denominator for context "two" is zero; scaling probabilities to sum to 1 In the generated language model, the log BOWs are zero for those four words: -1.156247 four 0 -1.09725 seven 0 -1.203041 six 0 -1.029482 two 0 Thanks, David From stolcke at speech.sri.com Thu Jan 17 18:25:11 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 17 Jan 2008 18:25:11 -0800 Subject: SRILM BOW denominator warning In-Reply-To: References: Message-ID: <47900E07.8000904@speech.sri.com> David Gelbart wrote: > Hello, > > I am trying to build a trigram LM for the OGI Numbers corpus, in which > utterances are spoken strings of numbers such as 'eighty nine eighty > eight'. Since there are no singletons, I am using Witten-Bell > discounting instead of Good-Turing. ngram-count displays "BOW > denominator for context... is zero" warnings. Does this mean the LM > is broken? If I try adding "-gt3min 1 -gt2min 1" to the ngram-count > options, I still see these warnings. Here is the ngram-count output: > > $ ngram-count -wbdiscount -text /u/gelbart/tmp/train.trans -order 3 \ > -lm /u/gelbart/tmp/numbers-wb.lm > BOW denominator for context "seven" is zero; scaling probabilities to > sum to 1 > BOW denominator for context "six" is zero; scaling probabilities to > sum to 1 > BOW denominator for context "four" is zero; scaling probabilities to > sum to 1 > BOW denominator for context "two" is zero; scaling probabilities to > sum to 1 > > In the generated language model, the log BOWs are zero for those four > words: > > -1.156247 four 0 > -1.09725 seven 0 > -1.203041 six 0 > -1.029482 two 0 > this happens when you have a small vocabulary and all words are observed in a given context, so there is no backoff mass to distribute over unseen words. there is no need to do anything, the LM will work just fine. this should probably be included in the FAQ for smoothing issues. Andreas > Thanks, > David From sai_tang_huang at hotmail.com Sun Jan 20 18:37:51 2008 From: sai_tang_huang at hotmail.com (SAI TANG HUANG) Date: Mon, 21 Jan 2008 03:37:51 +0100 Subject: new to srilm, first ngram-count with discount coeff out of range warning Message-ID: Hi SRILM team, My name is Sai Tang and I am a student at the University of Brighton. My project involves using SRILM to create language models as well as calculating entropies. I have just managed to get my first LM file after running ngram-count on a text file as follows: sai at uk-notebook:~/Desktop$ ngram-count -order 2 -text Merged_File.txt -lm file file is my lm file. The file was created, however I got a warning message during the run of the ngram-count: warning: discount coeff 1 is out of range: -2.26158e-17 My experience in NLP is not very complete and it's also the first time I use SRILM. I would appreciate it a lot if someone could help me with this. Also is this the address I have to right to in order to post to the mailing list? Kind regards, Sai _________________________________________________________________ La vida de los famosos al desnudo en MSN Entretenimiento http://entretenimiento.es.msn.com/ From sai_tang_huang at hotmail.com Sun Jan 20 18:41:45 2008 From: sai_tang_huang at hotmail.com (SAI TANG HUANG) Date: Mon, 21 Jan 2008 03:41:45 +0100 Subject: how to use ngram -counts-entropy Message-ID: Hiya, My name is Sai and I'm trying to get the entropy of my LM. I have managed to run ngram -counts successfully, but when I type ngram -lm myLMfile -counts-entropy nothing happens. I have read the manual of ngram but I can't seem to understand why this command isn't working. Thanks, Sai _________________________________________________________________ La vida de los famosos al desnudo en MSN Entretenimiento http://entretenimiento.es.msn.com/ From sai_tang_huang at hotmail.com Mon Jan 21 04:59:49 2008 From: sai_tang_huang at hotmail.com (SAI TANG HUANG) Date: Mon, 21 Jan 2008 13:59:49 +0100 Subject: can't get right counts-entropy Message-ID: Hi, I have created a counts file and a back-off LM file from a text file with sentences with the following command: sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file Then I ran the ngram program with -counts here is the output: sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file file count_file: 23640 sentences, 460074 words, 0 OOVs 7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575 sai at uk-notebook:~/Desktop$ I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get: sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs 0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985 sai at uk-notebook:~/Desktop$ Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ? If there is a more detailed manual or document describing these values then I'm willing to read it. Thanks a lot, Sai _________________________________________________________________ Tecnolog?a, moda, motor, viajes,?suscr?bete a nuestros boletines para estar siempre a la ?ltima http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com From stolcke at speech.sri.com Tue Jan 22 15:36:32 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 23 Jan 2008 01:36:32 +0200 Subject: can't get right counts-entropy In-Reply-To: References: Message-ID: <47967E00.9080500@speech.sri.com> SAI TANG HUANG wrote: > Hi, > > I have created a counts file and a back-off LM file from a text file with sentences with the following command: > > sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file > > Then I ran the ngram program with -counts here is the output: > > sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file > file count_file: 23640 sentences, 460074 words, 0 OOVs > 7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575 > sai at uk-notebook:~/Desktop$ > > I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get: > The 7880 zeroprobs are probably due to the tokens output by the ngram-count program. you cannot use the ngram-count output directly as input to ngram -counts. See below. > sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt > file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs > 0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985 > sai at uk-notebook:~/Desktop$ > > Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ? > > If there is a more detailed manual or document describing these values then I'm willing to read it. > This is not yet well documented. To use ngram -counts correctly to must only feed those N-grams that correspond to "events" in the LM, not those that only appear as "context". That means you need to filter the ngram-count output and retain only ngrams that - are of the highest order (e.g., trigrams for a trigram LM), OR - start with (but not the unigam, see above). For example, the sentence "a b c" in conjunction with a trigram LM should generate only the ngrams a a b a b c b c You can do this filtering with a small perl or gawk script. Sounds like another topic for the FAQ file. Andreas > Thanks a lot, > > Sai > _________________________________________________________________ > Tecnolog?a, moda, motor, viajes,?suscr?bete a nuestros boletines para estar siempre a la ?ltima > http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com > From amantrac at ulb.ac.be Wed Jan 23 09:20:40 2008 From: amantrac at ulb.ac.be (Amin Mantrach) Date: Wed, 23 Jan 2008 18:20:40 +0100 Subject: From logproba on sentences to logproba on words Message-ID: Hi, I wonder if it is possible to initialize directly an LM model with initial log probabilities on sentence and not on ngrams counts or a textfile. And if yes wich command to use.(I don't see in the help how to realize that with the ngram or ngram-count). Or How to obtain the log-proba on ngrams having the log-proba on all the sentences of a set of documents. Thanks a lot. From stolcke at speech.sri.com Mon Jan 28 14:44:41 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 28 Jan 2008 14:44:41 -0800 Subject: can't get right counts-entropy In-Reply-To: References: <47967E00.9080500@speech.sri.com> Message-ID: <479E5AD9.5060802@speech.sri.com> SAI TANG HUANG wrote: > Hi, > > Thanks for your answer. However I'm still a bit confused. I ran the abc example you used in your last email and I got a count file of: > > sai at uk-notebook:~/Desktop$ ngram-count -text abc.txt > 1 > a 1 > a b 1 > a 1 > a b 1 > a b c 1 > b 1 > b c 1 > b c 1 > c 1 > c 1 > 1 > > According to you, I should only keep > > a > a b > a b c > b c > > because the rest of them are "context". I understand that all the unigrams are contexts, but why do I keep " a" and chuck away all the other bigrams? > > Also, in what context do you mean by "context" and "event"? In P(a|b) = P(a,b)/P(b). > > Is P(b) the context and P(a,b) the event then? > It's hard to make sense of this because somehow the and tags got deleted in the mail. I will therefore use SB for "sentence begin" and SE for "sentence end". I will try to explain better which ngrams are relevant for computing perplexity. By "events" I mean the tokens that the LM predicts. Those are all the tokens except the SB. a b c SE Now, you add the context for each for those tokens as is used by the LM for condition the predictions. These are the precending two tokens, except in those cases where you are near the beginning of the sentence , since you cannot go beyond the SB token. So the contexts together with the predicted tokens are SB a SB a b a b c b c SE Those are the ngrams that you need to feed to ngram -counts to get a valid perplexity. I hope this makes more sense now. Andreas > Thanks a lot, > > Sai > > ---------------------------------------- > >> Date: Wed, 23 Jan 2008 01:36:32 +0200 >> From: stolcke at speech.sri.com >> To: sai_tang_huang at hotmail.com >> CC: srilm-user at speech.sri.com >> Subject: Re: can't get right counts-entropy >> >> SAI TANG HUANG wrote: >> >>> Hi, >>> >>> I have created a counts file and a back-off LM file from a text file with sentences with the following command: >>> >>> sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file >>> >>> Then I ran the ngram program with -counts here is the output: >>> >>> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file >>> file count_file: 23640 sentences, 460074 words, 0 OOVs >>> 7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575 >>> sai at uk-notebook:~/Desktop$ >>> >>> I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get: >>> >>> >> The 7880 zeroprobs are probably due to the tokens output by the >> ngram-count program. >> you cannot use the ngram-count output directly as input to ngram >> -counts. See below. >> >>> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt >>> file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs >>> 0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985 >>> sai at uk-notebook:~/Desktop$ >>> >>> Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ? >>> >>> If there is a more detailed manual or document describing these values then I'm willing to read it. >>> >>> >> This is not yet well documented. To use ngram -counts correctly to >> must only feed those N-grams that correspond to "events" in the LM, not >> those that only appear as "context". That means you need to filter the >> ngram-count output and retain only ngrams that >> >> - are of the highest order (e.g., trigrams for a trigram LM), OR >> - start with (but not the unigam, see above). >> >> For example, the sentence "a b c" in conjunction with a trigram LM >> should generate only the ngrams >> >> a >> a b >> a b c >> b c >> >> You can do this filtering with a small perl or gawk script. >> >> Sounds like another topic for the FAQ file. >> >> Andreas >> >> >> >>> Thanks a lot, >>> >>> Sai >>> _________________________________________________________________ >>> Tecnolog?a, moda, motor, viajes,?suscr?bete a nuestros boletines para estar siempre a la ?ltima >>> http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com >>> >>> >> > > _________________________________________________________________ > La vida de los famosos al desnudo en MSN Entretenimiento > http://entretenimiento.es.msn.com/ > From amantrac at ulb.ac.be Tue Jan 29 10:35:25 2008 From: amantrac at ulb.ac.be (Amin Mantrach) Date: Tue, 29 Jan 2008 19:35:25 +0100 Subject: Fw: From logproba on sentences to logproba on words In-Reply-To: References: Message-ID: <29269D3B-A655-4ACA-B160-6A805F5AF58E@ulb.ac.be> Apparently my question doesn't meet any answer, so I'll reformulate it in order to be more clear. Actually, I want to create an LM model with the command > # ngram- count -text textfile -lm lmfile In the case I'm concerned with I dispose of the log-probabilities for every sentences of appearing. The same that you can obtain from (#ngram -lm lm_file -debug 1 -ppl testfile) What I want to do ? Create a new LM file build from probabilities on sentences I have. Current ideas : 1 / Produce a text file with the sentences. Each sentence can appear in file multiple times. It will appear in fact exactly n times. Where n = exp(log-proba of the sentence)*1000) (Rounded to integer). And then simply : ngram-count -text newtextsentences -lm new_lm 2 / Produce a count file (with only the counts needed (of the highest order, etc.) and for each n-gram multiply the nb of occurrence by the sum of proba of the sentences it belongs to. This methods is clearly not fair. Can you answer me if one of those ideas are correct. If not how should I proceed. I hope the question in now clear enough. Thanks a lot for your help. Amin. From amantrac at ulb.ac.be Wed Jan 30 04:49:32 2008 From: amantrac at ulb.ac.be (Amin Mantrach) Date: Wed, 30 Jan 2008 13:49:32 +0100 Subject: From logproba on sentences to logproba on words In-Reply-To: References: <29269D3B-A655-4ACA-B160-6A805F5AF58E@ulb.ac.be> Message-ID: Le 30-janv.-08 ? 09:32, Marcello Federico a ?crit : > I will try to answer. You basically want to generate a corpus from > some prior statistics over sentences, then estimate an n-gram LM > over such corpus. Correct? > Yes, I want to reesimate or estimate an n-gram LM over that corpus with prior probabilities on sentences. > I do not see anything wrong with that, but you have to keep in mind > that: > > - your corpus could not match the typical properties of real life > texts > (namely the distribution of ngrams could be very different); > It means that having in the corpus sentences a number of times proportional to prio proba of the sentences don't preserve a correct distribution on the ngrams. > - you could not be able to apply all smoothing methods, such as > kneser-ney, just because your corpus will not generate proper > statistics of 'rare' ngrams, for the reasons stated above. > > Again, I do not see anything wrong with that, just pay attention to > the smoothing method you use. My suggestion would be to use a > simple technique like witten-bell smoothing. > > > Greetings, > Marcello Thanks for you answer. > > > > > On Jan 29, 2008, at 7:35 PM, Amin Mantrach wrote: > >> Apparently my question doesn't meet any answer, so I'll reformulate >> it >> in order to be more clear. >> >> Actually, I want to create an LM model with the command > # ngram- >> count -text textfile -lm lmfile >> >> >> In the case I'm concerned with I dispose of the log-probabilities for >> every sentences of appearing. The same that you can obtain from >> (#ngram -lm lm_file -debug 1 -ppl testfile) >> >> What I want to do ? Create a new LM file build from probabilities on >> sentences I have. >> >> Current ideas : >> >> 1 / Produce a text file with the sentences. Each sentence can appear >> in file multiple times. It will appear in fact exactly n times. >> Where >> n = exp(log-proba of the sentence)*1000) (Rounded to integer). >> >> And then simply : ngram-count -text newtextsentences -lm new_lm >> >> 2 / Produce a count file (with only the counts needed (of the >> highest >> order, etc.) and for each n-gram multiply the nb of occurrence by the >> sum of proba of the sentences it belongs to. >> This methods is clearly not fair. >> >> >> Can you answer me if one of those ideas are correct. If not how >> should >> I proceed. >> >> >> I hope the question in now clear enough. >> >> Thanks a lot for your help. >> Amin. >> >> > From amantrac at ulb.ac.be Wed Jan 30 06:45:09 2008 From: amantrac at ulb.ac.be (Amin Mantrach) Date: Wed, 30 Jan 2008 15:45:09 +0100 Subject: From logproba on sentences to logproba on words In-Reply-To: References: Message-ID: Thanks Eric for your response. The problem with doing that is that it supposes an equally redistributed probability for all n-grams of a sentence. Adding a 1 for unigram and for a bi-gram means that the 2 grams contribute equiprobably to the sentence probability while that's not true. May be I should first compute the probability of each word. 1/ ngram-count corpus.txt -lm wordmodel.lm 2/ ngram -lm wordmodel.lm -ppl corpus.txt -debug 2 Such that I obtain now for each word of the corpus the log- probability. (without taking into account OOV words) And then for taking into account the priors proba (of sentences) simply multiply each word by the sum of probabilities of sentences it appears into. Do you agree with that idea ? Le 29-janv.-08 ? 20:37, Joanis, Eric a ?crit : > Dear Amin, > > I would use a variant of 2): produce a count file, and *replace* the > counts by the sum of probabilities of the sentences where a given n- > gram > occurs. > > The default way to count adds 1 for each occurrence, which makes sense > when the distribution is assumed to be uniform over the observed data. > With your data, you can replace these 1's by the actual probability > figures you have. You may have to worry about underflow issues when > tallying small numbers, but otherwise the process should be simple > enough. You may also need to renormalize all the counts so that the > smallest count be equal to 1, depending on which discounting scheme > you > use. Not all discounting methods take float counts, so rounding may > also be necessary. > > By the way, with your modified definition of the problem, I would > probably write my own program to build the count file, and then invoke > the SRILM utilities afterwards for building the LM from the counts. > > Cheers, > > Eric > > ____________________________________________________ > Eric Joanis > CNRC - ITI - GTLI | NRC - IIT - ILT > > >> -----Original Message----- >> From: owner-srilm-user at speech.sri.com >> [mailto:owner-srilm-user at speech.sri.com] On Behalf Of Amin Mantrach >> Sent: January 29, 2008 1:35 PM >> To: srilm-user at speech.sri.com >> Subject: Fw: From logproba on sentences to logproba on words >> >> >> Apparently my question doesn't meet any answer, so I'll >> reformulate it >> in order to be more clear. >> >> Actually, I want to create an LM model with the command > # ngram- >> count -text textfile -lm lmfile >> >> >> In the case I'm concerned with I dispose of the >> log-probabilities for >> every sentences of appearing. The same that you can obtain from >> (#ngram -lm lm_file -debug 1 -ppl testfile) >> >> What I want to do ? Create a new LM file build from probabilities on >> sentences I have. >> >> Current ideas : >> >> 1 / Produce a text file with the sentences. Each sentence can appear >> in file multiple times. It will appear in fact exactly n >> times. Where >> n = exp(log-proba of the sentence)*1000) (Rounded to integer). >> >> And then simply : ngram-count -text newtextsentences -lm new_lm >> >> 2 / Produce a count file (with only the counts needed (of >> the highest >> order, etc.) and for each n-gram multiply the nb of >> occurrence by the >> sum of proba of the sentences it belongs to. >> This methods is clearly not fair. >> >> >> Can you answer me if one of those ideas are correct. If not >> how should >> I proceed. >> >> >> I hope the question in now clear enough. >> >> Thanks a lot for your help. >> Amin. >> >> >> From stolcke at speech.sri.com Wed Jan 30 23:11:44 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 30 Jan 2008 23:11:44 -0800 Subject: Another detail about the ngram-count In-Reply-To: References: Message-ID: <47A174B0.8060504@speech.sri.com> Sai Tang Huang wrote: > Hi Andreas, > > Another detail probably worth mentioning is that when I run > ngram-count to get the counts and create the LM I get a coeff out of > range warning: > > warning: discount coeff 1 is out of range: -3.33329e-17 > > I read that this was a bug somewhere in the mailing list archive. It's not a bug (there was bug related to this message back in 2003, but it's long fixed). What it means is that your corpus statistics are such that Good Turing discounting is not applicable, specifically, leading to a discounting factor that is effectively 0. The effect is that discounting is disabled for this order of n-gram. For reasons and countermeasures please check the FAQ man page or web page. > > Could this be affecting the ngram -counts? Only indirectly, in that the LM will be suboptimal. Andreas From bplank at science.uva.nl Sat Mar 1 13:08:19 2008 From: bplank at science.uva.nl (B. Plank) Date: Sat, 1 Mar 2008 22:08:19 +0100 (CET) Subject: problems compiling Message-ID: <2124.82.73.146.219.1204405699.squirrel@webmail.science.uva.nl> Hi to all, sorry I have a small question. When trying to install/compile SRILM 1.5.6 I all the time have problems compiling the lm files (I cannot compile ngram ngram-count etc). In more detail, it says me when compiling "make World": ERROR: File to be installed (../bin/i686/fngram-count) does not exist. ERROR: File to be installed (../bin/i686/fngram-count) is not a plain file. Usage: decipher-install ... mode: file permission mode, in octal file1 ... fileN: files to be installed directory: where the files should be installed I found some older posting guessing it was the TCL library. But now I tried both 1) letting them empty (TCL_INCLUDE and TCL_LIBRARY) and setting NO_TCL=X, and 2) setting the include and library to TCL_INCLUDE = -I/usr/include/tcl8.4 TCL_LIBRARY = -L/usr/lib/tcl8.4 -ltcl8.4 I'm running debian linux (i686). I also tried compiling the files in the subdirectory itself, but then I get undefined reference problem. Thanks in advance, Barbara From stolcke at speech.sri.com Sun Mar 2 23:14:38 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 02 Mar 2008 23:14:38 -0800 Subject: problems compiling In-Reply-To: <2124.82.73.146.219.1204405699.squirrel@webmail.science.uva.nl> References: <2124.82.73.146.219.1204405699.squirrel@webmail.science.uva.nl> Message-ID: <47CBA55E.5000103@speech.sri.com> B. Plank wrote: > Hi to all, > > sorry I have a small question. When trying to install/compile SRILM 1.5.6 > I all the time have problems compiling the lm files (I cannot compile > ngram ngram-count etc). In more detail, it says me when compiling "make > World": > > ERROR: File to be installed (../bin/i686/fngram-count) does not exist. > ERROR: File to be installed (../bin/i686/fngram-count) is not a plain file. > Usage: decipher-install ... > mode: file permission mode, in octal > file1 ... fileN: files to be installed > directory: where the files should be installed > > I found some older posting guessing it was the TCL library. But now I > tried both 1) letting them empty (TCL_INCLUDE and TCL_LIBRARY) and setting > NO_TCL=X, and 2) setting the include and library to > > TCL_INCLUDE = -I/usr/include/tcl8.4 > TCL_LIBRARY = -L/usr/lib/tcl8.4 -ltcl8.4 > > I'm running debian linux (i686). I also tried compiling the files in the > subdirectory itself, but then I get undefined reference problem. > You need to send me the output of the make command . there is not way of telling what went wrong otherwise. Andreas From sopheap.seng at gmail.com Mon Mar 3 04:54:15 2008 From: sopheap.seng at gmail.com (Sopheap SENG) Date: Mon, 3 Mar 2008 13:54:15 +0100 Subject: Rescore HTK lattice Message-ID: <3b7711ea0803030454j46fce354r9816ac5b538cac13@mail.gmail.com> Hello, I need HTK lattice in my experiments but the sphinx3 decoder I used, could not generate HTK lattice. So I have to convert sphinx lattice to HTK lattice. My problem is : the lattice generated by sphinx3 decoder provides only the acoustic score of word transitions, I did not find the option to obtain the lmscore in sphninx lattice. In order to obtain HTK lattice with lmscore, first I converted sphinx lattice to HTK SLF lattice format (I added l=0 as lmscore, the acoustic score is kept as it is) Then I used lattice tool (Srilm V 1.5.2) to rescore the lattice by giving a LM : > lattice-tool -in-lattice in.slf -read-htk -lm LM.BO -htk-lmscale 9.5 -htk-wdpenalty 0.7 -htk-logbase 1.0003 -out-lattice out.slf -write-htk (the lmscale, wdpenalty and logbase are the values that I used during lattice generation with sphninx3, the LM is the same as in sphinx3) I obtained in the output a lattice with acoustic score and new lmscore. What I observed is that the acoustic score in the output lattice is recalculated using the logbase. In order to verify that the output lattice in HTK format is equivalent to the orginal sphinx lattice once, I generated 200-Best lists from these two lattices. - for sphinx lattice I used sphinx3_astar to generate N-best - for the rescore HTK lattice, i used lattice-tool : >lattice-tool -in-lattice out.slf -read-htk -lm lm.BO -htk-lmscale 9.5-htk-wdpenalty 0.7 -htk-logbase 1.0003 -nbest-decode 200 -out-nbest-dir OUT/ The problem is that the order of the hypothesis in the two N-best list is not the same. The 1-best given by sphinx3_astar ccould be found in the 200-Best given by lattice-tool but with a much more lower rank or some time not found. But I always find the 1-best of sphinx_astar in a bigger N-Best list of lattice-tool (N=2000). I am convinced that this is a problems of normalizing the score between sphinx and lattice-tool. If the score is correctly normalized, I should have the same N-best at both sides. Could you please give me any clues on this issue? Thank in advance. Sopheap -- --------------------------------------------- Sopheap SENG Laboratoire d'Informatique de Grenoble (LIG) Equipe GETALP Bureau C118 220, avenue de la Chimie Campus Scientifique, BP53 38041 GRENOBLE Cedex 9, FRANCE T?l : (33)-4-76-63-55-81 T?l?copie : (33)-4-76-63-55-52 Courriel : sopheap.seng at imag.f URL : http://www-geod.imag.fr --------------------------------------------- Enseignant Institut de Technologie du Cambodge BP 86, Bd de Pochentong Phnom Penh - Cambodge T?l : (855)-23-88-03-70/98-24-45 T?l?copie : (855)-23-88-03-69 Courriel : sopheap.seng at itc.edu.kh URL : http://www.itc.edu.kh --------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Mon Mar 3 09:38:17 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 03 Mar 2008 09:38:17 -0800 Subject: Rescore HTK lattice In-Reply-To: <3b7711ea0803030454j46fce354r9816ac5b538cac13@mail.gmail.com> References: <3b7711ea0803030454j46fce354r9816ac5b538cac13@mail.gmail.com> Message-ID: <47CC3789.4060006@speech.sri.com> Sopheap SENG wrote: > Hello, > > I need HTK lattice in my experiments but the sphinx3 decoder I used, > could not generate HTK lattice. So I have to convert sphinx lattice to > HTK lattice. > > My problem is : the lattice generated by sphinx3 decoder provides only > the acoustic score of word transitions, I did not find the option to > obtain the lmscore in sphninx lattice. > > In order to obtain HTK lattice with lmscore, first I converted sphinx > lattice to HTK SLF lattice format (I added l=0 as lmscore, the > acoustic score is kept as it is) > > Then I used lattice tool (Srilm V 1.5.2) to rescore the lattice by > giving a LM : > > > lattice-tool -in-lattice in.slf -read-htk -lm LM.BO > -htk-lmscale 9.5 -htk-wdpenalty 0.7 -htk-logbase > 1.0003 -out-lattice out.slf -write-htk > > (the lmscale, wdpenalty and logbase are the values that I used during > lattice generation with sphninx3, the LM is the same as in sphinx3) > > I obtained in the output a lattice with acoustic score and new > lmscore. What I observed is that the acoustic score in the output > lattice is recalculated using the logbase. My first thought is that you have to make sure the logbase specified in the header of your converted sphinx lattices needs to reflect the based used by the actual scores. This should be obvious, but maybe not. > > In order to verify that the output lattice in HTK format is equivalent > to the orginal sphinx lattice once, I generated 200-Best lists from > these two lattices. > > - for sphinx lattice I used sphinx3_astar to generate N-best > - for the rescore HTK lattice, i used lattice-tool : > > >lattice-tool -in-lattice out.slf -read-htk -lm lm.BO > -htk-lmscale 9.5 -htk-wdpenalty 0.7 -htk-logbase 1.0003 -nbest-decode > 200 -out-nbest-dir OUT/ Do you have a way of generating the total (combined acoustic and lm) scores of the sphinx system? then try comparing them to the lattice-tool output and make sure they are the same (or nearly, up to numerical issues). if not then repeat the comparison for all component scores, but setting the weights of all by one score (including wedpenalty) to zero. that way you should be able to pinpoint the source of any discrepancy. Note that wdpenalty is also sensitive to logbase. Andreas From sai_tang_huang at hotmail.com Mon Mar 10 12:36:21 2008 From: sai_tang_huang at hotmail.com (SAI TANG HUANG) Date: Mon, 10 Mar 2008 20:36:21 +0100 Subject: Entropy going smaller as corpus goes smaller. Message-ID: Hi everyone, I have computed the entropy for my model with the following command: ngram -lm small_1.lm -counts small_1.cnt -counts-entropy where small_1.lm is a trigram model with wbdiscount created from ngram-count and where small_1.cnt is a count file only including the events we want to predict. The output is this: file small_1.cnt: 0 sentences, 1 words, 0 OOVs 0 zeroprobs, logprob= -4.69264 ppl= 49276.1 ppl1= 49276.1 This model is really trained using a subset of my TRAIN.txt corpus. This model also gives the following ppl against an unseen test set: file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs 0 zeroprobs, logprob= -38595.4 ppl= 339.378 ppl1= 476.644 On the other hand I have another model also a subset of my TRAIN.txt but a different subset from small_1.lm with entropy as follows: file small_2.cnt: 0 sentences, 1 words, 0 OOVs 0 zeroprobs, logprob= -4.03253 ppl= 10777.8 ppl1= 10777.8 and a perplexity against the same unseen test set of: file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs 0 zeroprobs, logprob= -38792.5 ppl= 349.627 ppl1= 491.891 So my question is, why is entropy bigger in the model whose ppl is actually the smallest? I thought that both measures could be used to measure the performance or quality of a language model. How can both numbers be so inconsistent? By the way my TRAIN.lm (model created from the whole of the training corpus) has an entropy of: file TRAIN_EVENTS.cnt: 0 sentences, 1 words, 0 OOVs 0 zeroprobs, logprob= -11.5557 ppl= 3.59464e+11 ppl1= 3.59464e+11 which is humongous! I am a complete beginner in this field and this is really not making any sense. Any help will be greatly appreciated. Regards to all, Sai _________________________________________________________________ MSN Video. http://video.msn.com/?mkt=es-es From syaman at ece.gatech.edu Tue Mar 25 12:29:08 2008 From: syaman at ece.gatech.edu (Sibel Yaman) Date: Tue, 25 Mar 2008 15:29:08 -0400 Subject: Optimizing Weights in Log-Linear Interpolation Message-ID: <006301c88eae$820cf4c0$4b95d78f@ece.gatech.edu> Hello, I was wondering how I can train the weights in log-linear interpolation of several language models (as in Klakow's paper). I have successfully used "compute-best-mix" script to use linear interpolation weights but do not see how to modify the process to optimize log-linear interpolation weights so that the perplexity is minimized on a cross-validation set. Thank you, Sibel Yaman From: Andreas Stolcke