From gelbart at icsi.berkeley.edu Mon Oct 8 18:32:33 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Mon, 8 Oct 2007 18:32:33 -0700 (PDT) Subject: SRILM and LC_ALL In-Reply-To: References: <46E81B83.7050603@speech.sri.com> Message-ID: On July 19 2007, Andreas Stolcke wrote: > David Brodbeck wrote: > > I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5. > > The machine type is i686_m64. Everything builds all right, but > > the tests fail for make-ngram-pfsg, ngram-class, and > > ngram-count-lm-limit-vocab. > > > > make-ngram-pfsg is the most obvious one, so I'll tackle that one > > first. I get the following in the stderr file: > > gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid > > collation character: /[[:lower:]-?]/ > > > Has anyone else run into this? I'm using GNU Awk 3.1.5, and the > > locale is set to en_US.UTF-8. > > This is odd since we're also using gawk 3.1.5 and I cannot replicate > the problem even when setting LANG to en_US.UTF-8. It seems that the > interpretation of gawk regular expressions should not depend on the > OS release version, but of course there may always be bugs. Hi Andreas, Are you sure you used gawk 3.1.5 when you tried to replicate this? The reason I ask is that at ICSI, the SRILM tools seem to invoke gawk 3.1.3, not gawk 3.1.5: $ head -1 `which add-pauses-to-pfsg` #!/usr/bin/gawk -f $ /usr/bin/gawk --version | head -1 GNU Awk 3.1.3 $ which gawk /usr/local/bin/gawk $ /usr/local/bin/gawk --version | head -1 GNU Awk 3.1.5 My default locale is en_US. With this locale, I do not see the error David Brodbeck did, even if I use gawk 3.1.5. If I set LANG=en_US.UTF-8 and use gawk 3.1.5, then I see the error: $ /usr/local/bin/gawk -f `which add-pauses-to-pfsg` gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: Invalid collation character: /[[:lower:]-?]/ Setting LC_ALL=C as suggested in the SRILM INSTALL file does not solve the problem: tmp$ export LC_ALL=C tmp$ /usr/local/bin/gawk -f `which add-pauses-to-pfsg` gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: Invalid collation character: /[[:lower:]-?]/ The compute-oov-rate script gives a similar error. David Brodbeck, if you're reading this, did setting LC_ALL=C solve your problem with add-pauses-to-pfsg? This was not clear to me from reading your July 23 email to Andreas. Thanks, David From gelbart at icsi.berkeley.edu Mon Oct 8 22:24:49 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Mon, 8 Oct 2007 22:24:49 -0700 (PDT) Subject: SRILM and LC_ALL In-Reply-To: References: <46E81B83.7050603@speech.sri.com> Message-ID: > My default locale is en_US. With this locale, I do not see the error David > Brodbeck did, even if I use gawk 3.1.5. If I set LANG=en_US.UTF-8 and use > gawk 3.1.5, then I see the error: > > $ /usr/local/bin/gawk -f `which add-pauses-to-pfsg` > gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: > Invalid collation character: /[[:lower:]-?]/ A followup: At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default locale is en_US.UTF-8: $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= If I use the default locale, I get the "Invalid collation character" error. If I set LANG=C, I get the same error. If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg test fails with the message "make-ngram-pfsg: stdout output DIFFERS". I think this is because when LC_ALL is set it overrides the other LC_* variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html). This means that the line in test/tests/make-ngram-pfsg/run-test which sets LC_COLLATE=C has no effect when LC_ALL is set. If I set LANG=en_US and leave LC_ALL unset, then the "Invalid collation character" error goes away and the make-ngram-pfsg test passes. So it appears that the gawk locale tips in the SRILM INSTALL file may need to be updated to reflect gawk 3.1.15's behavior. Please let me know if there's anything else I could do to help with this. Regards, David From briannalaugher at toggletext.com Mon Oct 15 01:52:26 2007 From: briannalaugher at toggletext.com (Brianna Laugher) Date: 15 Oct 2007 18:52:26 +1000 Subject: Some SRILM test errors Message-ID: <1192438345.4943.33.camel@lilah> Hello, After a day of experimenting I finally managed to get SRILM to compile. I thought I would share my settings to help others. I tried in vain on a RedHat 9 machine with gcc 3.2.2. Eventually I gave up and tried a different machine. On CentOS 4 with machine type i686 and gcc 3.4.6 I made these changes: in common/Makefile.machine.i686: - fix CC and CXX paths - remove -mtune=pentium3 from GCC_FLAGS - add NO_TCL=X and blank the other TCL things I have gawk 3.1.3. When running the tests I had DIFFERS for these files: nbest-rover-acoustic stdout ngram-class stdout ngram-count-lm-limit-vocab stdout & stderr I read in the archives that ngram-class is "very fickle" so not to worry about it... below is a diff between the last one's stdout. I'm really just trying to have a play with Moses. It would be nice to know if these tests are all unimportant and thus I can ignore their failings. :) thanks, Brianna [brianna at riley test]$ diff output/ngram-count-lm-limit-vocab.unknown.stdout reference/ngram-count-lm-limit-vocab.stdout 1,4c1,4 < file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 37429 OOVs < 0 zeroprobs, logprob= -7154.91 ppl= 14.898 ppl1= 6.98462e+08 < file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 37429 OOVs < 0 zeroprobs, logprob= -6929.8 ppl= 13.6842 ppl1= 3.68034e+08 --- > file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 681 OOVs > 0 zeroprobs, logprob= -86868 ppl= 106.512 ppl1= 205.572 > file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 681 OOVs > 0 zeroprobs, logprob= -85654.5 ppl= 99.788 ppl1= 190.833 From gelbart at icsi.berkeley.edu Tue Oct 16 21:26:34 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Tue, 16 Oct 2007 21:26:34 -0700 (PDT) Subject: Some SRILM test errors In-Reply-To: <1192438345.4943.33.camel@lilah> References: <1192438345.4943.33.camel@lilah> Message-ID: Hi Brianna, > I have gawk 3.1.3. > > When running the tests I had DIFFERS for these files: > nbest-rover-acoustic stdout > ngram-class stdout > ngram-count-lm-limit-vocab stdout & stderr The nbest-rover-acoustic test is broken in SRILM 1.5.3. For more info on that see www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-September/10.html I can duplicate the output you got for the ngram-count-lm-limit-vocab test if I put gawk 3.1.5 in my PATH instead of 3.1.3. (It's possible that something else is the reason other than the gawk version. I changed the environments in a way that may have changed more than just the gawk version.) Are you sure you don't have 3.1.5 installed somewhere where SRILM scripts might be finding it? I believe some of the SRILM tools find gawk using your PATH, while others will use the value of GAWK set in common/Makefile.machine.whatever. Please let us know if you learn anything more. Regards, David From gelbart at icsi.berkeley.edu Tue Oct 16 21:47:51 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Tue, 16 Oct 2007 21:47:51 -0700 (PDT) Subject: Some SRILM test errors In-Reply-To: References: <1192438345.4943.33.camel@lilah> Message-ID: On Tue, 16 Oct 2007, David Gelbart wrote: > I can duplicate the output you got for the ngram-count-lm-limit-vocab test if > I put gawk 3.1.5 in my PATH instead of 3.1.3. > > (It's possible that something else is the reason other than the gawk version. > I changed the environments in a way that may have changed more than just the > gawk version.) I can also make the ngram-class test start failing by switching to 3.1.5 (the caveat in parentheses above still applies). I have no experience with Moses, but if you just want to play with it my guess is that you can ignore these test failures. Regards, David From briannalaugher at toggletext.com Tue Oct 16 22:55:10 2007 From: briannalaugher at toggletext.com (Brianna Laugher) Date: 17 Oct 2007 15:55:10 +1000 Subject: Some SRILM test errors In-Reply-To: References: <1192438345.4943.33.camel@lilah> Message-ID: <1192600510.6600.4.camel@lilah> On Wed, 2007-10-17 at 14:26, David Gelbart wrote: > I can duplicate the output you got for the ngram-count-lm-limit-vocab > test if I put gawk 3.1.5 in my PATH instead of 3.1.3. > > (It's possible that something else is the reason other than the gawk > version. I changed the environments in a way that may have changed > more than just the gawk version.) > > Are you sure you don't have 3.1.5 installed somewhere where SRILM > scripts might be finding it? I believe some of the SRILM tools find > gawk using your PATH, while others will use the value of GAWK set in > common/Makefile.machine.whatever. Hi David, thanks for your reply. I double-checked and both the gawk in my config file and the gawk in my path are 3.1.3. No evidence that I could find of a 3.1.5 lurking... For kicks I tried running it again on a gawk 3.1.1, and EVERYTHING broke. :) Oh well, that's life. regards, Brianna From lfdharo at die.upm.es Wed Oct 17 01:56:11 2007 From: lfdharo at die.upm.es (Luis Fernando D'Haro) Date: Wed, 17 Oct 2007 10:56:11 +0200 Subject: Adding-One smoothing Message-ID: <20071017085610.GC26232@die.upm.es> Hello everyone: I just want to ask if the SRILM toolkit allows the creation a LM using the Lidstone's smoothing technique (i.e. adding-one or adding-delta). I want to compare the results obtained with a proprietary SW that works with this smoothing and the SRILM. I know that this technique is not the best one, but unfortunately we have a small corpus (around 5K sentences) and, at the moment, the performance of the other techniques have not been really good when compared with Lidstone's (at least using this SW). BTW: In our SW we use deleted interpolation, I know that SRILM just accept Backoff models. In a previous email in the user?s list, I saw an explanation about how to use it, but it was not totally clear for me. Could you (prof. Stolcke) expand a little more the example you wrote? Or if anyone has experience with that to explain me it again? Thanks in advance. Sincerely, Luis Fernando D'Haro From gelbart at icsi.berkeley.edu Wed Oct 17 20:14:03 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Wed, 17 Oct 2007 20:14:03 -0700 (PDT) Subject: Some SRILM test errors In-Reply-To: References: <1192438345.4943.33.camel@lilah> Message-ID: Hi Brianna, > I can duplicate the output you got for the ngram-count-lm-limit-vocab test if > I put gawk 3.1.5 in my PATH instead of 3.1.3. > > (It's possible that something else is the reason other than the gawk version. > I changed the environments in a way that may have changed more than just the > gawk version.) I just noticed that another aspect of the environment change I made is that I no longer had LC_ALL=C as recommended in the INSTALL file. If I set LC_ALL=C, the ngram-class and ngram-count-lm-limit-vocab tests pass for me regardless of whether the gawk version in my PATH is 3.1.3 or 3.1.5. Does that fix your problem? Regards, David From briannalaugher at toggletext.com Wed Oct 17 20:57:56 2007 From: briannalaugher at toggletext.com (Brianna Laugher) Date: 18 Oct 2007 13:57:56 +1000 Subject: Some SRILM test errors In-Reply-To: References: <1192438345.4943.33.camel@lilah> Message-ID: <1192679875.6773.0.camel@lilah> On Thu, 2007-10-18 at 13:14, David Gelbart wrote: > I just noticed that another aspect of the environment change I made is > that I no longer had LC_ALL=C as recommended in the INSTALL file. > > If I set LC_ALL=C, the ngram-class and ngram-count-lm-limit-vocab > tests pass for me regardless of whether the gawk version in my PATH is > 3.1.3 or 3.1.5. > > Does that fix your problem? Aha! Thankyou. When in doubt, read the instructions... then read them again... and again. :) cheers, Brianna From stolcke at speech.sri.com Thu Oct 18 09:35:01 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 18 Oct 2007 09:35:01 -0700 Subject: Adding-One smoothing In-Reply-To: Your message of Wed, 17 Oct 2007 10:56:11 +0200. <20071017085610.GC26232@die.upm.es> Message-ID: <200710181635.l9IGZ1d11264@speech.sri.com> In message <20071017085610.GC26232 at die.upm.es>you wrote: > Hello everyone: > > I just want to ask if the SRILM toolkit allows the creation a LM using the Li > dstone's smoothing technique (i.e. adding-one or adding-delta). I want to com > pare the results obtained with a proprietary SW that works with this smoothin > g and the SRILM. I know that this technique is not the best one, but unfortun > ately we have a small corpus (around 5K sentences) and, at the moment, the pe > rformance of the other techniques have not been really good when compared wit > h Lidstone's (at least using this SW). Add-delta smoothing is implemented in the latest version of SRILM. Try downloading the 1.5.4 (beta) version. The options are -addsmooth d -addsmooth1 d -addsmooth2 d etc. where d is the constant to add to each count. > > BTW: In our SW we use deleted interpolation, I know that SRILM just accept Ba > ckoff models. In a previous email in the user?s list, I saw an explanation abo > ut how to use it, but it was not totally clear for me. Could you (prof. Stolc > ke) expand a little more the example you wrote? Or if anyone has experience w > ith that to explain me it again? I'm not sure exactly what method you are asking about, but deleted interpolation is implemented as the smoothing method used by the ngram-count -count-lm option. ngram -count-lm is used to evaluate such an LM. Read the ngram man page to find a description of the file format. You prepare a descriptor file for -count-lm, estimate the interpolation weights with ngram-count, and then give the resulting file to ngram-count. An example of all this is in $SRILM/test/tests/ngram-count-lm/run-test . Andreas > > Thanks in advance. > > Sincerely, > > > Luis Fernando D'Haro From lfdharo at die.upm.es Thu Oct 18 10:27:37 2007 From: lfdharo at die.upm.es (Luis Fernando D'Haro) Date: Thu, 18 Oct 2007 19:27:37 +0200 Subject: Adding-One smoothing In-Reply-To: <200710181635.l9IGZ1d11264@speech.sri.com> References: <20071017085610.GC26232@die.upm.es> <200710181635.l9IGZ1d11264@speech.sri.com> Message-ID: <20071018172737.GE1499@die.upm.es> > Add-delta smoothing is implemented in the latest version of SRILM. > Try downloading the 1.5.4 (beta) version. The options are > > -addsmooth d > -addsmooth1 d > -addsmooth2 d > etc. > > where d is the constant to add to each count. Thanks Prof. for this new release and your quick answer. I will test it. > I'm not sure exactly what method you are asking about, but deleted > interpolation is implemented as the smoothing method used by the > ngram-count -count-lm option. ngram -count-lm is used to evaluate such > an LM. currently the SW we have implements something like this: P(w|h) = lambda_trig * P_3(w|h) + (1-lambda_trig)[lambda_big(P_2(w|h) + (1-lambda_big)[lambda_unig(P(w) + (1-lambda_unig)P(zerogram)]] In all cases, the probability is calculated using the adding-delta smoothing technique. It is important to mention that in this equation, there is a global lambda_trig, lambda_big and lambda_unig values (i.e. this is like having just one bin, not as proposed by Jelinek where there is a different lambda for different bins). Previously, I had tried to use the -count-lm using the following configuration file: order 3 vocabsize 1002 totalcount 74883 mixweights 0 0.5 0.5 0.5 countmodulus 1 counts train.counts and after applying the EM algorithm I obtained the following values: order 3 mixweights 0 0.932452 0.894774 0.994639 countmodulus 1 vocabsize 1002 totalcount 74883 counts train.counts but my PPL results were not as good as using the SW we have. Is it something wrong with the configuration file? or the problem is related with using Good-Turing instead of Adding-delta? Thanks in advance, Luis Fernando From stolcke at speech.sri.com Thu Oct 18 10:29:41 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 18 Oct 2007 10:29:41 -0700 Subject: Help needed with SRILM In-Reply-To: Your message of Fri, 12 Oct 2007 19:00:54 +0200. <71caf0730710121000i68f6c11s1b0f452a6e096086@mail.gmail.com> Message-ID: <200710181729.l9IHTfd16651@speech.sri.com> > > Hi Andreas, > > First of all, thank you for the fast replay last time. > I have read you answer to Roy Bar - Haim, and tried to follow. I found that > there were duplicate parts in the training data, and I have erased them, > and I have tried to create the language model form a corpus 10 times larger, > but it did not aid. I have managed to > get rid of the warning only by changing the -gt1n(min/max) options. > By doing this, I have discovered that the performance of the language model > is greatly affected by the probability given to the token. I use > ngram-count like this : > > ngram-count -text corp.out -lm ngram-count_output/lm_2iter.lm -unk -order 3 > -gt1min 0 -gt1max 2 > > So, as far as I understand, there should be no occurance of unk in the > corpus. But, unk gets a high probability - higher even than words that did > appear one in the corpus. Only when I disable discounting I get low > probability for . Here is the problem: you are estimating an LM with from data that doesn't have any instance of . As a result, gets all the unigram probability mass that is left after discounting the observed unigrams, and that can be substantial. This is because all the discounted unigram probability mass is distributed over all the zeroton words, and in this case is the only zeroton word. (If there are no zeroton words, then the discounted mass is added evenly to ALL the words.) Incidentally, when you try this with -gt1max 1 (the default) on the Switchboard counts under $SRILM/test/tests/ngram-count-gt you get -5.503182 a very small probability. Already, with -gt1max 2 you get -2.558506 which indeed is larger than many observed words. But that is not unexpected. After all, is representing ALL unobserved words. The proper remedy is to limit your LM vocabulary to something less than the observed words, so that the remaining words can give you a meaniningful estimate for unobserved words. > Is there an option to set a fixed probability for the > ? No, there isn't. But there is a trick to achieve a similar effect. Since your data doesn't contain any you can fake some. Just make a count file that contains some fictitious occurrences for , e.g., 100 and call this UNK.counts. Then add those counts to your real data, e.g., ngram-count -text corp.out -read UNK.counts -lm ngram-count_output/lm_2iter.lm -unk -order 3 -gt1min 0 -gt1max 2 And of course you can play with the fake count value to achieve a result that is reasonable, or even optimal on some held-out data. > > BTW : I changed the LM.cc a bit, so when I call ngram -ppl it acts as a > probability server - it listens on a port, and gets sequences of words and > returns its probability. > Do you want me to send you the code for it, so it could be added as a > feature ? Please do send the code. I wouldn't want to modify the existing meaning of -ppl, but a new option with this functionality is something that several people have asked about. Andreas > > Regards, > Elad Dinur > > On 9/23/07, Andreas Stolcke wrote: > > > > Elad Dinur wrote: > > > Hello Andreas And/Or Jing, > > > > > > I am a graduate student in the Hebrew University of Jerusalem, guided > > > by Ari Rappoport of The Hebrew University. > > > I am working on Unsupervised segmentation of words, with emphasis on > > > semitic languages, developing on Modern Hebrew. > > > I am using SRILM to generate a trigram language model, and finding the > > > probability of a sentence with the model. > > > I am using ngram-count with the default setting, As far as I > > > understand that means Good-Turing discounting with Katz Backoff. > > > I get the following warning : > > > > > > warning: discount coeff 1 is out of range: 1.79427e-17 > > > > > > I wonder if you can direct me to a document which elaborates on this > > warning. > > > Thanks in advance, > > > Elad Dinur. > > > > > You can find the answer to this and many other questions by going to > > > > http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/ > > > > and searching for "discount coeff 1 is out of range". > > > > Andreas > > > > > > > > > > > -- > what ?! > > ------=_Part_11378_11699390.1192208454625 > Content-Type: text/html; charset=ISO-8859-1 > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > > Hi Andreas,

First of all, thank you for the fast replay last time.
>I have read you answer to Roy Bar - Haim, and tried to follow. I found that > there were duplicate parts in the training data, and I have  erased them > , and I have tried to create the language model form a corpus 10 times larger > , but it did not aid. I have managed to >
get rid of the warning only by changing the -gt1n(min/max) options.
B > y doing this, I have discovered that the performance of the language model is > greatly affected by the probability given to the <unk> token. I use ng > ram-count like this : >

ngram-count -text corp.out -lm ngram-count_output/lm_2iter.lm -unk -o > rder 3 -gt1min 0 -gt1max 2

So, as far as I understand, there should > be no occurance of unk in the  corpus. But, unk gets a high probability > - higher even than words that did appear one in the corpus. Only when I disa > ble discounting I get low probability for <unk>. Is there an option to > set a fixed probability for the <unk>? >

BTW : I changed the LM.cc a bit, so when I call ngram -ppl it acts as > a probability server - it listens on a port, and gets sequences of words and > returns its probability.
Do you want me to send you the code for it, so > it could be added as a feature ? >

Regards,
Elad Dinur

On 9/23/ > 07, Andreas Stolcke <stolcke at speech.sri.com> wrote:
ass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0 > pt 0pt 0pt 0.8ex; padding-left: 1ex;"> > Elad Dinur wrote:
> Hello Andreas And/Or Jing,
>
> I am a g > raduate student in the Hebrew University of Jerusalem, guided
> by Ari > Rappoport of The Hebrew University.
> I am working on Unsupervised segm > entation of words, with emphasis on >
> semitic languages, developing on Modern Hebrew.
> I am using S > RILM to generate a trigram language model, and finding the
> probabilit > y of a sentence with the model.
> I am using ngram-count with the defau > lt setting, As far as I >
> understand that means Good-Turing discounting with Katz Backoff.
> > I get the following warning :
>
> warning: discount coeff 1 > is out of range: 1.79427e-17
>
> I wonder if you can direct me to > a document which elaborates on this warning. >
> Thanks in advance,
> Elad Dinur.
>
You can find the a > nswer to this and many other questions by going to

http://www.speech.sr > i.com/projects/srilm/mail-archive/srilm-user/ >

and searching for "discount coeff 1 is out of range". r>
Andreas






-- r>what ?! > > ------=_Part_11378_11699390.1192208454625-- From stolcke at speech.sri.com Thu Oct 18 12:11:48 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 18 Oct 2007 12:11:48 -0700 Subject: Adding-One smoothing In-Reply-To: Your message of Thu, 18 Oct 2007 19:27:37 +0200. <20071018172737.GE1499@die.upm.es> Message-ID: <200710181911.l9IJBmd24770@speech.sri.com> --Andreas In message <20071018172737.GE1499 at die.upm.es>you wrote: > > > Add-delta smoothing is implemented in the latest version of SRILM. > > Try downloading the 1.5.4 (beta) version. The options are > > > > -addsmooth d > > -addsmooth1 d > > -addsmooth2 d > > etc. > > > > where d is the constant to add to each count. > > Thanks Prof. for this new release and your quick answer. I will test it. > > > I'm not sure exactly what method you are asking about, but deleted > > interpolation is implemented as the smoothing method used by the > > ngram-count -count-lm option. ngram -count-lm is used to evaluate such > > an LM. > > currently the SW we have implements something like this: > > P(w|h) = lambda_trig * P_3(w|h) + (1-lambda_trig)[lambda_big(P_2(w|h) + (1-la > mbda_big)[lambda_unig(P(w) + (1-lambda_unig)P(zerogram)]] > > In all cases, the probability is calculated using the adding-delta smoothing > technique. That is a combination of additive smoothing and deleted interpolation that is not currently implemented in SRILM. > > It is important to mention that in this equation, there is a global lambda_tr > ig, lambda_big and lambda_unig values (i.e. this is like having just one bin, > not as proposed by Jelinek where there is a different lambda for different b > ins). > > Previously, I had tried to use the -count-lm using the following configuratio > n file: > > order 3 > vocabsize 1002 > totalcount 74883 > mixweights 0 > 0.5 0.5 0.5 > countmodulus 1 > counts train.counts > > and after applying the EM algorithm I obtained the following values: > > order 3 > mixweights 0 > 0.932452 0.894774 0.994639 > countmodulus 1 > vocabsize 1002 > totalcount 74883 > counts train.counts > > but my PPL results were not as good as using the SW we have. > > Is it something wrong with the configuration file? or the problem is related > with using Good-Turing instead of Adding-delta? There is nothing wrong with it. The difference is that in SRILM the underlying probability estimates (as in standard deleted inteprolation) are simple maximum likelihood estimates (without Good Turing smoothing). It would be very straightforward to include optional add-delta smoothing to the -count-lm model, since all the quantities needed are readily avaialable. You just have to add some code to get the delta parameter from the LM file (similar to what's already there for the other parameters) and modify line 373 in NgramCountLM.cc to implement the add-delta formula. If you do this please send me your changes! Andreas From stolcke at speech.sri.com Fri Oct 19 10:49:03 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 19 Oct 2007 10:49:03 PDT Subject: Some SRILM test errors In-Reply-To: Your message of Tue, 16 Oct 2007 21:26:34 -0700. Message-ID: <200710191749.l9JHn4M18206@huge> In message you wrote: > Hi Brianna, > > > I have gawk 3.1.3. > > > > When running the tests I had DIFFERS for these files: > > nbest-rover-acoustic stdout > > ngram-class stdout > > ngram-count-lm-limit-vocab stdout & stderr > > The nbest-rover-acoustic test is broken in SRILM 1.5.3. For more info > on that see > www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-September/10.h > tml David is right, and if you download the beta version of SRILM 1.5.4 this problem is fixed. This version also fixes a number of other locale-related issues. Specifically, in the tests ngram-class and ngram-count-lm-limit-vocab the problem is simply that different locale settings give different "sort" output. You can fix this by putting LC_COLLATE=C export LC_COLLATE at the top of $SRILM/test/tests/ngram-class/run-test $SRILM/test/tests/ngram-count-lm-limit-vocab/run-test Other than that it whould work regardless of the gawk version. Andreas > > I can duplicate the output you got for the ngram-count-lm-limit-vocab > test if I put gawk 3.1.5 in my PATH instead of 3.1.3. > > (It's possible that something else is the reason other than the gawk > version. I changed the environments in a way that may have changed > more than just the gawk version.) > > Are you sure you don't have 3.1.5 installed somewhere where SRILM > scripts might be finding it? I believe some of the SRILM tools find > gawk using your PATH, while others will use the value of GAWK set in > common/Makefile.machine.whatever. > > Please let us know if you learn anything more. > > Regards, > David > > > From stolcke at speech.sri.com Fri Oct 19 11:00:36 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 19 Oct 2007 11:00:36 PDT Subject: SRILM and LC_ALL In-Reply-To: Your message of Mon, 08 Oct 2007 22:24:49 -0700. Message-ID: <200710191800.l9JI0a620166@huge> David et al., there were several issues with add-pauses-to-pfsg and UTF-8 locales. The regular expression /[x80-x8F]/ is not legal in UTF-8 locales because it contains characters with the high bit set (UTF-8 uses the high bit to encode multibyte characters). I fixed this recently by using a different but equivalent regex instead. The other problem is that pre-3.1.5 (actually pre-3.1.4) gawk was not using ctype library functions for implementing character classes like [:lower:]. So, the upshot is that if you 1) get the latest beta version (to fixed the regex issue) AND 2) use gawk 3.1.5 or later you should be able to use add-pauses-to-pfsg and pass the "make-ngram-pfsg" test regardless of locale setting. You CAN use gawk 3.1.3 (which is what seems to be pre-installed on many Linux system) but then you need use LANG=C or LANG=en_US. I added a note about this to various documentation files. --Andreas In message you wrote: > > > My default locale is en_US. With this locale, I do not see the error David > > > Brodbeck did, even if I use gawk 3.1.5. If I set LANG=en_US.UTF-8 and use > > gawk 3.1.5, then I see the error: > > > > $ /usr/local/bin/gawk -f `which add-pauses-to-pfsg` > > gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: > > Invalid collation character: /[[:lower:]-?]/ > > A followup: > > At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default > locale is en_US.UTF-8: > > $ locale > LANG=en_US.UTF-8 > LC_CTYPE="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_PAPER="en_US.UTF-8" > LC_NAME="en_US.UTF-8" > LC_ADDRESS="en_US.UTF-8" > LC_TELEPHONE="en_US.UTF-8" > LC_MEASUREMENT="en_US.UTF-8" > LC_IDENTIFICATION="en_US.UTF-8" > LC_ALL= > > If I use the default locale, I get the "Invalid collation character" > error. If I set LANG=C, I get the same error. > > If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg > test fails with the message "make-ngram-pfsg: stdout output DIFFERS". > I think this is because when LC_ALL is set it overrides the other LC_* > variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html). > This means that the line in test/tests/make-ngram-pfsg/run-test which > sets LC_COLLATE=C has no effect when LC_ALL is set. > > If I set LANG=en_US and leave LC_ALL unset, then the > "Invalid collation character" error goes away and the make-ngram-pfsg > test passes. > > So it appears that the gawk locale tips in the SRILM INSTALL file may > need to be updated to reflect gawk 3.1.15's behavior. Please let me > know if there's anything else I could do to help with this. > > Regards, > David > > > > > From dianaduraiz at gmail.com Tue Oct 23 09:08:21 2007 From: dianaduraiz at gmail.com (=?ISO-8859-1?Q?Diana_Dur=E1n?=) Date: Tue, 23 Oct 2007 18:08:21 +0200 Subject: Optimize the interpolation parameters Message-ID: Hello, I am using SRILM to create a language model based on modified interpolated Kneser-Ney smoothing. Is it possible to optimize the lambdas values from a held-out set with SRILM? Thanks for your help Diana -------------- next part -------------- An HTML attachment was scrubbed... URL: From deliverable at gmail.com Tue Oct 23 09:47:00 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Tue, 23 Oct 2007 20:47:00 +0400 Subject: incremental ngram counts Message-ID: <9AF24308-326B-4BCE-9B10-CD9A3A258E75@gmail.com> Greetings -- I want to count ngrams at certain fraction of my corpus by size, e.g. for 10%, 20%, etc. Is there an alternative to concocting separate lists of ad hoc subcorpora and running ngram- count separately? What if I want to track exactly how many new ngrams each file contributes, when going in a certain order? Cheers, Alexy From save.climate at gmail.com Tue Oct 23 09:57:16 2007 From: save.climate at gmail.com (Kamadev Bhanuprasad) Date: Tue, 23 Oct 2007 18:57:16 +0200 Subject: Optimize the interpolation parameters In-Reply-To: References: Message-ID: <244d59a50710230957h704a1bf0j45076d517baaf86d@mail.gmail.com> Hi Diana, it was already discussed, see http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-July/5.html Kamadev On 10/23/07, Diana Dur?n wrote: > > > Hello, > > I am using SRILM to create a language model based on modified interpolated > Kneser-Ney smoothing. Is it possible to optimize the lambdas values from a > held-out set with SRILM? > > Thanks for your help > > Diana > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Oct 23 09:54:38 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 23 Oct 2007 09:54:38 -0700 Subject: Optimize the interpolation parameters In-Reply-To: References: Message-ID: <471E274E.6040605@speech.sri.com> Diana Dur?n wrote: > > Hello, > > I am using SRILM to create a language model based on modified > interpolated Kneser-Ney smoothing. Is it possible to optimize the > lambdas values from a held-out set with SRILM? The interpolation weights used in smoothing (for combining higher and lower-order estimates) do not have to be estimated separately from data. They are a given by formulae derived from the counts-of-counts, and built into the discounting methods. If you are asking about interpolation different LMs, use compute-best-mix, described in the ppl-scripts(1) man page. Andreas From mlharville at yahoo.com Tue Oct 23 14:37:20 2007 From: mlharville at yahoo.com (Michael Harville) Date: Tue, 23 Oct 2007 14:37:20 -0700 (PDT) Subject: lattice-tool -ppl not working for me Message-ID: <380533.58323.qm@web60616.mail.yahoo.com> Hi, Please excuse the newbie question, but I have searched the archives and web for an answer, and have not been able to find one. I am running the following command: echo "HUGE WIN OVER RUTGERS" > sentence.txt lattice-tool -ppl sentence.txt -in-lattice footballPodcast.lat -read-htk -debug 2 -order 4 and am getting the folllowing results: p( HUGE | ) = 0 [ -inf ] p( WIN | HUGE ...) = 0 [ -inf ] p( OVER | WIN ...) = 0 [ -inf ] p( RUTGERS | OVER ...) = 0 [ -inf ] p( | RUTGERS ...) = 0 [ -inf ] Viterbi backtrace failed 1 sentences, 4 words, 0 OOVs 5 zeroprobs, logprob= 0 ppl= undefined ppl1= undefined Anyone know what might be going on? The original utterance from which the lattice was built is 2 minutes long, containing much more speech than just the four word sentence I am testing on. Is that the problem? Generally speaking, I am looking for a tool that can give me the highest probability location (along with the associated probability) of where a sequence of words was spoken in an audio file. I am using Sphinx 3.7 to generate lattices from the audio, and have been using various SRILM tools to examine these lattices. Is there a tool that does what I want, or will I need to make one? Much thanks in advance! Mike From stolcke at speech.sri.com Tue Oct 23 16:26:15 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 23 Oct 2007 16:26:15 PDT Subject: lattice-tool -ppl not working for me In-Reply-To: Your message of Tue, 23 Oct 2007 14:37:20 -0700. <380533.58323.qm@web60616.mail.yahoo.com> Message-ID: <200710232326.l9NNQFc09322@huge> In message <380533.58323.qm at web60616.mail.yahoo.com>you wrote: > Hi, > > Please excuse the newbie question, but I have searched the archives and web f > or an answer, and have not been able to find one. I am running the following > command: > > echo "HUGE WIN OVER RUTGERS" > sentence.txt > lattice-tool -ppl sentence.txt -in-lattice footballPodcast.lat -read-htk -deb > ug 2 -order 4 > > and am getting the folllowing results: > > p( HUGE | ) = 0 [ -inf ] > p( WIN | HUGE ...) = 0 [ -inf ] > p( OVER | WIN ...) = 0 [ -inf ] > p( RUTGERS | OVER ...) = 0 [ -inf ] > p( | RUTGERS ...) = 0 [ -inf ] > Viterbi backtrace failed > 1 sentences, 4 words, 0 OOVs > 5 zeroprobs, logprob= 0 ppl= undefined ppl1= undefined > > Anyone know what might be going on? The original utterance from which the lat > tice was built is 2 minutes long, containing much more speech than just the f > our word sentence I am testing on. Is that the problem? Yes, probably. lattice-tool -ppl only works for word sequences that exactly correspond to a path through the lattice between initial and final node. > Generally speaking, I am looking for a tool that can give me the highest prob > ability location (along with the associated probability) of where a sequence > of words was spoken in an audio file. I am using Sphinx 3.7 to generate latti > ces from the audio, and have been using various SRILM tools to examine these > lattices. Is there a tool that does what I want, or will I need to make one? What you are trying to do is a kind of word or phrase spotting. lattice-tool -order 4 -write-ngrams OUTPUT will write a list of all 4-grams occurring anywhere in the lattice, along with their posterior probabilities accumulated over all positions. You could use this to see if your string is SOMEWHERE in the lattice. lattice-tool -order 4 -write-ngram-index OUTPUT will generate an index of all 4-gram occurrences and their positions relative to the start of the utterance, durations, and posterior probabilities (without combinining distinct instances that are separated in time). You might have to play with the -min-count option to limit output of very low-probability ngrams, or -posterior-prune to make the lattices smaller prior to processing (for speed/memory reasons). Andreas From svmats at yahoo.com Thu Oct 25 04:13:03 2007 From: svmats at yahoo.com (Mats Svenson) Date: Thu, 25 Oct 2007 04:13:03 -0700 (PDT) Subject: Saving option for ngram-class Message-ID: <159333.1460.qm@web31611.mail.mud.yahoo.com> Hi, I guess the -save options as implemented in ngram-class is not very useful. Typically, I'm not interesting in testing classes as appearing on the beginning of the clustering process, but rather in classes induced in final steps. If the number of clustered words is high, the current option results in creating an enormous number of useless files. It'd be much more practical if the user could explicitly set which classes with different granularity should be saved, or, alternatively, to have some -startsave option which'd allow to start saving class files close to the end of the clustering. Would that be easy to implement? One more thing, is there an easy way how to find how many classes appear in particular class file without writing a script? The number of iterations doesn't say that directly and I'm not sure whether it can be computed as NUMBER_OF_WORDS_IN_THE_VOCAB - NUMBER_OF_ITERATIONS - NUMBER_OF_WORDS_IN_THE_NO_CLASS_VOCAB Best, Mats __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From partha.lal at gmail.com Tue Oct 30 09:03:57 2007 From: partha.lal at gmail.com (Partha Lal) Date: Tue, 30 Oct 2007 16:03:57 +0000 Subject: format error in lattice file... Message-ID: Hello, I'm trying to get a lattice error rate from an htk format lattice file but keep getting a format error: > lattice-tool -read-htk -in-lattice results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.lat -out-lattice results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg > nbest-lattice -read results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg -lattice-error -reference `cat data/globalphone-sp/wrdfile/rmn/SP002/SP002_21.wrd` results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg: line 2: unknown keyword format error in lattice file I can't see what's wrong with my lattice files - I've made them available at http://homepages.inf.ed.ac.uk/s0565860/lattice_problem/ . Can anyone suggest what might be wrong? Thanks, Partha -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Oct 30 21:23:22 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 30 Oct 2007 21:23:22 -0700 Subject: format error in lattice file... In-Reply-To: References: Message-ID: <4728033A.5020802@speech.sri.com> Partha Lal wrote: > Hello, > > I'm trying to get a lattice error rate from an htk format lattice file > but keep getting a format error: > > > lattice-tool -read-htk -in-lattice > results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.lat > -out-lattice > results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg The PFSG lattice format is NOT the one used by nbest-lattice. nbest-lattice understands two kinds of format designed to encode word posterior probabilities, both described in the wlat-format(5) man page. The first format, word posterior lattices, is produced by lattice-tool -write-posteriors. The second format, word confusion networks, aka sausages, is produced by lattice-tool -write-mesh . Andreas > > nbest-lattice -read > results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg > -lattice-error -reference `cat > data/globalphone-sp/wrdfile/rmn/SP002/SP002_21.wrd` > results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg: > line 2: unknown keyword > format error in lattice file > > I can't see what's wrong with my lattice files - I've made them > available at http://homepages.inf.ed.ac.uk/s0565860/lattice_problem/ . > Can anyone suggest what might be wrong? > > Thanks, > > Partha From dyuret at ku.edu.tr Wed Oct 31 03:02:33 2007 From: dyuret at ku.edu.tr (Deniz Yuret) Date: Wed, 31 Oct 2007 12:02:33 +0200 Subject: OOV calculations Message-ID: Hi, We are working on language models for agglutinative languages where the number of unique tokens is comparatively large, and dividing words into morphemes is useful. When such divisions are performed (e.g. represent each compound word as two tokens: stem+ and +suffix), number of unique tokens and the number of OOV tokens are reduced, however it becomes difficult to compare two such systems with different OOV counts. Thus I started looking carefully into the ngram output, and so far here is what I have understood, please correct me if I am wrong: 1. logprob is the log of the product of the probabilities for all non-oov tokens (including ). 2. ppl = 10^(-logprob / (ntokens - noov + nsentences)) 3. ppl1 = 10^(-logprob / (ntokens - noov)) 4. I am not quite sure what zeroprobs gives. My first question is about a slight inconsistency in the calculation of ppl1: the probabilities are included in logprob, however their count is not included in the denominator. Shouldn't we have a separate logprob total that excludes for the ppl1 calculation? My second question is what exactly does zeroprobs give? My final question is on how to fairly compare two models which divide the same data into different numbers of tokens and have different OOV counts. It seems like the change in the number of tokens can be dealt with comparing the probabilities assigned to the whole data set (logprob) rather than per token averages (ppl). However the current output totally ignores the penalty that should be incurred from OOV tokens. As an easy solution, one can designate a fixed penalty for each OOV token to be added to the logprob total. It is not clear how that fixed penalty should be determined. A better solution is to have a character-based model that assigns a non-zero probability to every word and maybe interpolate it with the token-based model. I am not quite sure how this is possible in the srilm framework. Any advice would be appreciated. best, deniz From stolcke at speech.sri.com Wed Oct 31 16:46:50 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 31 Oct 2007 16:46:50 PDT Subject: OOV calculations In-Reply-To: Your message of Wed, 31 Oct 2007 12:02:33 +0200. Message-ID: <200710312346.l9VNkoX07102@huge> In message you wro te: > Hi, > > We are working on language models for agglutinative languages where > the number of unique tokens is comparatively large, and dividing words > into morphemes is useful. When such divisions are performed (e.g. > represent each compound word as two tokens: stem+ and +suffix), number > of unique tokens and the number of OOV tokens are reduced, however it > becomes difficult to compare two such systems with different OOV > counts. > > Thus I started looking carefully into the ngram output, and so far > here is what I have understood, please correct me if I am wrong: > > 1. logprob is the log of the product of the probabilities for all > non-oov tokens (including ). correct. > 2. ppl = 10^(-logprob / (ntokens - noov + nsentences)) correct. > 3. ppl1 = 10^(-logprob / (ntokens - noov)) correct. > 4. I am not quite sure what zeroprobs gives. Words that are in the vocabulary but get probability 0 in the LM. They are treated the same as OOVs for the purpose of perplexity computation. > My first question is about a slight inconsistency in the calculation > of ppl1: the probabilities are included in logprob, however their > count is not included in the denominator. Shouldn't we have a > separate logprob total that excludes for the ppl1 calculation? No, because the idea is that sentence boundaries are arbitrary and only a construct used by the LM to assign probabilities to words. So to compare two LMs that use a different sentence segmentation you need to normalize by the number of words excluding the (which differ), but you need to include the probability assigned to because they are part of the total probability the LMs assign to the complete word sequence. e.g.: P(a b c) = P(a) P(b | a) P( | a b) P(c | a b ) if the LM happens to require a sentence boundary between b and c. Actually, that's an approximation because you really need to sum over all possible positions of sentence boundaries. To compute the full probability summing over all segmentations you need to run a "hidden event" N-gram model, implemented by ngram -hidden-vocab (see man page). > My second question is what exactly does zeroprobs give? See above. If prob = 0 the perplexity becomes undefined (or infinity), so you need to remove them from the computation somehow (like OOVs). > > My final question is on how to fairly compare two models which divide > the same data into different numbers of tokens and have different OOV > counts. It seems like the change in the number of tokens can be dealt > with comparing the probabilities assigned to the whole data set > (logprob) rather than per token averages (ppl). However the current > output totally ignores the penalty that should be incurred from OOV > tokens. As an easy solution, one can designate a fixed penalty for > each OOV token to be added to the logprob total. It is not clear how > that fixed penalty should be determined. A better solution is to have > a character-based model that assigns a non-zero probability to every > word and maybe interpolate it with the token-based model. I am not > quite sure how this is possible in the srilm framework. You cannot compare LMs with different OOV counts. You need to create a model that assigns a nonzero probability to every event. E.g., you could have a letter-probability model for OOVS. As for comparing LMs with different number of tokens, that's easy. You are really comparing the total probabilties assigned to the complete observation sequence, however the various LMs choose to split up that sequence. So look at the "logprob" output, not ppl. If you want to report ppls just choose one token sequence as your reference and use that number of tokens in the denominator of the ppl computation for ALL LMs (you have to compute ppl from logprob yourself). Andreas From dyuret at ku.edu.tr Thu Nov 1 00:49:51 2007 From: dyuret at ku.edu.tr (Deniz Yuret) Date: Thu, 1 Nov 2007 09:49:51 +0200 Subject: OOV calculations In-Reply-To: <200710312346.l9VNkoX07102@huge> References: <200710312346.l9VNkoX07102@huge> Message-ID: Thank you. > You cannot compare LMs with different OOV counts. You need to create a > model that assigns a nonzero probability to every event. E.g., you > could have a letter-probability model for OOVS. As for your suggestion of creating a letter-probability model for OOVs (and maybe interpolating it with the ngram model), are there any tools/documentation in the srilm package that could be helpful? If not I think we can (1) go into the source code and figure out how to create a new letter-probability LM, or (2) create an independent letter-probability LM outside srilm and manually interpolate its results with the -debug 2 output of ngram. I am assuming here (maybe contrary to your suggestion) that we can create a model that assigns a nonzero probability to every event by interpolating a regular ngram model (with OOVs > 0) and a letter-probability model. deniz On 11/1/07, Andreas Stolcke wrote: > > In message you wro > te: > > Hi, > > > > We are working on language models for agglutinative languages where > > the number of unique tokens is comparatively large, and dividing words > > into morphemes is useful. When such divisions are performed (e.g. > > represent each compound word as two tokens: stem+ and +suffix), number > > of unique tokens and the number of OOV tokens are reduced, however it > > becomes difficult to compare two such systems with different OOV > > counts. > > > > Thus I started looking carefully into the ngram output, and so far > > here is what I have understood, please correct me if I am wrong: > > > > 1. logprob is the log of the product of the probabilities for all > > non-oov tokens (including ). > > correct. > > > 2. ppl = 10^(-logprob / (ntokens - noov + nsentences)) > > correct. > > > 3. ppl1 = 10^(-logprob / (ntokens - noov)) > > correct. > > > 4. I am not quite sure what zeroprobs gives. > > Words that are in the vocabulary but get probability 0 in the LM. > They are treated the same as OOVs for the purpose of perplexity computation. > > > My first question is about a slight inconsistency in the calculation > > of ppl1: the probabilities are included in logprob, however their > > count is not included in the denominator. Shouldn't we have a > > separate logprob total that excludes for the ppl1 calculation? > > No, because the idea is that sentence boundaries are arbitrary > and only a construct used by the LM to assign probabilities to words. > So to compare two LMs that use a different sentence segmentation you > need to normalize by the number of words excluding the (which differ), > but you need to include the probability assigned to because they > are part of the total probability the LMs assign to the complete word > sequence. e.g.: P(a b c) = P(a) P(b | a) P( | a b) P(c | a b ) > if the LM happens to require a sentence boundary between b and c. > Actually, that's an approximation because you really need to sum over > all possible positions of sentence boundaries. > > To compute the full probability summing over all segmentations > you need to run a "hidden event" N-gram model, implemented by > ngram -hidden-vocab (see man page). > > > My second question is what exactly does zeroprobs give? > > See above. If prob = 0 the perplexity becomes undefined (or infinity), > so you need to remove them from the computation somehow (like OOVs). > > > > > My final question is on how to fairly compare two models which divide > > the same data into different numbers of tokens and have different OOV > > counts. It seems like the change in the number of tokens can be dealt > > with comparing the probabilities assigned to the whole data set > > (logprob) rather than per token averages (ppl). However the current > > output totally ignores the penalty that should be incurred from OOV > > tokens. As an easy solution, one can designate a fixed penalty for > > each OOV token to be added to the logprob total. It is not clear how > > that fixed penalty should be determined. A better solution is to have > > a character-based model that assigns a non-zero probability to every > > word and maybe interpolate it with the token-based model. I am not > > quite sure how this is possible in the srilm framework. > > You cannot compare LMs with different OOV counts. You need to create a > model that assigns a nonzero probability to every event. E.g., you > could have a letter-probability model for OOVS. > > As for comparing LMs with different number of tokens, that's easy. > You are really comparing the total probabilties assigned to the complete > observation sequence, however the various LMs choose to split up that > sequence. So look at the "logprob" output, not ppl. If you want to > report ppls just choose one token sequence as your reference and use that > number of tokens in the denominator of the ppl computation for ALL LMs > (you have to compute ppl from logprob yourself). > > Andreas > > From stolcke at speech.sri.com Thu Nov 1 08:27:38 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Nov 2007 08:27:38 PDT Subject: OOV calculations In-Reply-To: Your message of Thu, 01 Nov 2007 09:49:51 +0200. Message-ID: <200711011527.lA1FRcS18102@huge> In message you wro te: > Thank you. > > > You cannot compare LMs with different OOV counts. You need to create a > > model that assigns a nonzero probability to every event. E.g., you > > could have a letter-probability model for OOVS. > > As for your suggestion of creating a letter-probability model for OOVs > (and maybe interpolating it with the ngram model), are there any > tools/documentation in the srilm package that could be helpful? If > not I think we can (1) go into the source code and figure out how to > create a new letter-probability LM, or (2) create an independent > letter-probability LM outside srilm and manually interpolate its > results with the -debug 2 output of ngram. > > I am assuming here (maybe contrary to your suggestion) that we can > create a model that assigns a nonzero probability to every event by > interpolating a regular ngram model (with OOVs > 0) and a > letter-probability model. Actually, I wasn't thinking of covering all words with a letter probability model (which would be poor for non-OOV words) and interpolating. A more typical approach is to have a word LM with an OOV token, and when you are inside the OOV you assign a probability to the specific word by a letter LM. so the total probability of p(a b c) where "b" is an OOV would be p(a | ...) p(OOV | a) p(b| OOV) p(c | a OOV) and p(b|OOV) is given by a totally separate LM that operates in terms of letters. Obviously this isn't implemented in SRILM at this point, but you can compute total probabilities, perplexities, etc. by first running the word LM, then the letter LM just on the OOVs in your test set, and adding the log probabilities. Andreas From stolcke at speech.sri.com Thu Nov 1 11:33:09 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Nov 2007 11:33:09 PDT Subject: Saving option for ngram-class In-Reply-To: Your message of Thu, 25 Oct 2007 04:13:03 -0700. <159333.1460.qm@web31611.mail.mud.yahoo.com> Message-ID: <200711011833.lA1IX9t17046@huge> In message <159333.1460.qm at web31611.mail.mud.yahoo.com>you wrote: > Hi, > I guess the -save options as implemented in ngram-class is not very useful. I agree. > Typically, I'm not interesting in testing classes as appearing on the beginni > ng of the clustering process, but rather in classes induced in final steps. I > f the number of clustered words is high, the current option results in creati > ng an enormous number of useless files. > > It'd be much more practical if the user could explicitly set which classes wi > th different granularity should be saved, or, alternatively, to have some -st > artsave option which'd allow to start saving class files close to the end of > the clustering. > > Would that be easy to implement? The next release (due out soon) will have a new option -save-maxclasses K Modifies the action of -save so as to only start saving once the number of classes reaches K. (The iteration numbers embedded in filenames will start at 0 from that point.) > > One more thing, is there an easy way how to find how many classes appear in p > articular class file without writing a script? The number of iterations doesn > 't say that directly and I'm not sure whether it can be computed as NUMBER_OF > _WORDS_IN_THE_VOCAB - NUMBER_OF_ITERATIONS - NUMBER_OF_WORDS_IN_THE_NO_CLASS_ > VOCAB You can get the number of classes from the class definition file with gawk '{ print $1 }' | uniq | wc -l This shouldn't be needed when using the -save-maxclasses option since you specific the number of classes directly (and then each new saved file has S fewer classes, where S is the argument to -save). Andreas From deliverable at gmail.com Thu Nov 1 12:54:10 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 1 Nov 2007 22:54:10 +0300 Subject: x86-64 Message-ID: Which platform should we use for a x86-64 build under Linux, on an Intel Xenon 64-bit CPU? Cheers, Alexy From stolcke at speech.sri.com Thu Nov 1 13:05:12 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Nov 2007 13:05:12 PDT Subject: x86-64 In-Reply-To: Your message of Thu, 01 Nov 2007 22:54:10 +0300. Message-ID: <200711012005.lA1K5Ct04066@huge> gnumake MACHINE_TYPE=i686-m64 --Andreas In message you wrote: > Which platform should we use for a x86-64 build under Linux, on an > Intel Xenon 64-bit CPU? > > Cheers, > Alexy From deliverable at gmail.com Thu Nov 1 13:09:46 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 1 Nov 2007 23:09:46 +0300 Subject: x86-64 In-Reply-To: <200711012005.lA1K5Ct04066@huge> References: <200711012005.lA1K5Ct04066@huge> Message-ID: <2DB91188-6892-49D5-879F-697E7CB7D1AD@gmail.com> Found that too, was wondering whether -march=athlon64 is optimal for the Xenon? Gentoo docs recommend -mtune=nocona Alexy On Nov 1, 2007, at 11:05 PM, Andreas Stolcke wrote: > > gnumake MACHINE_TYPE=i686-m64 > > --Andreas > > In message you wrote: >> Which platform should we use for a x86-64 build under Linux, on an >> Intel Xenon 64-bit CPU? >> >> Cheers, >> Alexy > From stolcke at speech.sri.com Thu Nov 1 13:13:46 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Nov 2007 12:13:46 -0800 Subject: x86-64 In-Reply-To: Your message of Thu, 01 Nov 2007 23:09:46 +0300. <2DB91188-6892-49D5-879F-697E7CB7D1AD@gmail.com> Message-ID: <200711012013.lA1KDkA16914@speech.sri.com> In message <2DB91188-6892-49D5-879F-697E7CB7D1AD at gmail.com>you wrote: > Found that too, was wondering whether -march=athlon64 is optimal for > the Xenon? Gentoo docs recommend > > -mtune=nocona Feel free to modify it whatever you think gives best results on your machines. -march=athlon64 was just something that made sense on ours. Andreas From deliverable at gmail.com Thu Nov 1 13:56:35 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Thu, 1 Nov 2007 23:56:35 +0300 Subject: parallel ngram-count Message-ID: I see one quick way to parallelize ngram-count on a N-core box: -- split file list into N sublists -- launch N ngram-count instances, giving each its own sublist -- merge counts Is there any better way? Cheers, Alexy From stolcke at speech.sri.com Thu Nov 1 14:01:47 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Nov 2007 13:01:47 -0800 Subject: parallel ngram-count In-Reply-To: Your message of Thu, 01 Nov 2007 23:56:35 +0300. Message-ID: <200711012101.lA1L1lA21308@speech.sri.com> In message you wrote: > I see one quick way to parallelize ngram-count on a N-core box: > > -- split file list into N sublists > -- launch N ngram-count instances, giving each its own sublist > -- merge counts > > Is there any better way? That's what I would do. Make sure you are not i/o bound when running many ngram-count in parallel, and watch for memory usage. Andreas From deliverable at gmail.com Thu Nov 1 14:42:55 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Fri, 2 Nov 2007 00:42:55 +0300 Subject: incremental ngram-count Message-ID: A separate task I do on a corpus is computing a "running ngram count": for each "tick" size of a subset of the corpus, e.g. 10%, 20%, etc., or every N files, or every file, show the *increase* in the number of ngrams. Obviously building sublists of files with a single file added and rerunning ngram-count on such lists is inefficient. Is it the case where I have to get into C++ library indeed, and which classes should I use? Basically, I want to know which *new* ngrams are contributed by a given file, in the sequence of processing. I may want to output them separately for look-see, too. Cheers, Alexy From deliverable at gmail.com Fri Nov 2 02:40:54 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Fri, 2 Nov 2007 12:40:54 +0300 Subject: ngram-count progress Message-ID: Is there a way to make ngram-count report its progress, e.g. print a dot on stderr every N processed input tokens? (Or which C++ would one hack at?) Cheers, Alexy From deliverable at gmail.com Fri Nov 2 03:25:13 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Fri, 2 Nov 2007 13:25:13 +0300 Subject: order of options to ngram-count Message-ID: My impression is that ngram-count is sensitive to the order of options (1.5.4b from yesterday). ngram-count -text - order 1 -write unigrams.count # takes up all memory, runs forever ngram-count -order 1 -text - write unigrams.count # finishes up quickly in a fraction of memory Does the former honor -order 1? What's the rule here -- trailing - write is honored anyways? I now stick other flags, such as -tolower, in front of -text - . Cheers, Alexy From svp at zuzino.net.ru Fri Nov 2 03:45:45 2007 From: svp at zuzino.net.ru (Sergey Protasov) Date: Fri, 2 Nov 2007 13:45:45 +0300 Subject: cross-entropy with OOV Message-ID: <150c31280711020345s4b76a1d4k1202b2cb4d9ce5da@mail.gmail.com> Dear experts, I need to compute entropy with OOV words... For example.. If we have dict_size diffrent words in training corpora then for test corpora (per word) entr2 = entr1 + stats.numOOVs*log2(dict_size_train_corpora)/num_words_test_corpora entr1 = log2(ppl1) But in C++ code TextStats.cc I don't know how to get Dict_size_train_corpora to compute this. Dict_size_train_corpora = number_unigrams_train_corpora Anybody help? Thanx in advance! From ioparin at yahoo.co.uk Fri Nov 2 05:22:20 2007 From: ioparin at yahoo.co.uk (ilya oparin) Date: Fri, 2 Nov 2007 12:22:20 +0000 (GMT) Subject: order of options to ngram-count In-Reply-To: Message-ID: <915733.57043.qm@web25403.mail.ukl.yahoo.com> Hi, It looks like in your first "run forever" line you forgot to put "-" right before "order" option, so ngram-count just skips this invalid option and build default trigrams instead of unigrams. In case you have large data, that would take long. --- Alexy Khrabrov wrote: > My impression is that ngram-count is sensitive to > the order of > options (1.5.4b from yesterday). > > ngram-count -text - order 1 -write unigrams.count # > takes up all > memory, runs forever > > ngram-count -order 1 -text - write unigrams.count # > finishes up > quickly in a fraction of memory > > Does the former honor -order 1? What's the rule > here -- trailing - > write is honored anyways? I now stick other flags, > such as -tolower, > in front of -text - . > > Cheers, > Alexy > best regards, Ilya ___________________________________________________________ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ From deliverable at gmail.com Fri Nov 2 07:48:10 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Fri, 2 Nov 2007 17:48:10 +0300 Subject: order of options to ngram-count In-Reply-To: <915733.57043.qm@web25403.mail.ukl.yahoo.com> References: <915733.57043.qm@web25403.mail.ukl.yahoo.com> Message-ID: <00FBEA2E-34F7-49DB-8266-114FA829A2B8@gmail.com> Ilya, thanks! Umm, I've typed these lines anew from what I've run before -- and there was a real -order 1 there. In any case, my control run shows OK now. I understand there's a tradition, so in case more GNU compliance is desired, options with long names may start with -- as an option. :) Cheers, Alexy On Nov 2, 2007, at 3:22 PM, ilya oparin wrote: > > It looks like in your first "run forever" line you > forgot to put "-" right before "order" option, so > ngram-count just skips this invalid option and build > default trigrams instead of unigrams. In case you have > large data, that would take long. From stolcke at speech.sri.com Fri Nov 2 08:05:31 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 02 Nov 2007 08:05:31 PDT Subject: order of options to ngram-count In-Reply-To: Your message of Fri, 02 Nov 2007 13:25:13 +0300. Message-ID: <200711021505.lA2F5Vg06495@huge> In message you wrote: > My impression is that ngram-count is sensitive to the order of > options (1.5.4b from yesterday). > > ngram-count -text - order 1 -write unigrams.count # takes up all > memory, runs forever You have a typo in the above: "order" instead of "-order". > > ngram-count -order 1 -text - write unigrams.count # finishes up > quickly in a fraction of memory > > Does the former honor -order 1? What's the rule here -- trailing - > write is honored anyways? I now stick other flags, such as -tolower, > in front of -text - . All SRILM programs are invariant to order of options. --Andreas From stolcke at speech.sri.com Fri Nov 2 12:12:00 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 02 Nov 2007 12:12:00 PDT Subject: SRILM 1.5.4 released Message-ID: <200711021912.lA2JC0109493@huge> The latest version of SRILM is downloadable from the usual place: http://www.speech.sri.com/projects/srilm/download.html A list of changes appears below. Enjoy! Andreas ------------------------------------------------------------------------------- 1.5.4 2 November 2007 Functionality: * New option ngram-count -addsmooth for additive smoothing. A corresponding new discounting subclass "AddSmooth" is defined in Discount.h. * New option ngram -server-port to start a "probability server" (based on a contribution by Elad Dinur). * WordLattice: print lattice name in warning messages. * lattice-tool -keep-unk option to preserve labels of OOV words in LM rescoring (currently works only for HTK lattices). * New option nbest-optimize -anti-refs and -anti-ref-weight to decorrelate errors with another set of hypotheses. * New support in nbest-optimize for BLEU optimization and Powell search (from Jing Zheng). * New option ngram-class -save-maxclasses to start the saving of intermediate results when a specified number classes is reached (suggested by Shlomo Wavrow and Mats Svenson). Bugs: * Fixed incorrect reference output for test "nbest-rover-acoustic". * Fixed a possible problem with tests "ngram-class" and "ngram-count-lm-limit-vocab" in non-C locales. * nbest-lattice: Avoid aligning reference words with -dump-errors or -wer, which would cause crash because no lattice is being generated internally. * make-batch-counts, merge-batch-counts: be more portable by dynamically finding the right options to use with xargs. * add-pauses-to-pfsg: Avoid using a regular expression construct that causes a gawk error in UTF-8 locales. However, to ensure this works correctly a gawk version of 3.1.5 should be used. See note in doc/README.linux. If the test "make-ngram-pfsg" fails a workaround is to set LANG=C or LANG=en_US and avoid UTF-8. * Fixes an uninitialized member variable in the unary constructor for class File, which was causing garbage to be return on the first getline(). * common/Makefile.machine.macos: Updated Tcl linking instructions (from Chuck Wooters). * Makefile: exit immediately if any of the subdirectories result in build errors. From stolcke at speech.sri.com Fri Nov 2 16:43:57 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 02 Nov 2007 16:43:57 -0700 Subject: ngram-count progress In-Reply-To: References: Message-ID: <472BB63D.2050103@speech.sri.com> Alexy Khrabrov wrote: > Is there a way to make ngram-count report its progress, e.g. print a > dot on stderr every N processed input tokens? (Or which C++ would one > hack at?) > By modifying the source code, sure. Andreas > Cheers, > Alexy From deliverable at gmail.com Sat Nov 3 03:32:59 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sat, 3 Nov 2007 13:32:59 +0300 Subject: 1.5.4: empty ngram* on Mac OSX Message-ID: I've downloaded 1.5.4b a day before release and it built fine. Now was trying to build 1.5.4 and ngram* binaries are size 0. Am investigatng this -- apparently LIBRARY changed to LIBRARIES in some, but not all, places in the make files? /s/src/srilm grep LIBR diff-b2-1.5.4 < $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARY) > $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARIES) > @make binaries depend on all $(LIBRARIES) < $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARY) > $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARIES) > $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARY) Cheers, Alexy From deliverable at gmail.com Sat Nov 3 04:55:55 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sat, 3 Nov 2007 14:55:55 +0300 Subject: Various LM stats and building speed/performance tradeoffs Message-ID: Is there any table / review of the performance of various LMs implemented in SRILM, versus the time needed to build them? What are the general considerations on choosing from SRILM vast number of LMs? The SRILM paper is from 2002 -- what about everything added after that? Cheers, Alexy From deliverable at gmail.com Sat Nov 3 16:10:59 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 4 Nov 2007 02:10:59 +0300 Subject: 1.5.4: empty ngram* on Mac OSX In-Reply-To: <200711032152.lA3LqFb03796@speech.sri.com> References: <200711032152.lA3LqFb03796@speech.sri.com> Message-ID: <0D6F25F1-5251-4BFF-AB73-3A2F121E6558@gmail.com> Indeed, built executables, but some tests show DIFFERS, notably google-ngrams: stdout output DIFFERS. google-ngrams: stderr output DIFFERS. make-big-lm: stdout output DIFFERS. make-big-lm: stderr output DIFFERS. make-big-lm-kn: stdout output DIFFERS. make-big-lm-kn: stderr output DIFFERS. make-ngram-pfsg: stdout output DIFFERS. make-ngram-pfsg: stderr output DIFFERS. make-unigram-pfsg: stdout output DIFFERS. make-unigram-pfsg: stderr output DIFFERS. nbest-optimize-bleu: stdout output IDENTICAL (IEEE version). nbest-optimize-bleu: stderr output DIFFERS. nbest-rescore: stdout output DIFFERS. nbest-rescore: stderr output DIFFERS. nbest-rover: stdout output DIFFERS. nbest-rover: stderr output DIFFERS. nbest-rover-acoustic: no reference stdout output found. nbest-rover-acoustic: stderr output DIFFERS. nbest-rover-posteriors: no reference stdout output found. nbest-rover-posteriors: stderr output DIFFERS. ngram-count-abs: stdout output DIFFERS. ngram-count-abs: stderr output DIFFERS. ngram-count-gt: stdout output IDENTICAL. ngram-count-gt: stderr output DIFFERS. Everything else shows IDENTICAL. BTW, the 1.5.4 contains RCS subdirs not present in 1.5.4b -- so patch program checked out the file! I surely can send full test output or anything else you might find interesting! Cheers, Alexy On Nov 4, 2007, at 12:52 AM, Andreas Stolcke wrote: > [...] > See if the patch below fixes your problem. [...] From gelbart at icsi.berkeley.edu Sun Nov 4 12:39:57 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Sun, 4 Nov 2007 12:39:57 -0800 (PST) Subject: SRILM 1.5.4 build problem In-Reply-To: <200710191800.l9JI0a620166@huge> References: <200710191800.l9JI0a620166@huge> Message-ID: Hello, With SRILM 1.5.4 on Fedora 7, I am seeing many build errors caused by a link step being skipped. Is anyone else seeing this? The problem results in make output that looks like this: /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc /root/srilm-1.5.4-build/sbin/decipher-install 0555 \ ../bin/i686/ngram-merge ../../bin/i686 ERROR: File to be installed (../bin/i686/ngram-merge) does not exist. In addition to ngram-merge this also happens for ngram-count, ngram-class, and many others. If I build SRILM 1.5.3 on the same machine, there is no error and the make output looks like this: /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ -I../../include -u matherr -L../../lib/i686 -g -O3 \ -o ../bin/i686/ngram-merge ../obj/i686/ngram-merge.o \ ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a \ ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -lm 2>&1 | c++filt /root/srilm-1.5.3-build/sbin/decipher-install 0555 \ ../bin/i686/ngram-merge ../../bin/i686 Regards, David From gelbart at icsi.berkeley.edu Sun Nov 4 12:41:34 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Sun, 4 Nov 2007 12:41:34 -0800 (PST) Subject: SRILM 1.5.4 build problem In-Reply-To: References: <200710191800.l9JI0a620166@huge> Message-ID: By the way, I am using GNU Make 3.81. On Sun, 4 Nov 2007, David Gelbart wrote: > Hello, > > With SRILM 1.5.4 on Fedora 7, I am seeing many build errors caused by a link > step being skipped. Is anyone else seeing this? The problem results in make > output that looks like this: > > /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ > -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc > /root/srilm-1.5.4-build/sbin/decipher-install 0555 \ > ../bin/i686/ngram-merge ../../bin/i686 > ERROR: File to be installed (../bin/i686/ngram-merge) does not exist. > > In addition to ngram-merge this also happens for ngram-count, ngram-class, > and many others. If I build SRILM 1.5.3 on the same machine, there is no > error and the make output looks like this: > > /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ > -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc > /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ > -I../../include -u matherr -L../../lib/i686 -g -O3 \ > -o ../bin/i686/ngram-merge ../obj/i686/ngram-merge.o \ > ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a \ > ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -lm 2>&1 | c++filt > /root/srilm-1.5.3-build/sbin/decipher-install 0555 \ > ../bin/i686/ngram-merge ../../bin/i686 > > Regards, > David > From barabbas at gmail.com Sun Nov 4 13:16:23 2007 From: barabbas at gmail.com (Tian-Jian "Barabbas" Jiang@Gmail) Date: Mon, 05 Nov 2007 05:16:23 +0800 Subject: SRILM 1.5.4 build problem In-Reply-To: References: <200710191800.l9JI0a620166@huge> Message-ID: <472E36A7.2040305@gmail.com> I encountered the same problem on Mac OS X. David Gelbart wrote: > By the way, I am using GNU Make 3.81. > > On Sun, 4 Nov 2007, David Gelbart wrote: > >> Hello, >> >> With SRILM 1.5.4 on Fedora 7, I am seeing many build errors caused by >> a link step being skipped. Is anyone else seeing this? The problem >> results in make output that looks like this: >> >> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ >> -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ >> -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc >> /root/srilm-1.5.4-build/sbin/decipher-install 0555 \ >> ../bin/i686/ngram-merge ../../bin/i686 >> ERROR: File to be installed (../bin/i686/ngram-merge) does not exist. >> >> In addition to ngram-merge this also happens for ngram-count, >> ngram-class, and many others. If I build SRILM 1.5.3 on the same >> machine, there is no error and the make output looks like this: >> >> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ >> -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ >> -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc >> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ >> -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ >> -I../../include -u matherr -L../../lib/i686 -g -O3 \ >> -o ../bin/i686/ngram-merge ../obj/i686/ngram-merge.o \ >> ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a \ >> ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -lm 2>&1 | >> c++filt >> /root/srilm-1.5.3-build/sbin/decipher-install 0555 \ >> ../bin/i686/ngram-merge ../../bin/i686 >> >> Regards, >> David >> > From gelbart at icsi.berkeley.edu Sun Nov 4 20:42:52 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Sun, 4 Nov 2007 20:42:52 -0800 (PST) Subject: SRILM 1.5.4 build problem In-Reply-To: References: <200710191800.l9JI0a620166@huge> Message-ID: I haven't been able to see where the problem has come from. Between 1.5.3 and 1.5.4, the only change I see to the link rule under the heading "# Program linking" in Makefile.common.targets is that "$(LIBRARY)" in the first line of the rule changed to "$(LIBRARIES)" Here is the 1.5.4 make output again, but with make's --debug option in use: /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ -I../../include -c -g -O3 -o ../obj/i686/ngram.o ngram.cc Successfully remade target file `../obj/i686/ngram.o'. Must remake target `../bin/i686/ngram'. Successfully remade target file `../bin/i686/ngram'. Must remake target `../../bin/i686/ngram'. /root/srilm-1.5.4/sbin/decipher-install 0555 ../bin/i686/ngram \ ../../bin/i686 ERROR: File to be installed (../bin/i686/ngram) does not exist. Regards, David From stolcke at speech.sri.com Sun Nov 4 20:53:26 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 04 Nov 2007 20:53:26 PST Subject: SRILM 1.5.4 build problem In-Reply-To: Your message of Sun, 04 Nov 2007 20:42:52 -0800. Message-ID: <200711050453.lA54rQEW017409@dylan.speech.sri.com> In message you wrote: > I haven't been able to see where the problem has come from. Between > 1.5.3 and 1.5.4, the only change I see to the link rule under the > heading "# Program linking" in Makefile.common.targets is that > "$(LIBRARY)" in the first line of the rule changed to "$(LIBRARIES)" > > Here is the 1.5.4 make output again, but with make's --debug option in > use: > > /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ > -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ > -I../../include -c -g -O3 -o ../obj/i686/ngram.o ngram.cc > Successfully remade target file `../obj/i686/ngram.o'. > Must remake target `../bin/i686/ngram'. > Successfully remade target file `../bin/i686/ngram'. > Must remake target `../../bin/i686/ngram'. > /root/srilm-1.5.4/sbin/decipher-install 0555 ../bin/i686/ngram \ > ../../bin/i686 > ERROR: File to be installed (../bin/i686/ngram) does not exist. Frankly, I don't understand how this change can lead to the observed problem, which I have not been able to duplicate on our machines. But try changing line 104 of common/Makefile.common.targets to $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(filter-out -%, $(LIBRARIES)) IF that doesn't work, try $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARIES) (the original) and report results back to me. Be sure to "make cleanest" before each trial. Thanks, and sorry for this inconvenience. Andreas From barabbas at gmail.com Sun Nov 4 22:24:25 2007 From: barabbas at gmail.com (Barabbas Jiang@Gmail) Date: Mon, 05 Nov 2007 14:24:25 +0800 Subject: SRILM 1.5.4 build problem In-Reply-To: <200711050453.lA54rQEW017409@dylan.speech.sri.com> References: <200711050453.lA54rQEW017409@dylan.speech.sri.com> Message-ID: <472EB719.7030203@gmail.com> Hi all, Andreas Stolcke wrote: > In message you wrote: > >> I haven't been able to see where the problem has come from. Between >> 1.5.3 and 1.5.4, the only change I see to the link rule under the >> heading "# Program linking" in Makefile.common.targets is that >> "$(LIBRARY)" in the first line of the rule changed to "$(LIBRARIES)" >> >> Here is the 1.5.4 make output again, but with make's --debug option in >> use: >> >> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \ >> -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 -I. \ >> -I../../include -c -g -O3 -o ../obj/i686/ngram.o ngram.cc >> Successfully remade target file `../obj/i686/ngram.o'. >> Must remake target `../bin/i686/ngram'. >> Successfully remade target file `../bin/i686/ngram'. >> Must remake target `../../bin/i686/ngram'. >> /root/srilm-1.5.4/sbin/decipher-install 0555 ../bin/i686/ngram \ >> ../../bin/i686 >> ERROR: File to be installed (../bin/i686/ngram) does not exist. >> > > Frankly, I don't understand how this change can lead to the observed problem, > which I have not been able to duplicate on our machines. > But try changing line 104 of common/Makefile.common.targets to > > $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(filter-out -%, $(LIBRARIES)) This patch works for me on Mac OS X now! Cheers, /Mike/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From gelbart at icsi.berkeley.edu Mon Nov 5 11:52:40 2007 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Mon, 5 Nov 2007 11:52:40 -0800 (PST) Subject: SRILM 1.5.4 build problem In-Reply-To: <472EB719.7030203@gmail.com> References: <200711050453.lA54rQEW017409@dylan.speech.sri.com> <472EB719.7030203@gmail.com> Message-ID: >> Frankly, I don't understand how this change can lead to the observed problem, >> which I have not been able to duplicate on our machines. >> But try changing line 104 of common/Makefile.common.targets to >> >> $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(filter-out -%, $(LIBRARIES)) > > This patch works for me on Mac OS X now! The patch works for me on Fedora 7 as well. Regards, David From deliverable at gmail.com Mon Nov 5 12:59:43 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Mon, 5 Nov 2007 23:59:43 +0300 Subject: running time estimate of -lm Message-ID: <83AACD84-A6ED-4DA3-93B9-89A275C67C92@gmail.com> I've launched ngram-count -order 2 -lm with a 1 billion word corpus a few days ago, and it's still going, after 4,600 minutes of CPU time (2.66 GHz Xeon 64-bit). Originally it took about 8 GB of RAM, then decreased by about 25%, now is climbing back. What is the overall running time estimate of -lm without any other options? Simple runs for about 15 million words finished in about 15 minutes. Cheers, Alexy From ioparin at yahoo.co.uk Tue Nov 6 00:43:36 2007 From: ioparin at yahoo.co.uk (ilya oparin) Date: Tue, 6 Nov 2007 08:43:36 +0000 (GMT) Subject: running time estimate of -lm In-Reply-To: <83AACD84-A6ED-4DA3-93B9-89A275C67C92@gmail.com> Message-ID: <102682.35796.qm@web25401.mail.ukl.yahoo.com> Hi, It's really worth using make-big-lm script (documented in training-scripts section of the manual) for training such huge models. Ilya --- Alexy Khrabrov wrote: > I've launched ngram-count -order 2 -lm with a 1 > billion word corpus a > few days ago, and it's still going, after 4,600 > minutes of CPU time > (2.66 GHz Xeon 64-bit). Originally it took about 8 > GB of RAM, then > decreased by about 25%, now is climbing back. What > is the overall > running time estimate of -lm without any other > options? Simple runs > for about 15 million words finished in about 15 > minutes. > > Cheers, > Alexy > best regards, Ilya ___________________________________________________________ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ From stolcke at speech.sri.com Tue Nov 6 00:54:50 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 06 Nov 2007 00:54:50 -0800 Subject: running time estimate of -lm In-Reply-To: Your message of Tue, 06 Nov 2007 08:43:36 +0000. <102682.35796.qm@web25401.mail.ukl.yahoo.com> Message-ID: <200711060854.lA68soW10632@speech.sri.com> Also, it isn't clear from the original message if counts were produced beforehand, or if ngram-count is in fact invoked directly on the billion-word corpus. In that case it's no wonder it takes forever, since it is probably paging itself to death. Use make-batch-counts/merge-batch-counts, and make-big-lm as explained in the training-scripts(1) man page. --Andreas In message <102682.35796.qm at web25401.mail.ukl.yahoo.com>you wrote: > Hi, > > It's really worth using make-big-lm script (documented > in training-scripts section of the manual) for > training such huge models. > > Ilya > > --- Alexy Khrabrov wrote: > > > I've launched ngram-count -order 2 -lm with a 1 > > billion word corpus a > > few days ago, and it's still going, after 4,600 > > minutes of CPU time > > (2.66 GHz Xeon 64-bit). Originally it took about 8 > > GB of RAM, then > > decreased by about 25%, now is climbing back. What > > is the overall > > running time estimate of -lm without any other > > options? Simple runs > > for about 15 million words finished in about 15 > > minutes. > > > > Cheers, > > Alexy > > > > > best regards, > Ilya > > > ___________________________________________________________ > Yahoo! Answers - Got a question? Someone out there knows the answer. Try it > now. > http://uk.answers.yahoo.com/ From deliverable at gmail.com Tue Nov 6 04:34:31 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Tue, 6 Nov 2007 15:34:31 +0300 Subject: running time estimate of -lm In-Reply-To: <200711060854.lA68soW10632@speech.sri.com> References: <200711060854.lA68soW10632@speech.sri.com> Message-ID: Indeed, the counts were not precomputed. However there's enough memory, and ngram-count never used even a half of RAM yet with the bigrams of a billion word corpus . No paging at all... Is there a hope it'll end after a few days, or I'll have to redo it following training-scripts(1)? Cheers, Alexy On Nov 6, 2007, at 11:54 AM, Andreas Stolcke wrote: > > Also, it isn't clear from the original message if counts were produced > beforehand, or if ngram-count is in fact invoked directly on the > billion-word corpus. In that case it's no wonder it takes forever, > since it is probably paging itself to death. > > Use make-batch-counts/merge-batch-counts, and make-big-lm as explained > in the training-scripts(1) man page. > > --Andreas > > In message <102682.35796.qm at web25401.mail.ukl.yahoo.com>you wrote: >> Hi, >> >> It's really worth using make-big-lm script (documented >> in training-scripts section of the manual) for >> training such huge models. >> >> Ilya >> >> --- Alexy Khrabrov wrote: >> >>> I've launched ngram-count -order 2 -lm with a 1 >>> billion word corpus a >>> few days ago, and it's still going, after 4,600 >>> minutes of CPU time >>> (2.66 GHz Xeon 64-bit). Originally it took about 8 >>> GB of RAM, then >>> decreased by about 25%, now is climbing back. What >>> is the overall >>> running time estimate of -lm without any other >>> options? Simple runs >>> for about 15 million words finished in about 15 >>> minutes. >>> >>> Cheers, >>> Alexy >>> >> >> >> best regards, >> Ilya >> >> >> ___________________________________________________________ >> Yahoo! Answers - Got a question? Someone out there knows the >> answer. Try it >> now. >> http://uk.answers.yahoo.com/ > From deliverable at gmail.com Tue Nov 6 09:20:44 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Tue, 6 Nov 2007 20:20:44 +0300 Subject: billion word -lm finished Message-ID: <8EFD533C-A2E3-43A2-BAB1-6B3BC5804E0E@gmail.com> I'm glad to report that the full -lm model of -order 2 over a billion words builds from scratch in about 100 CPU hours! Cheers, Alexy From deliverable at gmail.com Tue Nov 6 09:58:20 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Tue, 6 Nov 2007 20:58:20 +0300 Subject: billion word -lm finished In-Reply-To: <200711061728.lA6HSxO13767@huge> References: <200711061728.lA6HSxO13767@huge> Message-ID: <684FD65B-DF5A-492E-BF9A-2FCAB039AAF2@gmail.com> CPU-optimized (right after make World) -- these are 1.5.4b binaries from one day before 1.5.4 release. Compiled with -march=nocona - mtune=nocona for the Xeons. Did it time'd: % time cat list | xargs cat | ngram-count -text - -order 2 -lm model-1 warning: discount coeff 1 is out of range: 0 cat list 0,00s user 0,00s system 0% cpu 6:41,85 total xargs cat 0,66s user 15,31s system 2% cpu 11:23,54 total ngram-count -text - -order 2 -lm model-1 350025,89s user 91,27s system 100% cpu 96:52:30,83 total BTW, is the warning expected? Am always getting it with simple -lm from scratch. Cheers, Alexy On Nov 6, 2007, at 8:28 PM, Andreas Stolcke wrote: > > What version of the binaries did you use ? > Cpu or space-optimized (_c) ? > > It would have been good to run this with the unix "time" command > to get real and cpu time statistics. > > --Andreas > > In message <8EFD533C-A2E3-43A2-BAB1-6B3BC5804E0E at gmail.com>you wrote: >> I'm glad to report that the full -lm model of -order 2 over a billion >> words builds from scratch in about 100 CPU hours! >> >> Cheers, >> Alexy > From deliverable at gmail.com Tue Nov 6 10:19:39 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Tue, 6 Nov 2007 21:19:39 +0300 Subject: ngram -server-port for an 8-bit encoding Message-ID: How do I use ngram -server-port with an 8-bit encoding? Telnetting to the port cuts off the 8th bit... Cheers, Alexy From stolcke at speech.sri.com Tue Nov 6 10:28:03 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 06 Nov 2007 10:28:03 PST Subject: ngram -server-port for an 8-bit encoding In-Reply-To: Your message of Tue, 06 Nov 2007 21:19:39 +0300. Message-ID: <200711061828.lA6IS3R20504@huge> The following telnet options might be of interest: -8 Specifies an 8-bit data path. This causes an attempt to negoti- ate the TELNET BINARY option on both input and output. -E Stops any character from being recognized as an escape character. -L Specifies an 8-bit data path on output. This causes the BINARY option to be negotiated on output. --Andreas In message you wrote: > How do I use ngram -server-port with an 8-bit encoding? Telnetting > to the port cuts off the 8th bit... > > Cheers, > Alexy From stolcke at speech.sri.com Tue Nov 6 11:12:55 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 06 Nov 2007 11:12:55 PST Subject: SRILM 1.5.5 released Message-ID: <200711061912.lA6JCtK25573@huge> Seeing as the last release was beset by some portability issues I created a new release. Hopefully this one will cause less trouble. --Andreas 1.5.5 6 November 2007 Bug fixes: * Fixed Makefile problem in binaries depending on libraries that was preventing executables being generated on some platforms. * Fixed a compilation problem with MSVC for nbest-optimize. * Use MSVC _getpid() in ngram -generate random seed initialization. From jadell at gps.tsc.upc.edu Fri Nov 9 02:12:16 2007 From: jadell at gps.tsc.upc.edu (Jordi Adell) Date: Fri, 09 Nov 2007 11:12:16 +0100 Subject: Including srilm *.a inside a .so Message-ID: <47343280.1000900@gps.tsc.upc.edu> Dear Andreas, I'm recently using SRILM toolkit, which I think is a very useful tool and very well done. Congratulations. Just a previous note: in the documentation of the LM library there is no explanation that the order has to be specified in the constructor or by using the setorder() function. En therefore, when you read a LM file using the LM::read() function if one do not take this into account the maximum order is always three. OK, now my question. I'm using SRILM inside a shared object, therefore I included it like this: g++ -shared -Wl,-z,muldefs,-whole-archive,-lflm,-llattice,-lmisc,-ldstruct,-loolm,-no-whole-archive -o lib.so This means that ALL symbols are included in the lib.so whether needed or not. In particular I have a problem with the Tcl_AppInit. If you compile the libraries with TCL option on, then tclmain.cc is included inside library libmisc.a $> nm libmisc.a | grep tclmain tclmain.o: And this object has two undefined symbols; tclmain.o: U Tcl_AppInit U Tcl_Main This symbols had to be defined in tcl library, however I'm using tcl8.4 and the Tcl_AppInit symbol is not defined there. In the tcl.h says this: /* * Convenience declaration of Tcl_AppInit for backwards compatibility. * This function is not *implemented* by the tcl library, so the storage * class is neither DLLEXPORT nor DLLIMPORT */ #undef TCL_STORAGE_CLASS #define TCL_STORAGE_CLASS EXTERN int Tcl_AppInit _ANSI_ARGS_((Tcl_Interp *interp)); Therefore, if I try to use libmisc compiled with TCL inside the previously mention shared object lib.so, this error is given: ldd -d lib.so undefined symbol: Tcl_AppInit (/home/lib.so) I noticed that if I compile the libraries wit TCL OFF then this problem disappears because tclmain.o is not included in the library. I wonder whether this is how it should work or if it is a bug that could be arranged for next SRILM version. I hope this is useful for somebody it took my a while to understand why I couldn't include the libraries inside my .so. They key point to do so with tcl8.4 is to compile SRILM without TCL option. This means to set NO_TCL = X in the appropriate makefile in srilm/common/ Best Regards. Good job! -- _______________________________________________________________________ Jordi Adell Mercado TALP Research Center Signal and Communication Theory Dpt. Universitat Polit?cnica de Catalunya (UPC) c/Jordi Girona 1-3 e-mail: jadell at gps.tsc.upc.es Campus Nord D5-120 web: http://gps-tsc.upc.es/veu/personal/jadell 08034 - Barcelona phone: 93-401.16.27 ________________________________________________________________________ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3257 bytes Desc: S/MIME Cryptographic Signature URL: From stolcke at speech.sri.com Fri Nov 9 14:21:18 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 09 Nov 2007 14:21:18 PST Subject: Including srilm *.a inside a .so In-Reply-To: Your message of Fri, 09 Nov 2007 11:12:16 +0100. <47343280.1000900@gps.tsc.upc.edu> Message-ID: <200711092221.lA9MLIF00143@huge> > > Dear Andreas, > > I'm recently using SRILM toolkit, which I think is a very useful=20 > tool and very well done. Congratulations. Thanks, that's nice to hear ! > > Just a previous note: in the documentation of the LM library there=20 > is no explanation that the order has to be specified in the constructor=20 > or by using the setorder() function. En therefore, when you read a LM=20 > file using the LM::read() function if one do not take this into account=20 > the maximum order is always three. The library is not well-document as you know. The fact that the ngram order defaults to 3 is documented in the ngram-count man page. I certainly don't want to make any claims for the quality of the documentation in general. BTW, if anyone feels like improving the document (fixing or expanding it) I'd be more than happy to accept submissions ... > OK, now my question. I'm using SRILM inside a shared object,=20 > therefore I included it like this: > > g++ -shared=20 > -Wl,-z,muldefs,-whole-archive,-lflm,-llattice,-lmisc,-ldstruct,-loolm,-no= > -whole-archive=20 > -o lib.so > ... > > I noticed that if I compile the libraries wit TCL OFF then this=20 > problem disappears because tclmain.o is not included in the library. > =20 > > I wonder whether this is how it should work or if it is a bug that=20 > could be arranged for next SRILM version. > > I hope this is useful for somebody it took my a while to understand=20 > why I couldn't include the libraries inside my .so. They key point to do = > > so with tcl8.4 is to compile SRILM without TCL option. This means to set = > > NO_TCL =3D X in the appropriate makefile in srilm/common/ I would recommend just disabling the Tcl stuff ti avoid the problem. It's not important enough to track down this dynamic linker issue. Andreas From deliverable at gmail.com Sat Nov 10 10:05:05 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sat, 10 Nov 2007 21:05:05 +0300 Subject: 7z as a much better archiver than gz/bz2 Message-ID: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com> Greetings -- I've switched to 7z for most of corpora compression, as it gives results which are whole number of times better than gz, and 1.1-1.5 better than bz2. Would be nice to see it used more, especially for the huge kind of things we do here. E.g., a 4.0 GB lm file was compressed by 7za (a command line version for linux) to 642 MB. 7za is multi-core CPU aware and knows all about locales and encodings as well. http://www.7-zip.org/ Cheers, Alexy From save.climate at gmail.com Sat Nov 10 12:59:18 2007 From: save.climate at gmail.com (Kamadev Bhanuprasad) Date: Sat, 10 Nov 2007 21:59:18 +0100 Subject: 7z as a much better archiver than gz/bz2 In-Reply-To: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com> References: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com> Message-ID: <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com> Alexy, I strongly believe this mailing list is not suitable for spams like this. If you want to present to public which compression utilities you use or how much of cpu time your particular computation took, please, use your personal blog or something like that. I'm pretty much sure that vast majority of people in this list is not interested in receiving messages of this kind. Best, Kamadev On Nov 10, 2007 7:05 PM, Alexy Khrabrov wrote: > Greetings -- I've switched to 7z for most of corpora compression, as > it gives results which are whole number of times better than gz, and > 1.1-1.5 better than bz2. Would be nice to see it used more, > especially for the huge kind of things we do here. E.g., a 4.0 GB lm > file was compressed by 7za (a command line version for linux) to 642 > MB. 7za is multi-core CPU aware and knows all about locales and > encodings as well. > > http://www.7-zip.org/ > > Cheers, > Alexy > From deliverable at gmail.com Sat Nov 10 13:10:46 2007 From: deliverable at gmail.com (Alexy Khrabrov) Date: Sun, 11 Nov 2007 00:10:46 +0300 Subject: 7z as a much better archiver than gz/bz2 In-Reply-To: <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com> References: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com> <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com> Message-ID: <0530AD1E-A6FB-4120-AEC1-4054731008D6@gmail.com> (Kamadev -- I think you misunderstood my message.) I was wondering whether folks manage use 7z to speed up their access to their LMs. By default, ngram would read the gzipped files as well as the originals. Yet gzipped versions are still much larger than the 7z'ipped. 7z is an Open Source package with which I have no affiliation... By looking over 7z options, I found that one can extract a file to stdout with it too, e.g. 7z e archive.7z -so It would be possible to do that for a huge LM and feed that to it by piping to ngram -lm - -- yet the problem is, I use ngram -ppl - already to serve perplexities. Would appreciate other folks' experiences with speeding up loading of huge LM. Same could be applied to bz2 as well, and any other archiver better than gz. On Nov 10, 2007, at 11:59 PM, Kamadev Bhanuprasad wrote: [...] From runxin.li at gmail.com Sat Nov 10 21:05:19 2007 From: runxin.li at gmail.com (Runxin Li) Date: Sun, 11 Nov 2007 13:05:19 +0800 Subject: =?gb2312?B?tPC4tDogN3ogYXMgYSBtdWNoIGJldHRlciBhcmNoaXZlciB0aGFuIGd6L2I=?= =?gb2312?B?ejI=?= In-Reply-To: <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com> References: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com> <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com> Message-ID: <47368d88.25bb720a.5e6c.0e1e@mx.google.com> Totally agree with you -----????----- ???: owner-srilm-user at speech.sri.com [mailto:owner-srilm-user at speech.sri. com] ?? Kamadev Bhanuprasad ????: 2007?11?11? 4:59 ???: Alexy Khrabrov ??: srilm-user at speech.sri.com ??: Re: 7z as a much better archiver than gz/bz2 Alexy, I strongly believe this mailing list is not suitable for spams like this. If you want to present to public which compression utilities you use or how much of cpu time your particular computation took, please, use your personal blog or something like that. I'm pretty much sure that vast majority of people in this list is not interested in receiving messages of this kind. Best, Kamadev On Nov 10, 2007 7:05 PM, Alexy Khrabrov wrote: > Greetings -- I've switched to 7z for most of corpora compression, as > it gives results which are whole number of times better than gz, and > 1.1-1.5 better than bz2. Would be nice to see it used more, > especially for the huge kind of things we do here. E.g., a 4.0 GB lm > file was compressed by 7za (a command line version for linux) to 642 > MB. 7za is multi-core CPU aware and knows all about locales and > encodings as well. > > http://www.7-zip.org/ > > Cheers, > Alexy > From stolcke at speech.sri.com Sun Nov 11 08:27:29 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 11 Nov 2007 08:27:29 PST Subject: 7z as a much better archiver than gz/bz2 In-Reply-To: Your message of Sat, 10 Nov 2007 21:59:18 +0100. <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com> Message-ID: <200711111627.lABGRTE29698@huge> In message <244d59a50711101259v60acb8e8tf3743520f2d92aa6 at mail.gmail.com>you wro te: > Alexy, > I strongly believe this mailing list is not suitable for spams like > this. If you want to present to public which compression utilities you > use or how much of cpu time your particular computation took, please, > use your personal blog or something like that. I'm pretty much sure > that vast majority of people in this list is not interested in > receiving messages of this kind. > > Best, > Kamadev > > On Nov 10, 2007 7:05 PM, Alexy Khrabrov wrote: > > Greetings -- I've switched to 7z for most of corpora compression, as > > it gives results which are whole number of times better than gz, and > > 1.1-1.5 better than bz2. Would be nice to see it used more, > > especially for the huge kind of things we do here. E.g., a 4.0 GB lm > > file was compressed by 7za (a command line version for linux) to 642 > > MB. 7za is multi-core CPU aware and knows all about locales and > > encodings as well. > > > > http://www.7-zip.org/ > > > > Cheers, > > Alexy > > I actually think that Alexy's message is relevent to this list, since managing large LMs is a nontrivial problem. I had not heard of 7-zip before, and took a look. It does seem to produce slightly smaller files than bzip2, so it is definitely of interest for LM compression. One drawback is longer compression times (the software even uses multithreading on multi-cpu machines to speed that up). But, in any case, it was easy enough to add reading/writing of 7z files to the relevant library code. You simply have to replace the attached two files in $SRILM/misc/src. BTW, I tested this with the Unix port of 7z found at http://p7zip.sourceforge.net/ . I have NOT tested it on Windows using the original 7-zip software. Also, BTW, if you are concerned with LM reading/writing speed (and decent "compression" compared to text format), I would recommend the binary LM format. Andreas -------------- next part -------------- /* File: zio.h Author: Andreas Stolcke Date: Wed Feb 15 15:19:44 PST 1995 Description: Copyright (c) 1994-2007, SRI International. All Rights Reserved. RCS ID: $Id: zio.h,v 1.13 2007/11/11 16:06:53 stolcke Exp $ */ /* * $Log: zio.h,v $ * Revision 1.13 2007/11/11 16:06:53 stolcke * 7zip compression support * * Revision 1.12 2006/08/04 23:59:09 stolcke * MSVC portability * * Revision 1.11 2006/03/28 01:15:10 stolcke * include sys/signal.h to check for SIGPIPE * * Revision 1.10 2006/03/06 05:46:43 stolcke * define NO_ZIO in zio.h instead of zio.c * * Revision 1.9 2006/03/01 00:45:45 stolcke * allow disabling of zio for windows environment (NO_ZIO) * * Revision 1.8 2005/12/16 23:30:09 stolcke * added support for bzip2-compressed files * * Revision 1.7 2003/02/21 20:18:53 stolcke * avoid conflict if zopen is already defined in library * * Revision 1.6 1999/10/13 09:07:13 stolcke * make filename checking functions public * * Revision 1.5 1995/06/22 19:58:26 stolcke * ansi-fied * * Revision 1.4 1995/06/12 22:56:37 tmk * Added ifdef around the redefinitions of fopen() and fclose(). * */ /******************************************************************* Copyright 1994 SRI International. All rights reserved. This is an unpublished work of SRI International and is not to be used or disclosed except as provided in a license agreement or nondisclosure agreement with SRI International. ********************************************************************/ #ifndef _ZIO_H #define _ZIO_H #ifdef __cplusplus extern "C" { #endif /* Include declarations files. */ #include #include // to check for SIGPIPE /* Avoid conflict with library function */ #ifdef HAVE_ZOPEN #define zopen my_zopen #endif /* Constants */ #if !defined(SIGPIPE) #define NO_ZIO #endif #ifdef NO_ZIO # define COMPRESS_SUFFIX "" # define GZIP_SUFFIX "" # define OLD_GZIP_SUFFIX "" # define BZIP2_SUFFIX "" # define SEVENZIP_SUFFIX "" #else # define COMPRESS_SUFFIX ".Z" # define GZIP_SUFFIX ".gz" # define OLD_GZIP_SUFFIX ".z" # define BZIP2_SUFFIX ".bz2" # define SEVENZIP_SUFFIX ".7z" #endif /* NO_ZIO */ /* Define function prototypes. */ int stdio_filename_p (const char *name); int compressed_filename_p (const char *name); int gzipped_filename_p (const char *name); int bzipped_filename_p (const char *name); int sevenzipped_filename_p (const char *name); FILE * zopen (const char *name, const char *mode); int zclose (FILE *stream); /* Users of this header implicitly always use zopen/zclose in stdio */ #ifdef ZIO_HACK #define fopen(name,mode) zopen(name,mode) #define fclose(stream) zclose(stream) #endif #ifdef __cplusplus } #endif #endif /* _ZIO_H */ -------------- next part -------------- /* File: zio.c Author: Andreas Stolcke Date: Wed Feb 15 15:19:44 PST 1995 Description: Compressed file stdio extension */ #ifndef lint static char Copyright[] = "Copyright (c) 1995-2007 SRI International. All Rights Reserved."; static char RcsId[] = "@(#)$Header: /home/srilm/devel/misc/src/RCS/zio.c,v 1.25 2007/11/11 16:06:53 stolcke Exp $"; #endif /* * $Log: zio.c,v $ * Revision 1.25 2007/11/11 16:06:53 stolcke * 7zip compression support * * Revision 1.24 2006/03/06 05:46:43 stolcke * define NO_ZIO in zio.h instead of zio.c * * Revision 1.23 2006/03/01 00:45:45 stolcke * allow disabling of zio for windows environment (NO_ZIO) * * Revision 1.22 2006/01/09 17:39:03 stolcke * MSVC port * * Revision 1.21 2006/01/05 19:32:42 stolcke * ms visual c portability * * Revision 1.20 2005/12/16 23:30:09 stolcke * added support for bzip2-compressed files * * Revision 1.19 2005/07/28 21:08:15 stolcke * include signal.h for portability * * Revision 1.18 2005/07/28 18:37:47 stolcke * portability for systems w/o pipes * * Revision 1.17 2004/01/31 01:17:51 stolcke * don't declare errno, get it from errno.h * * Revision 1.16 2003/11/09 21:09:11 stolcke * use gunzip -f to allow uncompressed files ending in .gz * * Revision 1.15 2003/11/01 06:18:30 stolcke * issue stdin/stdout warning only once * * Revision 1.14 1999/10/13 09:07:13 stolcke * make filename checking functions public * * Revision 1.13 1997/06/07 15:58:47 stolcke * fixed some gcc warnings * * Revision 1.13 1997/06/07 15:56:24 stolcke * fixed some gcc warnings * * Revision 1.12 1997/01/23 20:38:35 stolcke * *** empty log message *** * * Revision 1.11 1997/01/23 20:02:59 stolcke * handle SIGPIPE termination * * Revision 1.10 1997/01/22 07:52:08 stolcke * warn about multiple uses of - * * Revision 1.9 1996/11/30 21:08:59 stolcke * use exec in compress commands * * Revision 1.8 1995/07/19 16:51:31 stolcke * remove PATH assignment to account for local setup * * Revision 1.7 1995/06/22 20:47:16 stolcke * dup stdio descriptors so fclose won't disturb them * * Revision 1.6 1995/06/22 20:44:39 stolcke * return more error info * * Revision 1.5 1995/06/22 19:58:11 stolcke * ansi-fied * * Revision 1.4 1995/06/12 22:57:12 tmk * Added ifdef around the redefinitions of fopen() and fclose(). * */ /******************************************************************* Copyright 1994,1997 SRI International. All rights reserved. This is an unpublished work of SRI International and is not to be used or disclosed except as provided in a license agreement or nondisclosure agreement with SRI International. ********************************************************************/ #include #include #ifndef _MSC_VER #include #include #endif #include #include #include #include #include #ifndef MAXPATHLEN #define MAXPATHLEN 1024 #endif #include "zio.h" #ifdef ZIO_HACK #undef fopen #undef fclose #endif #define STDIO_NAME "-" #define STD_PATH ":" /* "PATH=/usr/bin:/usr/ucb:/usr/bsd:/usr/local/bin" */ #define COMPRESS_CMD "exec compress -c" #define UNCOMPRESS_CMD "exec uncompress -c" #define GZIP_CMD "exec gzip -c" #define GUNZIP_CMD "exec gunzip -cf" #define BZIP2_CMD "exec bzip2" #define BUNZIP2_CMD "exec bunzip2 -c" #define SEVENZIP_CMD "exec 7z a -si" #define SEVENUNZIP_CMD "exec 7z x -so" /* * Does the filename refer to stdin/stdout ? */ int stdio_filename_p (const char *name) { return (strcmp(name, STDIO_NAME) == 0); } /* * Does the filename refer to a compressed file ? */ int compressed_filename_p (const char *name) { unsigned len = strlen(name); return (sizeof(COMPRESS_SUFFIX) > 1) && (len > sizeof(COMPRESS_SUFFIX)-1) && (strcmp(name + len - (sizeof(COMPRESS_SUFFIX)-1), COMPRESS_SUFFIX) == 0); } /* * Does the filename refer to a gzipped file ? */ int gzipped_filename_p (const char *name) { unsigned len = strlen(name); return (sizeof(GZIP_SUFFIX) > 1) && (len > sizeof(GZIP_SUFFIX)-1) && (strcmp(name + len - (sizeof(GZIP_SUFFIX)-1), GZIP_SUFFIX) == 0) || (sizeof(OLD_GZIP_SUFFIX) > 1) && (len > sizeof(OLD_GZIP_SUFFIX)-1) && (strcmp(name + len - (sizeof(OLD_GZIP_SUFFIX)-1), OLD_GZIP_SUFFIX) == 0); } /* * Does the filename refer to a bzipped file ? */ int bzipped_filename_p (const char *name) { unsigned len = strlen(name); return (sizeof(BZIP2_SUFFIX) > 1) && (len > sizeof(BZIP2_SUFFIX)-1) && (strcmp(name + len - (sizeof(BZIP2_SUFFIX)-1), BZIP2_SUFFIX) == 0); } /* * Does the filename refer to a 7-zip file ? */ int sevenzipped_filename_p (const char *name) { unsigned len = strlen(name); return (sizeof(SEVENZIP_SUFFIX) > 1) && (len > sizeof(SEVENZIP_SUFFIX)-1) && (strcmp(name + len - (sizeof(SEVENZIP_SUFFIX)-1), SEVENZIP_SUFFIX) == 0); } /* * Check file readability */ static int readable_p (const char *name) { int fd = open(name, O_RDONLY); if (fd < 0) return 0; else { close(fd); return 1; } } /* * Check file writability */ static int writable_p (const char *name) { int fd = open(name, O_WRONLY|O_CREAT, 0666); if (fd < 0) return 0; else { close(fd); return 1; } } /* * Open a stdio stream, handling special filenames */ FILE *zopen(const char *name, const char *mode) { char command[MAXPATHLEN + 100]; if (stdio_filename_p(name)) { /* * Return stream to stdin or stdout */ if (*mode == 'r') { static int stdin_used = 0; static int stdin_warning = 0; int fd; if (stdin_used) { if (!stdin_warning) { fprintf(stderr, "warning: '-' used multiple times for input\n"); stdin_warning = 1; } } else { stdin_used = 1; } fd = dup(0); return fd < 0 ? NULL : fdopen(fd, mode); } else if (*mode == 'w' || *mode == 'a') { static int stdout_used = 0; static int stdout_warning = 0; int fd; if (stdout_used) { if (!stdout_warning) { fprintf(stderr, "warning: '-' used multiple times for output\n"); stdout_warning = 1; } } else { stdout_used = 1; } fd = dup(1); return fd < 0 ? NULL : fdopen(fd, mode); } else { return NULL; } } else { char *compress_cmd = NULL; char *uncompress_cmd = NULL; int zip_to_stdout = 1; if (compressed_filename_p(name)) { compress_cmd = COMPRESS_CMD; uncompress_cmd = UNCOMPRESS_CMD; } else if (gzipped_filename_p(name)) { compress_cmd = GZIP_CMD; uncompress_cmd = GUNZIP_CMD; } else if (bzipped_filename_p(name)) { compress_cmd = BZIP2_CMD; uncompress_cmd = BUNZIP2_CMD; } else if (sevenzipped_filename_p(name)) { compress_cmd = SEVENZIP_CMD; uncompress_cmd = SEVENUNZIP_CMD; zip_to_stdout = 0; } if (compress_cmd != NULL) { #ifdef NO_ZIO fprintf(stderr, "Sorry, compressed I/O not available on this machine\n"); errno = EINVAL; return NULL; #else /* !NO_ZIO */ /* * Return stream to compress pipe */ if (*mode == 'r') { if (!readable_p(name)) return NULL; sprintf(command, "%s;%s %s", STD_PATH, uncompress_cmd, name); return popen(command, mode); } else if (*mode == 'w') { if (!writable_p(name)) return NULL; if (zip_to_stdout) { sprintf(command, "%s;%s >%s", STD_PATH, compress_cmd, name); } else { /* * This is necessary because the compression program might * complain if a zero-length file already exists. * However, it means that existing file owner & permission * attributes are not preserved. */ unlink(name); sprintf(command, "%s;%s %s", STD_PATH, compress_cmd, name); } return popen(command, mode); } else { return NULL; } #endif /* !NO_ZIO */ } else { return fopen(name, mode); } } } /* * Close a stream created by zopen() */ int zclose(FILE *stream) { #ifdef NO_ZIO return fclose(stream); #else /* !NO_ZIO */ int status; struct stat statb; /* * pclose(), according to the man page, should diagnose streams not * created by popen() and return -1. however, on SGIs, it core dumps * in that case. So we better be careful and try to figure out * what type of stream it is. */ if (fstat(fileno(stream), &statb) < 0) return -1; /* * First try pclose(). It will tell us if stream is not a pipe */ if ((statb.st_mode & S_IFMT) != S_IFIFO || fileno(stream) == 0 || fileno(stream) == 1) { return fclose(stream); } else { status = pclose(stream); if (status == -1) { /* * stream was not created by popen(), but popen() does fclose * for us in thise case. */ return ferror(stream); } else if (status == SIGPIPE) { /* * It's normal for the uncompressor to terminate by SIGPIPE, * i.e., if the user program closed the file before reaching * EOF. */ return 0; } else { /* * The compressor program terminated with an error, and supposedly * has printed a message to stderr. * Set errno to a generic error code if it hasn't been set already. */ if (errno == 0) { errno = EIO; } return status; } } #endif /* NO_ZIO */ } #ifdef STAND int main (argc, argv) int argc; char **argv; { int dowrite = 0; char buffer[BUFSIZ]; int nread; FILE *stream; if (argc < 3) { printf("usage: %s file {r|w}\n", argv[0]); exit(2); } if (*argv[2] == 'r') { stream = zopen(argv[1], argv[2]); if (!stream) { perror(argv[1]); exit(1); } while (!ferror(stream) && !feof(stream) &&!ferror(stdout)) { nread = fread(buffer, 1, sizeof(buffer), stream); (void)fwrite(buffer, 1, nread, stdout); } } else { stream = zopen(argv[1], argv[2]); if (!stream) { perror(argv[1]); exit(1); } while (!ferror(stdin) && !feof(stdin) && !ferror(stream)) { nread = fread(buffer, 1, sizeof(buffer), stdin); (void)fwrite(buffer, 1, nread, stream); } } if (ferror(stdin)) { perror("stdin"); } else if (ferror(stdout)) { perror("stdout"); } else if (ferror(stream)) { perror(argv[1]); } zclose(stream); } #endif /* STAND */ From barabbas at gmail.com Sun Nov 11 09:12:52 2007 From: barabbas at gmail.com (Tian-Jian "Barabbas" Jiang@Gmail) Date: Mon, 12 Nov 2007 01:12:52 +0800 Subject: 7z as a much better archiver than gz/bz2 In-Reply-To: <200711111627.lABGRTE29698@huge> References: <200711111627.lABGRTE29698@huge> Message-ID: <47373814.3000507@gmail.com> Hi all, Andreas Stolcke wrote: > In message <244d59a50711101259v60acb8e8tf3743520f2d92aa6 at mail.gmail.com>you wro > te: > > I actually think that Alexy's message is relevent to this list, since > managing large LMs is a nontrivial problem. > Sorry for going to off-topic but I would like to say, maybe make up a new interface to read/write data with SQLite is a good alternative. /Mike/ From dyuret at ku.edu.tr Sat Nov 17 14:43:20 2007 From: dyuret at ku.edu.tr (Deniz Yuret) Date: Sun, 18 Nov 2007 00:43:20 +0200 Subject: OOV calculations In-Reply-To: <200711011527.lA1FRcS18102@huge> References: <200711011527.lA1FRcS18102@huge> Message-ID: Hi, I had some interesting observations while trying to build a letter based model. My text file contains a word on each line with letters separated by spaces. 1. kndiscount gives an error for this data file even though ukndiscount seems to work. Is this a bug? ngram-count -kndiscount -order 3 -lm foo3.lm -text turkish.train.oov.split one of modified KneserNey discounts is negative error in discount estimator for order 1 2. ukndiscount accepts the -interpolate option and in fact does better with it. According to the documentation only wbdiscount, cdiscount, and kndiscount are supposed to work with interpolate. I checked the output with -debug 3 and all probabilities seem to add up to 1. Is the documentation out of date? 3. Training with -order k and then testing with -order n does not give the same results as training with -order n and testing with -order n. Is this normal? Which discounting methods should give equal results? deniz On Nov 1, 2007 5:27 PM, Andreas Stolcke wrote: > > In message you wro > te: > > Thank you. > > > > > You cannot compare LMs with different OOV counts. You need to create a > > > model that assigns a nonzero probability to every event. E.g., you > > > could have a letter-probability model for OOVS. > > > > As for your suggestion of creating a letter-probability model for OOVs > > (and maybe interpolating it with the ngram model), are there any > > tools/documentation in the srilm package that could be helpful? If > > not I think we can (1) go into the source code and figure out how to > > create a new letter-probability LM, or (2) create an independent > > letter-probability LM outside srilm and manually interpolate its > > results with the -debug 2 output of ngram. > > > > I am assuming here (maybe contrary to your suggestion) that we can > > create a model that assigns a nonzero probability to every event by > > interpolating a regular ngram model (with OOVs > 0) and a > > letter-probability model. > > Actually, I wasn't thinking of covering all words with a letter > probability model (which would be poor for non-OOV words) and > interpolating. A more typical approach is to have a word LM with an > OOV token, and when you are inside the OOV you assign a probability to > the specific word by a letter LM. so the total probability of > > p(a b c) where "b" is an OOV would be > > > p(a | ...) p(OOV | a) p(b| OOV) p(c | a OOV) and > > p(b|OOV) is given by a totally separate LM that operates in terms of letters. > > Obviously this isn't implemented in SRILM at this point, but you can compute > total probabilities, perplexities, etc. by first running the word LM, then > the letter LM just on the OOVs in your test set, and adding the log > probabilities. > > Andreas > From stolcke at speech.sri.com Sun Nov 18 08:29:04 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 18 Nov 2007 08:29:04 PST Subject: Entropy Pruning In-Reply-To: Your message of Sun, 18 Nov 2007 01:37:37 +0900. <473F18D1.5050201@cslab.kecl.ntt.co.jp> Message-ID: <200711181629.lAIGT4D23492@huge> In message <473F18D1.5050201 at cslab.kecl.ntt.co.jp>you wrote: > Dear Dr. Stolcke, > > Hello, I'm Daichi Mochihashi, a researcher in NTT Communication Science > Labs, Japan. > Until recently, I was involved in the language modeling team > at ATR Spoken Language Communication Research Laboratories, perhaps you may k > now. > > Lately, I developed a novel pruning method for variable-order ngrams > and want to compare it with your entropy pruning as the Gold standard. > However, in spite of the description in the SRILM paper, I found that > the entropy pruning method is not implemented but replaced by > a heuristic algorithm in the current SRILM distribution. That is not correct. The exact algorithm described in the paper is implemented in Ngram::pruneProbs() in NgramLM.cc. It is activated by the ngram-count and ngram -prune options. > > Is there any previous version of SRILM that supports entropy pruning? > or could you kindly send me a version of VarNgram.cc or any code > that you have used in the experiment? VarNgram.cc was a research effort that performs pruning during the estimation step to eliminate redunant Ngrams from the start, using a Hoeffding bound criterion, which happens not to work very well. Note that the standard Ngram class supports "variable" N-gram models already, since any mix of N-grams of different lengths is allowed. So stick to the Ngram class, and do not use the ngram-count -varprune option, which trigger the use of the VarNgram class. I'm sorry that the naming of classes must have been confusing. Andreas From stolcke at speech.sri.com Sun Nov 18 21:25:54 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 18 Nov 2007 21:25:54 PST Subject: OOV calculations In-Reply-To: Your message of Sun, 18 Nov 2007 00:43:20 +0200. Message-ID: <200711190525.lAJ5PsM15005@huge> In message you wro te: > Hi, > > I had some interesting observations while trying to build a letter > based model. My text file contains a word on each line with letters > separated by spaces. > > 1. kndiscount gives an error for this data file even though > ukndiscount seems to work. Is this a bug? > > ngram-count -kndiscount -order 3 -lm foo3.lm -text turkish.train.oov.split > one of modified KneserNey discounts is negative > error in discount estimator for order 1 This is quite possible as the formulae used by the two methods differ. Also, the count-of-count statistics used may be quite atypical given that the number of distinct unigram types is limited and small for letter-based models. > > 2. ukndiscount accepts the -interpolate option and in fact does better > with it. According to the documentation only wbdiscount, cdiscount, > and kndiscount are supposed to work with interpolate. I checked the > output with -debug 3 and all probabilities seem to add up to 1. Is > the documentation out of date? ukndiscount was added later, and the man page was not fully updated, it seems. ukndiscount is certainly supposed to support -interpolate . > 3. Training with -order k and then testing with -order n does not give > the same results as training with -order n and testing with -order n. > Is this normal? Which discounting methods should give equal results? This is a known (and desired) feature of KN (original and modified) discounting. KN treats the highest-order N-grams differently from the lower-order ones, and the lower-order N-grams are not supposed to be used by themselves. The reason is that the lower-order estimates are specifically chosen to work well as backoffs, not as standalone estimates. Andreas From stolcke at speech.sri.com Tue Nov 27 15:14:52 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 27 Nov 2007 15:14:52 PST Subject: SRILM and LM network servers Message-ID: <200711272314.lARNErT07252@huge> FYI, there is now much enhanced support for network-based "LM servers" in SRILM. The main changes are: * New ngram -use-server option to run the client side of a network LM server as implemented by ngram -server-port. Optionally, probabilities may be cached in the client (option -cache-served-ngrams). * New ngram -use-server option to run the client side of a network LM server as implemented by ngram -server-port . * New LMClient class to implement the above (a stub LM subclass that queries a server for LM probabilities). * ngram -server-port now behaves like a true server daemon: it handles multiple simultaneous or sequential clients, and never exits (unless killed). The number of simultaneous clients may be limited with the -server-maxclients option. This is still somewhat experimental, so I welcome any feedback. If you want to give it a try download the 1.5.6 (beta) version from the SRILM download page. An example and test of the functionality is in $SRILM/test/tests/ngram-server . Andreas From schwa717 at umn.edu Wed Nov 28 15:15:00 2007 From: schwa717 at umn.edu (Lane Schwartz) Date: Wed, 28 Nov 2007 17:15:00 -0600 Subject: Using and understanding LM file (with modified Kneser-Ney smoothing) Message-ID: Hi, I'm working on some machine translation code which in which I'd like incorporate a language model. I'm trying to replicate the system described in David Chiang's 2005 ACL paper; in that paper, his language model is a trigram model which uses modified Kneser-Ney smoothing. My goal is to train the LM using the SRILM toolkit, then use the generated LM file in my own code. I've looked over Chen & Goodman (1998), and I think I understand the ideas, but I'm having some trouble understanding how to make sense of the numbers in the LM file (produced by ngram-count). Any help would be greatly appreciated. My training corpus is the first 10000 words of the English side of the de-en Europarl training corpus (http://www.cs.umn.edu/research/nlp/mt/wmt06/europarl.de-en.en.gz ), which I have lowercased and converted to UTF-8. Again, my goal is a trigram language model which uses modified Kneser-Ney smoothing, and I want to use interpolation - here's what I did to get the LM file: $ zcat europarl.de-en.en.gz | head -n 10000 | ngram-count -text - - order 3 -kndiscount -interpolate -lm sample.srilm Since I'm trying to understand how to apply the ngram probabilities and backoff-weights, I'm testing using a very simple test phrase: echo "the man in" > sample.txt Here are the (I think) relevant lines from the LM file: unigrams: -2.987062 -99 -1.142606 -1.73375 in -0.660575 -3.960678 man -0.1932579 -1.781734 the -0.5241315 bigrams: -0.8540089 the -0.3293318 -1.516293 man in -3.496579 the man -0.09554159 trigrams: -0.6538057 the man in I then ran the ngram tool to see what it does with this phrase: $ ngram -lm sample.srilm -ppl sample.txt -debug 3 reading 10209 1-grams reading 78195 2-grams reading 20317 3-grams the man in p( the | ) = [2gram] 0.139956 [ -0.854009 ] / 1 p( man | the ...) = [2gram] 0.00014931 [ -3.82591 ] / 1 p( in | man ...) = [3gram] 0.221919 [ -0.653806 ] / 1 p( | in ...) = [1gram] 0.000225094 [ -3.64764 ] / 1 1 sentences, 3 words, 0 OOVs 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797 file sample.txt: 1 sentences, 3 words, 0 OOVs 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797 I'd like to make sense of the above numbers. The first line, p( the | ), makes sense, since the bigram log prob for " the" in lm.srilm is -0.8540089. I'm getting stuck figuring out where -3.82591 comes from in p( man | the ...). It seems that the formula should be: interpolated P( man | the ) = lamda_man*P(man) + (1 - lamda_man)*(lamda_man|the * p(man|the)) If the weights listed above are the lamdas in the above equation, that gives us the following (converting from log domain to regular domain as we go): lamda_man = 10**(-0.1932579 P(man) = 10**(-3.960678) lamda_man|the = 10**-0.09554159 P(man|the) = 10**-3.496579 So my interpolated P( man | the ) calculation gives 0.000162027. The ngram util gave 0.00014931. If anyone could help point out where I'm screwing up, it would be very much appreciated. Am I running with the appropriate parameters to ngram-count and ngram, given that I want an interpolated LM with modified Kneser-Ney smoothing (as used by Chiang(2005))? Does my equation above look right? I know this is a long email - thanks for your time and thoughts. Thanks, Lane Schwartz University of Minnesota From stolcke at speech.sri.com Fri Nov 30 19:48:59 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 30 Nov 2007 19:48:59 -0800 Subject: Using and understanding LM file (with modified Kneser-Ney smoothing) In-Reply-To: References: Message-ID: <4750D9AB.4050900@speech.sri.com> Lane, there is a key misunderstanding here. The interpolation of higher- and lower-order probability estimates (triggered by ngram-count -interpolate) happens at training time, and the final probability estimates are then stored in the LM file. Hence, no interpolation is required at test time. In fact, all LMs in ARPA backoff format are handled exactly the same in testing. The different smoothing methods only come in during training. I hope this answers your question. Andreas Lane Schwartz wrote: > Hi, > > I'm working on some machine translation code which in which I'd like > incorporate a language model. I'm trying to replicate the system > described in David Chiang's 2005 ACL paper; in that paper, his > language model is a trigram model which uses modified Kneser-Ney > smoothing. > > My goal is to train the LM using the SRILM toolkit, then use the > generated LM file in my own code. > > I've looked over Chen & Goodman (1998), and I think I understand the > ideas, but I'm having some trouble understanding how to make sense of > the numbers in the LM file (produced by ngram-count). > > Any help would be greatly appreciated. > > My training corpus is the first 10000 words of the English side of the > de-en Europarl training corpus > (http://www.cs.umn.edu/research/nlp/mt/wmt06/europarl.de-en.en.gz), > which I have lowercased and converted to UTF-8. Again, my goal is a > trigram language model which uses modified Kneser-Ney smoothing, and I > want to use interpolation - here's what I did to get the LM file: > > $ zcat europarl.de-en.en.gz | head -n 10000 | ngram-count -text - > -order 3 -kndiscount -interpolate -lm sample.srilm > > Since I'm trying to understand how to apply the ngram probabilities > and backoff-weights, I'm testing using a very simple test phrase: > > echo "the man in" > sample.txt > > Here are the (I think) relevant lines from the LM file: > > unigrams: > -2.987062 > -99 -1.142606 > -1.73375 in -0.660575 > -3.960678 man -0.1932579 > -1.781734 the -0.5241315 > > bigrams: > -0.8540089 the -0.3293318 > -1.516293 man in > -3.496579 the man -0.09554159 > > trigrams: > -0.6538057 the man in > > > > I then ran the ngram tool to see what it does with this phrase: > > $ ngram -lm sample.srilm -ppl sample.txt -debug 3 > reading 10209 1-grams > reading 78195 2-grams > reading 20317 3-grams > the man in > p( the | ) = [2gram] 0.139956 [ -0.854009 ] / 1 > p( man | the ...) = [2gram] 0.00014931 [ -3.82591 ] / 1 > p( in | man ...) = [3gram] 0.221919 [ -0.653806 ] / 1 > p( | in ...) = [1gram] 0.000225094 [ -3.64764 ] / 1 > 1 sentences, 3 words, 0 OOVs > 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797 > > file sample.txt: 1 sentences, 3 words, 0 OOVs > 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797 > > > > I'd like to make sense of the above numbers. > > The first line, p( the | ), makes sense, since the bigram log prob > for " the" in lm.srilm is -0.8540089. > > I'm getting stuck figuring out where -3.82591 comes from in p( man | > the ...). It seems that the formula should be: > interpolated P( man | the ) = lamda_man*P(man) + (1 - > lamda_man)*(lamda_man|the * p(man|the)) > > If the weights listed above are the lamdas in the above equation, that > gives us the following (converting from log domain to regular domain > as we go): > > lamda_man = 10**(-0.1932579 > P(man) = 10**(-3.960678) > lamda_man|the = 10**-0.09554159 > P(man|the) = 10**-3.496579 > > So my interpolated P( man | the ) calculation gives 0.000162027. The > ngram util gave 0.00014931. > > > If anyone could help point out where I'm screwing up, it would be very > much appreciated. Am I running with the appropriate parameters to > ngram-count and ngram, given that I want an interpolated LM with > modified Kneser-Ney smoothing (as used by Chiang(2005))? Does my > equation above look right? I know this is a long email - thanks for > your time and thoughts. > > Thanks, > Lane Schwartz > > University of Minnesota > From dyuret at ku.edu.tr Mon Dec 3 02:38:29 2007 From: dyuret at ku.edu.tr (Deniz Yuret) Date: Mon, 3 Dec 2007 12:38:29 +0200 Subject: Understanding lm-files and discounting Message-ID: I spent last weekend trying to figure out the discrepancies between the SRILM kn-discounting implementations and my earlier implementations. Basically I am trying to go from the text file to the count file to the model file to the probabilities assigned to the words in the test file. This took me on a journey from man pages to debug outputs to the source code. I figured a lot of it out but it turned out to be nontrivial to go from paper descriptions to the numbers in the ARPA ngram format to the final probability calculations. If you help me with a couple of things I promise I'll write a man page detailing all discounting calculations in SRILM. 1. Sometimes the model seems to use smaller ngrams even when longer ones are in the training file. An example from a letter model: E i s e n h o w e r p( E | ) = [2gram] 0.0122983 [ -1.91016 ] / 1 p( i | E ...) = [3gram] 0.0143471 [ -1.84324 ] / 1 p( s | i ...) = [4gram] 0.308413 [ -0.510867 ] / 1 p( e | s ...) = [5gram] 0.412852 [ -0.384206 ] / 1 p( n | e ...) = [6gram] 0.759049 [ -0.11973 ] / 1 p( h | n ...) = [7gram] 0.397406 [ -0.400766 ] / 1 p( o | h ...) = [4gram] 0.212227 [ -0.6732 ] / 1 p( w | o ...) = [3gram] 0.0199764 [ -1.69948 ] / 1 p( e | w ...) = [4gram] 0.165049 [ -0.782387 ] / 1 p( r | e ...) = [4gram] 0.222122 [ -0.653408 ] / 1 p( | r ...) = [5gram] 0.492478 [ -0.307613 ] / 1 1 sentences, 10 words, 0 OOVs 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213 This is an -order 7 model and the training file does have the word Eisenhower. So I don't understand why it goes back to using lower order ngrams after the letter 'h'. 2. Not all (n-1)-grams have backoff weights in the model file, why? 3. What exactly does srilm do with google ngrams? Can you give an example usage? Does it do things like extract a small subset useful for evaluating a test file? 4. Since google-ngrams have all ngrams below count=40 missing, the kn discount constants that rely on the number of ngrams with low counts will fail. Also I found that empirically the best highest order discount constant is close to 40, not in the [0,1] range. How does srilm handle this? 5. Do I need to understand what the following messages mean to understand the calculations: warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator warning: distributing 0.000254455 left-over probability mass over all 124 words discarded 254764 7-gram probs discounted to zero inserted 2766 redundant 3-gram probs best, deniz From stolcke at speech.sri.com Mon Dec 3 22:07:11 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 03 Dec 2007 22:07:11 PST Subject: Understanding lm-files and discounting In-Reply-To: Your message of Mon, 03 Dec 2007 12:38:29 +0200. Message-ID: <200712040607.lB467Bt14601@huge> In message you wro te: > I spent last weekend trying to figure out the discrepancies between the > SRILM kn-discounting implementations and my earlier implementations. > Basically I am trying to go from the text file to the count file to > the model file > to the probabilities assigned to the words in the test file. This took me on > a > journey from man pages to debug outputs to the source code. I figured > a lot of it out but it turned out to be nontrivial to go from paper > descriptions to the numbers in the ARPA ngram format to the final > probability calculations. If you help me with a couple of things I > promise I'll write a man page detailing all discounting calculations > in SRILM. A tutorial or FAQ including the information below would be most useful! > > 1. Sometimes the model seems to use smaller ngrams even when longer > ones are in the training file. An example from a letter model: > > E i s e n h o w e r > p( E | ) = [2gram] 0.0122983 [ -1.91016 ] / 1 > p( i | E ...) = [3gram] 0.0143471 [ -1.84324 ] / 1 > p( s | i ...) = [4gram] 0.308413 [ -0.510867 ] / 1 > p( e | s ...) = [5gram] 0.412852 [ -0.384206 ] / 1 > p( n | e ...) = [6gram] 0.759049 [ -0.11973 ] / 1 > p( h | n ...) = [7gram] 0.397406 [ -0.400766 ] / 1 > p( o | h ...) = [4gram] 0.212227 [ -0.6732 ] / 1 > p( w | o ...) = [3gram] 0.0199764 [ -1.69948 ] / 1 > p( e | w ...) = [4gram] 0.165049 [ -0.782387 ] / 1 > p( r | e ...) = [4gram] 0.222122 [ -0.653408 ] / 1 > p( | r ...) = [5gram] 0.492478 [ -0.307613 ] / 1 > 1 sentences, 10 words, 0 OOVs > 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213 > > This is an -order 7 model and the training file does have the word > Eisenhower. So I don't understand why it goes back to using lower > order ngrams after the letter 'h'. This is because the default "mincount" for N-grams longer than 2 words is 2, Meaning that a trigram, 4gram, etc. has to occur at least twice to be included in the LM. You can change this with the options -gt3min 1 -gt4min 1 etc. > > 2. Not all (n-1)-grams have backoff weights in the model file, why? Backoff weights are only recorded for N-grams that appear as the prefix of a longer N-gram. For all others the backoff weight is implicitly 1 (or 0, in log representation). This convention saves a lot of space. > > 3. What exactly does srilm do with google ngrams? Can you give an > example usage? Does it do things like extract a small subset useful > for evaluating a test file? Google n-grams are not an LM format, they are way to store N-gram counts on disk, and the classes that implement N-gram counts know how to read them. This is exercized by the ngram-count -read-google option. However, due to their typical size it is not advisable to try to build backoff LMs of the standard sort, which would require reading all N-grams into memory (someone working at Google might actually be able to do this if their hardware budget is as phenomenal as it must be). Instead, I recommend estimating a deleted-interpolation-smoothed "count LM", i.e, an LM that consists of only a small number of interpolation weights (for smoothing) as well as the raw N-gram counts themselves. This way we can in fact load only the portion of the counts into memory that impinge on a given test set (triggered by the ngram -limit-vocab option). There is no full example of this, but it is basically what you see in $SRILM/test/tests/ngram-count-lm-limit-vocab . The only change would be that instead of a countlm file with the keyword "counts" you would use the keyword "google-counts" followed by the path to the google count directory root. Read the man page sections for ngram-count -count-lm and ngram -count-lm for more information, and follow the example under the test directory. > > 4. Since google-ngrams have all ngrams below count=40 missing, the kn > discount constants that rely on the number of ngrams with low counts > will fail. Also I found that empirically the best highest order > discount constant is close to 40, not in the [0,1] range. How does > srilm handle this? The deleted interpolation method of smoothing I am recommending above does not have a problem with the missing ngrams. There is also a way to extrapolate from the available counts-of-counts above some threshold to those below the threshold, due to an empirical law that we found to hold for a range of corpora. For details see the paper W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto. http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz The extrapolation method is implemented in the script $SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use make-big-lm to build your LM. Again, it is not feasible to do this on the ngrams distributed by Google. > 5. Do I need to understand what the following messages mean to > understand the calculations: Not really, they are for information only. > warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator This means your unigram probabilities even after discounting sum to (almost) 1. As a crude fallback, the denominator in the estimator is incremented to yield usable backoff probability mass. > warning: distributing 0.000254455 left-over probability mass over all 124 wor > ds Here the backof mass is 0.000254455 and is spread out over the 124 words that don't have any observed occurrences. > discarded 254764 7-gram probs discounted to zero Due to discounting cutoff (mincounts, see above) some 7-grams were not included in the model. > inserted 2766 redundant 3-gram probs The ARPA format requires all prefixes of ngrams with probabilities to also have probabilities. E.g., if "a b c" is in the model, so must "a b", even if "a b" was not in the input ngram counts. In such cases SRILM will insert the "a b" probability but make it equal to what the backoff computation would yield. Andreas From wangc at csail.mit.edu Tue Dec 4 04:49:13 2007 From: wangc at csail.mit.edu (Chao Wang) Date: Tue, 04 Dec 2007 07:49:13 -0500 Subject: unsubscribe Message-ID: <20071204074913.smcpvncpqe8k8k0g@imap.csail.mit.edu> unsubscribe From dyuret at ku.edu.tr Mon Dec 10 07:41:16 2007 From: dyuret at ku.edu.tr (Deniz Yuret) Date: Mon, 10 Dec 2007 17:41:16 +0200 Subject: Understanding lm-files and discounting In-Reply-To: <200712040607.lB467Bt14601@huge> References: <200712040607.lB467Bt14601@huge> Message-ID: Working on that documentation as promised. Small question about the mincounts: I was able to verify what you said with the default (gt) discount, but with kn or ukndiscount some long ngrams with cnt=1 are included in the model. Since the counts are modified I thought maybe it is looking at unmodified counts, but then there are some ngrams excluded with regular count > 1 and kncount = 1. So I couldn't quite figure out what subset is included in the model with kndiscounting. deniz On Dec 4, 2007 8:07 AM, Andreas Stolcke wrote: > > In message you wro > te: > > I spent last weekend trying to figure out the discrepancies between the > > SRILM kn-discounting implementations and my earlier implementations. > > Basically I am trying to go from the text file to the count file to > > the model file > > to the probabilities assigned to the words in the test file. This took me on > > a > > journey from man pages to debug outputs to the source code. I figured > > a lot of it out but it turned out to be nontrivial to go from paper > > descriptions to the numbers in the ARPA ngram format to the final > > probability calculations. If you help me with a couple of things I > > promise I'll write a man page detailing all discounting calculations > > in SRILM. > > A tutorial or FAQ including the information below would be most useful! > > > > > 1. Sometimes the model seems to use smaller ngrams even when longer > > ones are in the training file. An example from a letter model: > > > > E i s e n h o w e r > > p( E | ) = [2gram] 0.0122983 [ -1.91016 ] / 1 > > p( i | E ...) = [3gram] 0.0143471 [ -1.84324 ] / 1 > > p( s | i ...) = [4gram] 0.308413 [ -0.510867 ] / 1 > > p( e | s ...) = [5gram] 0.412852 [ -0.384206 ] / 1 > > p( n | e ...) = [6gram] 0.759049 [ -0.11973 ] / 1 > > p( h | n ...) = [7gram] 0.397406 [ -0.400766 ] / 1 > > p( o | h ...) = [4gram] 0.212227 [ -0.6732 ] / 1 > > p( w | o ...) = [3gram] 0.0199764 [ -1.69948 ] / 1 > > p( e | w ...) = [4gram] 0.165049 [ -0.782387 ] / 1 > > p( r | e ...) = [4gram] 0.222122 [ -0.653408 ] / 1 > > p( | r ...) = [5gram] 0.492478 [ -0.307613 ] / 1 > > 1 sentences, 10 words, 0 OOVs > > 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213 > > > > This is an -order 7 model and the training file does have the word > > Eisenhower. So I don't understand why it goes back to using lower > > order ngrams after the letter 'h'. > > This is because the default "mincount" for N-grams longer than 2 words is 2, > Meaning that a trigram, 4gram, etc. has to occur at least twice to be included > in the LM. > You can change this with the options > > -gt3min 1 > -gt4min 1 > etc. > > > > > > 2. Not all (n-1)-grams have backoff weights in the model file, why? > > Backoff weights are only recorded for N-grams that appear as the prefix > of a longer N-gram. For all others the backoff weight is implicitly 1 > (or 0, in log representation). This convention saves a lot of space. > > > > > 3. What exactly does srilm do with google ngrams? Can you give an > > example usage? Does it do things like extract a small subset useful > > for evaluating a test file? > > Google n-grams are not an LM format, they are way to store N-gram counts > on disk, and the classes that implement N-gram counts know how to read them. > This is exercized by the ngram-count -read-google option. > However, due to their typical size it is not advisable to try to build > backoff LMs of the standard sort, which would require reading all N-grams > into memory (someone working at Google might actually be able to do this > if their hardware budget is as phenomenal as it must be). > > Instead, I recommend estimating a deleted-interpolation-smoothed > "count LM", i.e, an LM that consists of only a small number of > interpolation weights (for smoothing) as well as the raw N-gram counts > themselves. This way we can in fact load only the portion of the counts > into memory that impinge on a given test set (triggered by the > ngram -limit-vocab option). > > There is no full example of this, but it is basically what you see in > $SRILM/test/tests/ngram-count-lm-limit-vocab . The only change would be > that instead of a countlm file with the keyword "counts" you would > use the keyword "google-counts" followed by the path to the google count > directory root. Read the man page sections for ngram-count -count-lm and > ngram -count-lm for more information, and follow the example under the test > directory. > > > > > 4. Since google-ngrams have all ngrams below count=40 missing, the kn > > discount constants that rely on the number of ngrams with low counts > > will fail. Also I found that empirically the best highest order > > discount constant is close to 40, not in the [0,1] range. How does > > srilm handle this? > > The deleted interpolation method of smoothing I am recommending above does > not have a problem with the missing ngrams. > > There is also a way to extrapolate from the available counts-of-counts above > some threshold to those below the threshold, due to an empirical law that > we found to hold for a range of corpora. For details see the paper > > W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto. > http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz > > The extrapolation method is implemented in the script > $SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use > make-big-lm to build your LM. Again, it is not feasible to do this on > the ngrams distributed by Google. > > > 5. Do I need to understand what the following messages mean to > > understand the calculations: > > Not really, they are for information only. > > > warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator > > This means your unigram probabilities even after discounting sum to (almost) 1. > As a crude fallback, the denominator in the estimator is incremented to yield > usable backoff probability mass. > > > warning: distributing 0.000254455 left-over probability mass over all 124 wor > > ds > > Here the backof mass is 0.000254455 and is spread out over the 124 words that > don't have any observed occurrences. > > > discarded 254764 7-gram probs discounted to zero > > Due to discounting cutoff (mincounts, see above) some 7-grams were not > included in the model. > > > inserted 2766 redundant 3-gram probs > > The ARPA format requires all prefixes of ngrams with probabilities to > also have probabilities. E.g., if "a b c" is in the model, so must "a b", > even if "a b" was not in the input ngram counts. In such cases SRILM will > insert the "a b" probability but make it equal to what the backoff computation > would yield. > > Andreas > > From stolcke at speech.sri.com Mon Dec 10 13:51:51 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 11 Dec 2007 06:51:51 +0900 Subject: Understanding lm-files and discounting In-Reply-To: References: <200712040607.lB467Bt14601@huge> Message-ID: <475DB4F7.2000509@speech.sri.com> Deniz Yuret wrote: > Working on that documentation as promised. Small question about the > mincounts: I was able to verify what you said with the default (gt) > discount, but with kn or ukndiscount some long ngrams with cnt=1 are > included in the model. Since the counts are modified I thought maybe > it is looking at unmodified counts, but then there are some ngrams > excluded with regular count > 1 and kncount = 1. So I couldn't quite > figure out what subset is included in the model with kndiscounting. > I think what you're seeing can be explained by the following two facts: 1 - with KN discounting the mincounts are indeed applied to the modified lower-order counts. 2 - However (and this is true for all smoothing methods), if an ngram "a b c d" is included in the model based on its counts, then all prefixes of that ngram also need to be included (otherwise you'd have an empty first column in the lm file at those prefix ngrams, which is illegal). So, if mincount is 2 for 4grams and 3grams, and a b c d occurs twice, but (after count modification) a b c occurs only once, then a b c would still be included in the LM. See if the above is in agreement with your observations. Andreas > deniz > > > > On Dec 4, 2007 8:07 AM, Andreas Stolcke wrote: > >> In message you wro >> te: >> >>> I spent last weekend trying to figure out the discrepancies between the >>> SRILM kn-discounting implementations and my earlier implementations. >>> Basically I am trying to go from the text file to the count file to >>> the model file >>> to the probabilities assigned to the words in the test file. This took me on >>> a >>> journey from man pages to debug outputs to the source code. I figured >>> a lot of it out but it turned out to be nontrivial to go from paper >>> descriptions to the numbers in the ARPA ngram format to the final >>> probability calculations. If you help me with a couple of things I >>> promise I'll write a man page detailing all discounting calculations >>> in SRILM. >>> >> A tutorial or FAQ including the information below would be most useful! >> >> >>> 1. Sometimes the model seems to use smaller ngrams even when longer >>> ones are in the training file. An example from a letter model: >>> >>> E i s e n h o w e r >>> p( E | ) = [2gram] 0.0122983 [ -1.91016 ] / 1 >>> p( i | E ...) = [3gram] 0.0143471 [ -1.84324 ] / 1 >>> p( s | i ...) = [4gram] 0.308413 [ -0.510867 ] / 1 >>> p( e | s ...) = [5gram] 0.412852 [ -0.384206 ] / 1 >>> p( n | e ...) = [6gram] 0.759049 [ -0.11973 ] / 1 >>> p( h | n ...) = [7gram] 0.397406 [ -0.400766 ] / 1 >>> p( o | h ...) = [4gram] 0.212227 [ -0.6732 ] / 1 >>> p( w | o ...) = [3gram] 0.0199764 [ -1.69948 ] / 1 >>> p( e | w ...) = [4gram] 0.165049 [ -0.782387 ] / 1 >>> p( r | e ...) = [4gram] 0.222122 [ -0.653408 ] / 1 >>> p( | r ...) = [5gram] 0.492478 [ -0.307613 ] / 1 >>> 1 sentences, 10 words, 0 OOVs >>> 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213 >>> >>> This is an -order 7 model and the training file does have the word >>> Eisenhower. So I don't understand why it goes back to using lower >>> order ngrams after the letter 'h'. >>> >> This is because the default "mincount" for N-grams longer than 2 words is 2, >> Meaning that a trigram, 4gram, etc. has to occur at least twice to be included >> in the LM. >> You can change this with the options >> >> -gt3min 1 >> -gt4min 1 >> etc. >> >> >> >>> 2. Not all (n-1)-grams have backoff weights in the model file, why? >>> >> Backoff weights are only recorded for N-grams that appear as the prefix >> of a longer N-gram. For all others the backoff weight is implicitly 1 >> (or 0, in log representation). This convention saves a lot of space. >> >> >>> 3. What exactly does srilm do with google ngrams? Can you give an >>> example usage? Does it do things like extract a small subset useful >>> for evaluating a test file? >>> >> Google n-grams are not an LM format, they are way to store N-gram counts >> on disk, and the classes that implement N-gram counts know how to read them. >> This is exercized by the ngram-count -read-google option. >> However, due to their typical size it is not advisable to try to build >> backoff LMs of the standard sort, which would require reading all N-grams >> into memory (someone working at Google might actually be able to do this >> if their hardware budget is as phenomenal as it must be). >> >> Instead, I recommend estimating a deleted-interpolation-smoothed >> "count LM", i.e, an LM that consists of only a small number of >> interpolation weights (for smoothing) as well as the raw N-gram counts >> themselves. This way we can in fact load only the portion of the counts >> into memory that impinge on a given test set (triggered by the >> ngram -limit-vocab option). >> >> There is no full example of this, but it is basically what you see in >> $SRILM/test/tests/ngram-count-lm-limit-vocab . The only change would be >> that instead of a countlm file with the keyword "counts" you would >> use the keyword "google-counts" followed by the path to the google count >> directory root. Read the man page sections for ngram-count -count-lm and >> ngram -count-lm for more information, and follow the example under the test >> directory. >> >> >>> 4. Since google-ngrams have all ngrams below count=40 missing, the kn >>> discount constants that rely on the number of ngrams with low counts >>> will fail. Also I found that empirically the best highest order >>> discount constant is close to 40, not in the [0,1] range. How does >>> srilm handle this? >>> >> The deleted interpolation method of smoothing I am recommending above does >> not have a problem with the missing ngrams. >> >> There is also a way to extrapolate from the available counts-of-counts above >> some threshold to those below the threshold, due to an empirical law that >> we found to hold for a range of corpora. For details see the paper >> >> W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto. >> http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz >> >> The extrapolation method is implemented in the script >> $SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use >> make-big-lm to build your LM. Again, it is not feasible to do this on >> the ngrams distributed by Google. >> >> >>> 5. Do I need to understand what the following messages mean to >>> understand the calculations: >>> >> Not really, they are for information only. >> >> >>> warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator >>> >> This means your unigram probabilities even after discounting sum to (almost) 1. >> As a crude fallback, the denominator in the estimator is incremented to yield >> usable backoff probability mass. >> >> >>> warning: distributing 0.000254455 left-over probability mass over all 124 wor >>> ds >>> >> Here the backof mass is 0.000254455 and is spread out over the 124 words that >> don't have any observed occurrences. >> >> >>> discarded 254764 7-gram probs discounted to zero >>> >> Due to discounting cutoff (mincounts, see above) some 7-grams were not >> included in the model. >> >> >>> inserted 2766 redundant 3-gram probs >>> >> The ARPA format requires all prefixes of ngrams with probabilities to >> also have probabilities. E.g., if "a b c" is in the model, so must "a b", >> even if "a b" was not in the input ngram counts. In such cases SRILM will >> insert the "a b" probability but make it equal to what the backoff computation >> would yield. >> >> Andreas >> >> >> From stolcke at speech.sri.com Wed Dec 19 13:00:28 2007 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 19 Dec 2007 13:00:28 PST Subject: SRILM FAQ online Message-ID: <200712192100.lBJL0Sk17257@huge> A first cut at a Frequently Asked Question document for SRILM is now available at http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html This is very much work in progress. I would especially appreciate it if people sent me contributions to cover additional topics. Enjoy, Andreas