From gelbart at icsi.berkeley.edu  Mon Oct  8 18:32:33 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Mon, 8 Oct 2007 18:32:33 -0700 (PDT)
Subject: SRILM and LC_ALL
In-Reply-To: <Pine.LNX.4.63.0709141557290.6942@lamb.ICSI.Berkeley.EDU>
References: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU>
 <46E81B83.7050603@speech.sri.com> <Pine.LNX.4.63.0709141536290.31434@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0709141557290.6942@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0710081805540.12736@lamb.ICSI.Berkeley.EDU>

On July 19 2007, Andreas Stolcke wrote:
> David Brodbeck wrote:
> > I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5.
> > The machine type is i686_m64.  Everything builds all right, but 
> > the tests fail for make-ngram-pfsg, ngram-class, and
> > ngram-count-lm-limit-vocab.
> >
> > make-ngram-pfsg is the most obvious one, so I'll tackle that one
> > first.  I get the following in the stderr file:
> > gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid
> > collation character: /[[:lower:]-?]/
>
> > Has anyone else run into this?  I'm using GNU Awk 3.1.5, and the
> > locale is set to en_US.UTF-8.
>
> This is odd since we're also using gawk 3.1.5 and I cannot replicate 
> the problem even when setting LANG to en_US.UTF-8. It seems that the 
> interpretation of gawk regular expressions should not depend on the 
> OS release version, but of course there may always be bugs.

Hi Andreas,

Are you sure you used gawk 3.1.5 when you tried to replicate this? 
The reason I ask is that at ICSI, the SRILM tools seem to invoke gawk 
3.1.3, not gawk 3.1.5:

$ head -1 `which add-pauses-to-pfsg`
#!/usr/bin/gawk -f
$ /usr/bin/gawk --version | head -1
GNU Awk 3.1.3
$ which gawk
/usr/local/bin/gawk
$ /usr/local/bin/gawk --version | head -1
GNU Awk 3.1.5

My default locale is en_US.  With this locale, I do not see the error 
David Brodbeck did, even if I use gawk 3.1.5.  If I set 
LANG=en_US.UTF-8 and use gawk 3.1.5, then I see the error:

$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: 
fatal: Invalid collation character: /[[:lower:]-?]/

Setting LC_ALL=C as suggested in the SRILM INSTALL file does not solve 
the problem:

tmp$ export LC_ALL=C
tmp$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: 
fatal: Invalid collation character: /[[:lower:]-?]/

The compute-oov-rate script gives a similar error.

David Brodbeck, if you're reading this, did setting LC_ALL=C solve 
your problem with add-pauses-to-pfsg?  This was not clear to me from 
reading your July 23 email to Andreas.

Thanks,
David

From gelbart at icsi.berkeley.edu  Mon Oct  8 22:24:49 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Mon, 8 Oct 2007 22:24:49 -0700 (PDT)
Subject: SRILM and LC_ALL
In-Reply-To: <Pine.LNX.4.63.0710081805540.12736@lamb.ICSI.Berkeley.EDU>
References: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU>
 <46E81B83.7050603@speech.sri.com> <Pine.LNX.4.63.0709141536290.31434@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0709141557290.6942@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0710081805540.12736@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0710082207370.17151@lamb.ICSI.Berkeley.EDU>


> My default locale is en_US.  With this locale, I do not see the error David 
> Brodbeck did, even if I use gawk 3.1.5.  If I set LANG=en_US.UTF-8 and use 
> gawk 3.1.5, then I see the error:
>
> $ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
> gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: 
> Invalid collation character: /[[:lower:]-?]/

A followup:

At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default 
locale is en_US.UTF-8:

  $ locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_COLLATE="en_US.UTF-8"
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_PAPER="en_US.UTF-8"
  LC_NAME="en_US.UTF-8"
  LC_ADDRESS="en_US.UTF-8"
  LC_TELEPHONE="en_US.UTF-8"
  LC_MEASUREMENT="en_US.UTF-8"
  LC_IDENTIFICATION="en_US.UTF-8"
  LC_ALL=

If I use the default locale, I get the "Invalid collation character" 
error.  If I set LANG=C, I get the same error.

If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg 
test fails with the message "make-ngram-pfsg: stdout output DIFFERS". 
I think this is because when LC_ALL is set it overrides the other LC_* 
variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html). 
This means that the line in test/tests/make-ngram-pfsg/run-test which 
sets LC_COLLATE=C has no effect when LC_ALL is set.

If I set LANG=en_US and leave LC_ALL unset, then the
"Invalid collation character" error goes away and the make-ngram-pfsg
test passes.

So it appears that the gawk locale tips in the SRILM INSTALL file may 
need to be updated to reflect gawk 3.1.15's behavior.  Please let me 
know if there's anything else I could do to help with this.

Regards,
David


From briannalaugher at toggletext.com  Mon Oct 15 01:52:26 2007
From: briannalaugher at toggletext.com (Brianna Laugher)
Date: 15 Oct 2007 18:52:26 +1000
Subject: Some SRILM test errors
Message-ID: <1192438345.4943.33.camel@lilah>

Hello,

After a day of experimenting I finally managed to get SRILM to compile.
I thought I would share my settings to help others.

I tried in vain on a RedHat 9 machine with gcc 3.2.2. Eventually I gave
up and tried a different machine.

On CentOS 4 with machine type i686 and gcc 3.4.6 I made these changes:
in common/Makefile.machine.i686:
- fix CC and CXX paths
- remove -mtune=pentium3 from GCC_FLAGS
- add NO_TCL=X and blank the other TCL things

I have gawk 3.1.3.

When running the tests I had DIFFERS for these files:
nbest-rover-acoustic stdout
ngram-class stdout
ngram-count-lm-limit-vocab stdout & stderr

I read in the archives that ngram-class is "very fickle" so not to worry
about it... below is a diff between the last one's stdout.

I'm really just trying to have a play with Moses. It would be nice to
know if these tests are all unimportant and thus I can ignore their
failings. :)

thanks,
Brianna


[brianna at riley test]$ diff
output/ngram-count-lm-limit-vocab.unknown.stdout
reference/ngram-count-lm-limit-vocab.stdout
1,4c1,4
< file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 37429
OOVs
< 0 zeroprobs, logprob= -7154.91 ppl= 14.898 ppl1= 6.98462e+08
< file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 37429
OOVs
< 0 zeroprobs, logprob= -6929.8 ppl= 13.6842 ppl1= 3.68034e+08
---
> file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 681
OOVs
> 0 zeroprobs, logprob= -86868 ppl= 106.512 ppl1= 205.572
> file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 681
OOVs
> 0 zeroprobs, logprob= -85654.5 ppl= 99.788 ppl1= 190.833


From gelbart at icsi.berkeley.edu  Tue Oct 16 21:26:34 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Tue, 16 Oct 2007 21:26:34 -0700 (PDT)
Subject: Some SRILM test errors
In-Reply-To: <1192438345.4943.33.camel@lilah>
References: <1192438345.4943.33.camel@lilah>
Message-ID: <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>

Hi Brianna,

> I have gawk 3.1.3.
>
> When running the tests I had DIFFERS for these files:
> nbest-rover-acoustic stdout
> ngram-class stdout
> ngram-count-lm-limit-vocab stdout & stderr

The nbest-rover-acoustic test is broken in SRILM 1.5.3. For more info 
on that see 
www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-September/10.html

I can duplicate the output you got for the ngram-count-lm-limit-vocab 
test if I put gawk 3.1.5 in my PATH instead of 3.1.3.

(It's possible that something else is the reason other than the gawk 
version.  I changed the environments in a way that may have changed 
more than just the gawk version.)

Are you sure you don't have 3.1.5 installed somewhere where SRILM 
scripts might be finding it?  I believe some of the SRILM tools find 
gawk using your PATH, while others will use the value of GAWK set in 
common/Makefile.machine.whatever.

Please let us know if you learn anything more.

Regards,
David


From gelbart at icsi.berkeley.edu  Tue Oct 16 21:47:51 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Tue, 16 Oct 2007 21:47:51 -0700 (PDT)
Subject: Some SRILM test errors
In-Reply-To: <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>
References: <1192438345.4943.33.camel@lilah> <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0710162144130.7963@lamb.ICSI.Berkeley.EDU>

On Tue, 16 Oct 2007, David Gelbart wrote:

> I can duplicate the output you got for the ngram-count-lm-limit-vocab test if 
> I put gawk 3.1.5 in my PATH instead of 3.1.3.
>
> (It's possible that something else is the reason other than the gawk version. 
> I changed the environments in a way that may have changed more than just the 
> gawk version.)

I can also make the ngram-class test start failing by switching to 
3.1.5 (the caveat in parentheses above still applies).

I have no experience with Moses, but if you just want to play with it 
my guess is that you can ignore these test failures.

Regards,
David


From briannalaugher at toggletext.com  Tue Oct 16 22:55:10 2007
From: briannalaugher at toggletext.com (Brianna Laugher)
Date: 17 Oct 2007 15:55:10 +1000
Subject: Some SRILM test errors
In-Reply-To: <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>
References: <1192438345.4943.33.camel@lilah>
	 <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>
Message-ID: <1192600510.6600.4.camel@lilah>

On Wed, 2007-10-17 at 14:26, David Gelbart wrote:
> I can duplicate the output you got for the ngram-count-lm-limit-vocab 
> test if I put gawk 3.1.5 in my PATH instead of 3.1.3.
> 
> (It's possible that something else is the reason other than the gawk 
> version.  I changed the environments in a way that may have changed 
> more than just the gawk version.)
> 
> Are you sure you don't have 3.1.5 installed somewhere where SRILM 
> scripts might be finding it?  I believe some of the SRILM tools find 
> gawk using your PATH, while others will use the value of GAWK set in 
> common/Makefile.machine.whatever.

Hi David, thanks for your reply.

I double-checked and both the gawk in my config file and the gawk in my
path are 3.1.3. No evidence that I could find of a 3.1.5 lurking...

For kicks I tried running it again on a gawk 3.1.1, and EVERYTHING
broke. :)

Oh well, that's life.

regards,
Brianna


From lfdharo at die.upm.es  Wed Oct 17 01:56:11 2007
From: lfdharo at die.upm.es (Luis Fernando D'Haro)
Date: Wed, 17 Oct 2007 10:56:11 +0200
Subject: Adding-One smoothing
Message-ID: <20071017085610.GC26232@die.upm.es>

Hello everyone:

I just want to ask if the SRILM toolkit allows the creation a LM using the Lidstone's smoothing technique (i.e. adding-one or adding-delta). I want to compare the results obtained with a proprietary SW that works with this smoothing and the SRILM. I know that this technique is not the best one, but unfortunately we have a small corpus (around 5K sentences) and, at the moment, the performance of the other techniques have not been really good when compared with Lidstone's (at least using this SW). 

BTW: In our SW we use deleted interpolation, I know that SRILM just accept Backoff models. In a previous email in the user?s list, I saw an explanation about how to use it, but it was not totally clear for me. Could you (prof. Stolcke) expand a little more the example you wrote? Or if anyone has experience with that to explain me it again? 

Thanks in advance.

Sincerely,


Luis Fernando D'Haro


From gelbart at icsi.berkeley.edu  Wed Oct 17 20:14:03 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Wed, 17 Oct 2007 20:14:03 -0700 (PDT)
Subject: Some SRILM test errors
In-Reply-To: <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>
References: <1192438345.4943.33.camel@lilah> <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0710172008060.13753@lamb.ICSI.Berkeley.EDU>

Hi Brianna,

> I can duplicate the output you got for the ngram-count-lm-limit-vocab test if 
> I put gawk 3.1.5 in my PATH instead of 3.1.3.
>
> (It's possible that something else is the reason other than the gawk version. 
> I changed the environments in a way that may have changed more than just the 
> gawk version.)

I just noticed that another aspect of the environment change I made is 
that I no longer had LC_ALL=C as recommended in the INSTALL file.

If I set LC_ALL=C, the ngram-class and ngram-count-lm-limit-vocab 
tests pass for me regardless of whether the gawk version in my PATH is 
3.1.3 or 3.1.5.

Does that fix your problem?

Regards,
David


From briannalaugher at toggletext.com  Wed Oct 17 20:57:56 2007
From: briannalaugher at toggletext.com (Brianna Laugher)
Date: 18 Oct 2007 13:57:56 +1000
Subject: Some SRILM test errors
In-Reply-To: <Pine.LNX.4.63.0710172008060.13753@lamb.ICSI.Berkeley.EDU>
References: <1192438345.4943.33.camel@lilah>
	 <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU>
	 <Pine.LNX.4.63.0710172008060.13753@lamb.ICSI.Berkeley.EDU>
Message-ID: <1192679875.6773.0.camel@lilah>

On Thu, 2007-10-18 at 13:14, David Gelbart wrote:
> I just noticed that another aspect of the environment change I made is 
> that I no longer had LC_ALL=C as recommended in the INSTALL file.
> 
> If I set LC_ALL=C, the ngram-class and ngram-count-lm-limit-vocab 
> tests pass for me regardless of whether the gawk version in my PATH is 
> 3.1.3 or 3.1.5.
> 
> Does that fix your problem?

Aha! Thankyou. When in doubt, read the instructions... then read them
again... and again. :)

cheers,
Brianna


From stolcke at speech.sri.com  Thu Oct 18 09:35:01 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 18 Oct 2007 09:35:01 -0700
Subject: Adding-One smoothing 
In-Reply-To: Your message of Wed, 17 Oct 2007 10:56:11 +0200.
             <20071017085610.GC26232@die.upm.es> 
Message-ID: <200710181635.l9IGZ1d11264@speech.sri.com>


In message <20071017085610.GC26232 at die.upm.es>you wrote:
> Hello everyone:
> 
> I just want to ask if the SRILM toolkit allows the creation a LM using the Li
> dstone's smoothing technique (i.e. adding-one or adding-delta). I want to com
> pare the results obtained with a proprietary SW that works with this smoothin
> g and the SRILM. I know that this technique is not the best one, but unfortun
> ately we have a small corpus (around 5K sentences) and, at the moment, the pe
> rformance of the other techniques have not been really good when compared wit
> h Lidstone's (at least using this SW). 

Add-delta smoothing is implemented in the latest version of SRILM.
Try downloading the 1.5.4 (beta) version.  The options are 

	-addsmooth d
	-addsmooth1 d
	-addsmooth2 d
	etc.

where d is the constant to add to each count.

> 
> BTW: In our SW we use deleted interpolation, I know that SRILM just accept Ba
> ckoff models. In a previous email in the user?s list, I saw an explanation abo
> ut how to use it, but it was not totally clear for me. Could you (prof. Stolc
> ke) expand a little more the example you wrote? Or if anyone has experience w
> ith that to explain me it again? 

I'm not sure exactly what method you are asking about, but deleted
interpolation is implemented as the smoothing method used by the
ngram-count -count-lm option.  ngram -count-lm is used to evaluate such
an LM.  Read the ngram man page to find a description of the file format.
You prepare a descriptor file for -count-lm, estimate the interpolation
weights with ngram-count, and then give the resulting file to ngram-count.
An example of all this is in $SRILM/test/tests/ngram-count-lm/run-test .

Andreas

> 
> Thanks in advance.
> 
> Sincerely,
> 
> 
> Luis Fernando D'Haro


From lfdharo at die.upm.es  Thu Oct 18 10:27:37 2007
From: lfdharo at die.upm.es (Luis Fernando D'Haro)
Date: Thu, 18 Oct 2007 19:27:37 +0200
Subject: Adding-One smoothing
In-Reply-To: <200710181635.l9IGZ1d11264@speech.sri.com>
References: <20071017085610.GC26232@die.upm.es> <200710181635.l9IGZ1d11264@speech.sri.com>
Message-ID: <20071018172737.GE1499@die.upm.es>

 
> Add-delta smoothing is implemented in the latest version of SRILM.
> Try downloading the 1.5.4 (beta) version.  The options are 
> 
> 	-addsmooth d
> 	-addsmooth1 d
> 	-addsmooth2 d
> 	etc.
> 
> where d is the constant to add to each count.

Thanks Prof. for this new release and your quick answer. I will test it.

> I'm not sure exactly what method you are asking about, but deleted
> interpolation is implemented as the smoothing method used by the
> ngram-count -count-lm option.  ngram -count-lm is used to evaluate such
> an LM.  

currently the SW we have implements something like this:

P(w|h) = lambda_trig * P_3(w|h) + (1-lambda_trig)[lambda_big(P_2(w|h) + (1-lambda_big)[lambda_unig(P(w) + (1-lambda_unig)P(zerogram)]]

In all cases, the probability is calculated using the adding-delta smoothing technique. 
 
It is important to mention that in this equation, there is a global lambda_trig, lambda_big and lambda_unig values (i.e. this is like having just one bin, not as proposed by Jelinek where there is a different lambda for different bins). 

Previously, I had tried to use the -count-lm using the following configuration file:

order 3 
vocabsize 1002 
totalcount 74883 
mixweights 0 
0.5 0.5 0.5 
countmodulus 1 
counts train.counts

and after applying the EM algorithm I obtained the following values:

order 3
mixweights 0
 0.932452 0.894774 0.994639
countmodulus 1
vocabsize 1002
totalcount 74883
counts train.counts

but my PPL results were not as good as using the SW we have. 
 
Is it something wrong with the configuration file? or the problem is related with using Good-Turing instead of Adding-delta?

Thanks in advance,


Luis Fernando


From stolcke at speech.sri.com  Thu Oct 18 10:29:41 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 18 Oct 2007 10:29:41 -0700
Subject: Help needed with SRILM 
In-Reply-To: Your message of Fri, 12 Oct 2007 19:00:54 +0200.
             <71caf0730710121000i68f6c11s1b0f452a6e096086@mail.gmail.com> 
Message-ID: <200710181729.l9IHTfd16651@speech.sri.com>


> 
> Hi Andreas,
> 
> First of all, thank you for the fast replay last time.
> I have read you answer to Roy Bar - Haim, and tried to follow. I found that
> there were duplicate parts in the training data, and I have  erased them,
> and I have tried to create the language model form a corpus 10 times larger,
> but it did not aid. I have managed to
> get rid of the warning only by changing the -gt1n(min/max) options.
> By doing this, I have discovered that the performance of the language model
> is greatly affected by the probability given to the <unk> token. I use
> ngram-count like this :
> 
> ngram-count -text corp.out -lm ngram-count_output/lm_2iter.lm -unk -order 3
> -gt1min 0 -gt1max 2
> 
> So, as far as I understand, there should be no occurance of unk in the
> corpus. But, unk gets a high probability - higher even than words that did
> appear one in the corpus. Only when I disable discounting I get low
> probability for <unk>.

Here is the problem:  you are estimating an LM with <unk> from data that 
doesn't have any instance of <unk>.  As a result, <unk> gets all the 
unigram probability mass that is left after discounting the observed 
unigrams, and that can be substantial.
This is because all the discounted unigram probability mass is distributed
over all the zeroton words, and in this case <unk> is the only zeroton word.
(If there are no zeroton words, then the discounted mass is added evenly to
ALL the words.)

Incidentally, when you try this with -gt1max 1 (the default) on the
Switchboard counts under $SRILM/test/tests/ngram-count-gt you get

-5.503182	<unk>

a very small probability. Already, with -gt1max 2 you get 

-2.558506       <unk>

which indeed is larger than many observed words. But that is not unexpected.
After all, <unk> is representing ALL unobserved words.

The proper remedy is to limit your LM vocabulary to something less than the 
observed words, so that the remaining words can give you a meaniningful 
estimate for unobserved words.

> Is there an option to set a fixed probability for the
> <unk>?

No, there isn't.  But there is a trick to achieve a similar effect.
Since your data doesn't contain any <unk> you can fake some.
Just make a count file that contains some fictitious occurrences for <unk>,
e.g.,

<unk>		100

and call this UNK.counts.
Then add those counts to your real data, e.g., 

ngram-count -text corp.out -read UNK.counts -lm ngram-count_output/lm_2iter.lm -unk -order 3 -gt1min 0 -gt1max 2

And of course you can play with the fake count value to achieve a result 
that is reasonable, or even optimal on some held-out data.

> 
> BTW : I changed the LM.cc a bit, so when I call ngram -ppl it acts as a
> probability server - it listens on a port, and gets sequences of words and
> returns its probability.
> Do you want me to send you the code for it, so it could be added as a
> feature ?

Please do send the code.  I wouldn't want to modify the existing meaning
of -ppl, but a new option with this functionality is something that
several people have asked about.

Andreas

> 
> Regards,
> Elad Dinur
> 
> On 9/23/07, Andreas Stolcke <stolcke at speech.sri.com> wrote:
> >
> > Elad Dinur wrote:
> > > Hello Andreas And/Or Jing,
> > >
> > > I am a graduate student in the Hebrew University of Jerusalem, guided
> > > by Ari Rappoport of The Hebrew University.
> > > I am working on Unsupervised segmentation of words, with emphasis on
> > > semitic languages, developing on Modern Hebrew.
> > > I am using SRILM to generate a trigram language model, and finding the
> > > probability of a sentence with the model.
> > > I am using ngram-count with the default setting, As far as I
> > > understand that means Good-Turing discounting with Katz Backoff.
> > > I get the following warning :
> > >
> > > warning: discount coeff 1 is out of range: 1.79427e-17
> > >
> > > I wonder if you can direct me to a document which elaborates on this
> > warning.
> > > Thanks in advance,
> > > Elad Dinur.
> > >
> > You can find the answer to this and many other questions by going to
> >
> > http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/
> >
> > and searching for "discount coeff 1 is out of range".
> >
> > Andreas
> >
> >
> >
> >
> 
> 
> -- 
> what ?!
> 
> ------=_Part_11378_11699390.1192208454625
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> Hi Andreas,<br><br>First of all, thank you for the fast replay last time. <br
> >I have read you answer to Roy Bar - Haim, and tried to follow. I found that 
> there were duplicate parts in the training data, and I have&nbsp; erased them
> , and I have tried to create the language model form a corpus 10 times larger
> , but it did not aid. I have managed to 
> <br>get rid of the warning only by changing the -gt1n(min/max) options. <br>B
> y doing this, I have discovered that the performance of the language model is
>  greatly affected by the probability given to the &lt;unk&gt; token. I use ng
> ram-count like this :
> <br><br>ngram-count -text corp.out -lm ngram-count_output/lm_2iter.lm -unk -o
> rder 3 -gt1min 0 -gt1max 2 <br> <br>So, as far as I understand, there should 
> be no occurance of unk in the&nbsp;  corpus. But, unk gets a high probability
>  - higher even than words that did appear one in the corpus. Only when I disa
> ble discounting I get low probability for &lt;unk&gt;. Is there an option to 
> set a fixed probability for the &lt;unk&gt;?
> <br><br>BTW : I changed the LM.cc a bit, so when I call ngram -ppl it acts as
>  a probability server - it listens on a port, and gets sequences of words and
>  returns its probability. <br>Do you want me to send you the code for it, so 
> it could be added as a feature ?
> <br><br>Regards,<br>Elad Dinur<br><br><div><span class="gmail_quote">On 9/23/
> 07, <b class="gmail_sendername">Andreas Stolcke</b> &lt;<a href="mailto:stolc
> ke at speech.sri.com">stolcke at speech.sri.com</a>&gt; wrote:</span><blockquote cl
> ass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0
> pt 0pt 0pt 0.8ex; padding-left: 1ex;">
> Elad Dinur wrote:<br>&gt; Hello Andreas And/Or Jing,<br>&gt;<br>&gt; I am a g
> raduate student in the Hebrew University of Jerusalem, guided<br>&gt; by Ari 
> Rappoport of The Hebrew University.<br>&gt; I am working on Unsupervised segm
> entation of words, with emphasis on
> <br>&gt; semitic languages, developing on Modern Hebrew.<br>&gt; I am using S
> RILM to generate a trigram language model, and finding the<br>&gt; probabilit
> y of a sentence with the model.<br>&gt; I am using ngram-count with the defau
> lt setting, As far as I
> <br>&gt; understand that means Good-Turing discounting with Katz Backoff.<br>
> &gt; I get the following warning :<br>&gt;<br>&gt; warning: discount coeff 1 
> is out of range: 1.79427e-17<br>&gt;<br>&gt; I wonder if you can direct me to
>  a document which elaborates on this warning.
> <br>&gt; Thanks in advance,<br>&gt; Elad Dinur.<br>&gt;<br>You can find the a
> nswer to this and many other questions by going to<br><br><a href="http://www
> .speech.sri.com/projects/srilm/mail-archive/srilm-user/">http://www.speech.sr
> i.com/projects/srilm/mail-archive/srilm-user/
> </a><br><br>and searching for &quot;discount coeff 1 is out of range&quot;.<b
> r><br>Andreas<br><br><br><br></blockquote></div><br><br clear="all"><br>-- <b
> r>what ?!
> 
> ------=_Part_11378_11699390.1192208454625--


From stolcke at speech.sri.com  Thu Oct 18 12:11:48 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 18 Oct 2007 12:11:48 -0700
Subject: Adding-One smoothing 
In-Reply-To: Your message of Thu, 18 Oct 2007 19:27:37 +0200.
             <20071018172737.GE1499@die.upm.es> 
Message-ID: <200710181911.l9IJBmd24770@speech.sri.com>


--Andreas

In message <20071018172737.GE1499 at die.upm.es>you wrote:
>  
> > Add-delta smoothing is implemented in the latest version of SRILM.
> > Try downloading the 1.5.4 (beta) version.  The options are 
> > 
> > 	-addsmooth d
> > 	-addsmooth1 d
> > 	-addsmooth2 d
> > 	etc.
> > 
> > where d is the constant to add to each count.
> 
> Thanks Prof. for this new release and your quick answer. I will test it.
> 
> > I'm not sure exactly what method you are asking about, but deleted
> > interpolation is implemented as the smoothing method used by the
> > ngram-count -count-lm option.  ngram -count-lm is used to evaluate such
> > an LM.  
> 
> currently the SW we have implements something like this:
> 
> P(w|h) = lambda_trig * P_3(w|h) + (1-lambda_trig)[lambda_big(P_2(w|h) + (1-la
> mbda_big)[lambda_unig(P(w) + (1-lambda_unig)P(zerogram)]]
> 
> In all cases, the probability is calculated using the adding-delta smoothing 
> technique. 

That is a combination of additive smoothing and deleted interpolation
that is not currently implemented in SRILM.

>  
> It is important to mention that in this equation, there is a global lambda_tr
> ig, lambda_big and lambda_unig values (i.e. this is like having just one bin,
>  not as proposed by Jelinek where there is a different lambda for different b
> ins). 
> 
> Previously, I had tried to use the -count-lm using the following configuratio
> n file:
> 
> order 3 
> vocabsize 1002 
> totalcount 74883 
> mixweights 0 
> 0.5 0.5 0.5 
> countmodulus 1 
> counts train.counts
> 
> and after applying the EM algorithm I obtained the following values:
> 
> order 3
> mixweights 0
>  0.932452 0.894774 0.994639
> countmodulus 1
> vocabsize 1002
> totalcount 74883
> counts train.counts
> 
> but my PPL results were not as good as using the SW we have. 
>  
> Is it something wrong with the configuration file? or the problem is related 
> with using Good-Turing instead of Adding-delta?

There is nothing wrong with it.  The difference is that in SRILM the
underlying probability estimates (as in standard deleted inteprolation)
are simple maximum likelihood estimates (without Good Turing smoothing).

It would be very straightforward to include optional add-delta smoothing
to the -count-lm model, since all the quantities needed are readily avaialable.
You just have to add some code to get the delta parameter from the LM
file (similar to what's already there for the other parameters) and modify
line 373 in NgramCountLM.cc to implement the add-delta formula.

If you do this please send me your changes!

Andreas 


From stolcke at speech.sri.com  Fri Oct 19 10:49:03 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 19 Oct 2007 10:49:03 PDT
Subject: Some SRILM test errors 
In-Reply-To: Your message of Tue, 16 Oct 2007 21:26:34 -0700.
             <Pine.LNX.4.63.0710162114520.7817@lamb.ICSI.Berkeley.EDU> 
Message-ID: <200710191749.l9JHn4M18206@huge>


In message <Pine.LNX.4.63.0710162114520.7817 at lamb.ICSI.Berkeley.EDU>you wrote:
> Hi Brianna,
> 
> > I have gawk 3.1.3.
> >
> > When running the tests I had DIFFERS for these files:
> > nbest-rover-acoustic stdout
> > ngram-class stdout
> > ngram-count-lm-limit-vocab stdout & stderr
> 
> The nbest-rover-acoustic test is broken in SRILM 1.5.3. For more info 
> on that see 
> www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-September/10.h
> tml

David is right, and if you download the beta version of SRILM 1.5.4
this problem is fixed.
This version also fixes a number of other locale-related issues.

Specifically, in the tests ngram-class and ngram-count-lm-limit-vocab
the problem is simply that different locale settings give different "sort"
output.  You can fix this by putting 

LC_COLLATE=C
export LC_COLLATE

at the top of 

$SRILM/test/tests/ngram-class/run-test
$SRILM/test/tests/ngram-count-lm-limit-vocab/run-test

Other than that it whould work regardless of the gawk version.

Andreas 

> 
> I can duplicate the output you got for the ngram-count-lm-limit-vocab 
> test if I put gawk 3.1.5 in my PATH instead of 3.1.3.
> 
> (It's possible that something else is the reason other than the gawk 
> version.  I changed the environments in a way that may have changed 
> more than just the gawk version.)
> 
> Are you sure you don't have 3.1.5 installed somewhere where SRILM 
> scripts might be finding it?  I believe some of the SRILM tools find 
> gawk using your PATH, while others will use the value of GAWK set in 
> common/Makefile.machine.whatever.
> 
> Please let us know if you learn anything more.
> 
> Regards,
> David
> 
> 
> 


From stolcke at speech.sri.com  Fri Oct 19 11:00:36 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 19 Oct 2007 11:00:36 PDT
Subject: SRILM and LC_ALL 
In-Reply-To: Your message of Mon, 08 Oct 2007 22:24:49 -0700.
             <Pine.LNX.4.63.0710082207370.17151@lamb.ICSI.Berkeley.EDU> 
Message-ID: <200710191800.l9JI0a620166@huge>


David et al.,

there were several issues with add-pauses-to-pfsg and UTF-8 locales.
The regular expression /[x80-x8F]/ is not legal in UTF-8 locales because
it contains characters with the high bit set (UTF-8 uses the high bit to
encode multibyte characters).
I fixed this recently by using a different but equivalent regex instead.

The other problem is that pre-3.1.5 (actually pre-3.1.4) gawk
was not using ctype library functions for implementing character classes 
like [:lower:].

So, the upshot is that if you 

1) get the latest beta version (to fixed the regex issue) AND
2) use gawk 3.1.5 or later

you should be able to use add-pauses-to-pfsg and pass the "make-ngram-pfsg"
test regardless of locale setting.  You CAN use gawk 3.1.3 (which is 
what seems to be pre-installed on many Linux system) but then you need
use LANG=C or LANG=en_US.

I added a note about this to various documentation files.

--Andreas

In message <Pine.LNX.4.63.0710082207370.17151 at lamb.ICSI.Berkeley.EDU>you wrote:
> 
> > My default locale is en_US.  With this locale, I do not see the error David
>  
> > Brodbeck did, even if I use gawk 3.1.5.  If I set LANG=en_US.UTF-8 and use 
> > gawk 3.1.5, then I see the error:
> >
> > $ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
> > gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: 
> > Invalid collation character: /[[:lower:]-?]/
> 
> A followup:
> 
> At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default 
> locale is en_US.UTF-8:
> 
>   $ locale
>   LANG=en_US.UTF-8
>   LC_CTYPE="en_US.UTF-8"
>   LC_NUMERIC="en_US.UTF-8"
>   LC_TIME="en_US.UTF-8"
>   LC_COLLATE="en_US.UTF-8"
>   LC_MONETARY="en_US.UTF-8"
>   LC_MESSAGES="en_US.UTF-8"
>   LC_PAPER="en_US.UTF-8"
>   LC_NAME="en_US.UTF-8"
>   LC_ADDRESS="en_US.UTF-8"
>   LC_TELEPHONE="en_US.UTF-8"
>   LC_MEASUREMENT="en_US.UTF-8"
>   LC_IDENTIFICATION="en_US.UTF-8"
>   LC_ALL=
> 
> If I use the default locale, I get the "Invalid collation character" 
> error.  If I set LANG=C, I get the same error.
> 
> If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg 
> test fails with the message "make-ngram-pfsg: stdout output DIFFERS". 
> I think this is because when LC_ALL is set it overrides the other LC_* 
> variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html). 
> This means that the line in test/tests/make-ngram-pfsg/run-test which 
> sets LC_COLLATE=C has no effect when LC_ALL is set.
> 
> If I set LANG=en_US and leave LC_ALL unset, then the
> "Invalid collation character" error goes away and the make-ngram-pfsg
> test passes.
> 
> So it appears that the gawk locale tips in the SRILM INSTALL file may 
> need to be updated to reflect gawk 3.1.15's behavior.  Please let me 
> know if there's anything else I could do to help with this.
> 
> Regards,
> David
> 
> 
> 
> 
> 


From dianaduraiz at gmail.com  Tue Oct 23 09:08:21 2007
From: dianaduraiz at gmail.com (=?ISO-8859-1?Q?Diana_Dur=E1n?=)
Date: Tue, 23 Oct 2007 18:08:21 +0200
Subject: Optimize the interpolation parameters
Message-ID: <da7851910710230908m6eb0c418gb1aaf56f7c50a2b5@mail.gmail.com>

Hello,

I am using SRILM to create a language model based on modified interpolated
Kneser-Ney smoothing. Is it possible to optimize the lambdas values from a
held-out set with SRILM?

Thanks for your help

Diana
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20071023/84c3472f/attachment.html>

From deliverable at gmail.com  Tue Oct 23 09:47:00 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Tue, 23 Oct 2007 20:47:00 +0400
Subject: incremental ngram counts
Message-ID: <9AF24308-326B-4BCE-9B10-CD9A3A258E75@gmail.com>

Greetings -- I want to count ngrams at certain fraction of my corpus  
by size, e.g. for 10%, 20%, etc.  Is there an alternative to  
concocting separate lists of ad hoc subcorpora and running ngram- 
count separately?  What if I want to track exactly how many new  
ngrams each file contributes, when going in a certain order?

Cheers,
Alexy


From save.climate at gmail.com  Tue Oct 23 09:57:16 2007
From: save.climate at gmail.com (Kamadev Bhanuprasad)
Date: Tue, 23 Oct 2007 18:57:16 +0200
Subject: Optimize the interpolation parameters
In-Reply-To: <da7851910710230908m6eb0c418gb1aaf56f7c50a2b5@mail.gmail.com>
References: <da7851910710230908m6eb0c418gb1aaf56f7c50a2b5@mail.gmail.com>
Message-ID: <244d59a50710230957h704a1bf0j45076d517baaf86d@mail.gmail.com>

Hi Diana,

it was already discussed, see
http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-July/5.html

Kamadev

On 10/23/07, Diana Dur?n <dianaduraiz at gmail.com> wrote:
>
>
> Hello,
>
> I am using SRILM to create a language model based on modified interpolated
> Kneser-Ney smoothing. Is it possible to optimize the lambdas values from a
> held-out set with SRILM?
>
> Thanks for your help
>
> Diana
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20071023/fd03fe2b/attachment.html>

From stolcke at speech.sri.com  Tue Oct 23 09:54:38 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 23 Oct 2007 09:54:38 -0700
Subject: Optimize the interpolation parameters
In-Reply-To: <da7851910710230908m6eb0c418gb1aaf56f7c50a2b5@mail.gmail.com>
References: <da7851910710230908m6eb0c418gb1aaf56f7c50a2b5@mail.gmail.com>
Message-ID: <471E274E.6040605@speech.sri.com>

Diana Dur?n wrote:
>
> Hello,
>
> I am using SRILM to create a language model based on modified 
> interpolated Kneser-Ney smoothing. Is it possible to optimize the 
> lambdas values from a held-out set with SRILM?
The interpolation weights used in smoothing (for combining higher and 
lower-order estimates) do not have to be estimated separately from data.
They are a given by formulae derived from the counts-of-counts, and 
built into the discounting methods.

If you are asking about interpolation different LMs, use 
compute-best-mix, described in the ppl-scripts(1) man page.

Andreas


From mlharville at yahoo.com  Tue Oct 23 14:37:20 2007
From: mlharville at yahoo.com (Michael Harville)
Date: Tue, 23 Oct 2007 14:37:20 -0700 (PDT)
Subject: lattice-tool -ppl not working for me
Message-ID: <380533.58323.qm@web60616.mail.yahoo.com>

Hi,

Please excuse the newbie question, but I have searched the archives and web for an answer, and have not been able to find one. I am running the following command:

echo "HUGE WIN OVER RUTGERS" > sentence.txt
lattice-tool -ppl sentence.txt -in-lattice footballPodcast.lat -read-htk -debug 2 -order 4

and am getting the folllowing results:

        p( HUGE | <s> )         =  0 [ -inf ]
        p( WIN | HUGE ...)      =  0 [ -inf ]
        p( OVER | WIN ...)      =  0 [ -inf ]
        p( RUTGERS | OVER ...)  =  0 [ -inf ]
        p( </s> | RUTGERS ...)  =  0 [ -inf ]
Viterbi backtrace failed
1 sentences, 4 words, 0 OOVs
5 zeroprobs, logprob= 0 ppl= undefined ppl1= undefined

Anyone know what might be going on? The original utterance from which the lattice was built is 2 minutes long, containing much more speech than just the four word sentence I am testing on. Is that the problem?

Generally speaking, I am looking for a tool that can give me the highest probability location (along with the associated probability) of where a sequence of words was spoken in an audio file. I am using Sphinx 3.7 to generate lattices from the audio, and have been using various SRILM tools to examine these lattices. Is there a tool that does what I want, or will I need to make one?

Much thanks in advance!
Mike


From stolcke at speech.sri.com  Tue Oct 23 16:26:15 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 23 Oct 2007 16:26:15 PDT
Subject: lattice-tool -ppl not working for me 
In-Reply-To: Your message of Tue, 23 Oct 2007 14:37:20 -0700.
             <380533.58323.qm@web60616.mail.yahoo.com> 
Message-ID: <200710232326.l9NNQFc09322@huge>


In message <380533.58323.qm at web60616.mail.yahoo.com>you wrote:
> Hi,
> 
> Please excuse the newbie question, but I have searched the archives and web f
> or an answer, and have not been able to find one. I am running the following 
> command:
> 
> echo "HUGE WIN OVER RUTGERS" > sentence.txt
> lattice-tool -ppl sentence.txt -in-lattice footballPodcast.lat -read-htk -deb
> ug 2 -order 4
> 
> and am getting the folllowing results:
> 
>         p( HUGE | <s> )         =  0 [ -inf ]
>         p( WIN | HUGE ...)      =  0 [ -inf ]
>         p( OVER | WIN ...)      =  0 [ -inf ]
>         p( RUTGERS | OVER ...)  =  0 [ -inf ]
>         p( </s> | RUTGERS ...)  =  0 [ -inf ]
> Viterbi backtrace failed
> 1 sentences, 4 words, 0 OOVs
> 5 zeroprobs, logprob= 0 ppl= undefined ppl1= undefined
> 
> Anyone know what might be going on? The original utterance from which the lat
> tice was built is 2 minutes long, containing much more speech than just the f
> our word sentence I am testing on. Is that the problem?

Yes, probably.  lattice-tool -ppl only works for word sequences that exactly
correspond to a path through the lattice between initial and final node.

> Generally speaking, I am looking for a tool that can give me the highest prob
> ability location (along with the associated probability) of where a sequence 
> of words was spoken in an audio file. I am using Sphinx 3.7 to generate latti
> ces from the audio, and have been using various SRILM tools to examine these 
> lattices. Is there a tool that does what I want, or will I need to make one?

What you are trying to do is a kind of word or phrase spotting.

lattice-tool -order 4 -write-ngrams OUTPUT

will write a list of all 4-grams occurring anywhere in the lattice, along
with their posterior probabilities accumulated over all positions.
You could use this to see if your string is SOMEWHERE in the lattice.

lattice-tool -order 4 -write-ngram-index OUTPUT

will generate an index of all 4-gram occurrences and their positions relative
to the start of the utterance, durations, and posterior probabilities
(without combinining distinct instances that are separated in time).

You might have to play with the -min-count option to limit output of 
very low-probability ngrams, or -posterior-prune to make the lattices
smaller prior to processing (for speed/memory reasons).

Andreas 


From svmats at yahoo.com  Thu Oct 25 04:13:03 2007
From: svmats at yahoo.com (Mats Svenson)
Date: Thu, 25 Oct 2007 04:13:03 -0700 (PDT)
Subject: Saving option for ngram-class
Message-ID: <159333.1460.qm@web31611.mail.mud.yahoo.com>

Hi,
 I guess the -save options as implemented in ngram-class is not very useful. Typically, I'm not interesting in testing classes as appearing on the beginning of the clustering process, but rather in classes induced in final steps. If the number of clustered words is high, the current option results in creating an enormous number of useless files.

It'd be much more practical if the user could explicitly set which classes with different granularity should be saved, or, alternatively, to have some -startsave option which'd allow to start saving class files close to the end of the clustering.

Would that be easy to implement?

One more thing, is there an easy way how to find how many classes appear in particular class file without writing a script? The number of iterations doesn't say that directly and I'm not sure whether it can be computed as NUMBER_OF_WORDS_IN_THE_VOCAB - NUMBER_OF_ITERATIONS - NUMBER_OF_WORDS_IN_THE_NO_CLASS_VOCAB

Best,
 Mats


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From partha.lal at gmail.com  Tue Oct 30 09:03:57 2007
From: partha.lal at gmail.com (Partha Lal)
Date: Tue, 30 Oct 2007 16:03:57 +0000
Subject: format error in lattice file...
Message-ID: <b2d43c860710300903l57fbbd69sc920bcc63e394b1@mail.gmail.com>

Hello,

I'm trying to get a lattice error rate from an htk format lattice file but
keep getting a format error:

> lattice-tool -read-htk -in-lattice
results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.lat
-out-lattice
results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg
> nbest-lattice -read
results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg
-lattice-error -reference `cat
data/globalphone-sp/wrdfile/rmn/SP002/SP002_21.wrd`
results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg: line 2:
unknown keyword
format error in lattice file

I can't see what's wrong with my lattice files - I've made them available at
http://homepages.inf.ed.ac.uk/s0565860/lattice_problem/ .
Can anyone suggest what might be wrong?

Thanks,

Partha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20071030/e662c31a/attachment.html>

From stolcke at speech.sri.com  Tue Oct 30 21:23:22 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 30 Oct 2007 21:23:22 -0700
Subject: format error in lattice file...
In-Reply-To: <b2d43c860710300903l57fbbd69sc920bcc63e394b1@mail.gmail.com>
References: <b2d43c860710300903l57fbbd69sc920bcc63e394b1@mail.gmail.com>
Message-ID: <4728033A.5020802@speech.sri.com>

Partha Lal wrote:
> Hello,
>
> I'm trying to get a lattice error rate from an htk format lattice file 
> but keep getting a format error:
>
> > lattice-tool -read-htk -in-lattice 
> results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.lat 
> -out-lattice 
> results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg
The PFSG lattice format is NOT the one used by nbest-lattice.  
nbest-lattice understands two kinds of format designed to encode
word posterior probabilities, both described in the wlat-format(5) man page.

The first format, word posterior lattices, is produced by lattice-tool 
-write-posteriors.
The second format, word confusion networks, aka sausages, is produced by 
lattice-tool -write-mesh .

Andreas

> > nbest-lattice -read 
> results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg 
> -lattice-error -reference `cat 
> data/globalphone-sp/wrdfile/rmn/SP002/SP002_21.wrd`
> results/globalphone-sp/lattices/mfcc13+dd/hmm4_2_512/SP002_21.pfsg: 
> line 2: unknown keyword
> format error in lattice file
>
> I can't see what's wrong with my lattice files - I've made them 
> available at http://homepages.inf.ed.ac.uk/s0565860/lattice_problem/ .
> Can anyone suggest what might be wrong?
>
> Thanks,
>
> Partha


From dyuret at ku.edu.tr  Wed Oct 31 03:02:33 2007
From: dyuret at ku.edu.tr (Deniz Yuret)
Date: Wed, 31 Oct 2007 12:02:33 +0200
Subject: OOV calculations
Message-ID: <cea871f80710310302s53235d17x16ee2278d2170451@mail.gmail.com>

Hi,

We are working on language models for agglutinative languages where
the number of unique tokens is comparatively large, and dividing words
into morphemes is useful.  When such divisions are performed (e.g.
represent each compound word as two tokens: stem+ and +suffix), number
of unique tokens and the number of OOV tokens are reduced, however it
becomes difficult to compare two such systems with different OOV
counts.

Thus I started looking carefully into the ngram output, and so far
here is what I have understood, please correct me if I am wrong:

1. logprob is the log of the product of the probabilities for all
non-oov tokens (including </s>).
2. ppl = 10^(-logprob / (ntokens - noov + nsentences))
3. ppl1 = 10^(-logprob / (ntokens - noov))
4. I am not quite sure what zeroprobs gives.

My first question is about a slight inconsistency in the calculation
of ppl1: the </s> probabilities are included in logprob, however their
count is not included in the denominator.  Shouldn't we have a
separate logprob total that excludes </s> for the ppl1 calculation?

My second question is what exactly does zeroprobs give?

My final question is on how to fairly compare two models which divide
the same data into different numbers of tokens and have different OOV
counts.  It seems like the change in the number of tokens can be dealt
with comparing the probabilities assigned to the whole data set
(logprob) rather than per token averages (ppl).  However the current
output totally ignores the penalty that should be incurred from OOV
tokens.  As an easy solution, one can designate a fixed penalty for
each OOV token to be added to the logprob total.  It is not clear how
that fixed penalty should be determined.  A better solution is to have
a character-based model that assigns a non-zero probability to every
word and maybe interpolate it with the token-based model.  I am not
quite sure how this is possible in the srilm framework.

Any advice would be appreciated.

best,
deniz


From stolcke at speech.sri.com  Wed Oct 31 16:46:50 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 31 Oct 2007 16:46:50 PDT
Subject: OOV calculations 
In-Reply-To: Your message of Wed, 31 Oct 2007 12:02:33 +0200.
             <cea871f80710310302s53235d17x16ee2278d2170451@mail.gmail.com> 
Message-ID: <200710312346.l9VNkoX07102@huge>


In message <cea871f80710310302s53235d17x16ee2278d2170451 at mail.gmail.com>you wro
te:
> Hi,
> 
> We are working on language models for agglutinative languages where
> the number of unique tokens is comparatively large, and dividing words
> into morphemes is useful.  When such divisions are performed (e.g.
> represent each compound word as two tokens: stem+ and +suffix), number
> of unique tokens and the number of OOV tokens are reduced, however it
> becomes difficult to compare two such systems with different OOV
> counts.
> 
> Thus I started looking carefully into the ngram output, and so far
> here is what I have understood, please correct me if I am wrong:
> 
> 1. logprob is the log of the product of the probabilities for all
> non-oov tokens (including </s>).

correct.

> 2. ppl = 10^(-logprob / (ntokens - noov + nsentences))

correct.

> 3. ppl1 = 10^(-logprob / (ntokens - noov))

correct.

> 4. I am not quite sure what zeroprobs gives.

Words that are in the vocabulary but get probability 0 in the LM.
They are treated the same as OOVs for the purpose of perplexity computation.

> My first question is about a slight inconsistency in the calculation
> of ppl1: the </s> probabilities are included in logprob, however their
> count is not included in the denominator.  Shouldn't we have a
> separate logprob total that excludes </s> for the ppl1 calculation?

No, because the idea is that sentence boundaries are arbitrary 
and only a construct used by the LM to assign probabilities to words.
So to compare two LMs that use a different sentence segmentation you
need to normalize by the number of words excluding the </s> (which differ),
but you need to include the probability assigned to </s> because they 
are part of the total probability the LMs assign to the complete word 
sequence.  e.g.:  P(a b c) = P(a) P(b | a) P(</s> | a b) P(c | a b <s>)
if the LM happens to require a sentence boundary between b and c.
Actually, that's an approximation because you really need to sum over 
all possible positions of sentence boundaries.

To compute the full probability summing over all segmentations
you need to run a "hidden event" N-gram model, implemented by
ngram -hidden-vocab (see man page).

> My second question is what exactly does zeroprobs give?

See above.  If prob = 0 the perplexity becomes undefined (or infinity),
so you need to remove them from the computation somehow (like OOVs).

> 
> My final question is on how to fairly compare two models which divide
> the same data into different numbers of tokens and have different OOV
> counts.  It seems like the change in the number of tokens can be dealt
> with comparing the probabilities assigned to the whole data set
> (logprob) rather than per token averages (ppl).  However the current
> output totally ignores the penalty that should be incurred from OOV
> tokens.  As an easy solution, one can designate a fixed penalty for
> each OOV token to be added to the logprob total.  It is not clear how
> that fixed penalty should be determined.  A better solution is to have
> a character-based model that assigns a non-zero probability to every
> word and maybe interpolate it with the token-based model.  I am not
> quite sure how this is possible in the srilm framework.

You cannot compare LMs with different OOV counts.  You need to create a 
model that assigns a nonzero probability to every event.  E.g., you 
could have a letter-probability model for OOVS.

As for comparing LMs with different number of tokens, that's easy.
You are really comparing the total probabilties assigned to the complete
observation sequence, however the various LMs choose to split up that 
sequence.  So look at the "logprob" output, not ppl.   If you want to 
report ppls just choose one token sequence as your reference and use that
number of tokens in the denominator of the ppl computation for ALL LMs
(you have to compute ppl from logprob yourself).

Andreas 


From dyuret at ku.edu.tr  Thu Nov  1 00:49:51 2007
From: dyuret at ku.edu.tr (Deniz Yuret)
Date: Thu, 1 Nov 2007 09:49:51 +0200
Subject: OOV calculations
In-Reply-To: <200710312346.l9VNkoX07102@huge>
References: <cea871f80710310302s53235d17x16ee2278d2170451@mail.gmail.com>
	 <200710312346.l9VNkoX07102@huge>
Message-ID: <cea871f80711010049p5563bce5ib575ec42ab432dcd@mail.gmail.com>

Thank you.

> You cannot compare LMs with different OOV counts.  You need to create a
> model that assigns a nonzero probability to every event.  E.g., you
> could have a letter-probability model for OOVS.

As for your suggestion of creating a letter-probability model for OOVs
(and maybe interpolating it with the ngram model), are there any
tools/documentation in the srilm package that could be helpful?  If
not I think we can (1) go into the source code and figure out how to
create a new letter-probability LM, or (2) create an independent
letter-probability LM outside srilm and manually interpolate its
results with the -debug 2 output of ngram.

I am assuming here (maybe contrary to your suggestion) that we can
create a model that assigns a nonzero probability to every event by
interpolating a regular ngram model (with OOVs > 0) and a
letter-probability model.

deniz


On 11/1/07, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>
> In message <cea871f80710310302s53235d17x16ee2278d2170451 at mail.gmail.com>you wro
> te:
> > Hi,
> >
> > We are working on language models for agglutinative languages where
> > the number of unique tokens is comparatively large, and dividing words
> > into morphemes is useful.  When such divisions are performed (e.g.
> > represent each compound word as two tokens: stem+ and +suffix), number
> > of unique tokens and the number of OOV tokens are reduced, however it
> > becomes difficult to compare two such systems with different OOV
> > counts.
> >
> > Thus I started looking carefully into the ngram output, and so far
> > here is what I have understood, please correct me if I am wrong:
> >
> > 1. logprob is the log of the product of the probabilities for all
> > non-oov tokens (including </s>).
>
> correct.
>
> > 2. ppl = 10^(-logprob / (ntokens - noov + nsentences))
>
> correct.
>
> > 3. ppl1 = 10^(-logprob / (ntokens - noov))
>
> correct.
>
> > 4. I am not quite sure what zeroprobs gives.
>
> Words that are in the vocabulary but get probability 0 in the LM.
> They are treated the same as OOVs for the purpose of perplexity computation.
>
> > My first question is about a slight inconsistency in the calculation
> > of ppl1: the </s> probabilities are included in logprob, however their
> > count is not included in the denominator.  Shouldn't we have a
> > separate logprob total that excludes </s> for the ppl1 calculation?
>
> No, because the idea is that sentence boundaries are arbitrary
> and only a construct used by the LM to assign probabilities to words.
> So to compare two LMs that use a different sentence segmentation you
> need to normalize by the number of words excluding the </s> (which differ),
> but you need to include the probability assigned to </s> because they
> are part of the total probability the LMs assign to the complete word
> sequence.  e.g.:  P(a b c) = P(a) P(b | a) P(</s> | a b) P(c | a b <s>)
> if the LM happens to require a sentence boundary between b and c.
> Actually, that's an approximation because you really need to sum over
> all possible positions of sentence boundaries.
>
> To compute the full probability summing over all segmentations
> you need to run a "hidden event" N-gram model, implemented by
> ngram -hidden-vocab (see man page).
>
> > My second question is what exactly does zeroprobs give?
>
> See above.  If prob = 0 the perplexity becomes undefined (or infinity),
> so you need to remove them from the computation somehow (like OOVs).
>
> >
> > My final question is on how to fairly compare two models which divide
> > the same data into different numbers of tokens and have different OOV
> > counts.  It seems like the change in the number of tokens can be dealt
> > with comparing the probabilities assigned to the whole data set
> > (logprob) rather than per token averages (ppl).  However the current
> > output totally ignores the penalty that should be incurred from OOV
> > tokens.  As an easy solution, one can designate a fixed penalty for
> > each OOV token to be added to the logprob total.  It is not clear how
> > that fixed penalty should be determined.  A better solution is to have
> > a character-based model that assigns a non-zero probability to every
> > word and maybe interpolate it with the token-based model.  I am not
> > quite sure how this is possible in the srilm framework.
>
> You cannot compare LMs with different OOV counts.  You need to create a
> model that assigns a nonzero probability to every event.  E.g., you
> could have a letter-probability model for OOVS.
>
> As for comparing LMs with different number of tokens, that's easy.
> You are really comparing the total probabilties assigned to the complete
> observation sequence, however the various LMs choose to split up that
> sequence.  So look at the "logprob" output, not ppl.   If you want to
> report ppls just choose one token sequence as your reference and use that
> number of tokens in the denominator of the ppl computation for ALL LMs
> (you have to compute ppl from logprob yourself).
>
> Andreas
>
>


From stolcke at speech.sri.com  Thu Nov  1 08:27:38 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Nov 2007 08:27:38 PDT
Subject: OOV calculations 
In-Reply-To: Your message of Thu, 01 Nov 2007 09:49:51 +0200.
             <cea871f80711010049p5563bce5ib575ec42ab432dcd@mail.gmail.com> 
Message-ID: <200711011527.lA1FRcS18102@huge>


In message <cea871f80711010049p5563bce5ib575ec42ab432dcd at mail.gmail.com>you wro
te:
> Thank you.
> 
> > You cannot compare LMs with different OOV counts.  You need to create a
> > model that assigns a nonzero probability to every event.  E.g., you
> > could have a letter-probability model for OOVS.
> 
> As for your suggestion of creating a letter-probability model for OOVs
> (and maybe interpolating it with the ngram model), are there any
> tools/documentation in the srilm package that could be helpful?  If
> not I think we can (1) go into the source code and figure out how to
> create a new letter-probability LM, or (2) create an independent
> letter-probability LM outside srilm and manually interpolate its
> results with the -debug 2 output of ngram.
> 
> I am assuming here (maybe contrary to your suggestion) that we can
> create a model that assigns a nonzero probability to every event by
> interpolating a regular ngram model (with OOVs > 0) and a
> letter-probability model.

Actually, I wasn't thinking of covering all words with a letter
probability model (which would be poor for non-OOV words) and
interpolating.  A more typical approach is to have a word LM with an
OOV token, and when you are inside the OOV you assign a probability to
the specific word by a letter LM.  so the total probability of

	p(a b c) where "b" is an OOV would be 


p(a | ...) p(OOV | a) p(b| OOV) p(c | a OOV)  and 

p(b|OOV) is given by a totally separate LM that operates in terms of letters.

Obviously this isn't implemented in SRILM at this point, but you can compute
total probabilities, perplexities, etc. by first running the word LM, then
the letter LM just on the OOVs in your test set, and adding the log
probabilities.

Andreas


From stolcke at speech.sri.com  Thu Nov  1 11:33:09 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Nov 2007 11:33:09 PDT
Subject: Saving option for ngram-class 
In-Reply-To: Your message of Thu, 25 Oct 2007 04:13:03 -0700.
             <159333.1460.qm@web31611.mail.mud.yahoo.com> 
Message-ID: <200711011833.lA1IX9t17046@huge>


In message <159333.1460.qm at web31611.mail.mud.yahoo.com>you wrote:
> Hi,
>  I guess the -save options as implemented in ngram-class is not very useful. 

I agree.

> Typically, I'm not interesting in testing classes as appearing on the beginni
> ng of the clustering process, but rather in classes induced in final steps. I
> f the number of clustered words is high, the current option results in creati
> ng an enormous number of useless files.
> 
> It'd be much more practical if the user could explicitly set which classes wi
> th different granularity should be saved, or, alternatively, to have some -st
> artsave option which'd allow to start saving class files close to the end of 
> the clustering.
> 
> Would that be easy to implement?

The next release (due out soon) will have a new option 

       -save-maxclasses K
              Modifies  the  action  of -save so as to only start
              saving once the number of classes reaches K.   (The
              iteration  numbers embedded in filenames will start
              at 0 from that point.)

> 
> One more thing, is there an easy way how to find how many classes appear in p
> articular class file without writing a script? The number of iterations doesn
> 't say that directly and I'm not sure whether it can be computed as NUMBER_OF
> _WORDS_IN_THE_VOCAB - NUMBER_OF_ITERATIONS - NUMBER_OF_WORDS_IN_THE_NO_CLASS_
> VOCAB

You can get the number of classes from the class definition file with

gawk '{ print $1 }' | uniq | wc -l

This shouldn't be needed when using the -save-maxclasses option since you
specific the number of classes directly (and then each new saved file has
S fewer classes, where S is the argument to -save).

Andreas 


From deliverable at gmail.com  Thu Nov  1 12:54:10 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Thu, 1 Nov 2007 22:54:10 +0300
Subject: x86-64
Message-ID: <D3A69B7F-C88C-4728-9B10-50303763BF1D@gmail.com>

Which platform should we use for a x86-64 build under Linux, on an  
Intel Xenon 64-bit CPU?

Cheers,
Alexy


From stolcke at speech.sri.com  Thu Nov  1 13:05:12 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Nov 2007 13:05:12 PDT
Subject: x86-64 
In-Reply-To: Your message of Thu, 01 Nov 2007 22:54:10 +0300.
             <D3A69B7F-C88C-4728-9B10-50303763BF1D@gmail.com> 
Message-ID: <200711012005.lA1K5Ct04066@huge>


gnumake MACHINE_TYPE=i686-m64 

--Andreas

In message <D3A69B7F-C88C-4728-9B10-50303763BF1D at gmail.com>you wrote:
> Which platform should we use for a x86-64 build under Linux, on an  
> Intel Xenon 64-bit CPU?
> 
> Cheers,
> Alexy


From deliverable at gmail.com  Thu Nov  1 13:09:46 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Thu, 1 Nov 2007 23:09:46 +0300
Subject: x86-64 
In-Reply-To: <200711012005.lA1K5Ct04066@huge>
References: <200711012005.lA1K5Ct04066@huge>
Message-ID: <2DB91188-6892-49D5-879F-697E7CB7D1AD@gmail.com>

Found that too, was wondering whether -march=athlon64 is optimal for  
the Xenon?  Gentoo docs recommend

-mtune=nocona

Alexy

On Nov 1, 2007, at 11:05 PM, Andreas Stolcke wrote:

>
> gnumake MACHINE_TYPE=i686-m64
>
> --Andreas
>
> In message <D3A69B7F-C88C-4728-9B10-50303763BF1D at gmail.com>you wrote:
>> Which platform should we use for a x86-64 build under Linux, on an
>> Intel Xenon 64-bit CPU?
>>
>> Cheers,
>> Alexy
>


From stolcke at speech.sri.com  Thu Nov  1 13:13:46 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Nov 2007 12:13:46 -0800
Subject: x86-64 
In-Reply-To: Your message of Thu, 01 Nov 2007 23:09:46 +0300.
             <2DB91188-6892-49D5-879F-697E7CB7D1AD@gmail.com> 
Message-ID: <200711012013.lA1KDkA16914@speech.sri.com>


In message <2DB91188-6892-49D5-879F-697E7CB7D1AD at gmail.com>you wrote:
> Found that too, was wondering whether -march=athlon64 is optimal for  
> the Xenon?  Gentoo docs recommend
> 
> -mtune=nocona

Feel free to modify it whatever you think gives best results on your
machines.  -march=athlon64  was just something that made sense on ours.

Andreas


From deliverable at gmail.com  Thu Nov  1 13:56:35 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Thu, 1 Nov 2007 23:56:35 +0300
Subject: parallel ngram-count
Message-ID: <E9F47D5E-8138-49AA-9A7A-1101C276347F@gmail.com>

I see one quick way to parallelize ngram-count on a N-core box:

-- split file list into N sublists
-- launch N ngram-count instances, giving each its own sublist
-- merge counts

Is there any better way?
Cheers,
Alexy


From stolcke at speech.sri.com  Thu Nov  1 14:01:47 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Nov 2007 13:01:47 -0800
Subject: parallel ngram-count 
In-Reply-To: Your message of Thu, 01 Nov 2007 23:56:35 +0300.
             <E9F47D5E-8138-49AA-9A7A-1101C276347F@gmail.com> 
Message-ID: <200711012101.lA1L1lA21308@speech.sri.com>


In message <E9F47D5E-8138-49AA-9A7A-1101C276347F at gmail.com>you wrote:
> I see one quick way to parallelize ngram-count on a N-core box:
> 
> -- split file list into N sublists
> -- launch N ngram-count instances, giving each its own sublist
> -- merge counts
> 
> Is there any better way?

That's what I would do.  Make sure you are not i/o bound when running
many ngram-count in parallel, and watch for memory usage.

Andreas 


From deliverable at gmail.com  Thu Nov  1 14:42:55 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Fri, 2 Nov 2007 00:42:55 +0300
Subject: incremental ngram-count
Message-ID: <B45A7F02-030F-4475-A6A4-83C1E65799F2@gmail.com>

A separate task I do on a corpus is computing a "running ngram  
count": for each "tick" size of a subset of the corpus, e.g. 10%,  
20%, etc., or every N files, or every file, show the *increase* in  
the number of ngrams.

Obviously building sublists of files with a single file added and  
rerunning ngram-count on such lists is inefficient.  Is it the case  
where I have to get into C++ library indeed, and which classes should  
I use?  Basically, I want to know which *new* ngrams are contributed  
by a given file, in the sequence of processing.  I may want to output  
them separately for look-see, too.

Cheers,
Alexy


From deliverable at gmail.com  Fri Nov  2 02:40:54 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Fri, 2 Nov 2007 12:40:54 +0300
Subject: ngram-count progress
Message-ID: <DE81D4CD-10D9-4870-B0A9-B5C58CC26AC0@gmail.com>

Is there a way to make ngram-count report its progress, e.g. print a  
dot on stderr every N processed input tokens?  (Or which C++ would  
one hack at?)

Cheers,
Alexy


From deliverable at gmail.com  Fri Nov  2 03:25:13 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Fri, 2 Nov 2007 13:25:13 +0300
Subject: order of options to ngram-count
Message-ID: <B31D0E4E-EB44-4F81-AAC2-17E616D427AF@gmail.com>

My impression is that ngram-count is sensitive to the order of  
options (1.5.4b from yesterday).

ngram-count -text - order 1 -write unigrams.count # takes up all  
memory, runs forever

ngram-count -order 1 -text - write unigrams.count # finishes up  
quickly in a fraction of memory

Does the former honor -order 1?  What's the rule here -- trailing - 
write is honored anyways?  I now stick other flags, such as -tolower,  
in front of -text - .

Cheers,
Alexy


From svp at zuzino.net.ru  Fri Nov  2 03:45:45 2007
From: svp at zuzino.net.ru (Sergey Protasov)
Date: Fri, 2 Nov 2007 13:45:45 +0300
Subject: cross-entropy with OOV
Message-ID: <150c31280711020345s4b76a1d4k1202b2cb4d9ce5da@mail.gmail.com>

Dear experts,

I need to compute entropy with OOV words...

For example..
If  we have dict_size diffrent words in training corpora
then for test corpora (per word)

entr2  = entr1 +
stats.numOOVs*log2(dict_size_train_corpora)/num_words_test_corpora

entr1 = log2(ppl1)

But in C++ code TextStats.cc I don't know how to get  Dict_size_train_corpora
to compute this.

Dict_size_train_corpora = number_unigrams_train_corpora

Anybody help?

Thanx in advance!


From ioparin at yahoo.co.uk  Fri Nov  2 05:22:20 2007
From: ioparin at yahoo.co.uk (ilya oparin)
Date: Fri, 2 Nov 2007 12:22:20 +0000 (GMT)
Subject: order of options to ngram-count
In-Reply-To: <B31D0E4E-EB44-4F81-AAC2-17E616D427AF@gmail.com>
Message-ID: <915733.57043.qm@web25403.mail.ukl.yahoo.com>

Hi,

It looks like in your first "run forever" line you
forgot to put "-" right before "order" option, so
ngram-count just skips this invalid option and build
default trigrams instead of unigrams. In case you have
large data, that would take long.

--- Alexy Khrabrov <deliverable at gmail.com> wrote:

> My impression is that ngram-count is sensitive to
> the order of  
> options (1.5.4b from yesterday).
> 
> ngram-count -text - order 1 -write unigrams.count #
> takes up all  
> memory, runs forever
> 
> ngram-count -order 1 -text - write unigrams.count #
> finishes up  
> quickly in a fraction of memory
> 
> Does the former honor -order 1?  What's the rule
> here -- trailing - 
> write is honored anyways?  I now stick other flags,
> such as -tolower,  
> in front of -text - .
> 
> Cheers,
> Alexy
> 


best regards,
Ilya


      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 


From deliverable at gmail.com  Fri Nov  2 07:48:10 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Fri, 2 Nov 2007 17:48:10 +0300
Subject: order of options to ngram-count
In-Reply-To: <915733.57043.qm@web25403.mail.ukl.yahoo.com>
References: <915733.57043.qm@web25403.mail.ukl.yahoo.com>
Message-ID: <00FBEA2E-34F7-49DB-8266-114FA829A2B8@gmail.com>

Ilya, thanks!  Umm, I've typed these lines anew from what I've run  
before -- and there was a real -order 1 there.  In any case, my  
control run shows OK now.

I understand there's a tradition, so in case more GNU compliance is  
desired, options with long names may start with -- as an option.  :)

Cheers,
Alexy

On Nov 2, 2007, at 3:22 PM, ilya oparin wrote:

>
> It looks like in your first "run forever" line you
> forgot to put "-" right before "order" option, so
> ngram-count just skips this invalid option and build
> default trigrams instead of unigrams. In case you have
> large data, that would take long.


From stolcke at speech.sri.com  Fri Nov  2 08:05:31 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 02 Nov 2007 08:05:31 PDT
Subject: order of options to ngram-count 
In-Reply-To: Your message of Fri, 02 Nov 2007 13:25:13 +0300.
             <B31D0E4E-EB44-4F81-AAC2-17E616D427AF@gmail.com> 
Message-ID: <200711021505.lA2F5Vg06495@huge>


In message <B31D0E4E-EB44-4F81-AAC2-17E616D427AF at gmail.com>you wrote:
> My impression is that ngram-count is sensitive to the order of  
> options (1.5.4b from yesterday).
> 
> ngram-count -text - order 1 -write unigrams.count # takes up all  
> memory, runs forever

You have a typo in the above: "order" instead of "-order".

> 
> ngram-count -order 1 -text - write unigrams.count # finishes up  
> quickly in a fraction of memory
> 
> Does the former honor -order 1?  What's the rule here -- trailing - 
> write is honored anyways?  I now stick other flags, such as -tolower,  
> in front of -text - .

All SRILM programs are invariant to order of options.

--Andreas


From stolcke at speech.sri.com  Fri Nov  2 12:12:00 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 02 Nov 2007 12:12:00 PDT
Subject: SRILM 1.5.4 released
Message-ID: <200711021912.lA2JC0109493@huge>


The latest version of SRILM is downloadable from the usual place:
http://www.speech.sri.com/projects/srilm/download.html
A list of changes appears below.

Enjoy!

Andreas

-------------------------------------------------------------------------------

1.5.4   2 November 2007

        Functionality:

        * New option ngram-count -addsmooth for additive smoothing.
        A corresponding new discounting subclass "AddSmooth" is defined in 
        Discount.h.

        * New option ngram -server-port to start a "probability server"
        (based on a contribution by Elad Dinur).

        * WordLattice: print lattice name in warning messages.

        * lattice-tool -keep-unk option to preserve labels of OOV words in
        LM rescoring (currently works only for HTK lattices).

        * New option nbest-optimize -anti-refs and -anti-ref-weight to 
        decorrelate errors with another set of hypotheses.

        * New support in nbest-optimize for BLEU optimization and Powell search
        (from Jing Zheng).

        * New option ngram-class -save-maxclasses to start the saving of 
        intermediate results when a specified number classes is reached
        (suggested by Shlomo Wavrow and Mats Svenson).

        Bugs:

        * Fixed incorrect reference output for test "nbest-rover-acoustic".

        * Fixed a possible problem with tests "ngram-class" and
        "ngram-count-lm-limit-vocab" in non-C locales.

        * nbest-lattice: Avoid aligning reference words with -dump-errors or
        -wer, which would cause crash because no lattice is being generated
        internally.

        * make-batch-counts, merge-batch-counts: be more portable by dynamically
        finding the right options to use with xargs.

        * add-pauses-to-pfsg: Avoid using a regular expression construct that
        causes a gawk error in UTF-8 locales.  However, to ensure this works
        correctly a gawk version of 3.1.5 should be used. See note in
        doc/README.linux.  If the test "make-ngram-pfsg" fails a workaround is
        to set LANG=C or LANG=en_US and avoid UTF-8.

        * Fixes an uninitialized member variable in the unary constructor for
        class File, which was causing garbage to be return on the first
        getline().

        * common/Makefile.machine.macos: Updated Tcl linking instructions
        (from Chuck Wooters).
        
        * Makefile: exit immediately if any of the subdirectories result in
        build errors.


From stolcke at speech.sri.com  Fri Nov  2 16:43:57 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 02 Nov 2007 16:43:57 -0700
Subject: ngram-count progress
In-Reply-To: <DE81D4CD-10D9-4870-B0A9-B5C58CC26AC0@gmail.com>
References: <DE81D4CD-10D9-4870-B0A9-B5C58CC26AC0@gmail.com>
Message-ID: <472BB63D.2050103@speech.sri.com>

Alexy Khrabrov wrote:
> Is there a way to make ngram-count report its progress, e.g. print a 
> dot on stderr every N processed input tokens?  (Or which C++ would one 
> hack at?)
>
By modifying the source code, sure.

Andreas

> Cheers,
> Alexy


From deliverable at gmail.com  Sat Nov  3 03:32:59 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Sat, 3 Nov 2007 13:32:59 +0300
Subject: 1.5.4: empty ngram* on Mac OSX
Message-ID: <F4480F79-7681-4A01-9D21-8F0F6D21FC5E@gmail.com>

I've downloaded 1.5.4b a day before release and it built fine.  Now  
was trying to build 1.5.4 and ngram* binaries are size 0.  Am  
investigatng this -- apparently LIBRARY changed to LIBRARIES in some,  
but not all, places in the make files?

/s/src/srilm grep LIBR diff-b2-1.5.4
< $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARY)
 > $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARIES)
 > @make binaries depend on all $(LIBRARIES)
< $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARY)
 > $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARIES)
 > $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARY)

Cheers,
Alexy


From deliverable at gmail.com  Sat Nov  3 04:55:55 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Sat, 3 Nov 2007 14:55:55 +0300
Subject: Various LM stats and building speed/performance tradeoffs
Message-ID: <E2D9F38A-E7C2-4B4E-A954-8EB668580688@gmail.com>

Is there any table / review of the performance of various LMs  
implemented in SRILM, versus the time needed to build them?  What are  
the general considerations on choosing from SRILM vast number of  
LMs?  The SRILM paper is from 2002 -- what about everything added  
after that?

Cheers,
Alexy 


From deliverable at gmail.com  Sat Nov  3 16:10:59 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Sun, 4 Nov 2007 02:10:59 +0300
Subject: 1.5.4: empty ngram* on Mac OSX 
In-Reply-To: <200711032152.lA3LqFb03796@speech.sri.com>
References: <200711032152.lA3LqFb03796@speech.sri.com>
Message-ID: <0D6F25F1-5251-4BFF-AB73-3A2F121E6558@gmail.com>

Indeed, built executables, but some tests show DIFFERS, notably

google-ngrams: stdout output DIFFERS.
google-ngrams: stderr output DIFFERS.

make-big-lm: stdout output DIFFERS.
make-big-lm: stderr output DIFFERS.

make-big-lm-kn: stdout output DIFFERS.
make-big-lm-kn: stderr output DIFFERS.

make-ngram-pfsg: stdout output DIFFERS.
make-ngram-pfsg: stderr output DIFFERS.

make-unigram-pfsg: stdout output DIFFERS.
make-unigram-pfsg: stderr output DIFFERS.

nbest-optimize-bleu: stdout output IDENTICAL (IEEE version).
nbest-optimize-bleu: stderr output DIFFERS.

nbest-rescore: stdout output DIFFERS.
nbest-rescore: stderr output DIFFERS.

nbest-rover: stdout output DIFFERS.
nbest-rover: stderr output DIFFERS.

nbest-rover-acoustic: no reference stdout output found.
nbest-rover-acoustic: stderr output DIFFERS.

nbest-rover-posteriors: no reference stdout output found.
nbest-rover-posteriors: stderr output DIFFERS.

ngram-count-abs: stdout output DIFFERS.
ngram-count-abs: stderr output DIFFERS.

ngram-count-gt: stdout output IDENTICAL.
ngram-count-gt: stderr output DIFFERS.

Everything else shows IDENTICAL.

BTW, the 1.5.4 contains RCS subdirs not present in 1.5.4b -- so patch  
program checked out the file!

I surely can send full test output or anything else you might find  
interesting!
Cheers,
Alexy

On Nov 4, 2007, at 12:52 AM, Andreas Stolcke wrote:

> [...]
> See if the patch below fixes your problem. [...]


From gelbart at icsi.berkeley.edu  Sun Nov  4 12:39:57 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Sun, 4 Nov 2007 12:39:57 -0800 (PST)
Subject: SRILM 1.5.4 build problem
In-Reply-To: <200710191800.l9JI0a620166@huge>
References: <200710191800.l9JI0a620166@huge>
Message-ID: <Pine.LNX.4.63.0711041224330.9886@lamb.ICSI.Berkeley.EDU>

Hello,

With SRILM 1.5.4 on Fedora 7, I am seeing many build errors caused by 
a link step being skipped.  Is anyone else seeing this?  The problem 
results in make output that looks like this:

/usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
   -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
   -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc
/root/srilm-1.5.4-build/sbin/decipher-install 0555 \
   ../bin/i686/ngram-merge ../../bin/i686
ERROR:  File to be installed (../bin/i686/ngram-merge) does not exist.

In addition to ngram-merge this also happens for ngram-count, 
ngram-class, and many others.  If I build SRILM 1.5.3 on the same 
machine, there is no error and the make output looks like this:

/usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
   -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
   -I../../include   -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc
/usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
   -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
   -I../../include   -u matherr -L../../lib/i686  -g -O3 \
   -o ../bin/i686/ngram-merge ../obj/i686/ngram-merge.o \
   ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a  \
   ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a  -lm 2>&1 | c++filt
/root/srilm-1.5.3-build/sbin/decipher-install 0555 \
   ../bin/i686/ngram-merge ../../bin/i686

Regards,
David


From gelbart at icsi.berkeley.edu  Sun Nov  4 12:41:34 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Sun, 4 Nov 2007 12:41:34 -0800 (PST)
Subject: SRILM 1.5.4 build problem
In-Reply-To: <Pine.LNX.4.63.0711041224330.9886@lamb.ICSI.Berkeley.EDU>
References: <200710191800.l9JI0a620166@huge> <Pine.LNX.4.63.0711041224330.9886@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0711041240520.9886@lamb.ICSI.Berkeley.EDU>

By the way, I am using GNU Make 3.81.

On Sun, 4 Nov 2007, David Gelbart wrote:

> Hello,
>
> With SRILM 1.5.4 on Fedora 7, I am seeing many build errors caused by a link 
> step being skipped.  Is anyone else seeing this?  The problem results in make 
> output that looks like this:
>
> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>  -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>  -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc
> /root/srilm-1.5.4-build/sbin/decipher-install 0555 \
>  ../bin/i686/ngram-merge ../../bin/i686
> ERROR:  File to be installed (../bin/i686/ngram-merge) does not exist.
>
> In addition to ngram-merge this also happens for ngram-count, ngram-class, 
> and many others.  If I build SRILM 1.5.3 on the same machine, there is no 
> error and the make output looks like this:
>
> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>  -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>  -I../../include   -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc
> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>  -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>  -I../../include   -u matherr -L../../lib/i686  -g -O3 \
>  -o ../bin/i686/ngram-merge ../obj/i686/ngram-merge.o \
>  ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a  \
>  ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a  -lm 2>&1 | c++filt
> /root/srilm-1.5.3-build/sbin/decipher-install 0555 \
>  ../bin/i686/ngram-merge ../../bin/i686
>
> Regards,
> David
>


From barabbas at gmail.com  Sun Nov  4 13:16:23 2007
From: barabbas at gmail.com (Tian-Jian "Barabbas" Jiang@Gmail)
Date: Mon, 05 Nov 2007 05:16:23 +0800
Subject: SRILM 1.5.4 build problem
In-Reply-To: <Pine.LNX.4.63.0711041240520.9886@lamb.ICSI.Berkeley.EDU>
References: <200710191800.l9JI0a620166@huge> <Pine.LNX.4.63.0711041224330.9886@lamb.ICSI.Berkeley.EDU> <Pine.LNX.4.63.0711041240520.9886@lamb.ICSI.Berkeley.EDU>
Message-ID: <472E36A7.2040305@gmail.com>

I encountered the same problem on Mac OS X.

David Gelbart wrote:
> By the way, I am using GNU Make 3.81.
>
> On Sun, 4 Nov 2007, David Gelbart wrote:
>
>> Hello,
>>
>> With SRILM 1.5.4 on Fedora 7, I am seeing many build errors caused by 
>> a link step being skipped.  Is anyone else seeing this?  The problem 
>> results in make output that looks like this:
>>
>> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>>  -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>>  -I../../include -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc
>> /root/srilm-1.5.4-build/sbin/decipher-install 0555 \
>>  ../bin/i686/ngram-merge ../../bin/i686
>> ERROR:  File to be installed (../bin/i686/ngram-merge) does not exist.
>>
>> In addition to ngram-merge this also happens for ngram-count, 
>> ngram-class, and many others.  If I build SRILM 1.5.3 on the same 
>> machine, there is no error and the make output looks like this:
>>
>> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>>  -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>>  -I../../include   -c -g -O3 -o ../obj/i686/ngram-merge.o ngram-merge.cc
>> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>>  -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>>  -I../../include   -u matherr -L../../lib/i686  -g -O3 \
>>  -o ../bin/i686/ngram-merge ../obj/i686/ngram-merge.o \
>>  ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a  \
>>  ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a  -lm 2>&1 | 
>> c++filt
>> /root/srilm-1.5.3-build/sbin/decipher-install 0555 \
>>  ../bin/i686/ngram-merge ../../bin/i686
>>
>> Regards,
>> David
>>
>


From gelbart at icsi.berkeley.edu  Sun Nov  4 20:42:52 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Sun, 4 Nov 2007 20:42:52 -0800 (PST)
Subject: SRILM 1.5.4 build problem
In-Reply-To: <Pine.LNX.4.63.0711041240520.9886@lamb.ICSI.Berkeley.EDU>
References: <200710191800.l9JI0a620166@huge> <Pine.LNX.4.63.0711041224330.9886@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0711041240520.9886@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0711042023090.16635@lamb.ICSI.Berkeley.EDU>

I haven't been able to see where the problem has come from.  Between 
1.5.3 and 1.5.4, the only change I see to the link rule under the 
heading "# Program linking" in Makefile.common.targets is that 
"$(LIBRARY)" in the first line of the rule changed to "$(LIBRARIES)"

Here is the 1.5.4 make output again, but with make's --debug option in 
use:

  /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
    -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
    -I../../include -c -g -O3 -o ../obj/i686/ngram.o ngram.cc
     Successfully remade target file `../obj/i686/ngram.o'.
     Must remake target `../bin/i686/ngram'.
     Successfully remade target file `../bin/i686/ngram'.
   Must remake target `../../bin/i686/ngram'.
   /root/srilm-1.5.4/sbin/decipher-install 0555 ../bin/i686/ngram  \
     ../../bin/i686
   ERROR:  File to be installed (../bin/i686/ngram) does not exist.

Regards,
David


From stolcke at speech.sri.com  Sun Nov  4 20:53:26 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 04 Nov 2007 20:53:26 PST
Subject: SRILM 1.5.4 build problem 
In-Reply-To: Your message of Sun, 04 Nov 2007 20:42:52 -0800.
             <Pine.LNX.4.63.0711042023090.16635@lamb.ICSI.Berkeley.EDU> 
Message-ID: <200711050453.lA54rQEW017409@dylan.speech.sri.com>

In message <Pine.LNX.4.63.0711042023090.16635 at lamb.ICSI.Berkeley.EDU>you wrote:
> I haven't been able to see where the problem has come from.  Between 
> 1.5.3 and 1.5.4, the only change I see to the link rule under the 
> heading "# Program linking" in Makefile.common.targets is that 
> "$(LIBRARY)" in the first line of the rule changed to "$(LIBRARIES)"
> 
> Here is the 1.5.4 make output again, but with make's --debug option in 
> use:
> 
>   /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>     -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>     -I../../include -c -g -O3 -o ../obj/i686/ngram.o ngram.cc
>      Successfully remade target file `../obj/i686/ngram.o'.
>      Must remake target `../bin/i686/ngram'.
>      Successfully remade target file `../bin/i686/ngram'.
>    Must remake target `../../bin/i686/ngram'.
>    /root/srilm-1.5.4/sbin/decipher-install 0555 ../bin/i686/ngram  \
>      ../../bin/i686
>    ERROR:  File to be installed (../bin/i686/ngram) does not exist.

Frankly, I don't understand how this change can lead to the observed problem,
which I have not been able to duplicate on our machines.
But try changing line 104 of common/Makefile.common.targets to

$(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(filter-out -%, $(LIBRARIES))

IF that doesn't work, try

$(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(LIBRARIES)

(the original) and report results back to me.
Be sure to "make cleanest" before each trial.

Thanks, and sorry for this inconvenience.

Andreas 


From barabbas at gmail.com  Sun Nov  4 22:24:25 2007
From: barabbas at gmail.com (Barabbas Jiang@Gmail)
Date: Mon, 05 Nov 2007 14:24:25 +0800
Subject: SRILM 1.5.4 build problem
In-Reply-To: <200711050453.lA54rQEW017409@dylan.speech.sri.com>
References: <200711050453.lA54rQEW017409@dylan.speech.sri.com>
Message-ID: <472EB719.7030203@gmail.com>

Hi all,

Andreas Stolcke wrote:
> In message <Pine.LNX.4.63.0711042023090.16635 at lamb.ICSI.Berkeley.EDU>you wrote:
>   
>> I haven't been able to see where the problem has come from.  Between 
>> 1.5.3 and 1.5.4, the only change I see to the link rule under the 
>> heading "# Program linking" in Makefile.common.targets is that 
>> "$(LIBRARY)" in the first line of the rule changed to "$(LIBRARIES)"
>>
>> Here is the 1.5.4 make output again, but with make's --debug option in 
>> use:
>>
>>   /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit \
>>     -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64    -I. \
>>     -I../../include -c -g -O3 -o ../obj/i686/ngram.o ngram.cc
>>      Successfully remade target file `../obj/i686/ngram.o'.
>>      Must remake target `../bin/i686/ngram'.
>>      Successfully remade target file `../bin/i686/ngram'.
>>    Must remake target `../../bin/i686/ngram'.
>>    /root/srilm-1.5.4/sbin/decipher-install 0555 ../bin/i686/ngram  \
>>      ../../bin/i686
>>    ERROR:  File to be installed (../bin/i686/ngram) does not exist.
>>     
>
> Frankly, I don't understand how this change can lead to the observed problem,
> which I have not been able to duplicate on our machines.
> But try changing line 104 of common/Makefile.common.targets to
>
> $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(filter-out -%, $(LIBRARIES))

This patch works for me on Mac OS X now!

Cheers,
/Mike/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20071105/93a579de/attachment.html>

From gelbart at icsi.berkeley.edu  Mon Nov  5 11:52:40 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Mon, 5 Nov 2007 11:52:40 -0800 (PST)
Subject: SRILM 1.5.4 build problem
In-Reply-To: <472EB719.7030203@gmail.com>
References: <200711050453.lA54rQEW017409@dylan.speech.sri.com>
 <472EB719.7030203@gmail.com>
Message-ID: <Pine.LNX.4.63.0711051152030.20256@lamb.ICSI.Berkeley.EDU>


>> Frankly, I don't understand how this change can lead to the observed problem,
>> which I have not been able to duplicate on our machines.
>> But try changing line 104 of common/Makefile.common.targets to
>>
>> $(BINDIR)/%$(EXE_SUFFIX): $(OBJDIR)/%$(OBJ_SUFFIX) $(filter-out -%, $(LIBRARIES))
>
> This patch works for me on Mac OS X now!

The patch works for me on Fedora 7 as well.

Regards,
David


From deliverable at gmail.com  Mon Nov  5 12:59:43 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Mon, 5 Nov 2007 23:59:43 +0300
Subject: running time estimate of -lm
Message-ID: <83AACD84-A6ED-4DA3-93B9-89A275C67C92@gmail.com>

I've launched ngram-count -order 2 -lm with a 1 billion word corpus a  
few days ago, and it's still going, after 4,600 minutes of CPU time  
(2.66 GHz Xeon 64-bit).  Originally it took about 8 GB of RAM, then  
decreased by about 25%, now is climbing back.  What is the overall  
running time estimate of -lm without any other options?  Simple runs  
for about 15 million words finished in about 15 minutes.

Cheers,
Alexy


From ioparin at yahoo.co.uk  Tue Nov  6 00:43:36 2007
From: ioparin at yahoo.co.uk (ilya oparin)
Date: Tue, 6 Nov 2007 08:43:36 +0000 (GMT)
Subject: running time estimate of -lm
In-Reply-To: <83AACD84-A6ED-4DA3-93B9-89A275C67C92@gmail.com>
Message-ID: <102682.35796.qm@web25401.mail.ukl.yahoo.com>

Hi,

It's really worth using make-big-lm script (documented
in training-scripts section of the manual) for
training such huge models.

Ilya

--- Alexy Khrabrov <deliverable at gmail.com> wrote:

> I've launched ngram-count -order 2 -lm with a 1
> billion word corpus a  
> few days ago, and it's still going, after 4,600
> minutes of CPU time  
> (2.66 GHz Xeon 64-bit).  Originally it took about 8
> GB of RAM, then  
> decreased by about 25%, now is climbing back.  What
> is the overall  
> running time estimate of -lm without any other
> options?  Simple runs  
> for about 15 million words finished in about 15
> minutes.
> 
> Cheers,
> Alexy
> 


best regards,
Ilya


      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/


From stolcke at speech.sri.com  Tue Nov  6 00:54:50 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 06 Nov 2007 00:54:50 -0800
Subject: running time estimate of -lm 
In-Reply-To: Your message of Tue, 06 Nov 2007 08:43:36 +0000.
             <102682.35796.qm@web25401.mail.ukl.yahoo.com> 
Message-ID: <200711060854.lA68soW10632@speech.sri.com>


Also, it isn't clear from the original message if counts were produced 
beforehand, or if ngram-count is in fact invoked directly on the 
billion-word corpus.  In that case it's no wonder it takes forever,
since it is probably paging itself to death.

Use make-batch-counts/merge-batch-counts, and make-big-lm as explained 
in the training-scripts(1) man page.

--Andreas

In message <102682.35796.qm at web25401.mail.ukl.yahoo.com>you wrote:
> Hi,
> 
> It's really worth using make-big-lm script (documented
> in training-scripts section of the manual) for
> training such huge models.
> 
> Ilya
> 
> --- Alexy Khrabrov <deliverable at gmail.com> wrote:
> 
> > I've launched ngram-count -order 2 -lm with a 1
> > billion word corpus a  
> > few days ago, and it's still going, after 4,600
> > minutes of CPU time  
> > (2.66 GHz Xeon 64-bit).  Originally it took about 8
> > GB of RAM, then  
> > decreased by about 25%, now is climbing back.  What
> > is the overall  
> > running time estimate of -lm without any other
> > options?  Simple runs  
> > for about 15 million words finished in about 15
> > minutes.
> > 
> > Cheers,
> > Alexy
> > 
> 
> 
> best regards,
> Ilya
> 
> 
>       ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
> now.
> http://uk.answers.yahoo.com/


From deliverable at gmail.com  Tue Nov  6 04:34:31 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Tue, 6 Nov 2007 15:34:31 +0300
Subject: running time estimate of -lm 
In-Reply-To: <200711060854.lA68soW10632@speech.sri.com>
References: <200711060854.lA68soW10632@speech.sri.com>
Message-ID: <A3102482-4958-4E43-86AD-629307EDDFF3@gmail.com>

Indeed, the counts were not precomputed.  However there's enough  
memory, and ngram-count never used even a half of RAM yet with the  
bigrams of a billion word corpus .  No paging at all... Is there a  
hope it'll end after a few days, or I'll have to redo it following  
training-scripts(1)?

Cheers,
Alexy

On Nov 6, 2007, at 11:54 AM, Andreas Stolcke wrote:

>
> Also, it isn't clear from the original message if counts were produced
> beforehand, or if ngram-count is in fact invoked directly on the
> billion-word corpus.  In that case it's no wonder it takes forever,
> since it is probably paging itself to death.
>
> Use make-batch-counts/merge-batch-counts, and make-big-lm as explained
> in the training-scripts(1) man page.
>
> --Andreas
>
> In message <102682.35796.qm at web25401.mail.ukl.yahoo.com>you wrote:
>> Hi,
>>
>> It's really worth using make-big-lm script (documented
>> in training-scripts section of the manual) for
>> training such huge models.
>>
>> Ilya
>>
>> --- Alexy Khrabrov <deliverable at gmail.com> wrote:
>>
>>> I've launched ngram-count -order 2 -lm with a 1
>>> billion word corpus a
>>> few days ago, and it's still going, after 4,600
>>> minutes of CPU time
>>> (2.66 GHz Xeon 64-bit).  Originally it took about 8
>>> GB of RAM, then
>>> decreased by about 25%, now is climbing back.  What
>>> is the overall
>>> running time estimate of -lm without any other
>>> options?  Simple runs
>>> for about 15 million words finished in about 15
>>> minutes.
>>>
>>> Cheers,
>>> Alexy
>>>
>>
>>
>> best regards,
>> Ilya
>>
>>
>>       ___________________________________________________________
>> Yahoo! Answers - Got a question? Someone out there knows the  
>> answer. Try it
>> now.
>> http://uk.answers.yahoo.com/
>


From deliverable at gmail.com  Tue Nov  6 09:20:44 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Tue, 6 Nov 2007 20:20:44 +0300
Subject: billion word -lm finished
Message-ID: <8EFD533C-A2E3-43A2-BAB1-6B3BC5804E0E@gmail.com>

I'm glad to report that the full -lm model of -order 2 over a billion  
words builds from scratch in about 100 CPU hours!

Cheers,
Alexy


From deliverable at gmail.com  Tue Nov  6 09:58:20 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Tue, 6 Nov 2007 20:58:20 +0300
Subject: billion word -lm finished 
In-Reply-To: <200711061728.lA6HSxO13767@huge>
References: <200711061728.lA6HSxO13767@huge>
Message-ID: <684FD65B-DF5A-492E-BF9A-2FCAB039AAF2@gmail.com>

CPU-optimized (right after make World) -- these are 1.5.4b binaries  
from one day before 1.5.4 release.  Compiled with -march=nocona - 
mtune=nocona for the Xeons.  Did it time'd:

% time cat list | xargs cat | ngram-count -text - -order 2 -lm model-1
warning: discount coeff 1 is out of range: 0
cat list  0,00s user 0,00s system 0% cpu 6:41,85 total
xargs cat  0,66s user 15,31s system 2% cpu 11:23,54 total
ngram-count -text - -order 2 -lm model-1  350025,89s user 91,27s  
system 100% cpu 96:52:30,83 total

BTW, is the warning expected?  Am always getting it with simple -lm  
from scratch.
Cheers,
Alexy

On Nov 6, 2007, at 8:28 PM, Andreas Stolcke wrote:

>
> What version of the binaries did you use ?
> Cpu or space-optimized (_c) ?
>
> It would have been good to run this with the unix "time" command
> to get real and cpu time statistics.
>
> --Andreas
>
> In message <8EFD533C-A2E3-43A2-BAB1-6B3BC5804E0E at gmail.com>you wrote:
>> I'm glad to report that the full -lm model of -order 2 over a billion
>> words builds from scratch in about 100 CPU hours!
>>
>> Cheers,
>> Alexy
>


From deliverable at gmail.com  Tue Nov  6 10:19:39 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Tue, 6 Nov 2007 21:19:39 +0300
Subject: ngram -server-port for an 8-bit encoding
Message-ID: <B84005D2-702B-439B-85B9-3E396462A898@gmail.com>

How do I use ngram -server-port with an 8-bit encoding?  Telnetting  
to the port cuts off the 8th bit...

Cheers,
Alexy


From stolcke at speech.sri.com  Tue Nov  6 10:28:03 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 06 Nov 2007 10:28:03 PST
Subject: ngram -server-port for an 8-bit encoding 
In-Reply-To: Your message of Tue, 06 Nov 2007 21:19:39 +0300.
             <B84005D2-702B-439B-85B9-3E396462A898@gmail.com> 
Message-ID: <200711061828.lA6IS3R20504@huge>


The following telnet options might be of interest:

     -8      Specifies an 8-bit data path.  This causes an attempt to negoti-
             ate the TELNET BINARY option on both input and output.

     -E      Stops any character from being recognized as an escape character.

     -L      Specifies an 8-bit data path on output.  This causes the BINARY
             option to be negotiated on output.

--Andreas

In message <B84005D2-702B-439B-85B9-3E396462A898 at gmail.com>you wrote:
> How do I use ngram -server-port with an 8-bit encoding?  Telnetting  
> to the port cuts off the 8th bit...
> 
> Cheers,
> Alexy


From stolcke at speech.sri.com  Tue Nov  6 11:12:55 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 06 Nov 2007 11:12:55 PST
Subject: SRILM 1.5.5 released
Message-ID: <200711061912.lA6JCtK25573@huge>


Seeing as the last release was beset by some portability issues I created
a new release.  Hopefully this one will cause less trouble.

--Andreas

1.5.5   6 November 2007

        Bug fixes:

        * Fixed Makefile problem in binaries depending on libraries that was
        preventing executables being generated on some platforms.

        * Fixed a compilation problem with MSVC for nbest-optimize.

        * Use MSVC _getpid() in ngram -generate random seed initialization.


From jadell at gps.tsc.upc.edu  Fri Nov  9 02:12:16 2007
From: jadell at gps.tsc.upc.edu (Jordi Adell)
Date: Fri, 09 Nov 2007 11:12:16 +0100
Subject: Including srilm *.a inside a .so
Message-ID: <47343280.1000900@gps.tsc.upc.edu>


Dear Andreas,

    I'm recently using SRILM toolkit, which I think is a very useful 
tool and very well done. Congratulations.

    Just a previous note: in the documentation of the LM library there 
is no explanation that the order has to be specified in the constructor 
or by using the setorder() function. En therefore, when you read a LM 
file using the LM::read() function if one do not take this into account 
the maximum order is always three.

    OK, now my question. I'm using SRILM inside a shared object, 
therefore I included it like this:

    g++ -shared 
-Wl,-z,muldefs,-whole-archive,-lflm,-llattice,-lmisc,-ldstruct,-loolm,-no-whole-archive 
-o lib.so

    This means that ALL symbols are included in the lib.so whether 
needed or not.

    In particular I have a problem with the Tcl_AppInit. If you compile 
the libraries with TCL option on, then tclmain.cc is included inside 
library libmisc.a

    $> nm libmisc.a | grep tclmain
       tclmain.o:

    And this object has two undefined symbols;
    tclmain.o:
         U Tcl_AppInit
         U Tcl_Main

    This symbols had to be defined in tcl library, however I'm using 
tcl8.4 and the Tcl_AppInit symbol is not defined there. In the tcl.h 
says this:
    /*
     * Convenience declaration of Tcl_AppInit for backwards compatibility.
     * This function is not *implemented* by the tcl library, so the storage
     * class is neither DLLEXPORT nor DLLIMPORT
     */
    #undef TCL_STORAGE_CLASS
    #define TCL_STORAGE_CLASS

    EXTERN int              Tcl_AppInit _ANSI_ARGS_((Tcl_Interp *interp));

    Therefore, if I try to use libmisc compiled with TCL inside the 
previously mention shared object lib.so, this error is given:
    ldd -d lib.so
    undefined symbol: Tcl_AppInit   (/home/lib.so)

    I noticed that if I compile the libraries wit TCL OFF then this 
problem disappears because tclmain.o is not included in the library.
   

    I wonder whether this is how it should work or if it is a bug that 
could be arranged for next SRILM version.

    I hope this is useful for somebody it took my a while to understand 
why I couldn't include the libraries inside my .so. They key point to do 
so with tcl8.4 is to compile SRILM without TCL option. This means to set 
NO_TCL = X in the appropriate makefile in srilm/common/


    Best Regards.
    Good job!
   

-- 
_______________________________________________________________________
Jordi Adell Mercado

TALP Research Center
Signal and Communication Theory Dpt.
Universitat Polit?cnica de Catalunya (UPC)


c/Jordi Girona 1-3	e-mail: jadell at gps.tsc.upc.es	
Campus Nord D5-120 	web: http://gps-tsc.upc.es/veu/personal/jadell
08034 - Barcelona	phone: 93-401.16.27
________________________________________________________________________


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3257 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20071109/7e84ac6d/attachment.bin>

From stolcke at speech.sri.com  Fri Nov  9 14:21:18 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 09 Nov 2007 14:21:18 PST
Subject: Including srilm *.a inside a .so 
In-Reply-To: Your message of Fri, 09 Nov 2007 11:12:16 +0100.
             <47343280.1000900@gps.tsc.upc.edu> 
Message-ID: <200711092221.lA9MLIF00143@huge>


> 
> Dear Andreas,
> 
>     I'm recently using SRILM toolkit, which I think is a very useful=20
> tool and very well done. Congratulations.

Thanks, that's nice to hear !

> 
>     Just a previous note: in the documentation of the LM library there=20
> is no explanation that the order has to be specified in the constructor=20
> or by using the setorder() function. En therefore, when you read a LM=20
> file using the LM::read() function if one do not take this into account=20
> the maximum order is always three.

The library is not well-document as you know.
The fact that the ngram order defaults to 3 is documented in the 
ngram-count man page.
I certainly don't want to make any claims for the quality of the documentation
in general.

BTW, if anyone feels like improving the document (fixing or expanding it)
I'd be more than happy to accept submissions ...

>     OK, now my question. I'm using SRILM inside a shared object,=20
> therefore I included it like this:
> 
>     g++ -shared=20
> -Wl,-z,muldefs,-whole-archive,-lflm,-llattice,-lmisc,-ldstruct,-loolm,-no=
> -whole-archive=20
> -o lib.so
> 
...
> 
>     I noticed that if I compile the libraries wit TCL OFF then this=20
> problem disappears because tclmain.o is not included in the library.
>   =20
> 
>     I wonder whether this is how it should work or if it is a bug that=20
> could be arranged for next SRILM version.
> 
>     I hope this is useful for somebody it took my a while to understand=20
> why I couldn't include the libraries inside my .so. They key point to do =
> 
> so with tcl8.4 is to compile SRILM without TCL option. This means to set =
> 
> NO_TCL =3D X in the appropriate makefile in srilm/common/

I would recommend just disabling the Tcl stuff ti avoid the problem.
It's not important enough to track down this dynamic linker issue.

Andreas 


From deliverable at gmail.com  Sat Nov 10 10:05:05 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Sat, 10 Nov 2007 21:05:05 +0300
Subject: 7z as a much better archiver than gz/bz2
Message-ID: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com>

Greetings -- I've switched to 7z for most of corpora compression, as  
it gives results which are whole number of times better than gz, and  
1.1-1.5 better than bz2.  Would be nice to see it used more,  
especially for the huge kind of things we do here.  E.g., a 4.0 GB lm  
file was compressed by 7za (a command line version for linux) to 642  
MB.  7za is multi-core CPU aware and knows all about locales and  
encodings as well.

http://www.7-zip.org/

Cheers,
Alexy


From save.climate at gmail.com  Sat Nov 10 12:59:18 2007
From: save.climate at gmail.com (Kamadev Bhanuprasad)
Date: Sat, 10 Nov 2007 21:59:18 +0100
Subject: 7z as a much better archiver than gz/bz2
In-Reply-To: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com>
References: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com>
Message-ID: <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com>

Alexy,
 I strongly believe this mailing list is not suitable for spams like
this. If you want to present to public which compression utilities you
use or how much of cpu time your particular computation took, please,
use your personal blog or something like that. I'm pretty much sure
that vast majority of people in this list is not interested in
receiving messages of this kind.

Best,
 Kamadev

On Nov 10, 2007 7:05 PM, Alexy Khrabrov <deliverable at gmail.com> wrote:
> Greetings -- I've switched to 7z for most of corpora compression, as
> it gives results which are whole number of times better than gz, and
> 1.1-1.5 better than bz2.  Would be nice to see it used more,
> especially for the huge kind of things we do here.  E.g., a 4.0 GB lm
> file was compressed by 7za (a command line version for linux) to 642
> MB.  7za is multi-core CPU aware and knows all about locales and
> encodings as well.
>
> http://www.7-zip.org/
>
> Cheers,
> Alexy
>


From deliverable at gmail.com  Sat Nov 10 13:10:46 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Sun, 11 Nov 2007 00:10:46 +0300
Subject: 7z as a much better archiver than gz/bz2
In-Reply-To: <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com>
References: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com> <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com>
Message-ID: <0530AD1E-A6FB-4120-AEC1-4054731008D6@gmail.com>

(Kamadev -- I think you misunderstood my message.)  I was wondering  
whether folks manage use 7z to speed up their access to their LMs.   
By default, ngram would read the gzipped files as well as the  
originals.  Yet gzipped versions are still much larger than the  
7z'ipped.  7z is an Open Source package with which I have no  
affiliation...

By looking over 7z options, I found that one can extract a file to  
stdout with it  too, e.g.

7z e archive.7z -so

It would be possible to do that for a huge LM and feed that to it by  
piping to

ngram -lm -

-- yet the problem is, I use 

ngram -ppl -

already to serve perplexities.

Would appreciate other folks' experiences with speeding up loading of  
huge LM.  Same could be applied to bz2 as well, and any other  
archiver better than gz.

On Nov 10, 2007, at 11:59 PM, Kamadev Bhanuprasad wrote:
[...]


From runxin.li at gmail.com  Sat Nov 10 21:05:19 2007
From: runxin.li at gmail.com (Runxin Li)
Date: Sun, 11 Nov 2007 13:05:19 +0800
Subject: =?gb2312?B?tPC4tDogN3ogYXMgYSBtdWNoIGJldHRlciBhcmNoaXZlciB0aGFuIGd6L2I=?=
	=?gb2312?B?ejI=?=
In-Reply-To: <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com>
References: <56634EE6-9BC9-476C-B7FA-457584199559@gmail.com> <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com>
Message-ID: <47368d88.25bb720a.5e6c.0e1e@mx.google.com>

Totally agree with you

-----????-----
???: owner-srilm-user at speech.sri.com [mailto:owner-srilm-user at speech.sri.
com] ?? Kamadev Bhanuprasad
????: 2007?11?11? 4:59
???: Alexy Khrabrov
??: srilm-user at speech.sri.com
??: Re: 7z as a much better archiver than gz/bz2

Alexy,
 I strongly believe this mailing list is not suitable for spams like
this. If you want to present to public which compression utilities you
use or how much of cpu time your particular computation took, please,
use your personal blog or something like that. I'm pretty much sure
that vast majority of people in this list is not interested in
receiving messages of this kind.

Best,
 Kamadev

On Nov 10, 2007 7:05 PM, Alexy Khrabrov <deliverable at gmail.com> wrote:
> Greetings -- I've switched to 7z for most of corpora compression, as
> it gives results which are whole number of times better than gz, and
> 1.1-1.5 better than bz2.  Would be nice to see it used more,
> especially for the huge kind of things we do here.  E.g., a 4.0 GB lm
> file was compressed by 7za (a command line version for linux) to 642
> MB.  7za is multi-core CPU aware and knows all about locales and
> encodings as well.
>
> http://www.7-zip.org/
>
> Cheers,
> Alexy
>


From stolcke at speech.sri.com  Sun Nov 11 08:27:29 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 11 Nov 2007 08:27:29 PST
Subject: 7z as a much better archiver than gz/bz2 
In-Reply-To: Your message of Sat, 10 Nov 2007 21:59:18 +0100.
             <244d59a50711101259v60acb8e8tf3743520f2d92aa6@mail.gmail.com> 
Message-ID: <200711111627.lABGRTE29698@huge>

In message <244d59a50711101259v60acb8e8tf3743520f2d92aa6 at mail.gmail.com>you wro
te:
> Alexy,
>  I strongly believe this mailing list is not suitable for spams like
> this. If you want to present to public which compression utilities you
> use or how much of cpu time your particular computation took, please,
> use your personal blog or something like that. I'm pretty much sure
> that vast majority of people in this list is not interested in
> receiving messages of this kind.
> 
> Best,
>  Kamadev
> 
> On Nov 10, 2007 7:05 PM, Alexy Khrabrov <deliverable at gmail.com> wrote:
> > Greetings -- I've switched to 7z for most of corpora compression, as
> > it gives results which are whole number of times better than gz, and
> > 1.1-1.5 better than bz2.  Would be nice to see it used more,
> > especially for the huge kind of things we do here.  E.g., a 4.0 GB lm
> > file was compressed by 7za (a command line version for linux) to 642
> > MB.  7za is multi-core CPU aware and knows all about locales and
> > encodings as well.
> >
> > http://www.7-zip.org/
> >
> > Cheers,
> > Alexy
> >

I actually think that Alexy's message is relevent to this list, since 
managing large LMs is a nontrivial problem.

I had not heard of 7-zip before, and took a look.  It does seem to produce
slightly smaller files than bzip2, so it is definitely of interest for 
LM compression.  One drawback is longer compression times (the software even
uses multithreading on multi-cpu machines to speed that up).
But, in any case, it was easy enough to add reading/writing of 7z files
to the relevant library code.  You simply have to replace the attached two
files in $SRILM/misc/src.  BTW, I tested this with the Unix port of 7z found at
http://p7zip.sourceforge.net/ .  I have NOT tested it on Windows using the 
original 7-zip software.

Also, BTW, if you are concerned with LM reading/writing speed (and decent
"compression" compared to text format), I would recommend the binary LM format.

Andreas 

-------------- next part --------------
/*
    File:   zio.h
    Author: Andreas Stolcke
    Date:   Wed Feb 15 15:19:44 PST 1995
   
    Description:

    Copyright (c) 1994-2007, SRI International.  All Rights Reserved.

    RCS ID: $Id: zio.h,v 1.13 2007/11/11 16:06:53 stolcke Exp $
*/

/*
 *  $Log: zio.h,v $
 *  Revision 1.13  2007/11/11 16:06:53  stolcke
 *  7zip compression support
 *
 *  Revision 1.12  2006/08/04 23:59:09  stolcke
 *  MSVC portability
 *
 *  Revision 1.11  2006/03/28 01:15:10  stolcke
 *  include sys/signal.h to check for SIGPIPE
 *
 *  Revision 1.10  2006/03/06 05:46:43  stolcke
 *  define NO_ZIO in zio.h instead of zio.c
 *
 *  Revision 1.9  2006/03/01 00:45:45  stolcke
 *  allow disabling of zio for windows environment (NO_ZIO)
 *
 *  Revision 1.8  2005/12/16 23:30:09  stolcke
 *  added support for bzip2-compressed files
 *
 *  Revision 1.7  2003/02/21 20:18:53  stolcke
 *  avoid conflict if zopen is already defined in library
 *
 *  Revision 1.6  1999/10/13 09:07:13  stolcke
 *  make filename checking functions public
 *
 *  Revision 1.5  1995/06/22 19:58:26  stolcke
 *  ansi-fied
 *
 *  Revision 1.4  1995/06/12 22:56:37  tmk
 *  Added ifdef around the redefinitions of fopen() and fclose().
 *
 */

/*******************************************************************
   Copyright 1994 SRI International.  All rights reserved.
   This is an unpublished work of SRI International and is not to be
   used or disclosed except as provided in a license agreement or
   nondisclosure agreement with SRI International.
 ********************************************************************/


#ifndef _ZIO_H
#define _ZIO_H

#ifdef __cplusplus
extern "C" {
#endif

/* Include declarations files. */

#include <stdio.h>
#include <signal.h>		// to check for SIGPIPE

/* Avoid conflict with library function */
#ifdef HAVE_ZOPEN
#define zopen my_zopen
#endif

/* Constants */
#if !defined(SIGPIPE)
#define NO_ZIO
#endif

#ifdef NO_ZIO
# define COMPRESS_SUFFIX  ""
# define GZIP_SUFFIX	  ""
# define OLD_GZIP_SUFFIX  ""
# define BZIP2_SUFFIX	  ""
# define SEVENZIP_SUFFIX  ""
#else
# define COMPRESS_SUFFIX  ".Z"
# define GZIP_SUFFIX	  ".gz"
# define OLD_GZIP_SUFFIX  ".z"
# define BZIP2_SUFFIX	  ".bz2"
# define SEVENZIP_SUFFIX  ".7z"
#endif /* NO_ZIO */

/* Define function prototypes. */

int	stdio_filename_p (const char *name);
int	compressed_filename_p (const char *name);
int 	gzipped_filename_p (const char *name);
int 	bzipped_filename_p (const char *name);
int 	sevenzipped_filename_p (const char *name);

FILE *	zopen (const char *name, const char *mode);
int	zclose (FILE *stream);

/* Users of this header implicitly always use zopen/zclose in stdio */

#ifdef ZIO_HACK
#define fopen(name,mode)	zopen(name,mode)
#define fclose(stream)		zclose(stream)
#endif

#ifdef __cplusplus
}
#endif

#endif /* _ZIO_H */

-------------- next part --------------
/*
    File:   zio.c
    Author: Andreas Stolcke
    Date:   Wed Feb 15 15:19:44 PST 1995
   
    Description:
                 Compressed file stdio extension
*/

#ifndef lint
static char Copyright[] = "Copyright (c) 1995-2007 SRI International.  All Rights Reserved.";
static char RcsId[] = "@(#)$Header: /home/srilm/devel/misc/src/RCS/zio.c,v 1.25 2007/11/11 16:06:53 stolcke Exp $";
#endif

/*
 * $Log: zio.c,v $
 * Revision 1.25  2007/11/11 16:06:53  stolcke
 * 7zip compression support
 *
 * Revision 1.24  2006/03/06 05:46:43  stolcke
 * define NO_ZIO in zio.h instead of zio.c
 *
 * Revision 1.23  2006/03/01 00:45:45  stolcke
 * allow disabling of zio for windows environment (NO_ZIO)
 *
 * Revision 1.22  2006/01/09 17:39:03  stolcke
 * MSVC port
 *
 * Revision 1.21  2006/01/05 19:32:42  stolcke
 * ms visual c portability
 *
 * Revision 1.20  2005/12/16 23:30:09  stolcke
 * added support for bzip2-compressed files
 *
 * Revision 1.19  2005/07/28 21:08:15  stolcke
 * include signal.h for portability
 *
 * Revision 1.18  2005/07/28 18:37:47  stolcke
 * portability for systems w/o pipes
 *
 * Revision 1.17  2004/01/31 01:17:51  stolcke
 * don't declare errno, get it from errno.h
 *
 * Revision 1.16  2003/11/09 21:09:11  stolcke
 * use gunzip -f to allow uncompressed files ending in .gz
 *
 * Revision 1.15  2003/11/01 06:18:30  stolcke
 * issue stdin/stdout warning only once
 *
 * Revision 1.14  1999/10/13 09:07:13  stolcke
 * make filename checking functions public
 *
 * Revision 1.13  1997/06/07 15:58:47  stolcke
 * fixed some gcc warnings
 *
 * Revision 1.13  1997/06/07 15:56:24  stolcke
 * fixed some gcc warnings
 *
 * Revision 1.12  1997/01/23 20:38:35  stolcke
 * *** empty log message ***
 *
 * Revision 1.11  1997/01/23 20:02:59  stolcke
 * handle SIGPIPE termination
 *
 * Revision 1.10  1997/01/22 07:52:08  stolcke
 * warn about multiple uses of -
 *
 * Revision 1.9  1996/11/30 21:08:59  stolcke
 * use exec in compress commands
 *
 * Revision 1.8  1995/07/19 16:51:31  stolcke
 * remove PATH assignment to account for local setup
 *
 * Revision 1.7  1995/06/22 20:47:16  stolcke
 * dup stdio descriptors so fclose won't disturb them
 *
 * Revision 1.6  1995/06/22 20:44:39  stolcke
 * return more error info
 *
 * Revision 1.5  1995/06/22 19:58:11  stolcke
 * ansi-fied
 *
 * Revision 1.4  1995/06/12 22:57:12  tmk
 * Added ifdef around the redefinitions of fopen() and fclose().
 *
 */

/*******************************************************************
   Copyright 1994,1997 SRI International.  All rights reserved.
   This is an unpublished work of SRI International and is not to be
   used or disclosed except as provided in a license agreement or
   nondisclosure agreement with SRI International.
 ********************************************************************/

#include <stdio.h>
#include <string.h>
#ifndef _MSC_VER
#include <unistd.h>
#include <sys/param.h>
#endif
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <signal.h>
#include <errno.h>

#ifndef MAXPATHLEN
#define MAXPATHLEN 1024
#endif

#include "zio.h"

#ifdef ZIO_HACK
#undef fopen
#undef fclose
#endif

#define STDIO_NAME	  "-"

#define STD_PATH    ":"   /* "PATH=/usr/bin:/usr/ucb:/usr/bsd:/usr/local/bin" */

#define COMPRESS_CMD	  "exec compress -c"
#define UNCOMPRESS_CMD	  "exec uncompress -c"

#define GZIP_CMD	  "exec gzip -c"
#define GUNZIP_CMD	  "exec gunzip -cf"

#define BZIP2_CMD	  "exec bzip2"
#define BUNZIP2_CMD	  "exec bunzip2 -c"

#define SEVENZIP_CMD	  "exec 7z a -si"
#define SEVENUNZIP_CMD	  "exec 7z x -so"

/*
 * Does the filename refer to stdin/stdout ?
 */
int
stdio_filename_p (const char *name)
{
    return (strcmp(name, STDIO_NAME) == 0);
}

/*
 * Does the filename refer to a compressed file ?
 */
int
compressed_filename_p (const char *name)
{
    unsigned len = strlen(name);

    return
	(sizeof(COMPRESS_SUFFIX) > 1) &&
	    (len > sizeof(COMPRESS_SUFFIX)-1) &&
		(strcmp(name + len - (sizeof(COMPRESS_SUFFIX)-1),
			COMPRESS_SUFFIX) == 0);
}

/*
 * Does the filename refer to a gzipped file ?
 */
int
gzipped_filename_p (const char *name)
{
    unsigned len = strlen(name);

    return 
	(sizeof(GZIP_SUFFIX) > 1) &&
	    (len > sizeof(GZIP_SUFFIX)-1) &&
		(strcmp(name + len - (sizeof(GZIP_SUFFIX)-1),
			GZIP_SUFFIX) == 0) ||
	(sizeof(OLD_GZIP_SUFFIX) > 1) &&
	    (len > sizeof(OLD_GZIP_SUFFIX)-1) &&
		(strcmp(name + len - (sizeof(OLD_GZIP_SUFFIX)-1),
			OLD_GZIP_SUFFIX) == 0);
}

/*
 * Does the filename refer to a bzipped file ?
 */
int
bzipped_filename_p (const char *name)
{
    unsigned len = strlen(name);

    return 
	(sizeof(BZIP2_SUFFIX) > 1) &&
	    (len > sizeof(BZIP2_SUFFIX)-1) &&
		(strcmp(name + len - (sizeof(BZIP2_SUFFIX)-1),
			BZIP2_SUFFIX) == 0);
}

/*
 * Does the filename refer to a 7-zip file ?
 */
int
sevenzipped_filename_p (const char *name)
{
    unsigned len = strlen(name);

    return 
	(sizeof(SEVENZIP_SUFFIX) > 1) &&
	    (len > sizeof(SEVENZIP_SUFFIX)-1) &&
		(strcmp(name + len - (sizeof(SEVENZIP_SUFFIX)-1),
			SEVENZIP_SUFFIX) == 0);
}

/*
 * Check file readability
 */
static int
readable_p (const char *name)
{
    int fd = open(name, O_RDONLY);

    if (fd < 0)
        return 0;
    else {
        close(fd);
	return 1;
    }
}

/*
 * Check file writability
 */
static int
writable_p (const char *name)
{
    int fd = open(name, O_WRONLY|O_CREAT, 0666);

    if (fd < 0)
        return 0;
    else {
        close(fd);
	return 1;
    }
}

/*
 * Open a stdio stream, handling special filenames
 */
FILE *zopen(const char *name, const char *mode)
{
    char command[MAXPATHLEN + 100];

    if (stdio_filename_p(name)) {
	/*
	 * Return stream to stdin or stdout
	 */
	if (*mode == 'r') {
		static int stdin_used = 0;
		static int stdin_warning = 0;
		int fd;

		if (stdin_used) {
		    if (!stdin_warning) {
			fprintf(stderr,
				"warning: '-' used multiple times for input\n");
			stdin_warning = 1;
		    }
		} else {
		    stdin_used = 1;
		}

		fd = dup(0);
		return fd < 0 ? NULL : fdopen(fd, mode);
	} else if (*mode == 'w' || *mode == 'a') {
		static int stdout_used = 0;
		static int stdout_warning = 0;
		int fd;

		if (stdout_used) {
		    if (!stdout_warning) {
			fprintf(stderr,
				"warning: '-' used multiple times for output\n");
			stdout_warning = 1;
		    }
		} else {
		    stdout_used = 1;
		}

		fd = dup(1);
		return fd < 0 ? NULL : fdopen(fd, mode);
	} else {
		return NULL;
	}
    } else {
	char *compress_cmd = NULL;
	char *uncompress_cmd = NULL;
	int zip_to_stdout = 1;
	
	if (compressed_filename_p(name)) {
	    compress_cmd = COMPRESS_CMD;
	    uncompress_cmd = UNCOMPRESS_CMD;
	} else if (gzipped_filename_p(name)) {
	    compress_cmd = GZIP_CMD;
	    uncompress_cmd = GUNZIP_CMD;
	} else if (bzipped_filename_p(name)) {
	    compress_cmd = BZIP2_CMD;
	    uncompress_cmd = BUNZIP2_CMD;
	} else if (sevenzipped_filename_p(name)) {
	    compress_cmd = SEVENZIP_CMD;
	    uncompress_cmd = SEVENUNZIP_CMD;
	    zip_to_stdout = 0;
	}

	if (compress_cmd != NULL) {
#ifdef NO_ZIO
	    fprintf(stderr, "Sorry, compressed I/O not available on this machine\n");
	    errno = EINVAL;
	    return NULL;
#else /* !NO_ZIO */
	    /*
	     * Return stream to compress pipe
	     */
	    if (*mode == 'r') {
		if (!readable_p(name))
		    return NULL;
		sprintf(command, "%s;%s %s", STD_PATH, uncompress_cmd, name);
		return popen(command, mode);
	    } else if (*mode == 'w') {
		if (!writable_p(name))
		    return NULL;
		if (zip_to_stdout) {
		    sprintf(command, "%s;%s >%s", STD_PATH, compress_cmd, name);
		} else {
		    /*
		     * This is necessary because the compression program might
		     * complain if a zero-length file already exists.
		     * However, it means that existing file owner & permission
		     * attributes are not preserved.
		     */
		    unlink(name);
		    sprintf(command, "%s;%s %s", STD_PATH, compress_cmd, name);
		}
		return popen(command, mode);
	    } else {
		return NULL;
	    }
#endif /* !NO_ZIO */
	} else {
	    return fopen(name, mode);
	}
    }
}

/*
 * Close a stream created by zopen()
 */
int
zclose(FILE *stream)
{
#ifdef NO_ZIO
     return fclose(stream);
#else /* !NO_ZIO */

    int status;
    struct stat statb;

    /*
     * pclose(), according to the man page, should diagnose streams not 
     * created by popen() and return -1.  however, on SGIs, it core dumps
     * in that case.  So we better be careful and try to figure out
     * what type of stream it is.
     */
    if (fstat(fileno(stream), &statb) < 0)
	return -1;

    /*
     * First try pclose().  It will tell us if stream is not a pipe
     */
    if ((statb.st_mode & S_IFMT) != S_IFIFO ||
        fileno(stream) == 0 || fileno(stream) == 1)
    {
        return fclose(stream);
    } else {
	status = pclose(stream);
	if (status == -1) {
	    /*
	     * stream was not created by popen(), but popen() does fclose
	     * for us in thise case.
	     */
	    return ferror(stream);
	} else if (status == SIGPIPE) {
	    /*
	     * It's normal for the uncompressor to terminate by SIGPIPE,
	     * i.e., if the user program closed the file before reaching
	     * EOF. 
	     */
	     return 0;
	} else {
	    /*
	     * The compressor program terminated with an error, and supposedly
	     * has printed a message to stderr.
	     * Set errno to a generic error code if it hasn't been set already.
	     */
	    if (errno == 0) {
		errno = EIO;
	    }
	    return status;
	}
    }
#endif /* NO_ZIO */
}

#ifdef STAND
int
main (argc, argv)
    int argc;
    char **argv;
{
    int dowrite = 0;
    char buffer[BUFSIZ];
    int nread;
    FILE *stream;

    if (argc < 3) {
	printf("usage: %s file {r|w}\n", argv[0]);
 	exit(2);
    }

    if (*argv[2] == 'r') {
	stream = zopen(argv[1], argv[2]);

	if (!stream) {
		perror(argv[1]);
		exit(1);
	}

	while (!ferror(stream) && !feof(stream) &&!ferror(stdout)) {
		nread = fread(buffer, 1, sizeof(buffer), stream);
		(void)fwrite(buffer, 1, nread, stdout);
	}
    } else {
	stream = zopen(argv[1], argv[2]);

	if (!stream) {
		perror(argv[1]);
		exit(1);
	}

	while (!ferror(stdin) && !feof(stdin) && !ferror(stream)) {
		nread = fread(buffer, 1, sizeof(buffer), stdin);
		(void)fwrite(buffer, 1, nread, stream);
	}
   }
   if (ferror(stdin)) {
	perror("stdin");
   } else if (ferror(stdout)) {
	perror("stdout");
   } else if (ferror(stream)) {
	perror(argv[1]);
   }
   zclose(stream);
   
}
#endif /* STAND */

From barabbas at gmail.com  Sun Nov 11 09:12:52 2007
From: barabbas at gmail.com (Tian-Jian "Barabbas" Jiang@Gmail)
Date: Mon, 12 Nov 2007 01:12:52 +0800
Subject: 7z as a much better archiver than gz/bz2
In-Reply-To: <200711111627.lABGRTE29698@huge>
References: <200711111627.lABGRTE29698@huge>
Message-ID: <47373814.3000507@gmail.com>

Hi all,

Andreas Stolcke wrote:
> In message <244d59a50711101259v60acb8e8tf3743520f2d92aa6 at mail.gmail.com>you wro
> te:
>   
> I actually think that Alexy's message is relevent to this list, since 
> managing large LMs is a nontrivial problem.
>   
Sorry for going to off-topic but I would like to say, maybe make up a 
new interface to read/write data with SQLite is a good alternative.

/Mike/


From dyuret at ku.edu.tr  Sat Nov 17 14:43:20 2007
From: dyuret at ku.edu.tr (Deniz Yuret)
Date: Sun, 18 Nov 2007 00:43:20 +0200
Subject: OOV calculations
In-Reply-To: <200711011527.lA1FRcS18102@huge>
References: <cea871f80711010049p5563bce5ib575ec42ab432dcd@mail.gmail.com>
	 <200711011527.lA1FRcS18102@huge>
Message-ID: <cea871f80711171443x5861dae8w3cc9d25388e75806@mail.gmail.com>

Hi,

I had some interesting observations while trying to build a letter
based model.  My text file contains a word on each line with letters
separated by spaces.

1. kndiscount gives an error for this data file even though
ukndiscount seems to work.  Is this a bug?

ngram-count -kndiscount -order 3 -lm foo3.lm -text turkish.train.oov.split
one of modified KneserNey discounts is negative
error in discount estimator for order 1

2. ukndiscount accepts the -interpolate option and in fact does better
with it.  According to the documentation only wbdiscount, cdiscount,
and kndiscount are supposed to work with interpolate.  I checked the
output with -debug 3 and all probabilities seem to add up to 1.  Is
the documentation out of date?

3. Training with -order k and then testing with -order n does not give
the same results as training with -order n and testing with -order n.
Is this normal?  Which discounting methods should give equal results?

deniz

On Nov 1, 2007 5:27 PM, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>
> In message <cea871f80711010049p5563bce5ib575ec42ab432dcd at mail.gmail.com>you wro
> te:
> > Thank you.
> >
> > > You cannot compare LMs with different OOV counts.  You need to create a
> > > model that assigns a nonzero probability to every event.  E.g., you
> > > could have a letter-probability model for OOVS.
> >
> > As for your suggestion of creating a letter-probability model for OOVs
> > (and maybe interpolating it with the ngram model), are there any
> > tools/documentation in the srilm package that could be helpful?  If
> > not I think we can (1) go into the source code and figure out how to
> > create a new letter-probability LM, or (2) create an independent
> > letter-probability LM outside srilm and manually interpolate its
> > results with the -debug 2 output of ngram.
> >
> > I am assuming here (maybe contrary to your suggestion) that we can
> > create a model that assigns a nonzero probability to every event by
> > interpolating a regular ngram model (with OOVs > 0) and a
> > letter-probability model.
>
> Actually, I wasn't thinking of covering all words with a letter
> probability model (which would be poor for non-OOV words) and
> interpolating.  A more typical approach is to have a word LM with an
> OOV token, and when you are inside the OOV you assign a probability to
> the specific word by a letter LM.  so the total probability of
>
>         p(a b c) where "b" is an OOV would be
>
>
> p(a | ...) p(OOV | a) p(b| OOV) p(c | a OOV)  and
>
> p(b|OOV) is given by a totally separate LM that operates in terms of letters.
>
> Obviously this isn't implemented in SRILM at this point, but you can compute
> total probabilities, perplexities, etc. by first running the word LM, then
> the letter LM just on the OOVs in your test set, and adding the log
> probabilities.
>
> Andreas
>


From stolcke at speech.sri.com  Sun Nov 18 08:29:04 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 18 Nov 2007 08:29:04 PST
Subject: Entropy Pruning 
In-Reply-To: Your message of Sun, 18 Nov 2007 01:37:37 +0900.
             <473F18D1.5050201@cslab.kecl.ntt.co.jp> 
Message-ID: <200711181629.lAIGT4D23492@huge>


In message <473F18D1.5050201 at cslab.kecl.ntt.co.jp>you wrote:
> Dear Dr. Stolcke,
> 
> Hello, I'm Daichi Mochihashi, a researcher in NTT Communication Science
> Labs, Japan.
> Until recently, I was involved in the language modeling team
> at ATR Spoken Language Communication Research Laboratories, perhaps you may k
> now.
> 
> Lately, I developed a novel pruning method for variable-order ngrams
> and want to compare it with your entropy pruning as the Gold standard.
> However, in spite of the description in the SRILM paper, I found that
> the entropy pruning method is not implemented but replaced by
> a heuristic algorithm in the current SRILM distribution.

That is not correct.  The exact algorithm described in the paper is
implemented in Ngram::pruneProbs() in NgramLM.cc.  It is activated by
the ngram-count and ngram -prune options.

> 
> Is there any previous version of SRILM that supports entropy pruning?
> or could you kindly send me a version of VarNgram.cc or any code
> that you have used in the experiment?

VarNgram.cc was a research effort that performs pruning during the 
estimation step to eliminate redunant Ngrams from the start, using a
Hoeffding bound criterion, which happens not to work very well.

Note that the standard Ngram class supports "variable" N-gram models
already, since any mix of N-grams of different lengths is allowed.
So stick to the Ngram class, and do not use the ngram-count -varprune
option, which trigger the use of the VarNgram class.

I'm sorry that the naming of classes must have been confusing.

Andreas 


From stolcke at speech.sri.com  Sun Nov 18 21:25:54 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 18 Nov 2007 21:25:54 PST
Subject: OOV calculations 
In-Reply-To: Your message of Sun, 18 Nov 2007 00:43:20 +0200.
             <cea871f80711171443x5861dae8w3cc9d25388e75806@mail.gmail.com> 
Message-ID: <200711190525.lAJ5PsM15005@huge>


In message <cea871f80711171443x5861dae8w3cc9d25388e75806 at mail.gmail.com>you wro
te:
> Hi,
> 
> I had some interesting observations while trying to build a letter
> based model.  My text file contains a word on each line with letters
> separated by spaces.
> 
> 1. kndiscount gives an error for this data file even though
> ukndiscount seems to work.  Is this a bug?
> 
> ngram-count -kndiscount -order 3 -lm foo3.lm -text turkish.train.oov.split
> one of modified KneserNey discounts is negative
> error in discount estimator for order 1

This is quite possible as the formulae used by the two methods differ.
Also, the count-of-count statistics used may be quite atypical given
that the number of distinct unigram types is limited and small for letter-based
models.

> 
> 2. ukndiscount accepts the -interpolate option and in fact does better
> with it.  According to the documentation only wbdiscount, cdiscount,
> and kndiscount are supposed to work with interpolate.  I checked the
> output with -debug 3 and all probabilities seem to add up to 1.  Is
> the documentation out of date?

ukndiscount was added later, and the man page was not fully updated, it seems.
ukndiscount is certainly supposed to support -interpolate .

> 3. Training with -order k and then testing with -order n does not give
> the same results as training with -order n and testing with -order n.
> Is this normal?  Which discounting methods should give equal results?

This is a known (and desired) feature of KN (original and modified)
discounting.  KN treats the highest-order N-grams differently from the
lower-order ones, and the lower-order N-grams are not supposed to be 
used by themselves.  The reason is that the lower-order estimates are
specifically chosen to work well as backoffs, not as standalone estimates.

Andreas 


From stolcke at speech.sri.com  Tue Nov 27 15:14:52 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 27 Nov 2007 15:14:52 PST
Subject: SRILM and LM network servers
Message-ID: <200711272314.lARNErT07252@huge>


FYI, there is now much enhanced support for network-based "LM servers"
in SRILM.  The main changes are:

        * New ngram -use-server option to run the client side of a network LM
        server as implemented by ngram -server-port.  Optionally, probabilities
        may be cached in the client (option -cache-served-ngrams).

        * New ngram -use-server option to run the client side of a network LM
        server as implemented by ngram -server-port . 

        * New LMClient class to implement the above (a stub LM subclass that
        queries a server for LM probabilities).

        * ngram -server-port now behaves like a true server daemon: it handles
        multiple simultaneous or sequential clients, and never exits (unless
        killed).  The number of simultaneous clients may be limited with the
        -server-maxclients option.

This is still somewhat experimental, so I welcome any feedback.
If you want to give it a try download the 1.5.6 (beta) version from the
SRILM download page.

An example and test of the functionality is in $SRILM/test/tests/ngram-server .

Andreas


From schwa717 at umn.edu  Wed Nov 28 15:15:00 2007
From: schwa717 at umn.edu (Lane Schwartz)
Date: Wed, 28 Nov 2007 17:15:00 -0600
Subject: Using and understanding LM file (with modified Kneser-Ney smoothing)
Message-ID: <D70681C5-A6FA-43A4-A4B8-93EACD91892B@umn.edu>

Hi,

I'm working on some machine translation code which in which I'd like  
incorporate a language model. I'm trying to replicate the system  
described in David Chiang's 2005 ACL paper; in that paper, his  
language model is a trigram model which uses modified Kneser-Ney  
smoothing.

My goal is to train the LM using the SRILM toolkit, then use the  
generated LM file in my own code.

I've looked over Chen & Goodman (1998), and I think I understand the  
ideas, but I'm having some trouble understanding how to make sense of  
the numbers in the LM file (produced by ngram-count).

Any help would be greatly appreciated.

My training corpus is the first 10000 words of the English side of the  
de-en Europarl training corpus (http://www.cs.umn.edu/research/nlp/mt/wmt06/europarl.de-en.en.gz 
), which I have lowercased and converted to UTF-8. Again, my goal is a  
trigram language model which uses modified Kneser-Ney smoothing, and I  
want to use interpolation - here's what I did to get the LM file:

$ zcat europarl.de-en.en.gz | head -n 10000 | ngram-count -text - - 
order 3 -kndiscount -interpolate -lm sample.srilm

Since I'm trying to understand how to apply the ngram probabilities  
and backoff-weights, I'm testing using a very simple test phrase:

echo "the man in" > sample.txt

Here are the (I think) relevant lines from the LM file:

unigrams:
-2.987062	</s>
-99	<s>	-1.142606
-1.73375	in	-0.660575
-3.960678	man	-0.1932579
-1.781734	the	-0.5241315

bigrams:
-0.8540089	<s> the	-0.3293318
-1.516293	man in
-3.496579	the man	-0.09554159

trigrams:
-0.6538057	the man in


I then ran the ngram tool to see what it does with this phrase:

$ ngram -lm sample.srilm -ppl sample.txt -debug 3
reading 10209 1-grams
reading 78195 2-grams
reading 20317 3-grams
the man in
         p( the | <s> )  = [2gram] 0.139956 [ -0.854009 ] / 1
         p( man | the ...)       = [2gram] 0.00014931 [ -3.82591 ] / 1
         p( in | man ...)        = [3gram] 0.221919 [ -0.653806 ] / 1
         p( </s> | in ...)       = [1gram] 0.000225094 [ -3.64764 ] / 1
1 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797

file sample.txt: 1 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797


I'd like to make sense of the above numbers.

The first line, p( the | <s> ), makes sense, since the bigram log prob  
for "<s> the" in lm.srilm is -0.8540089.

I'm getting stuck figuring out where -3.82591 comes from in p( man |  
the ...). It seems that the formula should be:
interpolated P( man | the ) = lamda_man*P(man) + (1 -  
lamda_man)*(lamda_man|the * p(man|the))

If the weights listed above are the lamdas in the above equation, that  
gives us the following (converting from log domain to regular domain  
as we go):

lamda_man = 10**(-0.1932579
P(man) = 10**(-3.960678)
lamda_man|the = 10**-0.09554159
P(man|the) = 10**-3.496579

So my interpolated P( man | the ) calculation gives 0.000162027. The  
ngram util gave 0.00014931.


If anyone could help point out where I'm screwing up, it would be very  
much appreciated. Am I running with the appropriate parameters to  
ngram-count and ngram, given that I want an interpolated LM with  
modified Kneser-Ney smoothing (as used by Chiang(2005))? Does my  
equation above look right? I know this is a long email - thanks for  
your time and thoughts.

Thanks,
Lane Schwartz

University of Minnesota


From stolcke at speech.sri.com  Fri Nov 30 19:48:59 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 30 Nov 2007 19:48:59 -0800
Subject: Using and understanding LM file (with modified Kneser-Ney smoothing)
In-Reply-To: <D70681C5-A6FA-43A4-A4B8-93EACD91892B@umn.edu>
References: <D70681C5-A6FA-43A4-A4B8-93EACD91892B@umn.edu>
Message-ID: <4750D9AB.4050900@speech.sri.com>


Lane,

there is a key misunderstanding here.  The interpolation of higher- and 
lower-order probability estimates (triggered by ngram-count 
-interpolate) happens at training time, and the final probability 
estimates are then stored in the LM file.  Hence, no interpolation is 
required at test time.
In fact, all LMs in ARPA backoff format are handled exactly the same in 
testing.  The different smoothing methods only come in during training.

I hope this answers your question.

Andreas

Lane Schwartz wrote:
> Hi,
>
> I'm working on some machine translation code which in which I'd like 
> incorporate a language model. I'm trying to replicate the system 
> described in David Chiang's 2005 ACL paper; in that paper, his 
> language model is a trigram model which uses modified Kneser-Ney 
> smoothing.
>
> My goal is to train the LM using the SRILM toolkit, then use the 
> generated LM file in my own code.
>
> I've looked over Chen & Goodman (1998), and I think I understand the 
> ideas, but I'm having some trouble understanding how to make sense of 
> the numbers in the LM file (produced by ngram-count).
>
> Any help would be greatly appreciated.
>
> My training corpus is the first 10000 words of the English side of the 
> de-en Europarl training corpus 
> (http://www.cs.umn.edu/research/nlp/mt/wmt06/europarl.de-en.en.gz), 
> which I have lowercased and converted to UTF-8. Again, my goal is a 
> trigram language model which uses modified Kneser-Ney smoothing, and I 
> want to use interpolation - here's what I did to get the LM file:
>
> $ zcat europarl.de-en.en.gz | head -n 10000 | ngram-count -text - 
> -order 3 -kndiscount -interpolate -lm sample.srilm
>
> Since I'm trying to understand how to apply the ngram probabilities 
> and backoff-weights, I'm testing using a very simple test phrase:
>
> echo "the man in" > sample.txt
>
> Here are the (I think) relevant lines from the LM file:
>
> unigrams:
> -2.987062    </s>
> -99    <s>    -1.142606
> -1.73375    in    -0.660575
> -3.960678    man    -0.1932579
> -1.781734    the    -0.5241315
>
> bigrams:
> -0.8540089    <s> the    -0.3293318
> -1.516293    man in
> -3.496579    the man    -0.09554159
>
> trigrams:
> -0.6538057    the man in
>
>
>
> I then ran the ngram tool to see what it does with this phrase:
>
> $ ngram -lm sample.srilm -ppl sample.txt -debug 3
> reading 10209 1-grams
> reading 78195 2-grams
> reading 20317 3-grams
> the man in
>         p( the | <s> )  = [2gram] 0.139956 [ -0.854009 ] / 1
>         p( man | the ...)       = [2gram] 0.00014931 [ -3.82591 ] / 1
>         p( in | man ...)        = [3gram] 0.221919 [ -0.653806 ] / 1
>         p( </s> | in ...)       = [1gram] 0.000225094 [ -3.64764 ] / 1
> 1 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797
>
> file sample.txt: 1 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797
>
>
>
> I'd like to make sense of the above numbers.
>
> The first line, p( the | <s> ), makes sense, since the bigram log prob 
> for "<s> the" in lm.srilm is -0.8540089.
>
> I'm getting stuck figuring out where -3.82591 comes from in p( man | 
> the ...). It seems that the formula should be:
> interpolated P( man | the ) = lamda_man*P(man) + (1 - 
> lamda_man)*(lamda_man|the * p(man|the))
>
> If the weights listed above are the lamdas in the above equation, that 
> gives us the following (converting from log domain to regular domain 
> as we go):
>
> lamda_man = 10**(-0.1932579
> P(man) = 10**(-3.960678)
> lamda_man|the = 10**-0.09554159
> P(man|the) = 10**-3.496579
>
> So my interpolated P( man | the ) calculation gives 0.000162027. The 
> ngram util gave 0.00014931.
>
>
> If anyone could help point out where I'm screwing up, it would be very 
> much appreciated. Am I running with the appropriate parameters to 
> ngram-count and ngram, given that I want an interpolated LM with 
> modified Kneser-Ney smoothing (as used by Chiang(2005))? Does my 
> equation above look right? I know this is a long email - thanks for 
> your time and thoughts.
>
> Thanks,
> Lane Schwartz
>
> University of Minnesota
>


From dyuret at ku.edu.tr  Mon Dec  3 02:38:29 2007
From: dyuret at ku.edu.tr (Deniz Yuret)
Date: Mon, 3 Dec 2007 12:38:29 +0200
Subject: Understanding lm-files and discounting
Message-ID: <cea871f80712030238r148279bdsf4664161e710a2a2@mail.gmail.com>

I spent last weekend trying to figure out the discrepancies between the
SRILM kn-discounting implementations and my earlier implementations.
Basically I am trying to go from the text file to the count file to
the model file
to the probabilities assigned to the words in the test file.  This took me on a
journey from man pages to debug outputs to the source code.  I figured
a lot of it out but it turned out to be nontrivial to go from paper
descriptions to the numbers in the ARPA ngram format to the final
probability calculations.  If you help me with a couple of things I
promise I'll write a man page detailing all discounting calculations
in SRILM.

1. Sometimes the model seems to use smaller ngrams even when longer
ones are in the training file.  An example from a letter model:

E i s e n h o w e r
       p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
       p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
       p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
       p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
       p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
       p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
       p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
       p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
       p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
       p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
       p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
1 sentences, 10 words, 0 OOVs
0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213

This is an -order 7 model and the training file does have the word
Eisenhower.  So I don't understand why it goes back to using lower
order ngrams after the letter 'h'.

2. Not all (n-1)-grams have backoff weights in the model file, why?

3. What exactly does srilm do with google ngrams?  Can you give an
example usage?  Does it do things like extract a small subset useful
for evaluating a test file?

4. Since google-ngrams have all ngrams below count=40 missing, the kn
discount constants that rely on the number of ngrams with low counts
will fail.  Also I found that empirically the best highest order
discount constant is close to 40, not in the [0,1] range.  How does
srilm handle this?

5. Do I need to understand what the following messages mean to
understand the calculations:
warning: 7.65818e-10 backoff probability mass left for "" --
incrementing denominator
warning: distributing 0.000254455 left-over probability mass over all 124 words
discarded 254764 7-gram probs discounted to zero
inserted 2766 redundant 3-gram probs

best,
deniz


From stolcke at speech.sri.com  Mon Dec  3 22:07:11 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 03 Dec 2007 22:07:11 PST
Subject: Understanding lm-files and discounting 
In-Reply-To: Your message of Mon, 03 Dec 2007 12:38:29 +0200.
             <cea871f80712030238r148279bdsf4664161e710a2a2@mail.gmail.com> 
Message-ID: <200712040607.lB467Bt14601@huge>


In message <cea871f80712030238r148279bdsf4664161e710a2a2 at mail.gmail.com>you wro
te:
> I spent last weekend trying to figure out the discrepancies between the
> SRILM kn-discounting implementations and my earlier implementations.
> Basically I am trying to go from the text file to the count file to
> the model file
> to the probabilities assigned to the words in the test file.  This took me on
>  a
> journey from man pages to debug outputs to the source code.  I figured
> a lot of it out but it turned out to be nontrivial to go from paper
> descriptions to the numbers in the ARPA ngram format to the final
> probability calculations.  If you help me with a couple of things I
> promise I'll write a man page detailing all discounting calculations
> in SRILM.

A tutorial or FAQ including the information below would be most useful!

> 
> 1. Sometimes the model seems to use smaller ngrams even when longer
> ones are in the training file.  An example from a letter model:
> 
> E i s e n h o w e r
>        p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
>        p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
>        p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
>        p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
>        p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
>        p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
>        p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
>        p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
>        p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
>        p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
>        p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
> 1 sentences, 10 words, 0 OOVs
> 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213
> 
> This is an -order 7 model and the training file does have the word
> Eisenhower.  So I don't understand why it goes back to using lower
> order ngrams after the letter 'h'.

This is because the default "mincount" for N-grams longer than 2 words is 2,
Meaning that a trigram, 4gram, etc. has to occur at least twice to be included
in the LM.
You can change this with the options

	-gt3min 1
	-gt4min 1
	etc.


> 
> 2. Not all (n-1)-grams have backoff weights in the model file, why?

Backoff weights are only recorded for N-grams that appear as the prefix 
of a longer N-gram.  For all others the backoff weight is implicitly 1
(or 0, in log representation).  This convention saves a lot of space.

> 
> 3. What exactly does srilm do with google ngrams?  Can you give an
> example usage?  Does it do things like extract a small subset useful
> for evaluating a test file?

Google n-grams are not an LM format, they are way to store N-gram counts
on disk, and the classes that implement N-gram counts know how to read them.
This is exercized by the ngram-count -read-google option.
However, due to their typical size it is not advisable to try to build 
backoff LMs of the standard sort, which would require reading all N-grams 
into memory (someone working at Google might actually be able to do this
if their hardware budget is as phenomenal as it must be).

Instead, I recommend estimating a deleted-interpolation-smoothed
"count LM", i.e, an LM that consists of only a small number of 
interpolation weights (for smoothing) as well as the raw N-gram counts
themselves.  This way we can in fact load only the portion of the counts
into memory that impinge on a given test set (triggered by the 
ngram -limit-vocab option).

There is no full example of this, but it is basically what you see in 
$SRILM/test/tests/ngram-count-lm-limit-vocab .  The only change would be 
that instead of a countlm file with the keyword "counts" you would
use the keyword "google-counts" followed by the path to the google count
directory root.  Read the man page sections for ngram-count -count-lm and 
ngram -count-lm  for more information, and follow the example under the test
directory.

> 
> 4. Since google-ngrams have all ngrams below count=40 missing, the kn
> discount constants that rely on the number of ngrams with low counts
> will fail.  Also I found that empirically the best highest order
> discount constant is close to 40, not in the [0,1] range.  How does
> srilm handle this?

The deleted interpolation method of smoothing I am recommending above does
not have a problem with the missing ngrams.

There is also a way to extrapolate from the available counts-of-counts above
some threshold to those below the threshold, due to an empirical law that
we found to hold for a range of corpora.  For details see the paper

W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto. 
http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz

The extrapolation method is implemented in the script
$SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use 
make-big-lm to build your LM.   Again, it is not feasible to do this on
the ngrams distributed by Google.

> 5. Do I need to understand what the following messages mean to
> understand the calculations:

Not really, they are for information only.

> warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator

This means your unigram probabilities even after discounting sum to (almost) 1.
As a crude fallback, the denominator in the estimator is incremented to yield
usable backoff probability mass.

> warning: distributing 0.000254455 left-over probability mass over all 124 wor
> ds

Here the backof mass is 0.000254455 and is spread out over the 124 words that 
don't have any observed occurrences.

> discarded 254764 7-gram probs discounted to zero

Due to discounting cutoff (mincounts, see above) some 7-grams were not
included in the model.

> inserted 2766 redundant 3-gram probs

The ARPA format requires all prefixes of ngrams with probabilities to 
also have probabilities.  E.g., if "a b c" is in the model, so must "a b",
even if "a b" was not in the input ngram counts.  In such cases SRILM will
insert the "a b" probability but make it equal to what the backoff computation
would yield.

Andreas 


From wangc at csail.mit.edu  Tue Dec  4 04:49:13 2007
From: wangc at csail.mit.edu (Chao Wang)
Date: Tue, 04 Dec 2007 07:49:13 -0500
Subject: unsubscribe
Message-ID: <20071204074913.smcpvncpqe8k8k0g@imap.csail.mit.edu>

unsubscribe


From dyuret at ku.edu.tr  Mon Dec 10 07:41:16 2007
From: dyuret at ku.edu.tr (Deniz Yuret)
Date: Mon, 10 Dec 2007 17:41:16 +0200
Subject: Understanding lm-files and discounting
In-Reply-To: <200712040607.lB467Bt14601@huge>
References: <cea871f80712030238r148279bdsf4664161e710a2a2@mail.gmail.com>
	 <200712040607.lB467Bt14601@huge>
Message-ID: <cea871f80712100741y1bf9f839wf0f981dbfb1d9c38@mail.gmail.com>

Working on that documentation as promised.  Small question about the
mincounts: I was able to verify what you said with the default (gt)
discount, but with kn or ukndiscount some long ngrams with cnt=1 are
included in the model.  Since the counts are modified I thought maybe
it is looking at unmodified counts, but then there are some ngrams
excluded with regular count > 1 and kncount = 1.  So I couldn't quite
figure out what subset is included in the model with kndiscounting.

deniz


On Dec 4, 2007 8:07 AM, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>
> In message <cea871f80712030238r148279bdsf4664161e710a2a2 at mail.gmail.com>you wro
> te:
> > I spent last weekend trying to figure out the discrepancies between the
> > SRILM kn-discounting implementations and my earlier implementations.
> > Basically I am trying to go from the text file to the count file to
> > the model file
> > to the probabilities assigned to the words in the test file.  This took me on
> >  a
> > journey from man pages to debug outputs to the source code.  I figured
> > a lot of it out but it turned out to be nontrivial to go from paper
> > descriptions to the numbers in the ARPA ngram format to the final
> > probability calculations.  If you help me with a couple of things I
> > promise I'll write a man page detailing all discounting calculations
> > in SRILM.
>
> A tutorial or FAQ including the information below would be most useful!
>
> >
> > 1. Sometimes the model seems to use smaller ngrams even when longer
> > ones are in the training file.  An example from a letter model:
> >
> > E i s e n h o w e r
> >        p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
> >        p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
> >        p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
> >        p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
> >        p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
> >        p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
> >        p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
> >        p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
> >        p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
> >        p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
> >        p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
> > 1 sentences, 10 words, 0 OOVs
> > 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213
> >
> > This is an -order 7 model and the training file does have the word
> > Eisenhower.  So I don't understand why it goes back to using lower
> > order ngrams after the letter 'h'.
>
> This is because the default "mincount" for N-grams longer than 2 words is 2,
> Meaning that a trigram, 4gram, etc. has to occur at least twice to be included
> in the LM.
> You can change this with the options
>
>         -gt3min 1
>         -gt4min 1
>         etc.
>
>
> >
> > 2. Not all (n-1)-grams have backoff weights in the model file, why?
>
> Backoff weights are only recorded for N-grams that appear as the prefix
> of a longer N-gram.  For all others the backoff weight is implicitly 1
> (or 0, in log representation).  This convention saves a lot of space.
>
> >
> > 3. What exactly does srilm do with google ngrams?  Can you give an
> > example usage?  Does it do things like extract a small subset useful
> > for evaluating a test file?
>
> Google n-grams are not an LM format, they are way to store N-gram counts
> on disk, and the classes that implement N-gram counts know how to read them.
> This is exercized by the ngram-count -read-google option.
> However, due to their typical size it is not advisable to try to build
> backoff LMs of the standard sort, which would require reading all N-grams
> into memory (someone working at Google might actually be able to do this
> if their hardware budget is as phenomenal as it must be).
>
> Instead, I recommend estimating a deleted-interpolation-smoothed
> "count LM", i.e, an LM that consists of only a small number of
> interpolation weights (for smoothing) as well as the raw N-gram counts
> themselves.  This way we can in fact load only the portion of the counts
> into memory that impinge on a given test set (triggered by the
> ngram -limit-vocab option).
>
> There is no full example of this, but it is basically what you see in
> $SRILM/test/tests/ngram-count-lm-limit-vocab .  The only change would be
> that instead of a countlm file with the keyword "counts" you would
> use the keyword "google-counts" followed by the path to the google count
> directory root.  Read the man page sections for ngram-count -count-lm and
> ngram -count-lm  for more information, and follow the example under the test
> directory.
>
> >
> > 4. Since google-ngrams have all ngrams below count=40 missing, the kn
> > discount constants that rely on the number of ngrams with low counts
> > will fail.  Also I found that empirically the best highest order
> > discount constant is close to 40, not in the [0,1] range.  How does
> > srilm handle this?
>
> The deleted interpolation method of smoothing I am recommending above does
> not have a problem with the missing ngrams.
>
> There is also a way to extrapolate from the available counts-of-counts above
> some threshold to those below the threshold, due to an empirical law that
> we found to hold for a range of corpora.  For details see the paper
>
> W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto.
> http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
>
> The extrapolation method is implemented in the script
> $SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use
> make-big-lm to build your LM.   Again, it is not feasible to do this on
> the ngrams distributed by Google.
>
> > 5. Do I need to understand what the following messages mean to
> > understand the calculations:
>
> Not really, they are for information only.
>
> > warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator
>
> This means your unigram probabilities even after discounting sum to (almost) 1.
> As a crude fallback, the denominator in the estimator is incremented to yield
> usable backoff probability mass.
>
> > warning: distributing 0.000254455 left-over probability mass over all 124 wor
> > ds
>
> Here the backof mass is 0.000254455 and is spread out over the 124 words that
> don't have any observed occurrences.
>
> > discarded 254764 7-gram probs discounted to zero
>
> Due to discounting cutoff (mincounts, see above) some 7-grams were not
> included in the model.
>
> > inserted 2766 redundant 3-gram probs
>
> The ARPA format requires all prefixes of ngrams with probabilities to
> also have probabilities.  E.g., if "a b c" is in the model, so must "a b",
> even if "a b" was not in the input ngram counts.  In such cases SRILM will
> insert the "a b" probability but make it equal to what the backoff computation
> would yield.
>
> Andreas
>
>


From stolcke at speech.sri.com  Mon Dec 10 13:51:51 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 11 Dec 2007 06:51:51 +0900
Subject: Understanding lm-files and discounting
In-Reply-To: <cea871f80712100741y1bf9f839wf0f981dbfb1d9c38@mail.gmail.com>
References: <cea871f80712030238r148279bdsf4664161e710a2a2@mail.gmail.com>	 <200712040607.lB467Bt14601@huge> <cea871f80712100741y1bf9f839wf0f981dbfb1d9c38@mail.gmail.com>
Message-ID: <475DB4F7.2000509@speech.sri.com>

Deniz Yuret wrote:
> Working on that documentation as promised.  Small question about the
> mincounts: I was able to verify what you said with the default (gt)
> discount, but with kn or ukndiscount some long ngrams with cnt=1 are
> included in the model.  Since the counts are modified I thought maybe
> it is looking at unmodified counts, but then there are some ngrams
> excluded with regular count > 1 and kncount = 1.  So I couldn't quite
> figure out what subset is included in the model with kndiscounting.
>   
I think what you're seeing can be explained by the following two facts:

1 - with KN discounting the mincounts are indeed applied to the modified 
lower-order counts.
2 - However (and this is true for all smoothing methods), if an ngram "a 
b c d" is included in the model based on its counts, then all prefixes 
of that ngram also need to be included (otherwise you'd have an empty 
first column in the lm file at those prefix ngrams, which is illegal).
So, if mincount is 2 for 4grams and 3grams, and a b c d occurs twice, 
but (after count modification) a b c occurs only once, then a b c would 
still be included in the LM.

See if the above is in agreement with your observations.

Andreas

> deniz
>
>
>
> On Dec 4, 2007 8:07 AM, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>   
>> In message <cea871f80712030238r148279bdsf4664161e710a2a2 at mail.gmail.com>you wro
>> te:
>>     
>>> I spent last weekend trying to figure out the discrepancies between the
>>> SRILM kn-discounting implementations and my earlier implementations.
>>> Basically I am trying to go from the text file to the count file to
>>> the model file
>>> to the probabilities assigned to the words in the test file.  This took me on
>>>  a
>>> journey from man pages to debug outputs to the source code.  I figured
>>> a lot of it out but it turned out to be nontrivial to go from paper
>>> descriptions to the numbers in the ARPA ngram format to the final
>>> probability calculations.  If you help me with a couple of things I
>>> promise I'll write a man page detailing all discounting calculations
>>> in SRILM.
>>>       
>> A tutorial or FAQ including the information below would be most useful!
>>
>>     
>>> 1. Sometimes the model seems to use smaller ngrams even when longer
>>> ones are in the training file.  An example from a letter model:
>>>
>>> E i s e n h o w e r
>>>        p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
>>>        p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
>>>        p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
>>>        p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
>>>        p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
>>>        p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
>>>        p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
>>>        p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
>>>        p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
>>>        p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
>>>        p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
>>> 1 sentences, 10 words, 0 OOVs
>>> 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213
>>>
>>> This is an -order 7 model and the training file does have the word
>>> Eisenhower.  So I don't understand why it goes back to using lower
>>> order ngrams after the letter 'h'.
>>>       
>> This is because the default "mincount" for N-grams longer than 2 words is 2,
>> Meaning that a trigram, 4gram, etc. has to occur at least twice to be included
>> in the LM.
>> You can change this with the options
>>
>>         -gt3min 1
>>         -gt4min 1
>>         etc.
>>
>>
>>     
>>> 2. Not all (n-1)-grams have backoff weights in the model file, why?
>>>       
>> Backoff weights are only recorded for N-grams that appear as the prefix
>> of a longer N-gram.  For all others the backoff weight is implicitly 1
>> (or 0, in log representation).  This convention saves a lot of space.
>>
>>     
>>> 3. What exactly does srilm do with google ngrams?  Can you give an
>>> example usage?  Does it do things like extract a small subset useful
>>> for evaluating a test file?
>>>       
>> Google n-grams are not an LM format, they are way to store N-gram counts
>> on disk, and the classes that implement N-gram counts know how to read them.
>> This is exercized by the ngram-count -read-google option.
>> However, due to their typical size it is not advisable to try to build
>> backoff LMs of the standard sort, which would require reading all N-grams
>> into memory (someone working at Google might actually be able to do this
>> if their hardware budget is as phenomenal as it must be).
>>
>> Instead, I recommend estimating a deleted-interpolation-smoothed
>> "count LM", i.e, an LM that consists of only a small number of
>> interpolation weights (for smoothing) as well as the raw N-gram counts
>> themselves.  This way we can in fact load only the portion of the counts
>> into memory that impinge on a given test set (triggered by the
>> ngram -limit-vocab option).
>>
>> There is no full example of this, but it is basically what you see in
>> $SRILM/test/tests/ngram-count-lm-limit-vocab .  The only change would be
>> that instead of a countlm file with the keyword "counts" you would
>> use the keyword "google-counts" followed by the path to the google count
>> directory root.  Read the man page sections for ngram-count -count-lm and
>> ngram -count-lm  for more information, and follow the example under the test
>> directory.
>>
>>     
>>> 4. Since google-ngrams have all ngrams below count=40 missing, the kn
>>> discount constants that rely on the number of ngrams with low counts
>>> will fail.  Also I found that empirically the best highest order
>>> discount constant is close to 40, not in the [0,1] range.  How does
>>> srilm handle this?
>>>       
>> The deleted interpolation method of smoothing I am recommending above does
>> not have a problem with the missing ngrams.
>>
>> There is also a way to extrapolate from the available counts-of-counts above
>> some threshold to those below the threshold, due to an empirical law that
>> we found to hold for a range of corpora.  For details see the paper
>>
>> W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto.
>> http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
>>
>> The extrapolation method is implemented in the script
>> $SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use
>> make-big-lm to build your LM.   Again, it is not feasible to do this on
>> the ngrams distributed by Google.
>>
>>     
>>> 5. Do I need to understand what the following messages mean to
>>> understand the calculations:
>>>       
>> Not really, they are for information only.
>>
>>     
>>> warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator
>>>       
>> This means your unigram probabilities even after discounting sum to (almost) 1.
>> As a crude fallback, the denominator in the estimator is incremented to yield
>> usable backoff probability mass.
>>
>>     
>>> warning: distributing 0.000254455 left-over probability mass over all 124 wor
>>> ds
>>>       
>> Here the backof mass is 0.000254455 and is spread out over the 124 words that
>> don't have any observed occurrences.
>>
>>     
>>> discarded 254764 7-gram probs discounted to zero
>>>       
>> Due to discounting cutoff (mincounts, see above) some 7-grams were not
>> included in the model.
>>
>>     
>>> inserted 2766 redundant 3-gram probs
>>>       
>> The ARPA format requires all prefixes of ngrams with probabilities to
>> also have probabilities.  E.g., if "a b c" is in the model, so must "a b",
>> even if "a b" was not in the input ngram counts.  In such cases SRILM will
>> insert the "a b" probability but make it equal to what the backoff computation
>> would yield.
>>
>> Andreas
>>
>>
>>     


From stolcke at speech.sri.com  Wed Dec 19 13:00:28 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 19 Dec 2007 13:00:28 PST
Subject: SRILM FAQ online
Message-ID: <200712192100.lBJL0Sk17257@huge>


A first cut at a Frequently Asked Question document for SRILM
is now available at

http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html

This is very much work in progress.  I would especially appreciate it 
if people sent me contributions to cover additional topics.

Enjoy,

Andreas