From nayeem at cs.iitm.ernet.in  Tue Jan  4 05:07:12 2005
From: nayeem at cs.iitm.ernet.in (Nayeem)
Date: Tue, 4 Jan 2005 18:37:12 +0530 (IST)
Subject: Rescoring query
In-Reply-To: <200412312046.MAA06884@tonga>
Message-ID: <Pine.LNX.4.44.0501041823120.32198-100000@mail-speech.cs.iitm.ernet.in>

Sir,

I have a language model (long span LM) that gives a sequence of 
probabilites for each word in a test  sentence. I cannot write this 
language model in arpa format. The query I have is, can I integrate these 
probabilites into a lattice or N-best lists for rescoring.

In detail the problem is, for a test sentence/utterance I can get a 
lattice or N-best list  generated using HTK.  For the same sentence 
(assuming I have the transcription) I can get the probabilities for each 
word in the sentence using my long span LM. How can I integrate/rescore 
the lattice or N-best list using the tools in the SRI-LM toolkit. What 
tools and options should I use.

Kindly help me in this regard. Any suggestions is welcome.

Any pointers to important papers on rescoring is also requested.


A. Nayeemulla Khan
Research scholar 
IIT Madras
India


From solen.quiniou at irisa.fr  Mon Jan 31 09:46:43 2005
From: solen.quiniou at irisa.fr (Solen Quiniou)
Date: Mon, 31 Jan 2005 18:46:43 +0100
Subject: once occuring trigram discarded
Message-ID: <41FE6F03.5040103@irisa.fr>

Hi,
I made a trigram model using Kneser-Ney modified smoothing and 
interpolation and I don't understand why there are only 5828 trigrams in 
the model whereas there are 102520 trigrams in the corpus. I think that 
the trigrams discarded occur just once because there are 96692 trigrams 
occuring once which is the difference between the trigrams in the corpus 
and the trigram in the model. I tried to use other smoothing and even no 
smoothing but every time the trigrams are discarded.
I don't understand why since the bigram occuring once (there are 58764 
of such bigrams) are not discarded in the bigram model I built using 
Kneser-Ney modified smoothing and interpolation.

Thanks a lot for your answer.
Solen.

-- 
Solen Quiniou (Solen.Quiniou at irisa.fr)
Doctorante, ?quipe IMADOC - bureau C303
IRISA-INRIA, Campus de Beaulieu
35042 Rennes cedex, France
T?l: +33 (0) 2 99 84 22 35
Fax: +33 (0) 2 99 84 71 71


From stolcke  Mon Jan 31 10:01:16 2005
From: stolcke (Andreas Stolcke)
Date: Mon, 31 Jan 2005 10:01:16 PST
Subject: once occuring trigram discarded 
Message-ID: <200501311801.j0VI1GH22094@speech.sri.com>


In message <41FE6F03.5040103 at irisa.fr>you wrote:
> Hi,
> I made a trigram model using Kneser-Ney modified smoothing and 
> interpolation and I don't understand why there are only 5828 trigrams in 
> the model whereas there are 102520 trigrams in the corpus. I think that 
> the trigrams discarded occur just once because there are 96692 trigrams 
> occuring once which is the difference between the trigrams in the corpus 
> and the trigram in the model. I tried to use other smoothing and even no 
> smoothing but every time the trigrams are discarded.
> I don't understand why since the bigram occuring once (there are 58764 
> of such bigrams) are not discarded in the bigram model I built using 
> Kneser-Ney modified smoothing and interpolation.

The default cutoff for trigrams (and higher) is count 2.  
The default cutoff for unigrams and bigrams is count 1.

Use ngram-count -gt3min 1 to include all trigrams.

ngram-count -help displays the default values for all the options.

--Andreas 


From stolcke at speech.sri.com  Thu Jan  6 23:25:16 2005
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 06 Jan 2005 23:25:16 PST
Subject: Rescoring query 
In-Reply-To: Your message of Tue, 04 Jan 2005 18:37:12 +0530.
             <Pine.LNX.4.44.0501041823120.32198-100000@mail-speech.cs.iitm.ernet.in> 
Message-ID: <200501070725.XAA09190@huge>


In message <Pine.LNX.4.44.0501041823120.32198-100000 at mail-speech.cs.iitm.ernet.
in>you wrote:
> Sir,
> 
> I have a language model (long span LM) that gives a sequence of 
> probabilites for each word in a test  sentence. I cannot write this 
> language model in arpa format. The query I have is, can I integrate these 
> probabilites into a lattice or N-best lists for rescoring.
> 
> In detail the problem is, for a test sentence/utterance I can get a 
> lattice or N-best list  generated using HTK.  For the same sentence 
> (assuming I have the transcription) I can get the probabilities for each 
> word in the sentence using my long span LM. How can I integrate/rescore 
> the lattice or N-best list using the tools in the SRI-LM toolkit. What 
> tools and options should I use.

The following approaches use the SRILM tools at a high level to
integrate LM scores that you compute externally.

For N-best lists:

1. Generate N-best lists using HTK, then convert them into the 3rd format
   described in nbest-format(5).  

2. Generate your own LM scores by whatever means, and store them into
   a separate directory.   For example, if one of your waveform names is
   abcde.wav, then format the corresponding N-best LM scores into a single
   column of numbers and put them in a file called DIR/abcde or
   DIR/abcde.gz (compressed).

3. Use the rescore-reweight script (see nbest-scripts(1) man page to
   combine the standard scores and your own and extract the 1-best hypotheses.

4. You can use the nbest-optimize(1) tool to tune the score combination
   weights on a held-out set.

For lattices:

1. Generate lattices using HTK.

2. Generate your own LM scores by whatever means and insert them into the 
   lattices.  You can either replace the original LM scores (by modifying
   the "l=" fields), or add them as a separate set of scores.
   In SRILM you can add the "x1=", "x2=", etc. fields to add arbitrary
   additional scores to lattice nodes or links.
   Note this assumes you can somehow compute LM scores on a word-by-word
   basis.  This might not be simple, especially if your LM is "long-span",
   and might require expanding the lattice etc.

3. Use lattice-tool(1) to combine the old and new scores in a weighted
   fashion and extract the 1-best hypotheses.

The other possibility is to implement your LM in C++ as a LM class in
the SRILM framework.  This is a fair amount of work, would require some
study of the existing code, etc., but would ultimately allow you to use
your LM seamlessly in all the SRILM tools, for perplexity computation,
nbest rescoring, lattice expansion, etc.  (I'm assuming you probably do
NOT want to attempt this for now.)

> Kindly help me in this regard. Any suggestions is welcome.
> 
> Any pointers to important papers on rescoring is also requested.

The classic paper on rescoring is 

M. Ostendorf, A. Kannan, S. Austin, O. Kimball, R. Schwartz, J. R. Rohlicek,
Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses, Proceedings DARPA workshop on Speech and natural language,
Pacific Grove, California, pp. 83-87,
Morgan Kaufmann Publishers Inc., San Francisco, CA, 1991.

The framework has been extremely popular since, and there have probably been
probably hundreds of other papers since.

--Andreas 


From dtwitchell at cmi.arizona.edu  Wed Jan 26 09:33:10 2005
From: dtwitchell at cmi.arizona.edu (Twitchell, Doug)
Date: Wed, 26 Jan 2005 10:33:10 -0700
Subject: problems compiling on alpha
Message-ID: <270593C43CEE6E42A84D7F860469E9120290C7@grande.CMI.arizona.edu>

I attempting to compile srilm on the following machine:

Machine: HP/Compaq Alpha GS1280
OS: Tru64 Unix
Compiler: gcc 3.4.3
Make:  GNU make 3.80

Everything compiles cleanly except the "ngram" executable (which, of
course, is one that I need to use).  

This is the error it returns during the make:

g++ -mieee-with-inexact -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES
-I~/tcl/generic -I. -I/home4/u12/dtwitche/srilm/include   -u matherr
-L/home4/u12/dtwitche/srilm/lib/alpha  -g3 -O2 -o ../bin/alpha/ngram
../obj/alpha/ngram.o -L/home4/u12/dtwitche/srilm/lib/alpha
../obj/alpha/liboolm.a -lm -lflm -ldstruct -lmisc -L~/tcl/unix -ltcl -lm
2>&1 | c++filt
/usr/bin/ld:
../obj/alpha/liboolm.a(CacheLM.o): LHash<unsigned int,
double>::removedData: multiply defined
../obj/alpha/liboolm.a(CacheLM.o): global constructors keyed to
_ZN5LHashIjdE11removedDataE: multiply defined
../obj/alpha/liboolm.a(CacheLM.o): global destructors keyed to
_ZN5LHashIjdE11removedDataE: multiply defined
../obj/alpha/liboolm.a(CacheLM.o):
_GLOBAL__F__ZN5LHashIjdE11removedDataE: multiply defined
collect2: ld returned 1 exit status

Any ideas on how to resolve this?

Thanks,

Doug


From stolcke at speech.sri.com  Fri Feb  4 11:59:33 2005
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 04 Feb 2005 11:59:33 PST
Subject: problems compiling on alpha 
In-Reply-To: Your message of Wed, 26 Jan 2005 10:33:10 -0700.
             <270593C43CEE6E42A84D7F860469E9120290C7@grande.CMI.arizona.edu> 
Message-ID: <200502041959.j14JxXC19256@huge>


I don't have access to an Alpha system anymore.

Your linker might require a flag to instruct it to merge multiple
definitions of the same symbol.  On Solaris that is ld -z muldefs,
and you would invoke the compiler with

	-Wl,-z,muldefs

Check your ld man page to find something similar.

--Andreas

In message <270593C43CEE6E42A84D7F860469E9120290C7 at grande.CMI.arizona.edu>you w
rote:
> I attempting to compile srilm on the following machine:
> 
> Machine: HP/Compaq Alpha GS1280
> OS: Tru64 Unix
> Compiler: gcc 3.4.3
> Make:  GNU make 3.80
> 
> Everything compiles cleanly except the "ngram" executable (which, of
> course, is one that I need to use).  
> 
> This is the error it returns during the make:
> 
> g++ -mieee-with-inexact -Wreturn-type -Wimplicit -DINSTANTIATE_TEMPLATES
> -I~/tcl/generic -I. -I/home4/u12/dtwitche/srilm/include   -u matherr
> -L/home4/u12/dtwitche/srilm/lib/alpha  -g3 -O2 -o ../bin/alpha/ngram
> ../obj/alpha/ngram.o -L/home4/u12/dtwitche/srilm/lib/alpha
> ../obj/alpha/liboolm.a -lm -lflm -ldstruct -lmisc -L~/tcl/unix -ltcl -lm
> 2>&1 | c++filt
> /usr/bin/ld:
> ../obj/alpha/liboolm.a(CacheLM.o): LHash<unsigned int,
> double>::removedData: multiply defined
> ../obj/alpha/liboolm.a(CacheLM.o): global constructors keyed to
> _ZN5LHashIjdE11removedDataE: multiply defined
> ../obj/alpha/liboolm.a(CacheLM.o): global destructors keyed to
> _ZN5LHashIjdE11removedDataE: multiply defined
> ../obj/alpha/liboolm.a(CacheLM.o):
> _GLOBAL__F__ZN5LHashIjdE11removedDataE: multiply defined
> collect2: ld returned 1 exit status
> 
> Any ideas on how to resolve this?
> 
> Thanks,
> 
> Doug
> 


From stolcke at speech.sri.com  Fri Jan 28 20:12:59 2005
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 28 Jan 2005 20:12:59 PST
Subject: SRILM windows binaries sought
Message-ID: <200501290412.UAA16574@huge>


If someone on this list can kindly help Fatma please email him 
directly.

thanks

--Andreas 


------- Forwarded Message

Subject: Your help is appreciated!
x-mimeole: Produced By Microsoft Exchange V6.5.7226.0
Date: Sat, 29 Jan 2005 08:06:28 +0400
Message-ID: <32FAD195D3CB674DA423D79021F1C82BB34309 at exbe1.sharjah.uos.edu>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Your help is appreciated!
Thread-Index: AcUFt+hn3+RiYVFrQ5qsWs8CS+ogWg==
From: "Fatima Al Shamsi" <fshamsi at sharjah.ac.ae>
To: <stolcke at speech.sri.com>
X-Spam-Status: No, score=0.1 threshold=4.0
X-Spam-Level: 

Dear Dr.  Stolcke

I'm at  the final stages of my thesis in Arabic information extraction , =
I faced a lot of errors while making the SRILM source files (I'm working =
in Windows platform and using CYGwin package ) . on the other hand, I =
was able to make a lot of other files using CYGwin package without any =
problems . Could you please send me the executable files only so that I =
can run them to create and train a trigram model.=20

I appreciate your help=20

=20

Thank you

Yours=20

Fatma al shamsi

University of Sharjah / UAE

fshamsi at sharjah.ac.ae


From mlebeau at stanford.edu  Tue Feb 22 18:15:45 2005
From: mlebeau at stanford.edu (Mike LeBeau)
Date: Tue, 22 Feb 2005 18:15:45 -0800
Subject: "format error in lattice file"?
Message-ID: <ae8e0a183c36947961abcf4e19368c59@stanford.edu>

Hi folks,

I've used lattice-tool to take a lattice file in HTK SLF format and 
convert it to a file in PFSG format. This seems to have worked okay, 
looking at the newly created file, it seems to correspond to the format 
described in the pfsg-format man page. I used the following syntax to 
create the file:

lattice-tool -read-htk -in-lattice htklattice.lat -out-lattice 
pfsglattice.lat

However now I'm trying to use this new lattice file as the input to 
nbest-lattice, to create an n-best list. Here's the syntax I'm trying 
to use:

nbest-lattice -read pfsglattice.lat -write-nbest nbest.txt

This gives me the error:

pfsglattice.lat: line 2: unknown keyword
format error in lattice file

Looking over the file, it seems fine to me, and since the LM tools 
themselves created this file, I assumed it would work. So what am I 
missing? Thanks for any help you can provide.

-mike


From stolcke at speech.sri.com  Tue Feb 22 18:53:11 2005
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 22 Feb 2005 18:53:11 PST
Subject: "format error in lattice file"? 
In-Reply-To: Your message of Tue, 22 Feb 2005 18:15:45 -0800.
             <ae8e0a183c36947961abcf4e19368c59@stanford.edu> 
Message-ID: <200502230253.j1N2rB705117@huge>


In message <ae8e0a183c36947961abcf4e19368c59 at stanford.edu>you wrote:
> Hi folks,
> 
> I've used lattice-tool to take a lattice file in HTK SLF format and 
> convert it to a file in PFSG format. This seems to have worked okay, 
> looking at the newly created file, it seems to correspond to the format 
> described in the pfsg-format man page. I used the following syntax to 
> create the file:
> 
> lattice-tool -read-htk -in-lattice htklattice.lat -out-lattice 
> pfsglattice.lat
> 
> However now I'm trying to use this new lattice file as the input to 
> nbest-lattice, to create an n-best list. Here's the syntax I'm trying 
> to use:
> 
> nbest-lattice -read pfsglattice.lat -write-nbest nbest.txt

nbest-lattice is the wrong tool.  I makes lattices from nbest lists,
not the other way around ;-)  Also, it deals with yet another lattice 
format, which is described in wlat-format(5).

What you want is the lattice-tool -nbest-decode function.  This generates
nbest lists from lattices, and you don't even have to convert them first.
In fact, it is preferable to generate nbest directly from HTK lattices.

Make sure you have SRILM 1.4.3.

--Andreas 


From tanel.alumae at aqris.com  Wed Feb 23 00:11:00 2005
From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=)
Date: Wed, 23 Feb 2005 10:11:00 +0200
Subject: Interpolating with -lambda 1.0
Message-ID: <1109146260.28101.1.camel@markov>

Hello,

I'm a bit confused with interpolation.
I want to calculate test text's perplexity using different interpolation
weights (lambdas). Everything is OK until I set lambda to 1.0. Shouldn't
I then get the same perplexity as using only the base language model?
This doesn't seem to be the case:

$ ngram -lm trigram.arpa -ppl <testtxt> 
file <testtxt>: 2394 sentences, 29475 words, 1224 OOVs
0 zeroprobs, logprob= -86274.9 ppl= 653.583 ppl1= 1132.06

$ ngram -lm trigram.arpa -ppl <testtxt> -classes <classdefs>  -mix-lm
class-trigram.arpa -lambda 1.0
file <testtxt>: 2394 sentences, 29475 words, 1224 OOVs
0 zeroprobs, logprob= -85554.4 ppl= 619.144 ppl1= 1067.5

As shown, the perplexity is 653.539 when using standalone trigram, and
619.144 when interpolating the trigram with the class-trigam, using
lambda 1.0. Why are they not equal?

Both word trigram and class trigram are close-vocabulary LMs, if it
matters.

Regards,

Tanel A.


From stolcke at speech.sri.com  Wed Feb 23 10:23:55 2005
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 23 Feb 2005 10:23:55 PST
Subject: Interpolating with -lambda 1.0 
In-Reply-To: Your message of Wed, 23 Feb 2005 10:11:00 +0200.
             <1109146260.28101.1.camel@markov> 
Message-ID: <200502231823.KAA12013@tonga>


Try using -bayes 0 when running the interpolated model.
Without it, ngram will construct a merged ngram model in memory,
which does not work well when combining word and class-based models.

--Andreas

In message <1109146260.28101.1.camel at markov>you wrote:
> Hello,
> 
> I'm a bit confused with interpolation.
> I want to calculate test text's perplexity using different interpolation
> weights (lambdas). Everything is OK until I set lambda to 1.0. Shouldn't
> I then get the same perplexity as using only the base language model?
> This doesn't seem to be the case:
> 
> $ ngram -lm trigram.arpa -ppl <testtxt> 
> file <testtxt>: 2394 sentences, 29475 words, 1224 OOVs
> 0 zeroprobs, logprob= -86274.9 ppl= 653.583 ppl1= 1132.06
> 
> $ ngram -lm trigram.arpa -ppl <testtxt> -classes <classdefs>  -mix-lm
> class-trigram.arpa -lambda 1.0
> file <testtxt>: 2394 sentences, 29475 words, 1224 OOVs
> 0 zeroprobs, logprob= -85554.4 ppl= 619.144 ppl1= 1067.5
> 
> As shown, the perplexity is 653.539 when using standalone trigram, and
> 619.144 when interpolating the trigram with the class-trigam, using
> lambda 1.0. Why are they not equal?
> 
> Both word trigram and class trigram are close-vocabulary LMs, if it
> matters.
> 
> Regards,
> 
> Tanel A.
> 
> 
> 


From stolcke at speech.sri.com  Wed Feb 23 16:34:29 2005
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 23 Feb 2005 16:34:29 PST
Subject: "format error in lattice file"? 
In-Reply-To: Your message of Wed, 23 Feb 2005 15:36:36 -0800.
             <efce2afe95f32130966ac61bcc973bc6@stanford.edu> 
Message-ID: <200502240034.j1O0YTn18520@huge>


In message <efce2afe95f32130966ac61bcc973bc6 at stanford.edu>you wrote:
> Hi Andreas,
> 
> If I have lattices, which I have converted into n-best lists using 
> lattice-tool -nbest-decode, and I want to then compare the scoring 
> results of the original nbest from the lattice with an nbest that has 
> been rescored in a particular way, how would you recommend I go about 
> inputting each of them into compute-sclite?
> 
> Since the nbest list is not in one of the input forms for 
> compute-sclite, I'm not sure how to do this, yet I need the lattices in 
> n-best form in order to be able to rescore them in the manner I'm 
> planning. Does that make any sense?

You don't score the entire nbest lists.  You extract the 1best 
according to some linear weighting of the different model scores,
then score the 1best hypotheses you get .

Check the nbest-scripts(1) man page for a description of 
"rescore-reweight".

There is also a tool to optimize the weights on a held-out set.
Check the nbest-optimize(1) command.

--Andreas 


From tanel.alumae at aqris.com  Mon Feb 28 02:13:57 2005
From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=)
Date: Mon, 28 Feb 2005 12:13:57 +0200
Subject: Interpolating with -lambda 1.0
In-Reply-To: <200502231823.KAA12013@tonga>
References: <200502231823.KAA12013@tonga>
Message-ID: <1109585637.7158.7.camel@localhost.localdomain>

Hello,

The -bayes 0 switch didn't help, although it did change the calculated
perplexity values for lambda < 1.0

However, I discovered that I had the following in my classdefs:
<s> 1 <s>
</s> 1 </s>

After removing those lines, the interpolated perplexity with -lambda 1.0
is equal to the perplexity of the pure word trigram, as expected.

Regards,

Tanel A.


On Wed, 2005-02-23 at 10:23 -0800, Andreas Stolcke wrote:
> Try using -bayes 0 when running the interpolated model.
> Without it, ngram will construct a merged ngram model in memory,
> which does not work well when combining word and class-based models.
> 
> --Andreas
> 
> In message <1109146260.28101.1.camel at markov>you wrote:
> > Hello,
> > 
> > I'm a bit confused with interpolation.
> > I want to calculate test text's perplexity using different interpolation
> > weights (lambdas). Everything is OK until I set lambda to 1.0. Shouldn't
> > I then get the same perplexity as using only the base language model?
> > This doesn't seem to be the case:
> > 
> > $ ngram -lm trigram.arpa -ppl <testtxt> 
> > file <testtxt>: 2394 sentences, 29475 words, 1224 OOVs
> > 0 zeroprobs, logprob= -86274.9 ppl= 653.583 ppl1= 1132.06
> > 
> > $ ngram -lm trigram.arpa -ppl <testtxt> -classes <classdefs>  -mix-lm
> > class-trigram.arpa -lambda 1.0
> > file <testtxt>: 2394 sentences, 29475 words, 1224 OOVs
> > 0 zeroprobs, logprob= -85554.4 ppl= 619.144 ppl1= 1067.5
> > 
> > As shown, the perplexity is 653.539 when using standalone trigram, and
> > 619.144 when interpolating the trigram with the class-trigam, using
> > lambda 1.0. Why are they not equal?
> > 
> > Both word trigram and class trigram are close-vocabulary LMs, if it
> > matters.
> > 
> > Regards,
> > 
> > Tanel A.
> > 
> > 
> > 
> 


From vancrusoe at hotmail.com  Tue Mar 29 09:09:27 2005
From: vancrusoe at hotmail.com (zhou hao)
Date: Wed, 30 Mar 2005 01:09:27 +0800
Subject: about the ngram -hmm option
Message-ID: <BAY22-F15F2B1C4DDA07016D78AFAAB450@phx.gbl>

Hey,

just got a question in my mind, in the ngram command, it comes with an 
option -hmm, which needs to take a HMM file as input, so how can I create 
this file when I train the language model. or should I write some code 
myself to generate that.

thanks
Crusoe

_________________________________________________________________
??????????????? MSN Hotmail?  http://www.hotmail.com  


From stolcke at speech.sri.com  Wed Mar 30 00:01:45 2005
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 30 Mar 2005 00:01:45 PST
Subject: about the ngram -hmm option 
In-Reply-To: Your message of Wed, 30 Mar 2005 01:09:27 +0800.
             <BAY22-F15F2B1C4DDA07016D78AFAAB450@phx.gbl> 
Message-ID: <200503300801.j2U81l805125@huge>


In message <BAY22-F15F2B1C4DDA07016D78AFAAB450 at phx.gbl>you wrote:
> Hey,
> 
> just got a question in my mind, in the ngram command, it comes with an 
> option -hmm, which needs to take a HMM file as input, so how can I create 
> this file when I train the language model. or should I write some code 
> myself to generate that.

You typically create the file by hand, thus SRILM comes with no
special tools for this.  However, if you are building a large HMM
structure it is best done by a program or script.

I hope you don't expect SRILM to include some kind of "mini HMM toolkit".
It doesn't.  ngram -hmm is meant for building simple models that
switch ngram distributions as they generate a sentence.

--Andreas