From stolcke at speech.sri.com  Mon Apr  5 12:01:00 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 05 Apr 2004 12:01:00 PDT
Subject: linear interpolation process 
In-Reply-To: Your message of Fri, 02 Apr 2004 14:05:01 +0100.
             <003c01c418b3$1ca4ee00$0800000a@speechasus> 
Message-ID: <200404051901.MAA02078@huge>


In message <003c01c418b3$1ca4ee00$0800000a at speechasus>you wrote:
> 
> Hi!
> 
> I have a question about linear interpolation process executed by ngram
> command in SRILM.
> What's the main difference between dynamic interpolation (using -bayes) and
> static interpolation?
> I tried both but I'm getting a big difference in perplexity values: for
> instance, 314 against 246.
> If we do static interpolation one can use -write-lm to pruduce a file with
> the interpolated model. However, using dynamic process it is not. Why? Are
> the process diferences so big?
> 
> Just an observation: the big differences in perplexity values result in the
> case we are doing interpolation of word and class models. For interpolation
> of word models the difference is quite insignificant.

That's the problem.  You cannot do "static" interpolation of a word 
and class-based N-gram LM.   This is only supported for two word or two
class-based LM.s

--Andreas 


From stolcke at speech.sri.com  Mon Apr  5 22:13:26 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 05 Apr 2004 22:13:26 PDT
Subject: positive backoff weight 
In-Reply-To: Your message of Thu, 18 Mar 2004 11:52:02 +0100.
             <40597F52.4050803@irisa.fr> 
Message-ID: <200404060513.WAA01373@huge>


In message <40597F52.4050803 at irisa.fr>you wrote:
> Thank you for the past answers to my questions.
> 
> I've got another question. Sometimes, when I use a Good-Turing 
> discounting, some of the backoff weight of the unigram (I compute a 
> bigram model) are positive log-probability. How is it possible ? Is it 

Backoff weights are not probabilities.  They are normalizing factors.
Backoff weight for a history h is defined as

	BOW(h) =  [ 1- \sum_(w,h) p(w|h) ] / [ 1- \sum_(w,h) p'(w|h) ]

where p'(w|h) is the lower-order probability estimate (e.g., a bigram
estimate in a trigram model).
So, if the trigram probability estimate give lower value than the corresponding
bigram estimates for a given history, then BOW(h) will be > 1 and its log 
positive.

> because Good-Turing discounting is disabled on unigram since there are 
> no unigram which frequency is 1 ? And, more, generally, how are computed 
> backoff weights for unigrams, in the case of a bigram model ?

Backoff weights for unigrams are computed by exactly the same method
(in the formula above, p(w|h) are bigram probabilities and p'(w|h) are
unigram probabilities).

--Andreas 


From dpico at dsic.upv.es  Tue Apr  6 01:24:55 2004
From: dpico at dsic.upv.es (=?ISO-8859-1?Q?David_Pic=F3?=)
Date: Tue, 06 Apr 2004 10:24:55 +0200
Subject: A simple question about SRILM
Message-ID: <40726957.3070101@dsic.upv.es>

Hello,

I also have a little question about SRILM. How can I infer a trigram (or 
bigram, or tetragram...) with no smoothing at all? I need to do some 
experiments to check the effect of n-gram smoothing in my models and I 
need a pure trigram with no probability mass derived to lower levels. Is 
this possible in SRILM? I need to be sure that I really get a trigram 
(with the whole trigram probabilities).

Thank you very much in advance for your help and attention!
David

-- 
David Pic?-Vila
Universitat Polit?cnica de Val?ncia
Departament de Sistemes Inform?tics i Computaci?
Val?ncia, Spain


From stolcke at speech.sri.com  Tue Apr  6 09:34:12 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 06 Apr 2004 09:34:12 PDT
Subject: A simple question about SRILM 
In-Reply-To: Your message of Tue, 06 Apr 2004 10:24:55 +0200.
             <40726957.3070101@dsic.upv.es> 
Message-ID: <200404061634.JAA01494@huge>


The ngram-count man page says

       -gtnmax count
              where  n  is 1, 2, 3, 4, 5, 6, 7, 8, or 9.  Set the
              maximal count of N-grams of order n that  are  dis-
              counted  under  Good-Turing.  All N-grams more fre-
              quent than that  will  receive  maximum  likelihood
              estimates.  Discounting can be effectively disabled
              by setting this to 0.

Therefore, you can disable smoothing with 

	ngram-count -gt1max 0 -gt2max 0 -gt3max 0 ...

--Andreas

In message <40726957.3070101 at dsic.upv.es>you wrote:
> Hello,
> 
> I also have a little question about SRILM. How can I infer a trigram (or 
> bigram, or tetragram...) with no smoothing at all? I need to do some 
> experiments to check the effect of n-gram smoothing in my models and I 
> need a pure trigram with no probability mass derived to lower levels. Is 
> this possible in SRILM? I need to be sure that I really get a trigram 
> (with the whole trigram probabilities).
> 
> Thank you very much in advance for your help and attention!
> David
> 
> -- 
> David Pic?-Vila
> Universitat Polit?cnica de Val?ncia
> Departament de Sistemes Inform?tics i Computaci?
> Val?ncia, Spain
> 


From julyjune03 at yahoo.com  Sat Apr 10 22:30:12 2004
From: julyjune03 at yahoo.com (June July)
Date: Sat, 10 Apr 2004 22:30:12 -0700 (PDT)
Subject: can a cache LM be loaded in disambig?
Message-ID: <20040411053012.75383.qmail@web41601.mail.yahoo.com>

I'm trying to load a cache LM in disambig tool by adding several lines of code according to ngram.cc.  Everything is fine except the linking, where I had a problem:
 
/usr3/Test/sri-1.4/lm/src/CacheLM.cc:50: multiple definition of `LHash<unsigned, double>::removedData'
../obj/i686/liboolm.a(VocabMap.o):/usr3/Test/sri-1.4/lm/src/VocabMap.cc:39: first defined here
collect2: ld returned 1 exit status
 
Does that mean duplicate definations of "removeData" originally from LHash.cc?  How to fix it? Or is there an way to load a cache model in disambig?
 
 
---------------------------------
Do you Yahoo!?
Yahoo! Tax Center - File online by April 15th
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20040410/40bf07d9/attachment.html>

From stolcke at speech.sri.com  Sat Apr 10 22:50:15 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sat, 10 Apr 2004 22:50:15 PDT
Subject: can a cache LM be loaded in disambig? 
In-Reply-To: Your message of Sat, 10 Apr 2004 22:30:12 -0700.
             <20040411053012.75383.qmail@web41601.mail.yahoo.com> 
Message-ID: <200404110550.WAA12707@tonga>


The linking problem can be solved by removing the instantiation
of LHash<unsigned,double> in CacheLM.cc.  

However, it probably won't work as intended.   CacheLM is not Markovian
(is does not use a finite history).  This will cause the DP algorithm in
disambig to degenerate into keeping all histories as distinct states,
which is not feasible except for very short sentences.

--Andreas

In message <20040411053012.75383.qmail at web41601.mail.yahoo.com>you wrote:
> --0-1533885807-1081661412=:75154
> Content-Type: text/plain; charset=us-ascii
> 
> I'm trying to load a cache LM in disambig tool by adding several lines of cod
> e according to ngram.cc.  Everything is fine except the linking, where I had 
> a problem:
>  
> /usr3/Test/sri-1.4/lm/src/CacheLM.cc:50: multiple definition of `LHash<unsign
> ed, double>::removedData'
> ../obj/i686/liboolm.a(VocabMap.o):/usr3/Test/sri-1.4/lm/src/VocabMap.cc:39: f
> irst defined here
> collect2: ld returned 1 exit status
>  
> Does that mean duplicate definations of "removeData" originally from LHash.cc
> ?  How to fix it? Or is there an way to load a cache model in disambig?
>  
>  
>  
> 
> 
> ---------------------------------
> Do you Yahoo!?
> Yahoo! Tax Center - File online by April 15th
> --0-1533885807-1081661412=:75154
> Content-Type: text/html; charset=us-ascii
> 
> <DIV>I'm trying to load a cache LM in disambig tool by&nbsp;adding&nbsp;sever
> al lines of&nbsp;code according to ngram.cc.&nbsp;&nbsp;Everything&nbsp;is fi
> ne except&nbsp;the linking, where I had a problem:</DIV>
> <DIV>&nbsp;</DIV>
> <DIV>/usr3/Test/sri-1.4/lm/src/CacheLM.cc:50: multiple definition of `LHash&l
> t;unsigned, double&gt;::removedData'</DIV>
> <DIV>../obj/i686/liboolm.a(VocabMap.o):/usr3/Test/sri-1.4/lm/src/VocabMap.cc:
> 39: first defined here</DIV>
> <DIV>collect2: ld returned 1 exit status</DIV>
> <DIV>&nbsp;</DIV>
> <DIV>Does that mean duplicate definations of "removeData" originally from LHa
> sh.cc?&nbsp; How to fix it? Or is there an way to load a cache model in disam
> big?</DIV>
> <DIV>&nbsp;</DIV>
> <DIV>&nbsp;</DIV>
> <DIV>&nbsp;</DIV><p><hr size=1><font face=arial size=-1>Do you Yahoo!?<br>
> Yahoo! Tax Center - <a href="http://taxes.yahoo.com/filing.html">File online 
> by April 15th</a>
> --0-1533885807-1081661412=:75154--


From cam at crb.ucp.pt  Sun Apr 11 01:57:55 2004
From: cam at crb.ucp.pt (cam at crb.ucp.pt)
Date: Sun, 11 Apr 2004 09:57:55 +0100 (WEST)
Subject: Log-linear interpolation
Message-ID: <32809.213.58.88.69.1081673875.squirrel@mail.crb.ucp.pt>


Hi!

Does anyone know a program or toolkit allowing to do log-linear
interpolation of different language models? since SRILM only permit to do
linear interpolation.
Thanks for your help,

Ciro Martins


From nlp at pobox.sk  Mon Apr 19 08:12:06 2004
From: nlp at pobox.sk (Robert Wagner)
Date: Mon, 19 Apr 2004 17:12:06 +0200
Subject: Factored n-grams
Message-ID: <200404191512.i3JFC59N023398@www7.pobox.sk>


Hello Everybody!
 Please, does anybody know a good paper referring to factored n-grams 
(new SRILM feature)? I absolutely do not know what is it and would 
like to learn more about it:-)
 Thanks
     Robert

====================== REKLAMA ========================
Java Desktop System predstavuje prvu pouzitelnu alternativu voci Windows za
poslednych 15 rokov, pretoze prinasa z?kaznikom bezpecne a doveryhodne
desktopove riesenie za zlomok ceny Windows.
Viac informacii najdete na : http://www.somi.sk/sun/java_desktop.php
=======================================================


From sarahs at cs.washington.edu  Mon Apr 19 08:37:48 2004
From: sarahs at cs.washington.edu (Sarah E. Schwarm)
Date: Mon, 19 Apr 2004 08:37:48 -0700 (PDT)
Subject: Factored n-grams
In-Reply-To: <200404191512.i3JFC59N023398@www7.pobox.sk>
Message-ID: <20040419083134.N13854-100000@scarpia.cs.washington.edu>

Here's the paper:
 J. Bilmes and K. Kirchhoff, "Factored Language Models and Generalized 
Parallel Backoff",  Proceedings of HLT/NAACL 2003, Edmonton, Canada, May 
2003 [pdf]
available on this page:
http://ssli.ee.washington.edu/people/katrin/

There's also quite a bit of information about the factored LM extensions 
to SRILM in the final report for the JHU workshop 2002 Novel Speech 
Recognition Models for Arabic group:
http://www.clsp.jhu.edu/ws2002/groups/arabic/

Hope this helps!

- Sarah

On Mon, 19 Apr 2004, Robert Wagner wrote:

> 
> Hello Everybody!
>  Please, does anybody know a good paper referring to factored n-grams 
> (new SRILM feature)? I absolutely do not know what is it and would 
> like to learn more about it:-)
>  Thanks
>      Robert
> 
> ====================== REKLAMA ========================
> Java Desktop System predstavuje prvu pouzitelnu alternativu voci Windows za
> poslednych 15 rokov, pretoze prinasa z?kaznikom bezpecne a doveryhodne
> desktopove riesenie za zlomok ceny Windows.
> Viac informacii najdete na : http://www.somi.sk/sun/java_desktop.php
> =======================================================
> 
> 
> 

________________________
Sarah Schwarm
sarahs at cs.washington.edu


From Nicholas.Romanyshyn at colorado.edu  Fri Apr 30 13:28:42 2004
From: Nicholas.Romanyshyn at colorado.edu (Nick Romanyshyn)
Date: Fri, 30 Apr 2004 14:28:42 -0600
Subject: remove </s> <s>
Message-ID: <1083356922.4092b6faf0e67@webmail.colorado.edu>

Hi,

   I'm using ngram-count to make a language model, but I don't want </s> or <s>
to be included in the language model.  I coudn't find anything in the
documentation about how to keep this from happening.  Could somebody point me
to the code where </s> and <s> are inserted?

Thanks,
Nick Romanyshyn


From anand at speech.sri.com  Fri Apr 30 13:50:32 2004
From: anand at speech.sri.com (Anand Venkataraman)
Date: Fri, 30 Apr 2004 13:50:32 -0700 (PDT)
Subject: remove </s> <s>
In-Reply-To: <1083356922.4092b6faf0e67@webmail.colorado.edu> (message from
	Nick Romanyshyn on Fri, 30 Apr 2004 14:28:42 -0600)
Message-ID: <200404302050.NAA13199@stockholm>

You should be able to do this without modifying the
code.  There are at least two ways -- Create a file
with lines containing </s> and <s> and give this file
to ngram-count using -nonevents.  Alternately, you can
create count files first (-write), remove the
uninteresting events and create an lm using the count
file (-read).

&


From jachym at kky.zcu.cz  Fri Apr 30 13:54:38 2004
From: jachym at kky.zcu.cz (Jachym Kolar)
Date: Fri, 30 Apr 2004 22:54:38 +0200
Subject: remove </s> <s>
In-Reply-To: <1083356922.4092b6faf0e67@webmail.colorado.edu>
References: <1083356922.4092b6faf0e67@webmail.colorado.edu>
Message-ID: <1083358478.4092bd0e87eae@webmail.zcu.cz>

Hello Nick,
 you should use the script continuous-ngram-count.

E.g.:

continuous-ngram-count order=3 trainingtext | \
ngram-count -read - -write-vocab vocabulary -tolower -write output -lm lmfile

Regards,
 Jachym


Cituji z e-mailu od Nick Romanyshyn <Nicholas.Romanyshyn at colorado.edu>:

> Hi,
> 
>    I'm using ngram-count to make a language model, but I don't want </s> or
> <s>
> to be included in the language model.  I coudn't find anything in the
> documentation about how to keep this from happening.  Could somebody point
> me
> to the code where </s> and <s> are inserted?
> 
> Thanks,
> Nick Romanyshyn
> 
> 


From gaston at gastonrcangiano.net  Sun May  2 19:37:31 2004
From: gaston at gastonrcangiano.net (Gaston R. Cangiano)
Date: Sun, 2 May 2004 19:37:31 -0700 (PDT)
Subject: first install
Message-ID: <20040503023731.27942.qmail@web11306.mail.yahoo.com>

Hi,

i am trying to install the toolkit (v 1.4) on an i686
running RH Linux 9 (kernel 2.4). I checked for the
correct versions of gcc, make and Tcl installed in my
machine, and also updated all the variables in the
makefiles correctly (both top level and
machine-specific).

I am not able to build properly, nothing gets
compiled. These are the error messages:

g++: cannot specify -o with -c or -S and multiple
compilations
make [2]: *** [.../obj/i686/tclmain.o] Error

for object files qsort.o matherr.o FDiscount.o
Lattice.o ngram.o fngram.o and lattice-tool.o

Can anyone lend a hint?

thank you!

Gaston.

=====
Gaston R. Cangiano
3024 Deakin St. Apt. #5
Berkeley, CA 94705
tel: 510-486-8271
fax: 425-930-1047
gaston at gastonrcangiano.net


From tanel.alumae at aqris.com  Tue May  4 02:32:25 2004
From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=)
Date: Tue, 04 May 2004 12:32:25 +0300
Subject: Factored LMs and interpolated models
Message-ID: <1083663145.10474.16.camel@markov>

Hello,

I'm experimenting with factored language modeling implementation in
SRILM. I got some nice results and now want to compare them with the
traditional approach where a word-trigram LM is interpolated with the
parallel class trigram. Is it possible to create a factored LM that
actually implements such traditional interpolation? 

Thanks in advance,
Tanel A.


From Caroline.Lavecchia at loria.fr  Tue May  4 07:18:11 2004
From: Caroline.Lavecchia at loria.fr (lavecchia)
Date: Tue, 04 May 2004 16:18:11 +0200
Subject: question about vocabulary
Message-ID: <4097A623.1E8E3B88@loria.fr>

Hello everybody,

I would like to know if it's possible with the SRILM toolkit to generate
a vocabulary with the 20000 most frequent words of a corpus for example. 

I know that with -write-vocab  in the ngram-count function I can
generate a vocabulary but only with all the words of the corpus.

Thanks in advance and sorry for my bad english, 

Caroline L.


From anand at speech.sri.com  Tue May  4 08:53:13 2004
From: anand at speech.sri.com (Anand Venkataraman)
Date: Tue, 4 May 2004 08:53:13 -0700 (PDT)
Subject: question about vocabulary
In-Reply-To: <4097A623.1E8E3B88@loria.fr> (message from lavecchia on Tue, 04
	May 2004 16:18:11 +0200)
Message-ID: <200405041553.IAA23335@clara>

> I would like to know if it's possible with the SRILM toolkit to generate
> a vocabulary with the 20000 most frequent words of a corpus for example.

You should be able achieve this by using "ngram-count -order 1 -write -",
doing reverse sort on field 2 and taking the top 20000.

&


From stolcke at speech.sri.com  Tue May  4 08:57:23 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 04 May 2004 08:57:23 PDT
Subject: question about vocabulary 
In-Reply-To: Your message of Tue, 04 May 2004 16:18:11 +0200.
             <4097A623.1E8E3B88@loria.fr> 
Message-ID: <200405041557.IAA16703@huge>


In message <4097A623.1E8E3B88 at loria.fr>you wrote:
> Hello everybody,
> 
> I would like to know if it's possible with the SRILM toolkit to generate
> a vocabulary with the 20000 most frequent words of a corpus for example. 
> 
> I know that with -write-vocab  in the ngram-count function I can
> generate a vocabulary but only with all the words of the corpus.

How about this:

ngram-count -order 1 -text CORPUS -write - | \
sort +1rn -2 | awk 'NR <= 20000 { print $1 }' > top20000.vocab


--Andreas 


From duh at ee.washington.edu  Thu May  6 10:39:27 2004
From: duh at ee.washington.edu (Kevin Duh)
Date: Thu, 06 May 2004 10:39:27 -0700
Subject: Factored LMs and interpolated models
In-Reply-To: <1083663145.10474.16.camel@markov>
References: <1083663145.10474.16.camel@markov>
Message-ID: <409A784F.8060305@ee.washington.edu>

There is no easy way to interpolate word and class ngram models in the 
factored language model framework. Factor language models support only 
interpolation of an N-gram probability estimate and its corresponding 
lower-order estimate, which is similar to the "interpolate" option in 
"ngram-count."

You could conceivably treat the word and the class as your factors and 
perform interpolation whenever you back off from one set of these 
conditioning variables to a subset. However, this backoff nature makes 
the interpolation different from the traditional interpolation of 
parallel n-grams. Probably the best thing to do is to use the usual 
SRILM tools for this.

Hope this helps,
Kevin Duh

Tanel Alum?e wrote:

>Hello,
>
>I'm experimenting with factored language modeling implementation in
>SRILM. I got some nice results and now want to compare them with the
>traditional approach where a word-trigram LM is interpolated with the
>parallel class trigram. Is it possible to create a factored LM that
>actually implements such traditional interpolation? 
>
>Thanks in advance,
>Tanel A.
>  
>


From stolcke at speech.sri.com  Thu May  6 11:33:20 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 06 May 2004 11:33:20 PDT
Subject: Factored LMs and interpolated models 
In-Reply-To: Your message of Thu, 06 May 2004 10:39:27 -0700.
             <409A784F.8060305@ee.washington.edu> 
Message-ID: <200405061833.LAA15228@huge>


Hi,

you SHOULD be able to do this with

	ngram -factored -bayes 0 ...
 
followed by the usual options to specify mixtures of LMs.   This is because
the -factored option causes all LM components to be interpreted as factored
LMs, and this causes the standard interpolation mechanism to be wrapped
around them.

So, all you have to do is implement your standard word ngram and class
ngrams each separately as FLMs.   This is not quite what you asked for
but it should be equivalent.

This is the theory.  I don't think we ever tested this, so there might be 
glitches.  But those could be fixed if necessary, the basic machinery is
there.

There might also be a different approach.  You could engineer the FLM
definition so that at the highest level you always back off.  Then you specify
interpolation as the backoff strategy, emulating word and class ngrams
as two parallel backoff paths.  I'm not sure this will work with the current
functionality, it's just an idea.  Katrin or Jeff should be able to tell you
if it's feasible.

--Andreas

In message <409A784F.8060305 at ee.washington.edu>you wrote:
> There is no easy way to interpolate word and class ngram models in the 
> factored language model framework. Factor language models support only 
> interpolation of an N-gram probability estimate and its corresponding 
> lower-order estimate, which is similar to the "interpolate" option in 
> "ngram-count."
> 
> You could conceivably treat the word and the class as your factors and 
> perform interpolation whenever you back off from one set of these 
> conditioning variables to a subset. However, this backoff nature makes 
> the interpolation different from the traditional interpolation of 
> parallel n-grams. Probably the best thing to do is to use the usual 
> SRILM tools for this.
> 
> Hope this helps,
> Kevin Duh
> 
> Tanel Alum?e wrote:
> 
> >Hello,
> >
> >I'm experimenting with factored language modeling implementation in
> >SRILM. I got some nice results and now want to compare them with the
> >traditional approach where a word-trigram LM is interpolated with the
> >parallel class trigram. Is it possible to create a factored LM that
> >actually implements such traditional interpolation? 
> >
> >Thanks in advance,
> >Tanel A.
> >  
> >
> 


From tanel.alumae at aqris.com  Thu May  6 23:52:35 2004
From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=)
Date: Fri, 07 May 2004 09:52:35 +0300
Subject: Factored LMs and interpolated models
In-Reply-To: <20040506122755.A21243@duck.ee.washington.edu>
References: <409A784F.8060305@ee.washington.edu>
	 <200405061833.LAA15228@huge> <20040506122755.A21243@duck.ee.washington.edu>
Message-ID: <1083912755.8267.7.camel@NOOL2>


> Let me know if this helps or if I have misunderstood your question...
> 

Hello,

First, thanks to everybody for help.

My goal was, as Katrin correctly assumed, "to interpolate a
traditional class-based model and a standard n-gram model but you want
to express this within a single FLM file". This is currently not
possible, but it's not very important because I learned that I can
use:

ngram -factored -lm <FLM1> -mix-lm <FLM2>

The above really works.

Still, I noticed a strange thing with perplexity calculation. Namely,
the perplexity figures calculated by fngram and ngram are slightly
different.  I used the following options and got following results:

fngram -ppl <testtext> -factor-file tmp/fngram_m.conf

Result: 
61 sentences, 1009 words, 26 OOVs 
0 zeroprobs, logprob= -2760.87 ppl= 441.076 ppl1= 643.604

ngram -factored -ppl <testtext> -lm tmp/fngram_m.conf 61 sentences, 1009
words, 

Result:
26 OOVs 0 zeroprobs, logprob= -2761.16 ppl= 441.359 ppl1= 644.042


-- 

The above is for a FLM that in fact is standard word trigram. The
difference is very small.

However, when I test a FLM that is a word-given-two-previous-classes
trigram, the difference is much larger:

fngram -ppl <testtext> -factor-file tmp/fngram_c.conf 

61 sentences, 1009 words, 26 OOVs 
0 zeroprobs, logprob= - 2826.73 ppl= 510.034 ppl1= 750.963

And the same with ngram:

ngram -factored -lm tmp/fngram_c.conf -ppl <testtext>

61 sentences, 1009 words, 26 OOVs 
0 zeroprobs, logprob= -2863.71 ppl= 553.378 ppl1= 818.917


As you see, here the difference (ppl1= 750 vs 818) is significant. Could
this be a configuration issue, a bug or have I understood smth wrong?

Regards,

Tanel Alum?e


From stolcke at speech.sri.com  Fri May  7 07:02:40 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 07 May 2004 07:02:40 PDT
Subject: Factored LMs and interpolated models 
In-Reply-To: Your message of Fri, 07 May 2004 09:52:35 +0300.
             <1083912755.8267.7.camel@NOOL2> 
Message-ID: <200405071402.HAA14786@tonga>


There are few knowns bugs in the FLM code as last released.
They will be fixed in the next release (1.4.1) which I expect to
be out in a couple days.

--Andreas

In message <1083912755.8267.7.camel at NOOL2>you wrote:
> 
> 
> > Let me know if this helps or if I have misunderstood your question...
> > 
> 
> Hello,
> 
> First, thanks to everybody for help.
> 
> My goal was, as Katrin correctly assumed, "to interpolate a
> traditional class-based model and a standard n-gram model but you want
> to express this within a single FLM file". This is currently not
> possible, but it's not very important because I learned that I can
> use:
> 
> ngram -factored -lm <FLM1> -mix-lm <FLM2>
> 
> The above really works.
> 
> Still, I noticed a strange thing with perplexity calculation. Namely,
> the perplexity figures calculated by fngram and ngram are slightly
> different.  I used the following options and got following results:
> 
> fngram -ppl <testtext> -factor-file tmp/fngram_m.conf
> 
> Result: 
> 61 sentences, 1009 words, 26 OOVs 
> 0 zeroprobs, logprob= -2760.87 ppl= 441.076 ppl1= 643.604
> 
> ngram -factored -ppl <testtext> -lm tmp/fngram_m.conf 61 sentences, 1009
> words, 
> 
> Result:
> 26 OOVs 0 zeroprobs, logprob= -2761.16 ppl= 441.359 ppl1= 644.042
> 
> 
> -- 
> 
> The above is for a FLM that in fact is standard word trigram. The
> difference is very small.
> 
> However, when I test a FLM that is a word-given-two-previous-classes
> trigram, the difference is much larger:
> 
> fngram -ppl <testtext> -factor-file tmp/fngram_c.conf 
> 
> 61 sentences, 1009 words, 26 OOVs 
> 0 zeroprobs, logprob= - 2826.73 ppl= 510.034 ppl1= 750.963
> 
> And the same with ngram:
> 
> ngram -factored -lm tmp/fngram_c.conf -ppl <testtext>
> 
> 61 sentences, 1009 words, 26 OOVs 
> 0 zeroprobs, logprob= -2863.71 ppl= 553.378 ppl1= 818.917
> 
> 
> As you see, here the difference (ppl1= 750 vs 818) is significant. Could
> this be a configuration issue, a bug or have I understood smth wrong?
> 
> Regards,
> 
> Tanel Alum?e
> 


From katrin at ssli-mail.ee.washington.edu  Fri May  7 10:25:20 2004
From: katrin at ssli-mail.ee.washington.edu (Katrin Kirchhoff)
Date: Fri, 7 May 2004 10:25:20 -0700
Subject: Factored LMs and interpolated models
In-Reply-To: <1083912755.8267.7.camel@NOOL2>; from tanel.alumae@aqris.com on Fri, May 07, 2004 at 09:52:35AM +0300
References: <409A784F.8060305@ee.washington.edu> <200405061833.LAA15228@huge> <20040506122755.A21243@duck.ee.washington.edu> <1083912755.8267.7.camel@NOOL2>
Message-ID: <20040507102520.A10555@duck.ee.washington.edu>


In order to emulate the exact behaviour of ngram with fngram,
you need to use:

-no-virtual-begin-sentence
-nonull

and make sure that the smoothing options (smoothing method, gtmin, gtmax etc.)
in your FLM file correspond to the the same values that ngram uses.

E.g. for a standard trigram 

ngram -lm <non-factored LM> -ppl <text>

and

fngram -factor-file <factored LM> -ppl <text> -no-virtual-begin-sentence -nonull

should give exactly the same perplexities. Andreas might be able to 
say whether these are needed when using ngram with the -factored  
option.

Katrin 


> Still, I noticed a strange thing with perplexity calculation. Namely,
> the perplexity figures calculated by fngram and ngram are slightly
> different.  I used the following options and got following results:
> 
> fngram -ppl <testtext> -factor-file tmp/fngram_m.conf
> 
> Result: 
> 61 sentences, 1009 words, 26 OOVs 
> 0 zeroprobs, logprob= -2760.87 ppl= 441.076 ppl1= 643.604
> 
> ngram -factored -ppl <testtext> -lm tmp/fngram_m.conf 61 sentences, 1009
> words, 
> 
> Result:
> 26 OOVs 0 zeroprobs, logprob= -2761.16 ppl= 441.359 ppl1= 644.042
> 
> 
> -- 
> 
> The above is for a FLM that in fact is standard word trigram. The
> difference is very small.
> 
> However, when I test a FLM that is a word-given-two-previous-classes
> trigram, the difference is much larger:
> 
> fngram -ppl <testtext> -factor-file tmp/fngram_c.conf 
> 
> 61 sentences, 1009 words, 26 OOVs 
> 0 zeroprobs, logprob= - 2826.73 ppl= 510.034 ppl1= 750.963
> 
> And the same with ngram:
> 
> ngram -factored -lm tmp/fngram_c.conf -ppl <testtext>
> 
> 61 sentences, 1009 words, 26 OOVs 
> 0 zeroprobs, logprob= -2863.71 ppl= 553.378 ppl1= 818.917
> 
> 
> As you see, here the difference (ppl1= 750 vs 818) is significant. Could
> this be a configuration issue, a bug or have I understood smth wrong?
> 
> Regards,
> 
> Tanel Alum?e

-- 
-----------------------------------------------------------------
Katrin Kirchhoff
Dept of Electrical Engineering, University of Washington
M422 EE/CS Building, Box 352500, Seattle, WA, 98195
Phone: (206) 616 5494
katrin at ee.washington.edu
-----------------------------------------------------------------


From barhaim at cs.technion.ac.il  Thu May 13 08:30:37 2004
From: barhaim at cs.technion.ac.il (Roy Bar Haim)
Date: Thu, 13 May 2004 17:30:37 +0200
Subject: Tagging with disambig
Message-ID: <005a01c438ff$3efbc5c0$34284484@cs.technion.ac.il>

Hi,

I use disambig for POS tagging.

I have two questions:
1.Is there a utility that automatically generates the map file required
for disambig from a tagged corpus?
2.Suppose I want to assume (for a 'didactic' purpose) that Ti (the i'th
tag) depends not ony on Ti-1 but also on Wi-1. Is there an easy way to
encode this assumption into the lm file?

Thanks,
Roy.


From stolcke at speech.sri.com  Thu May 13 16:59:48 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 13 May 2004 16:59:48 PDT
Subject: Tagging with disambig 
In-Reply-To: Your message of Thu, 13 May 2004 17:30:37 +0200.
             <005a01c438ff$3efbc5c0$34284484@cs.technion.ac.il> 
Message-ID: <200405132359.QAA25559@teeny>


In message <005a01c438ff$3efbc5c0$34284484 at cs.technion.ac.il>you wrote:
> Hi,
> 
> I use disambig for POS tagging.
> 
> I have two questions:
> 1.Is there a utility that automatically generates the map file required
> for disambig from a tagged corpus?

It's very corpus dependent, just like text conditioning for LM training,
so there are no "standard" tools.  It should require only a moderate
amount of perl or gawk hacking.

> 2.Suppose I want to assume (for a 'didactic' purpose) that Ti (the i'th
> tag) depends not ony on Ti-1 but also on Wi-1. Is there an easy way to
> encode this assumption into the lm file?

Depends on what you consider "easy" ;-).

You can do it by including the words in the states of the HMM.
So the "hidden" vocabulary would consist of pairs (Wi,Ti), and 
the observed vocabulary is still the words Wi.  The map file
would enforce consistency between the two.  In other words the
map file just lists the possible correspondences

W	w,t1 w,t2 w,t3 ...

(the probabilities can be omitted and default to 1).

If you do this and nothing else you would need an N-gram LM over the 
combined (Wi,Ti) sequence.  But you say you want a more specific model
of the form

	P(Ti | Wi-1, Ti-1)

This, too, can be done but requires some work.
You construct a trigram count file of 3-grams (Wi-1, Ti-1, Ti)
from your training data, and estimate an LM for it (be sure to specify all the
words as non-events so they don't receive any probability).

Then you construct a bigram LM in terms of the (W,T) tokens, such that it
gives exactly the same probabilities as the more constrained model
you just estimated.  So you have to construct a bigram LM file 
and make sure that the bigram

	Wi-1,Ti-1   Wi,Ti

gets the probility  P(Ti | Wi-1, Ti-1) * P(Wi|Ti),
for all Wi-1,Ti-1,Wi,Ti .
You have to write your own program to construct this file 
in ARPA LM format, but it's not rocket science once you understand
the format.

Then you decode using this LM and disambig.

--Andreas 


From barhaim at cs.technion.ac.il  Mon May 17 11:25:52 2004
From: barhaim at cs.technion.ac.il (Roy Bar Haim)
Date: Mon, 17 May 2004 20:25:52 +0200
Subject: FW: A simple question about SRILM 
Message-ID: <001701c43c3c$65fc62c0$34284484@cs.technion.ac.il>

Hi,

I have the same problem. I want the LM to give maximum-likelihood estimates.
That is, all the backoff weights should be zero.

I applied the solution below, but still I get backoff weights. 

For example, when I build the lm like this:
ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text corpus.tags -lm corpus.tags.lm

I found that the once-occuring trigrams DO NOT APPEAR in the lm, so probablity mass is still discounted.

When I turned on the debug messages, I saw many messages like: 
warning: 0 backoff probability mass left for "AT SCLN" -- incrementing denominator 

Does it mean that smoothing is enforced here?

Is there a way to get a pure maximum-likelihood language model, without backoff weights at all, using ngram-count?

Thanks,
Roy.
> -----Original Message-----
> From: owner-srilm-user at speech.sri.com 
> [mailto:owner-srilm-user at speech.sri.com] On Behalf Of Andreas Stolcke
> Sent: Tuesday, April 06, 2004 6:34 PM
> To: David Pic?
> Cc: srilm-user at speech.sri.com; Jorge Gonz?lez
> Subject: Re: A simple question about SRILM 
> 
> 
> 
> The ngram-count man page says
> 
>        -gtnmax count
>               where  n  is 1, 2, 3, 4, 5, 6, 7, 8, or 9.  Set the
>               maximal count of N-grams of order n that  are  dis-
>               counted  under  Good-Turing.  All N-grams more fre-
>               quent than that  will  receive  maximum  likelihood
>               estimates.  Discounting can be effectively disabled
>               by setting this to 0.
> 
> Therefore, you can disable smoothing with 
> 
> 	ngram-count -gt1max 0 -gt2max 0 -gt3max 0 ...
> 
> --Andreas
> 
> In message <40726957.3070101 at dsic.upv.es>you wrote:
> > Hello,
> > 
> > I also have a little question about SRILM. How can I infer 
> a trigram 
> > (or
> > bigram, or tetragram...) with no smoothing at all? I need 
> to do some 
> > experiments to check the effect of n-gram smoothing in my 
> models and I 
> > need a pure trigram with no probability mass derived to 
> lower levels. Is 
> > this possible in SRILM? I need to be sure that I really get 
> a trigram 
> > (with the whole trigram probabilities).
> > 
> > Thank you very much in advance for your help and attention! David
> > 
> > --
> > David Pic?-Vila
> > Universitat Polit?cnica de Val?ncia
> > Departament de Sistemes Inform?tics i Computaci?
> > Val?ncia, Spain
> > 
> 
> 


From stolcke at speech.sri.com  Mon May 17 10:37:58 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 17 May 2004 10:37:58 PDT
Subject: FW: A simple question about SRILM 
In-Reply-To: Your message of Mon, 17 May 2004 20:25:52 +0200.
             <001701c43c3c$65fc62c0$34284484@cs.technion.ac.il> 
Message-ID: <200405171737.KAA17206@huge>


In message <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote:
> Hi,
> 
> I have the same problem. I want the LM to give maximum-likelihood estimates.
> That is, all the backoff weights should be zero.
> 
> I applied the solution below, but still I get backoff weights. 
> 
> For example, when I build the lm like this:
> ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text corpus.tags -lm corp
> us.tags.lm
> 
> I found that the once-occuring trigrams DO NOT APPEAR in the lm, so probablit
> y mass is still discounted.

the default minimum coccurrence count for trigrams is 2.  set it to 1 to 
include all trigrams:

-gt3min 1 etc.

that's why you still get backoff.

> 
> When I turned on the debug messages, I saw many messages like: 
> warning: 0 backoff probability mass left for "AT SCLN" -- incrementing denomi
> nator 
> 
> Does it mean that smoothing is enforced here?
> 
> Is there a way to get a pure maximum-likelihood language model, without backo
> ff weights at all, using ngram-count?

see above.

--Andreas 


From barhaim at cs.technion.ac.il  Mon May 17 13:05:31 2004
From: barhaim at cs.technion.ac.il (Roy Bar Haim)
Date: Mon, 17 May 2004 22:05:31 +0200
Subject: FW: A simple question about SRILM 
In-Reply-To: <200405171737.KAA17206@huge>
Message-ID: <002701c43c4a$4f810b00$34284484@cs.technion.ac.il>

Hi Andreas,

Thanks for you super-fast reply!

I tried it like you suggested:
ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0
-gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1

Many of the backoff weights indeed became 99 (which is good), but many
remained non-zero (although small: -6,-7,-8...)

Is there a way to make them all 99?

The debug messages I got are listed below.

Thanks a lot,
Roy.
------------------------------------------------------------------------
---------------------
corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs
0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
Good-Turing discounting 1-grams
GT-count [0] = 0
GT-count [1] = 0
warning: no singleton counts
GT discounting disabled
Good-Turing discounting 2-grams
GT-count [0] = 0
GT-count [1] = 126
GT discounting disabled
Good-Turing discounting 3-grams
GT-count [0] = 0
GT-count [1] = 2142
GT discounting disabled
discarded 1 2-gram contexts containing pseudo-events
discarded 2 3-gram contexts containing pseudo-events
writing 41 1-grams
writing 800 2-grams
writing 5145 3-grams

> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at speech.sri.com] 
> Sent: Monday, May 17, 2004 7:38 PM
> To: Roy Bar Haim
> Cc: srilm-user at speech.sri.com
> Subject: Re: FW: A simple question about SRILM 
> 
> 
> 
> In message 
> <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote:
> > Hi,
> > 
> > I have the same problem. I want the LM to give maximum-likelihood 
> > estimates. That is, all the backoff weights should be zero.
> > 
> > I applied the solution below, but still I get backoff weights.
> > 
> > For example, when I build the lm like this:
> > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text 
> corpus.tags 
> > -lm corp us.tags.lm
> > 
> > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so 
> > probablit y mass is still discounted.
> 
> the default minimum coccurrence count for trigrams is 2.  set 
> it to 1 to 
> include all trigrams:
> 
> -gt3min 1 etc.
> 
> that's why you still get backoff.
> 
> > 
> > When I turned on the debug messages, I saw many messages like:
> > warning: 0 backoff probability mass left for "AT SCLN" -- 
> incrementing denomi
> > nator 
> > 
> > Does it mean that smoothing is enforced here?
> > 
> > Is there a way to get a pure maximum-likelihood language model, 
> > without backo ff weights at all, using ngram-count?
> 
> see above.
> 
> --Andreas 
> 
> 


From stolcke at speech.sri.com  Tue May 18 20:03:39 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 18 May 2004 20:03:39 PDT
Subject: FW: A simple question about SRILM 
In-Reply-To: Your message of Mon, 17 May 2004 22:05:31 +0200.
             <002701c43c4a$4f810b00$34284484@cs.technion.ac.il> 
Message-ID: <200405190303.UAA16121@huge>


In message <002701c43c4a$4f810b00$34284484 at cs.technion.ac.il>you wrote:
> Hi Andreas,
> 
> Thanks for you super-fast reply!
> 
> I tried it like you suggested:
> ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0
> -gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1
> 
> Many of the backoff weights indeed became 99 (which is good), but many
> remained non-zero (although small: -6,-7,-8...)
> 
> Is there a way to make them all 99?

This might not be necessary. 

If the left-over probability mass in some context is 0 (as it should
be when using ML estimates) AND the sum of the lower-order probabilities
for the occurring N-grams is also 0 (since those are also ML estimates),
the backoff weight is 0/0, and due to numerical inaccuracies this may turn
out to be one of the values your observed. (The code catches actual
0/0 divisions and generates -99 in those cases.)
However, this is not a problem because the backoff log prob value for one of 
the non-observed ngrams would be -infinity, and the particular value of 
the backoff weight that gets applied doesn't matter for the outcome
(-infinity plus any value is still -infinity).

To verify that that's the case just feed some of those unobserved
ngrams to ngram -debug 2 -ppl and make sure the log probabilities are -infinity.

--Andreas 


> 
> The debug messages I got are listed below.
> 
> Thanks a lot,
> Roy.
> ------------------------------------------------------------------------
> ---------------------
> corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs
> 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
> Good-Turing discounting 1-grams
> GT-count [0] = 0
> GT-count [1] = 0
> warning: no singleton counts
> GT discounting disabled
> Good-Turing discounting 2-grams
> GT-count [0] = 0
> GT-count [1] = 126
> GT discounting disabled
> Good-Turing discounting 3-grams
> GT-count [0] = 0
> GT-count [1] = 2142
> GT discounting disabled
> discarded 1 2-gram contexts containing pseudo-events
> discarded 2 3-gram contexts containing pseudo-events
> writing 41 1-grams
> writing 800 2-grams
> writing 5145 3-grams
> 
> > -----Original Message-----
> > From: Andreas Stolcke [mailto:stolcke at speech.sri.com] 
> > Sent: Monday, May 17, 2004 7:38 PM
> > To: Roy Bar Haim
> > Cc: srilm-user at speech.sri.com
> > Subject: Re: FW: A simple question about SRILM 
> > 
> > 
> > 
> > In message 
> > <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote:
> > > Hi,
> > > 
> > > I have the same problem. I want the LM to give maximum-likelihood 
> > > estimates. That is, all the backoff weights should be zero.
> > > 
> > > I applied the solution below, but still I get backoff weights.
> > > 
> > > For example, when I build the lm like this:
> > > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text 
> > corpus.tags 
> > > -lm corp us.tags.lm
> > > 
> > > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so 
> > > probablit y mass is still discounted.
> > 
> > the default minimum coccurrence count for trigrams is 2.  set 
> > it to 1 to 
> > include all trigrams:
> > 
> > -gt3min 1 etc.
> > 
> > that's why you still get backoff.
> > 
> > > 
> > > When I turned on the debug messages, I saw many messages like:
> > > warning: 0 backoff probability mass left for "AT SCLN" -- 
> > incrementing denomi
> > > nator 
> > > 
> > > Does it mean that smoothing is enforced here?
> > > 
> > > Is there a way to get a pure maximum-likelihood language model, 
> > > without backo ff weights at all, using ngram-count?
> > 
> > see above.
> > 
> > --Andreas 
> > 
> > 
> 


From Caroline.Lavecchia at loria.fr  Wed May 19 01:58:23 2004
From: Caroline.Lavecchia at loria.fr (lavecchia)
Date: Wed, 19 May 2004 10:58:23 +0200
Subject: script select-vocab
Message-ID: <40AB21AF.D96BE599@loria.fr>

Hi,

I would like to use the  script "select-vocab" but there is a problem.

When I put "select-vocab -heldout corpus.text corpus2.text ", the error
message is 

"
unkown option -heldout
 Usage : select-vocab -heldout corph corp1 corp2 ...
"


Does anyone know what is the problem ???

Thanks, 

Caroline


From anand at speech.sri.com  Wed May 19 04:39:20 2004
From: anand at speech.sri.com (Anand Venkataraman)
Date: Wed, 19 May 2004 04:39:20 -0700 (PDT)
Subject: script select-vocab
In-Reply-To: <40AB21AF.D96BE599@loria.fr> (message from lavecchia on Wed, 19
	May 2004 10:58:23 +0200)
Message-ID: <200405191139.EAA23419@clara>

Caroline

I believe there was a slight mismatch between the program
and its man page at one point.  You can use "-held"
instead (the "-" in "held-out" was not optional).  Let
me know if this works.  (You can look inside the
script, btw, to see exactly what the option is).

&


From anair at usc.edu  Fri May 21 19:14:36 2004
From: anair at usc.edu (Anish Nair)
Date: Fri, 21 May 2004 19:14:36 -0700
Subject: srilam class library
Message-ID: <47101927874.20040521191436@usc.edu>


hi,

has anyone successfully compiled the srilm class libraries in visual
c++. i need to compile on win32 and not cygwin because of other
dependencies. it would be nice if someone could send me a pointer to
some pre-built library. that plus sample code would be ideal.

thanks,
anish


From solen.quiniou at irisa.fr  Thu May 27 08:05:05 2004
From: solen.quiniou at irisa.fr (Solen Quiniou)
Date: Thu, 27 May 2004 17:05:05 +0200
Subject: compiling srilm using visual c++
Message-ID: <40B603A1.7090601@irisa.fr>

Hi,
I don't know how to compile srilm using visual c++ but I'm also 
interested in knowing how to do that.

Solen.


From tanel.alumae at aqris.com  Wed Jun 23 03:44:09 2004
From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=)
Date: Wed, 23 Jun 2004 13:44:09 +0300
Subject: lattice-tool and trigram
Message-ID: <1087987449.4078.13.camel@pc118.host2.starman.ee>

Hello,

I'm trying to expand word lattices with trigrams (and with factored LMs)
and find a best path (using -viterbi-decode). However, I'm quite
confused with the lattice-tool, because I cannot really understand what
it does - sometimes some parameters and options seem to be ignored. 

I have a word lattice produced with HTK (using bigram LM), and a trigram
LM in ARPA format.I run:
lattice-tool -lm trigram.arpa -in-lattice <in-lattice> -read-htk
-viterbi-decode 

When I try this, I get following output:
reading 60002 1-grams
reading 4038368 2-grams
reading 1184013 3-grams
Lattice::expandToLM: starting expansion to general LM (maxNodes = 0) ...
Lattice::bestWords: processing
Lattice::bestWords: best path prob = -inf
/home/tanel/devel/data/mfc/fts/sentences/aa/s10000_01.mfc </s>


Is this caused by the fact that there are !NULL nodes at the start and
end of the <in_lattice>? I tried adding the -no-htk-nulls -no-nulls
options but this doesn't seem to help...

Maybe somebody can provide a correct workflow to get a  trigram-scored
best path from HTK lattices using lattice-tool?

Thanks and best regards,

Tanel Alum?e


From stolcke at speech.sri.com  Wed Jun 23 22:59:52 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 23 Jun 2004 22:59:52 PDT
Subject: lattice-tool and trigram 
In-Reply-To: Your message of Wed, 23 Jun 2004 13:44:09 +0300.
             <1087987449.4078.13.camel@pc118.host2.starman.ee> 
Message-ID: <200406240559.WAA29115@huge>


In message <1087987449.4078.13.camel at pc118.host2.starman.ee>you wrote:
> Hello,
> 
> I'm trying to expand word lattices with trigrams (and with factored LMs)
> and find a best path (using -viterbi-decode). However, I'm quite
> confused with the lattice-tool, because I cannot really understand what
> it does - sometimes some parameters and options seem to be ignored. 
> 
> I have a word lattice produced with HTK (using bigram LM), and a trigram
> LM in ARPA format.I run:
> lattice-tool -lm trigram.arpa -in-lattice <in-lattice> -read-htk
> -viterbi-decode 
> 
> When I try this, I get following output:
> reading 60002 1-grams
> reading 4038368 2-grams
> reading 1184013 3-grams
> Lattice::expandToLM: starting expansion to general LM (maxNodes = 0) ...
> Lattice::bestWords: processing
> Lattice::bestWords: best path prob = -inf
> /home/tanel/devel/data/mfc/fts/sentences/aa/s10000_01.mfc </s>
> 
> 
> Is this caused by the fact that there are !NULL nodes at the start and
> end of the <in_lattice>? I tried adding the -no-htk-nulls -no-nulls
> options but this doesn't seem to help...
> 
> Maybe somebody can provide a correct workflow to get a  trigram-scored
> best path from HTK lattices using lattice-tool?
> 
> Thanks and best regards,
> 
> Tanel Alum?e

You cannot do lattice expansion and decoding in the same run 
(the decoding would happen on the original lattices, not the expanded
ones). So,

1. expand you lattices, store the results
2. decode 1-best words from the expanded lattices

This is also convenient if you want to play with different score weights
in step 2 (since step 1 takes much longer than step 2, typically).

In the first step you might want to also apply some pruning to keep the 
size and runtime manageable.  You CAN do pruning and expansion in the same
run, since the pruning happens before the expansion.

The order of application of the various option of lattice-tool needs to
be documented better.  One of these days ...

--Andreas 


From barhaim at cs.technion.ac.il  Tue Jun 29 09:20:20 2004
From: barhaim at cs.technion.ac.il (Roy Bar Haim)
Date: Tue, 29 Jun 2004 18:20:20 +0200
Subject: Lattice tool
Message-ID: <00d701c45df4$f9d70c00$34284484@cs.technion.ac.il>

Hi,

I have a few questions about lattices:

1. Is it possible to get the n-best word sequences from a lattice,
according to Viterbi decoding, and not only the 1-best? If not, what is
the function of  n-best lists in SRILM? How they are created?
2. When applying lattice-expansion in lattice-tool: are the original
probabilities in the lattice just ignored?
3. How does lattice-expansion work, for instance, for a trigram backoff
model. How do the states and transitions change? I would be grateful to
see a toy example that clarifies that, or to get a reference for such an
explanation.
4. Just to make sure: if the transition prbablity is p, should I encode
it as 10000.5*log(p) (log is the natural log) in the pfsg?
5. In pfsg format, are n1 n2 in the transition lines state numbers, or
the actual words? If they are numbers, does the numbering start with 0
or with 1?

Thanks a lot,
Roy.


From stolcke at speech.sri.com  Tue Jun 29 21:34:48 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 29 Jun 2004 21:34:48 PDT
Subject: Lattice tool 
In-Reply-To: Your message of Tue, 29 Jun 2004 18:20:20 +0200.
             <00d701c45df4$f9d70c00$34284484@cs.technion.ac.il> 
Message-ID: <200406300434.VAA27321@huge>


In message <00d701c45df4$f9d70c00$34284484 at cs.technion.ac.il>you wrote:
> Hi,
> 
> I have a few questions about lattices:
> 
> 1. Is it possible to get the n-best word sequences from a lattice,
> according to Viterbi decoding, and not only the 1-best? If not, what is
> the function of  n-best lists in SRILM? How they are created?

There is currently no facility in SRILM to generate N-best lists from
lattices, although that is a perfectly legitimate thing to want to do.
It just hasn't been a high-priority thing for us because we generate
N-best with our recognizer, not from lattices directly.

I think HTK has some tool that does it.  So at least for HTK lattices you
could try that.

> 2. When applying lattice-expansion in lattice-tool: are the original
> probabilities in the lattice just ignored?

Yes, except if you preprocess the lattices in some way that relies on
the probabilities, e.g., by pruning.
Then the "old" probabilities are used in the pruning step, prior to
expansion.

> 3. How does lattice-expansion work, for instance, for a trigram backoff
> model. How do the states and transitions change? I would be grateful to
> see a toy example that clarifies that, or to get a reference for such an
> explanation.

Try the paper referenced in the lattice-tool man page.
A postscript file can be found at
http://www.speech.sri.com/papers/icslp98-lattices.ps.gz

> 4. Just to make sure: if the transition prbablity is p, should I encode
> it as 10000.5*log(p) (log is the natural log) in the pfsg?

Correct.

> 5. In pfsg format, are n1 n2 in the transition lines state numbers, or
> the actual words? If they are numbers, does the numbering start with 0
> or with 1?

The numbers are the state indices, starting at 0.

--Andreas