From svmats at yahoo.com  Sat Jul  7 09:06:51 2007
From: svmats at yahoo.com (Mats Svenson)
Date: Sat, 7 Jul 2007 09:06:51 -0700 (PDT)
Subject: Limited vocabulary causing "no-singletons" problem
Message-ID: <499960.43881.qm@web31612.mail.mud.yahoo.com>

Hi SRILM users,
 I have the following problem. I want to train a LM
for a low-resource speech recognizer. Since the
recognizer can only handle vocabularies with a limited
size (N), I first must fix my vocabulary to only
contain N most frequently occurring words from the
training text. However, since all such words occur
more than once in the training corpus, it seems to
disables me from using the discounting schemes which
rely on singleton counts. 

For GT discounting, ngram-count gives a warning on
no-singletons in the training data, for KN no warning
was printed, however, I guess the KN discounting is
affected by the no-singletons as well. Ngram-count
also has an option "-knn knfile" to calculate
smoothing parameters using an unlimited vocabulary in
advance, however, I guess this does not entirely solve
this problem... Is it true?

 Is there a way how to bypass this problem using SRILM
or do I have to use another (generally inferior)
discounting scheme such as Witten-Bell (at least for
counts of order 1)?

Thanks for help,
 Mats


____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC


From stolcke at speech.sri.com  Sat Jul  7 09:48:35 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sat, 07 Jul 2007 09:48:35 PDT
Subject: Limited vocabulary causing "no-singletons" problem 
In-Reply-To: Your message of Sat, 07 Jul 2007 09:06:51 -0700.
             <499960.43881.qm@web31612.mail.mud.yahoo.com> 
Message-ID: <200707071648.l67GmaZ12653@huge>


Use the make-big-lm script for training your LM.
(Despite the name, it works for small LMs as well.)

It will compute the GT or KN count-of-count statistics using 
the unlimited vocabulary, and then apply your vocabulary in
building the LM.

--Andreas

In message <499960.43881.qm at web31612.mail.mud.yahoo.com>you wrote:
> Hi SRILM users,
>  I have the following problem. I want to train a LM
> for a low-resource speech recognizer. Since the
> recognizer can only handle vocabularies with a limited
> size (N), I first must fix my vocabulary to only
> contain N most frequently occurring words from the
> training text. However, since all such words occur
> more than once in the training corpus, it seems to
> disables me from using the discounting schemes which
> rely on singleton counts. 
> 
> For GT discounting, ngram-count gives a warning on
> no-singletons in the training data, for KN no warning
> was printed, however, I guess the KN discounting is
> affected by the no-singletons as well. Ngram-count
> also has an option "-knn knfile" to calculate
> smoothing parameters using an unlimited vocabulary in
> advance, however, I guess this does not entirely solve
> this problem... Is it true?
> 
>  Is there a way how to bypass this problem using SRILM
> or do I have to use another (generally inferior)
> discounting scheme such as Witten-Bell (at least for
> counts of order 1)?

> Thanks for help,
>  Mats
> 
> 
>        
> _____________________________________________________________________________
> _______
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, new
> s, photos & more. 
> http://mobile.yahoo.com/go?refer=1GNXIC


From save.climate at gmail.com  Tue Jul 10 07:47:59 2007
From: save.climate at gmail.com (Kamadev Bhanuprasad)
Date: Tue, 10 Jul 2007 16:47:59 +0200
Subject: Estimating mixture LM weights using SRILM
Message-ID: <244d59a50707100747w39e0bcfbt1de04db2dc851f82@mail.gmail.com>

Hi,
 is there a SRILM tool to estimate weights of a mixture LM (from several
separate LMs in ARPA format) using held-out data? I guess that similar
algorithms (EM, Powell search) are implemented in SRILM many times but I
haven't found in any SRILM tool implementation of this very common
particular task.

Thanks
 Kama
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20070710/0819cda2/attachment.html>

From shuet at irisa.fr  Tue Jul 10 08:12:40 2007
From: shuet at irisa.fr (Stephane Huet)
Date: Tue, 10 Jul 2007 17:12:40 +0200
Subject: lattice-tool to rescore a lattice
Message-ID: <4693A1E8.2020502@irisa.fr>

Hi,

By rescoring lattices in HTK format with a 4-gram LM with the following 
options:
   lattice-tool -in-lattice <IN> -read-htk -out-lattice <OUT> -write-htk 
-order 4 -lm <LM>,
I noticed that the LM scores written in the output lattice were 
sometimes different from what was expected.

In the lattice-tool manpages (I use SRILM 1.5.0), I read that the 
algorithm by default for lattice-expansion is "General LM expansion" and 
expands the lattice "without use of backoff transitions". Does this mean 
that during LM expansion, no backoff is taken into account in the LM 
probabilities?
However, by investigating in the source code, I noticed that the 
following line of Lattice::expandAddTransition function in LatticeExpand.cc:
   transProb += lm.contextBOW(usedContext, usedLength);
includes backoff transition.

To get the expected linguistic scores when expanding the lattice by the 
LM, I put the previous line in comment and take into account the LM 
back-off by modifying the  lm.contextID(nextWord, usedContext, 
usedLength2) call in Lattice::expandNodeToLM function of 
LatticeExpand.cc. Indeed, I noticed that when lm.contextID returns the 
LM order instead of what he originally did, the context of the 
conditional probability is not anymore truncated and the LM scores of 
the output lattices are coherent with what I expected.

There may be options that I didn't understand to rescore lattices with a 
LM but I find strange the LM scores processed by lattice-tool. I can 
send the files I modified if you want.

Regards,

St?phane


From stolcke at speech.sri.com  Tue Jul 10 08:43:52 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 10 Jul 2007 08:43:52 PDT
Subject: Estimating mixture LM weights using SRILM 
In-Reply-To: Your message of Tue, 10 Jul 2007 16:47:59 +0200.
             <244d59a50707100747w39e0bcfbt1de04db2dc851f82@mail.gmail.com> 
Message-ID: <200707101543.l6AFhq418932@huge>


Check the ppl-scripts(1) man page, and specifically the 
"compute-best-mix" command.

--Andreas

In message <244d59a50707100747w39e0bcfbt1de04db2dc851f82 at mail.gmail.com>you wro
te:
> 
> Hi,
>  is there a SRILM tool to estimate weights of a mixture LM (from several
> separate LMs in ARPA format) using held-out data? I guess that similar
> algorithms (EM, Powell search) are implemented in SRILM many times but I
> haven't found in any SRILM tool implementation of this very common
> particular task.
> 
> Thanks
>  Kama
> 


From shuet at irisa.fr  Wed Jul 11 08:37:01 2007
From: shuet at irisa.fr (Stephane Huet)
Date: Wed, 11 Jul 2007 17:37:01 +0200
Subject: lattice-tool to rescore a lattice
In-Reply-To: <200707102202.l6AM2LU27317@speech.sri.com>
References: <200707102202.l6AM2LU27317@speech.sri.com>
Message-ID: <4694F91D.8070603@irisa.fr>

>
>
>What you can verify is that the lattice as a whole assigns the correct
>log probabiliy to a complete path through the lattice.
>For this purpose, the lattice-tool -ppl option allows you to treat the
>lattice as a language model, and you can feed it sentences.
>The -debug 2 option displays scores at the word level.
>
>  
>

As you suggested, I compared the result of lattice-tool -ppl given for a 
lattice and the result of ngram -ppl. In both cases, I obtained the same 
logprob for the complete sentence. However, the logprobs at the word 
level are different, which I have already noticed in the linguistic 
scores of the HTK lattices.

Here are the results I obtained:

 > ngram -lm <LM> -order 4 -ppl test.ppl -debug 2
appeler les op?rateurs marocains
        p( appeler | <s> )      = [2gram] 6.15184e-06 [ -5.211 ]
        p( les | appeler ...)   = [2gram] 0.0759806 [ -1.1193 ]
        p( op?rateurs | les ...)        = [2gram] 0.000738462 [ -3.13167 ]
        p( marocains | op?rateurs ...)  = [2gram] 0.00450344 [ -2.34646 ]
        p( </s> | marocains ...)        = [2gram] 0.186189 [ -0.730047 ]
1 sentences, 4 words, 0 OOVs
0 zeroprobs, logprob= -12.5385 ppl= 321.879 ppl1= 1363.38

file test.ppl: 1 sentences, 4 words, 0 OOVs
0 zeroprobs, logprob= -12.5385 ppl= 321.879 ppl1= 1363.38


 > lattice-tool -ppl test.ppl -in-lattice <lattice> -read-htk -debug 2 
-order 4
appeler les op?rateurs marocains
        p( appeler | <s> )      = [400][405][385][386][390] 5.1187e-06 [ 
-5.29084 ]
        p( les | appeler ...)   = [512][519][531][532][539] 0.0623089 [ 
-1.20545 ]
        p( op?rateurs | les ...)        = 
[1120][1121][1122][1067][1068][1069][977][978][979][965][966][967][879][880][881] 
0.00107221 [ -2.96972 ]
        p( marocains | op?rateurs ...)  = 
[1123][1124][1125][1126][1127][1128][1129][1130][1131][980][981][982][983][984][985][986][987][988][882][883][884][885][886][887][888][889][890][1070][1071][1072][1073][1074][1075][1076][1077][1078][968][969][970][971][972][973][974][975][976] 
0.00454559 [ -2.34241 ]
        p( </s> | marocains ...)        = 
[1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1] 
0.186188 [ -0.730047 ]
Lattice states: 0 386 532 966 973 1
1 sentences, 4 words, 0 OOVs
0 zeroprobs, logprob= -12.5385 ppl= 321.88 ppl1= 1363.38

The differences might be linked to the way the backoffs are taken into 
account in the linguistic scores in the lattice. With the few changes I 
previously did in the source code, the logprobs seem more correct at the 
word level:

 > lattice-tool -ppl test.ppl -in-lattice <lattice> -read-htk -debug 2 
-order 4
appeler les op?rateurs marocains
        p( appeler | <s> )      = [474][479][459][460][464] 6.15177e-06 
[ -5.211 ]
        p( les | appeler ...)   = [586][593][605][606][613] 0.0759801 [ 
-1.1193 ]
        p( op?rateurs | les ...)        = 
[966][967][968][1243][1244][1245][1164][1165][1166][1064][1065][1066][1052][1053][1054] 
0.000738466 [ -3.13167 ]
        p( marocains | op?rateurs ...)  = 
[1055][1056][1057][1058][1059][1060][1061][1062][1063][1167][1168][1169][1170][1171][1172][1173][1174][1175][1246][1247][1248][1249][1250][1251][1252][1253][1254][1067][1068][1069][1070][1071][1072][1073][1074][1075][969][970][971][975][976][977][972][973][974] 
0.00450339 [ -2.34646 ]
        p( </s> | marocains ...)        = 
[1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1][1] 
0.186188 [ -0.730047 ]
Lattice states: 0 464 613 967 973 1
1 sentences, 4 words, 0 OOVs
0 zeroprobs, logprob= -12.5385 ppl= 321.881 ppl1= 1363.39


Anyway, what I need is the scores provided by lattice-tool at the 
sentence level and they are correct.

Thanks for your answer.

St?phane


From brodbd at u.washington.edu  Fri Jul 13 15:26:57 2007
From: brodbd at u.washington.edu (David Brodbeck)
Date: Fri, 13 Jul 2007 15:26:57 -0700
Subject: Test failures on RHEL 5
Message-ID: <AB647B7F-3960-4C52-8E47-F9925C4BA805@u.washington.edu>

I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5.   
The machine type is i686_m64.  Everything builds all right, but the  
tests fail for make-ngram-pfsg, ngram-class, and ngram-count-lm-limit- 
vocab.

make-ngram-pfsg is the most obvious one, so I'll tackle that one  
first.  I get the following in the stderr file:
gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid  
collation character: /[[:lower:]-?]/

Has anyone else run into this?  I'm using GNU Awk 3.1.5, and the  
locale is set to en_US.UTF-8.


David Brodbeck
Information Technology Specialist 3
Computational Linguistics
University of Washington


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20070713/0a277974/attachment.html>

From stolcke at speech.sri.com  Wed Jul 18 13:53:03 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 18 Jul 2007 13:53:03 -0700
Subject: paper on lattice tool 
In-Reply-To: Your message of Wed, 18 Jul 2007 21:40:47 +0100.
             <WuCUZvMQ.1184791247.2874780.jpinto@idiap.ch> 
Message-ID: <200707182053.l6IKr3S18445@speech.sri.com>


In message <WuCUZvMQ.1184791247.2874780.jpinto at idiap.ch>you wrote:
> I am talking about the option compute posteriors. E.g. my options are
> 
>  $lattice_tool -compute-posteriors \
>                   -read-htk -in-lattice $path \
>                   -write-htk -out-lattice ${tmp_lattice_dir}/${file} \
>                   -htk-lmscale 8.3 \
>                   -htk-acscale 1.0 \
>                   -htk-wdpenalty -4.3429 \
>                   -htk-logbase 2.718 \
>                   -posterior-scale 8.3 \

lattice-tool -compute-posteriors computes NODE posterior probabilities,
not WORD posterior probabilities.  In other words, posteriors of the 
same word appearing on different nodes are not summed over.
Therefore, the algorithm is the basic FB known from HMMs, which is explained
in many text books, as well as the popular tutorial

L. R. Rabiner and B. H. Juang, An Introduction to Hidden {Markov} Models,
IEEE Signal Processing Magazine, 3(1), 4-16, Jan. 1986.

The Wessel et al. paper talks about how to sum over multiple hypotheses
containing the same word in the same "position".
That is something lattice-tool can do also, using word confusion networks
(see the -write-mesh option).

Andreas 

> 
> 
> Thanks,
> Joel.
> 
> On 7/18/2007, "Andreas Stolcke" <stolcke at speech.sri.com> wrote:
> 
> >jpinto at idiap.ch wrote:
> >> Dear SRILM user,
> >>
> >> Is there any publication or write up on how exactly the forward-backward
> >> alogirthm (for estimation of word posterior probability from a word
> >> lattice) is implemented in lattice-tool ?
> >>
> >What lattice-tool options specifically are you talking about ?
> >
> >Andreas
> >
> >> How similar or different is it from the algorithm described in
> >> "Confidence Measures for Large Vocabulary Continuous Speech
> >> Recognition" by Frank Wessel et. al.
> >>
> >> Many thanks,
> >> Joel.
> >>
> >>
> >
> 


From stolcke at speech.sri.com  Thu Jul 19 11:36:47 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 19 Jul 2007 11:36:47 -0700
Subject: Test failures on RHEL 5
In-Reply-To: <AB647B7F-3960-4C52-8E47-F9925C4BA805@u.washington.edu>
References: <AB647B7F-3960-4C52-8E47-F9925C4BA805@u.washington.edu>
Message-ID: <469FAF3F.4070608@speech.sri.com>

David Brodbeck wrote:
> I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5.  
> The machine type is i686_m64.  Everything builds all right, but the 
> tests fail for make-ngram-pfsg, ngram-class, and 
> ngram-count-lm-limit-vocab.
>
> make-ngram-pfsg is the most obvious one, so I'll tackle that one 
> first.  I get the following in the stderr file:
> gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid 
> collation character: /[[:lower:]-?]/
>
> Has anyone else run into this?  I'm using GNU Awk 3.1.5, and the 
> locale is set to en_US.UTF-8.
This is odd since we're also using gawk 3.1.5 and I cannot replicate the 
problem even when setting LANG to en_US.UTF-8.
It seems that the interpretation of gawk regular expressions should not 
depend on the OS release version, but of course there may always be bugs.

ngram-class is very fickle.  Small changes in the implementation of math 
library functions or machine arithmetic can cause small numerical 
differences and then different clustering decisions as a result. In 
fact, I get different results with 32bit and 64bit Linux binaries, so 
don't worry about that one.

ngram-count-lm-limit-vocab should work. You can send me more details on 
how the output differs.

Andreas


From stolcke at speech.sri.com  Mon Jul 23 12:55:18 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 23 Jul 2007 12:55:18 -0700
Subject: [Fwd: Re: Test failures on RHEL 5  -- fixed!]
Message-ID: <46A507A6.30508@speech.sri.com>


This resolves an issue brought up on this list previously.

Andreas

-------------- next part --------------
An embedded message was scrubbed...
From: David Brodbeck <brodbd at u.washington.edu>
Subject: Re: Test failures on RHEL 5  -- fixed!
Date: Mon, 23 Jul 2007 11:47:32 -0700
Size: 3185
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20070723/8f97c8f9/attachment.eml>

From hanisaf at gmail.com  Mon Jul 23 13:35:03 2007
From: hanisaf at gmail.com (Hani Safadi)
Date: Mon, 23 Jul 2007 16:35:03 -0400
Subject: unsubscribe
Message-ID: <990817d50707231335q3f27479ckdfa0e7844de71905@mail.gmail.com>

unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20070723/8cc654da/attachment.html>

From desaikey at egr.msu.edu  Mon Jul 23 15:08:27 2007
From: desaikey at egr.msu.edu (Keyur Desai)
Date: Mon, 23 Jul 2007 18:08:27 -0400
Subject: unsubscribe
Message-ID: <46A526DB.9040704@egr.msu.edu>

unsubscribe


From svmats at yahoo.com  Tue Jul 31 14:43:13 2007
From: svmats at yahoo.com (Mats Svenson)
Date: Tue, 31 Jul 2007 14:43:13 -0700 (PDT)
Subject: tcl and gawk problems when compiling SRIL 1.5.3
Message-ID: <666454.35086.qm@web31602.mail.mud.yahoo.com>

Dear SRILM users,
 I have just tried to compile the current SRILM
version and several problems surprised me.

1)  As to the tcl, the INSTALL file reads that:
"TCL_INCLUDE, TCL_LIBRARY:  to whatever is needed to
find the Tcl header files and library.
    If Tcl is not available, set NO_TCL=X and leave
the above variables empty."

I have an openSUSE system (10.0) with tcl installed,
but there's no "tcl.h" file present and I didn't find
appropriate rpm to obtain it. If I use "NO_TCL=X",
will it affect SRILM's functionality? What is tcl good
for there?

2) As to gawk, the INSTALL file reads that:
Recent versions of gawk may not perform correct
floating-point arithmetic unless either LC_NUMERIC=C
or LC_ALL=C is set in the environment. This affects
many of the scripts in utils/.

Does it mean that with non-standard locales, SRILM
does not work corretly? Does it affect model
parameters estimation?

Thanks,
 Mats


____________________________________________________________________________________
Looking for a deal? Find great prices on flights and hotels with Yahoo! FareChase.
http://farechase.yahoo.com/


From stolcke at speech.sri.com  Tue Jul 31 16:15:56 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 31 Jul 2007 16:15:56 -0700
Subject: tcl and gawk problems when compiling SRIL 1.5.3
In-Reply-To: <666454.35086.qm@web31602.mail.mud.yahoo.com>
References: <666454.35086.qm@web31602.mail.mud.yahoo.com>
Message-ID: <46AFC2AC.4020303@speech.sri.com>

Mats Svenson wrote:
> Dear SRILM users,
>  I have just tried to compile the current SRILM
> version and several problems surprised me.
>
> 1)  As to the tcl, the INSTALL file reads that:
> "TCL_INCLUDE, TCL_LIBRARY:  to whatever is needed to
> find the Tcl header files and library.
>     If Tcl is not available, set NO_TCL=X and leave
> the above variables empty."
>
> I have an openSUSE system (10.0) with tcl installed,
> but there's no "tcl.h" file present and I didn't find
> appropriate rpm to obtain it. If I use "NO_TCL=X",
> will it affect SRILM's functionality? What is tcl good
> for there?
>   
It is only used in some of the development test programs (not needed for 
a regular build and install).
There is no harm in not using it.
> 2) As to gawk, the INSTALL file reads that:
> Recent versions of gawk may not perform correct
> floating-point arithmetic unless either LC_NUMERIC=C
> or LC_ALL=C is set in the environment. This affects
> many of the scripts in utils/.
>
> Does it mean that with non-standard locales, SRILM
> does not work corretly? Does it affect model
> parameters estimation?
>   
One user sent me this observation, so I have no idea how widespread a 
problem it is.
SRILM should work fine with exotic locales for most everything.  The 
only issue is in the
gawk scripts that do arithmetic.  You should run the test suite to see 
if it is an issue for you.

Andreas


From stolcke at speech.sri.com  Mon Aug  6 10:55:25 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 06 Aug 2007 10:55:25 -0700
Subject: Class-based LM using the SRILM toolkit?
In-Reply-To: <d4929ad00708051130nd386560od20f6c022802f27d@mail.gmail.com>
References: <d4929ad00705211043v78272000odaef19023c4f1e41@mail.gmail.com>	 <200705300356.l4U3u3R26372@huge> <d4929ad00708051130nd386560od20f6c022802f27d@mail.gmail.com>
Message-ID: <46B7608D.6080002@speech.sri.com>

Madhav Krishna wrote:
> Dear Dr. Stolcke,
>
> Thank you for your email. However, we require a little more help. We
> have completed our experiments but have obtained surprising results.
>
> We trained and tested a class-based language model as per your
> instructions. We trained it on 5 training sets drawn from the same
> corpus. These sets were of sizes 300,000 words to 15,000,000 words -
> increasing in steps of 300,000 words. The testing data size was held
> constant at 400,000 sentences. When testing the 5 LMs obtained from
> the training data sets, we observed that the resulting perplexity
> values increased with increase in the size of training data. This is
> indeed contrary to popular findings. In fact, the perplexity values
> obtained were 710, 890, 1150, 1200, 1280.
>
> Could these values have occurred due to my not specifying a vocabulary
> explicitly while training the LMs? I believe that the toolkit adds all
> the words in the training data to the vocabulary by default. But then,
> how does it treat OOVs in the testing set? Also, how does the choice
> of vocabulary effect perplexity?
>   
Indeed, you cannot compare perplexities unless the LM vocabulary is 
constant across models.
That's because a large vocabulary leads to higher inherent uncertainty 
about the next word.
OOVs and words with zero probability are excluded from the perplexity 
computation, so by fixing the vocabulary you are also fixing the set of 
excluded words, again, making the comparison valid.

So, extract your vocabulary from the smallest or the largest of your 
training sets, and then train all models with -vocab VOCAB.
To handle words properly in the class-based LM you might want to stick 
all unseen words in a special class
(which you have to construct separately from ngram-class and add to the 
class definition file).

Andreas

> I would appreciate your help.
>
> Sincerely,
> Madhav Krishna
>
> On 5/30/07, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>   
>>> Dear Dr. Stolcke,
>>>
>>> Thank you once again for your invaluable help.
>>>
>>> I have now developed two LMs using your toolkit - a trigram word-based model
>>> and a class-based model (static models). I now want to interpolate them and
>>> then apply some form of smoothing on the resultant LM. The ngram program in
>>> the toolkit has a -mix-lm option which allows linear interpolation; the
>>> manpages for that option mention:
>>>
>>> "*NOTE: *Unless *-bayes *(see below) is specified, *-mix-lm *triggers a
>>> static interpolation of the models in memory. In most cases a more
>>> efficient, dynamic interpolation is sufficient, requested by *-bayes
>>> 0*.**Also, mixing models of different type (
>>> e.g., word-based and class-based) will *only *work correctly with dynamic
>>> interpolation."
>>>
>>> What is dynamic interpolation? Is it applicable in my case? Can
>>>       
>> Dynamic interpolation means that the probabilities of the interpolated model
>> are computed on-the-fly, at test time.
>> Static interpolation, by contrast, means that a single model is created
>> ahead of testing, containing the interpolated probabilities in the
>> usual backoff format.  This is only possible for models of the same type,
>> as explained in the note above.
>>
>>     
>>> mixing/interpolation of these models be perfomed only with the -dynamic
>>> option? In that case, how?
>>>       
>> The -dynamic option has nothing to do with dynamic interpolation of the
>> kind we are discussing here.
>> Dynamic interpolation is enabled by the -bayes option.
>>
>>     
>>> Also, what is the -bayes interpolation method about? The manpages say for
>>> the -bayes option:
>>> "Interpolate the second and the main model using posterior probabilities for
>>> local N-gram-contexts of length *length*."
>>> What are you referring to by "N-gram contexts"? Are only the posterior
>>> probabilities interpolated here? If possible, please provide me with a link
>>> to a reference text etc. where I can learn more about this.
>>>       
>> For an explanation of Bayesian interpolation please consult the technical
>> report cited at the bottom of the ngram(1) man page.  You can get it at
>> http://www.speech.sri.com/cgi-bin/run-distill?papers/lm95-report.ps.gz
>> then check Section 2.3.
>>
>> Andreas
>>
>>
>>     
>
>
>   


From janeklwb at yahoo.com.cn  Wed Aug  8 17:47:35 2007
From: janeklwb at yahoo.com.cn (jane)
Date: Thu, 9 Aug 2007 08:47:35 +0800 (CST)
Subject: question about lattice tool
Message-ID: <808074.51076.qm@web15703.mail.cnb.yahoo.com>

hi,

I try to use lattice-tool.exe construct a confusion
network, but I don't know how to use the option
"-init-mesh file", Would you please possibly give me
some clue about the problem?

Thanks in advance!

Jane
2007-8-9


      ___________________________________________________________ 
??????3.5G???20M??? 
http://cn.mail.yahoo.com/


From stolcke at speech.sri.com  Wed Aug  8 20:57:44 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 08 Aug 2007 20:57:44 -0700
Subject: question about lattice tool
In-Reply-To: <808074.51076.qm@web15703.mail.cnb.yahoo.com>
References: <808074.51076.qm@web15703.mail.cnb.yahoo.com>
Message-ID: <46BA90B8.2010208@speech.sri.com>

jane wrote:
> hi,
>
> I try to use lattice-tool.exe construct a confusion
> network, but I don't know how to use the option
> "-init-mesh file", Would you please possibly give me
> some clue about the problem?
>   
You don't need to use -init-mesh at all. It is used to align a lattice
to a preexisting confusion network.
But usually you just build a CN from scratch using only the lattice.

lattice-tool -in-lattice INPUT -write-mesh OUTPUT (other options)

Andreas

> Thanks in advance!
>
> Jane
> 2007-8-9
>
>
>       ___________________________________________________________ 
> ??????3.5G???20M??? 
> http://cn.mail.yahoo.com/
>   


From bond at fgan.de  Fri Aug 24 06:47:38 2007
From: bond at fgan.de (Christine de Bond)
Date: Fri, 24 Aug 2007 15:47:38 +0200
Subject: Problem installing SRILM
Message-ID: <46CEE17A.602@fgan.de>

Hello,

I am trying to install SRILM on a Suse Linux 10.1 system.
Whenever I type "make World" I get error prompts. (see below)
It seems the libmisc.a is not created, and therefore
ngram, ngram-count, disambig, fngram-count, fngram, hidden-ngram,
mulit-ngram, nbest-lattice, nbest-optimize, ngram-class, anti-ngram,
nbest-mix, nbest-pron-score, ngram-merge, segment, segment-nbest
are not being installed.

I am new to Linux, and new to MT. I tried installing SRILM for weeks,
and I don't know what goes wrong.
(I did change the environment variables and adjusted the gcc, g++, perl,
tcl paths, and I followed the install instructions.)
Does anyone have a clue and can give me a hint? Does someone know where
the libmisc.a file should come from?

With kind regards,
Christine de Bond

----------------------------------------------------------------------------------------
...
make[2]: Entering directory `/home/bond/SMTSystem/srilm/lm/src'
/usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit
-DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64  
-I/home/bond/ActiveTcl8.4.15.0. -I. -I../../include   -u matherr
-L../../lib/i686  -g -O3 -o ../bin/i686/ngram ../obj/i686/ngram.o
../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a
../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a
/home/bond/ActiveTcl8.4.15.0./lib/libtcl8.4.so -lm 2>&1 | c++filt

g++: ../../lib/i686/libmisc.a: No such file or directory

/home/bond/SMTSystem/srilm/sbin/decipher-install 0555 ../bin/i686/ngram
../../bin/i686
ERROR:  File to be installed (../bin/i686/ngram) does not exist.
ERROR:  File to be installed (../bin/i686/ngram) is not a plain file.
Usage:  decipher-install <mode> <file1> ... <fileN> <directory>
        mode:                 file permission mode, in octal
        file1 ... fileN:      files to be installed
        directory:            where the files should be installed

files =  ../bin/i686/ngram
directory =  ../../bin/i686
mode =  0555

make[2]: [../../bin/i686/ngram] Error 1 (ignored)
touch ../../bin/i686/ngram
...
----------------------------------------------------------------------------------------


From stolcke at speech.sri.com  Fri Aug 24 11:10:01 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 24 Aug 2007 11:10:01 -0700
Subject: Problem installing SRILM
In-Reply-To: <46CEE17A.602@fgan.de>
References: <46CEE17A.602@fgan.de>
Message-ID: <46CF1EF9.5040600@speech.sri.com>

Christine de Bond wrote:
> Hello,
>
> I am trying to install SRILM on a Suse Linux 10.1 system.
> Whenever I type "make World" I get error prompts. (see below)
> It seems the libmisc.a is not created, and therefore
> ngram, ngram-count, disambig, fngram-count, fngram, hidden-ngram,
> mulit-ngram, nbest-lattice, nbest-optimize, ngram-class, anti-ngram,
> nbest-mix, nbest-pron-score, ngram-merge, segment, segment-nbest
> are not being installed.
>
> I am new to Linux, and new to MT. I tried installing SRILM for weeks,
> and I don't know what goes wrong.
> (I did change the environment variables and adjusted the gcc, g++, perl,
> tcl paths, and I followed the install instructions.)
> Does anyone have a clue and can give me a hint? Does someone know where
> the libmisc.a file should come from?
>
> With kind regards,
> Christine de Bond
>   

My guess is you are having trouble with the Tcl library.  Please rebuild 
everything after editing the common/Makefile.machine.i686 file to contain:
   NO_TCL = X
   TCL_INCLUDE =
   TCL_LIBRARY =

Andreas


> ----------------------------------------------------------------------------------------
> ...
> make[2]: Entering directory `/home/bond/SMTSystem/srilm/lm/src'
> /usr/bin/g++ -mtune=pentium3 -Wreturn-type -Wimplicit
> -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64  
> -I/home/bond/ActiveTcl8.4.15.0. -I. -I../../include   -u matherr
> -L../../lib/i686  -g -O3 -o ../bin/i686/ngram ../obj/i686/ngram.o
> ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a
> ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a
> /home/bond/ActiveTcl8.4.15.0./lib/libtcl8.4.so -lm 2>&1 | c++filt
>
> g++: ../../lib/i686/libmisc.a: No such file or directory
>
> /home/bond/SMTSystem/srilm/sbin/decipher-install 0555 ../bin/i686/ngram
> ../../bin/i686
> ERROR:  File to be installed (../bin/i686/ngram) does not exist.
> ERROR:  File to be installed (../bin/i686/ngram) is not a plain file.
> Usage:  decipher-install <mode> <file1> ... <fileN> <directory>
>         mode:                 file permission mode, in octal
>         file1 ... fileN:      files to be installed
>         directory:            where the files should be installed
>
> files =  ../bin/i686/ngram
> directory =  ../../bin/i686
> mode =  0555
>
> make[2]: [../../bin/i686/ngram] Error 1 (ignored)
> touch ../../bin/i686/ngram
> ...
> ----------------------------------------------------------------------------------------
>
>   


From gelbart at icsi.berkeley.edu  Mon Sep 10 14:11:44 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Mon, 10 Sep 2007 14:11:44 -0700 (PDT)
Subject: SRI LM archives
Message-ID: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU>

Hi Andreas,

As I recall, you expressed dismay once that the only way to get at the 
srilm-user archives is to request them by email to majordomo.

Have you seen this script? 
http://www.lunamorena.net/perl/archives.html

It is a Perl CGI script to render majordomo archive files as web 
pages.  Maybe that would do the trick?  If you don't want to run a CGI 
script, I guess it would be easy enough to modify this script into a 
command line tool that would create a static web page for each archive 
file.  An index page with a link to each archive file could be easily 
generated from the list of archive filenames, since the filenames 
encode the month and year.  Let me know if you'd like me to take a 
look at this, since I may have the time to do it.

Similarly, if you are uneasy about exposing list participant's email 
addresses on the public web, I guess it would also be easy enough to 
modify the script to strip out the domain names from email addresses. 
Again, I might have the time to do it.

Regards,
David


From deliverable at gmail.com  Tue Sep 11 08:11:30 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Tue, 11 Sep 2007 17:11:30 +0200
Subject: memory-resident LMs for ngram?
Message-ID: <3E6206DB-6143-4320-8EDE-87F48C790426@gmail.com>

Is there an easy way to make ngram load an LM into memory and become  
a server of perlexity scores for sentences?

Cheers,
Alexy


From stolcke at speech.sri.com  Tue Sep 11 14:12:16 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 11 Sep 2007 14:12:16 PDT
Subject: memory-resident LMs for ngram? 
In-Reply-To: Your message of Tue, 11 Sep 2007 17:11:30 +0200.
             <3E6206DB-6143-4320-8EDE-87F48C790426@gmail.com> 
Message-ID: <200709112112.l8BLCGB03079@huge>


In message <3E6206DB-6143-4320-8EDE-87F48C790426 at gmail.com>you wrote:
> Is there an easy way to make ngram load an LM into memory and become  
> a server of perlexity scores for sentences?

It should be easy using the existing functionality.

Write a wrapper script (shell, perl, whatever) that 

- invokes
		ngram -lm LM ... -ppl -  -debug 2 
- reads input sentences from some defined place and writes them to the 
	std input of ngram (above)
- reads the std output of ngram and reformats it into whatever format is 
	suitable

Using this approach, ngram is invoked only once and the LM is read only once.
It will terminated after its std input is closed or sees end-of-file.

Andreas 


From gelbart at icsi.berkeley.edu  Tue Sep 11 18:22:29 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Tue, 11 Sep 2007 18:22:29 -0700 (PDT)
Subject: SRI LM archives
In-Reply-To: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU>
References: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU>

Dear srilm-user,

I have placed a copy of the srilm-user archives online at

   http://www.icsi.berkeley.edu/~gelbart/tmp/srilm-user-www/

Please let me know if you notice any problems with the online archives 
(the code I used to convert was mostly written today for this 
purpose).  I have checked several months of messages and I haven't 
noticed any problems so far.

The above URL is just a temporary location until Andreas sets up the 
archives at SRI.  So I don't plan to keep the archives at that URL 
up-to-date in the future.

Regards,
David


From deliverable at gmail.com  Wed Sep 12 08:50:50 2007
From: deliverable at gmail.com (Alexy Khrabrov)
Date: Wed, 12 Sep 2007 17:50:50 +0200
Subject: unicode & many files
Message-ID: <5591C830-E686-4E13-81AB-925A9E8B6140@gmail.com>

How good is the unicode support -- e.g. for utf8?  I've fed it some  
utf8 Cyrillics and it did fine.  How does it know we're using  
multibyte or single byte characters?

Another question -- how do I feed many text files from a directory,  
should I do multiple -text options after cooking them somehow, or use  
-read on an accumulating count file?

Cheers,
Alexy


From stolcke at speech.sri.com  Wed Sep 12 10:07:06 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 12 Sep 2007 10:07:06 -0700
Subject: unicode & many files
In-Reply-To: <5591C830-E686-4E13-81AB-925A9E8B6140@gmail.com>
References: <5591C830-E686-4E13-81AB-925A9E8B6140@gmail.com>
Message-ID: <46E81CBA.8050205@speech.sri.com>

Alexy Khrabrov wrote:
> How good is the unicode support -- e.g. for utf8?  I've fed it some 
> utf8 Cyrillics and it did fine.  How does it know we're using 
> multibyte or single byte characters?
SRILM is oblivious to character sets.  I uses whitespace to delimit 
words, but doesn't analyze them further.  As long as words are separated 
by ASCII whitespace most functions will work with any character set.

An exception to the above is the lower-case mapping enabled by the 
-tolower option of various tools.  This requires that your operating 
system knows how to map characters to lowercase via the tolower() 
library function.  This will interact with the locale setting which is 
typically controlled by environment variables.  But again, this is all 
outside SRILM, it's implemented by the OS and C library functions.
>
> Another question -- how do I feed many text files from a directory, 
> should I do multiple -text options after cooking them somehow, or use 
> -read on an accumulating count file?
You use Unix tools:  

    cat foo/file.* | ngram-count -text - ...

or

   find directory -type f (other options to select the right files) | 
xargs cat | ngram-count -text - ....

Creating separate count files and then cat-ing them together is also an 
option.

Andreas


From stolcke at speech.sri.com  Wed Sep 12 10:01:55 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 12 Sep 2007 10:01:55 -0700
Subject: SRI LM archives
In-Reply-To: <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU>
References: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU> <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU>
Message-ID: <46E81B83.7050603@speech.sri.com>

David Gelbart wrote:
> Dear srilm-user,
>
> I have placed a copy of the srilm-user archives online at
>
>   http://www.icsi.berkeley.edu/~gelbart/tmp/srilm-user-www/
>
> Please let me know if you notice any problems with the online archives 
> (the code I used to convert was mostly written today for this 
> purpose).  I have checked several months of messages and I haven't 
> noticed any problems so far.
>
> The above URL is just a temporary location until Andreas sets up the 
> archives at SRI.  So I don't plan to keep the archives at that URL 
> up-to-date in the future.
Thanks very much for doing this, David!

The srilm-user archive is now hosted at SRI in

    http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/

with a link from the SRILM home page.  I also added a search function.

Note this also makes it possible (which wasn't before) for people who 
are not subscribed to srilm-user to access the contributions of the list.

Andreas


From gelbart at icsi.berkeley.edu  Fri Sep 14 15:50:55 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Fri, 14 Sep 2007 15:50:55 -0700 (PDT)
Subject: nbest-rover-acoustic test failing
In-Reply-To: <46E81B83.7050603@speech.sri.com>
References: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU>
 <46E81B83.7050603@speech.sri.com>
Message-ID: <Pine.LNX.4.63.0709141536290.31434@lamb.ICSI.Berkeley.EDU>

Hi,

I have built SRILM 1.5.3 under Fedora Core 3 and Ubuntu 6.06. The 
nbest-rover-acoustic test fails for me because stdout differs from the 
reference output.

Below, I have included the beginning of my output and the reference 
output.  On line 10, puh_f and pum_f in the reference output are 
replaced with puh and pum in my output.  On line 16, 0.958381 in the 
reference output is replaced with 0.958202 in my output, and similarly 
for several of the other numbers.  The same kind of differences 
(sometimes missing _f after phone name, and sometimes slightly 
different numbers) continue later on in my output, and there are also 
cases where different words are recognized.  I have placed the full 
outputs at http://www.icsi.berkeley.edu/~gelbart/sriTest.tar

Does anyone have suggestions about what might be causing this? I have 
set LANG=C, LC_NUMERIC=C, and LC_ALL=C.

The beginning of my output, with line numbers:

[root at localhost test]# head -16 output/nbest-rover-acoustic.i686.stdout | cat -n
     1  name sw_40008_A_0015814_0016128
     2  numaligns 16
     3  posterior 1
     4  align 0 *DELETE* 0.999999 uhhuh 1.28131e-06
     5  reference 0 *DELETE*
     6  info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
     7  align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh 4.63062e-05 *DELETE* 0
     8  reference 1 *DELETE*
     9  info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
    10  info 1 um 0.58 0.21 -154.354 -13.163 puh:pum 18:3
    11  info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43
    12  info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10
    13  align 2 *DELETE* 1 [laugh] 9.69422e-10
    14  reference 2 *DELETE*
    15  info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3
    16  align 3 *DELETE* 0.958202 [mouth] 0.0325037 uhhuh 0.00557467 [laugh] 0.00342208 [noise] 0.000275018 yeah 1.44687e-05 is 4.34731e-06 oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13

The beginning of the reference output, with line numbers:

[root at localhost test]# head -16 reference/nbest-rover-acoustic.stdout | cat -n
     1  name sw_40008_A_0015814_0016128
     2  numaligns 16
     3  posterior 1
     4  align 0 *DELETE* 0.999999 uhhuh 1.28131e-06
     5  reference 0 *DELETE*
     6  info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
     7  align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh 4.63062e-05 *DELETE* 0
     8  reference 1 *DELETE*
     9  info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
    10  info 1 um 0.58 0.21 -154.354 -13.163 puh_f:pum_f 18:3
    11  info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43
    12  info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10
    13  align 2 *DELETE* 1 [laugh] 9.69422e-10
    14  reference 2 *DELETE*
    15  info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3
    16  align 3 *DELETE* 0.958381 [mouth] 0.0323282 uhhuh 0.00557468 [laugh] 0.00341845 [noise] 0.000274616 yeah 1.44687e-05 is 4.34731e-06 oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13

Thanks,
David


From gelbart at icsi.berkeley.edu  Fri Sep 14 15:59:38 2007
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Fri, 14 Sep 2007 15:59:38 -0700 (PDT)
Subject: nbest-rover-acoustic test failing
In-Reply-To: <Pine.LNX.4.63.0709141536290.31434@lamb.ICSI.Berkeley.EDU>
References: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU>
 <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU>
 <46E81B83.7050603@speech.sri.com> <Pine.LNX.4.63.0709141536290.31434@lamb.ICSI.Berkeley.EDU>
Message-ID: <Pine.LNX.4.63.0709141557290.6942@lamb.ICSI.Berkeley.EDU>

On Fri, 14 Sep 2007, David Gelbart wrote:

> I have built SRILM 1.5.3 under Fedora Core 3 and Ubuntu 6.06. The 
> nbest-rover-acoustic test fails for me because stdout differs from the 
> reference output.

Oops, I should give some more detail.  My CPU is a Pentium 4.  I am 
running these operating systems under VMware, with Windows XP as the 
host.  My gawk is GNU Awk 3.1.5.

Regards,
David


From stolcke at speech.sri.com  Fri Sep 14 20:29:14 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 14 Sep 2007 20:29:14 -0700
Subject: nbest-rover-acoustic test failing
In-Reply-To: <Pine.LNX.4.63.0709141536290.31434@lamb.ICSI.Berkeley.EDU>
References: <Pine.LNX.4.63.0709101410060.31387@lamb.ICSI.Berkeley.EDU> <Pine.LNX.4.63.0709111813150.8294@lamb.ICSI.Berkeley.EDU> <46E81B83.7050603@speech.sri.com> <Pine.LNX.4.63.0709141536290.31434@lamb.ICSI.Berkeley.EDU>
Message-ID: <46EB518A.5030405@speech.sri.com>

David Gelbart wrote:
> Hi,
>
> I have built SRILM 1.5.3 under Fedora Core 3 and Ubuntu 6.06. The 
> nbest-rover-acoustic test fails for me because stdout differs from the 
> reference output.
>
> Below, I have included the beginning of my output and the reference 
> output.  On line 10, puh_f and pum_f in the reference output are 
> replaced with puh and pum in my output.  On line 16, 0.958381 in the 
> reference output is replaced with 0.958202 in my output, and similarly 
> for several of the other numbers.  The same kind of differences 
> (sometimes missing _f after phone name, and sometimes slightly 
> different numbers) continue later on in my output, and there are also 
> cases where different words are recognized.  I have placed the full 
> outputs at http://www.icsi.berkeley.edu/~gelbart/sriTest.tar
>
> Does anyone have suggestions about what might be causing this? I have 
> set LANG=C, LC_NUMERIC=C, and LC_ALL=C.
It's a bug in the reference output.  There was an update to the handling 
of phone labels with diacritics ("_f")  in nbest-rover-acoustic, in 
release 1.5.3, but I never regenerated the reference output for this test.

Your output is in fact correct.  If you want you can download 1.5.4-beta 
and grab the reference output in it.

Andreas

>
> The beginning of my output, with line numbers:
>
> [root at localhost test]# head -16 
> output/nbest-rover-acoustic.i686.stdout | cat -n
>     1  name sw_40008_A_0015814_0016128
>     2  numaligns 16
>     3  posterior 1
>     4  align 0 *DELETE* 0.999999 uhhuh 1.28131e-06
>     5  reference 0 *DELETE*
>     6  info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
>     7  align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh 
> 4.63062e-05 *DELETE* 0
>     8  reference 1 *DELETE*
>     9  info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
>    10  info 1 um 0.58 0.21 -154.354 -13.163 puh:pum 18:3
>    11  info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43
>    12  info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10
>    13  align 2 *DELETE* 1 [laugh] 9.69422e-10
>    14  reference 2 *DELETE*
>    15  info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3
>    16  align 3 *DELETE* 0.958202 [mouth] 0.0325037 uhhuh 0.00557467 
> [laugh] 0.00342208 [noise] 0.000275018 yeah 1.44687e-05 is 4.34731e-06 
> oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13
>
> The beginning of the reference output, with line numbers:
>
> [root at localhost test]# head -16 reference/nbest-rover-acoustic.stdout 
> | cat -n
>     1  name sw_40008_A_0015814_0016128
>     2  numaligns 16
>     3  posterior 1
>     4  align 0 *DELETE* 0.999999 uhhuh 1.28131e-06
>     5  reference 0 *DELETE*
>     6  info 0 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
>     7  align 1 uhhuh 0.998465 um 0.00128149 uh 0.000207527 huh 
> 4.63062e-05 *DELETE* 0
>     8  reference 1 *DELETE*
>     9  info 1 uhhuh 0.55 0.54 -401.471 -14.9418 m:hh:pum 3:43:8
>    10  info 1 um 0.58 0.21 -154.354 -13.163 puh_f:pum_f 18:3
>    11  info 1 uh 0.58 0.43 -308.707 -17.7878 puh 43
>    12  info 1 huh 0.57 0.22 -170.807 -17.7878 hh:pum 12:10
>    13  align 2 *DELETE* 1 [laugh] 9.69422e-10
>    14  reference 2 *DELETE*
>    15  info 2 [laugh] 1.08 0.17 -160.491 -18.1436 lau:lau 14:3
>    16  align 3 *DELETE* 0.958381 [mouth] 0.0323282 uhhuh 0.00557468 
> [laugh] 0.00341845 [noise] 0.000274616 yeah 1.44687e-05 is 4.34731e-06 
> oh 3.83087e-06 huh 1.56753e-07 @reject@ 1.10134e-12 it 9.69716e-13
>
> Thanks,
> David


From stolcke at speech.sri.com  Tue Sep 18 09:26:57 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 18 Sep 2007 09:26:57 PDT
Subject: x86_m64 
In-Reply-To: Your message of Tue, 18 Sep 2007 17:13:07 +0100.
             <200709181713.07957.sadafre@computing.dcu.ie> 
Message-ID: <200709181626.l8IGQvh25371@claudio>


In message <200709181713.07957.sadafre at computing.dcu.ie>you wrote:
> Hi,
> 
> I wanted to install SRILM on x86_m64 Suse linux machine. Which of the make 
> files are appropriate or close to this plateform ? I do not see a makefile 
> which correspond to this machine type.

You can use MACHINE_TYPE=i686-m64 (assuming you have 64-bit gcc installed).

Andreas


From stolcke at speech.sri.com  Thu Sep 20 11:16:21 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 20 Sep 2007 11:16:21 -0700
Subject: srilm toolkit
In-Reply-To: <8CE2D832-8777-4256-BE5D-90A098D10043@ehu.es>
References: <8CE2D832-8777-4256-BE5D-90A098D10043@ehu.es>
Message-ID: <46F2B8F5.4030904@speech.sri.com>

Raquel Justo wrote:
> Dear Dr. Stolcke,
> I have seen in "SRILM - AN EXTENSIBLE LANGUAGE MODELING TOOLKIT" 
> article that the srilm toolkit deals with class N-gram LMs and that it 
> allows class members to be multiword strings .
> Although I have read the manual pages and seen that the "n-gram" 
> command has several options as "-expand-classes k" and "-expand-exact 
> k" for class expansion, I do not really understand how it works. Would 
> you mind telling me  where I could find further information related to 
> this issue?
>
> I am working with class-based LMs and I propose the use of class 
> n-gram LMs (where classes are made up of "multiword" strings or 
> "subsequences of words") in two different ways:
> - In a first approach a multiword string is considered as a new 
> lexical unit generated by joining the words and it is treated as a 
> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P("san_franciso"| 
> C_CITY_NAME))
> - Instead, in a second approach, the words (taking part in the 
> multiword string) are separately studied and the conditioned 
> probabilities are calculated. Thus, a class n-gram LM is generated on 
> the one hand, and on the other hand a word n-gram LM is generated 
> within each class. (e.g. "san francisco", 
> P(C_CITY_NAME)*P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)).
It looks to me like your second approach is equivalent to the first, 
modulo smoothing effects achieved by the different backing off 
distributions you might use in estimating the component probabilities.
>
> I send in an attached file a paper published in the "IEEE workshop on 
> machine learning and signal processing" explaining better the two 
> approaches.
>
> Does the -expand-classes or the -expand-exact option do something 
> similar to the aforementioned approaches do? or does it adapt the 
> class n-gram LM to a word n-gram LM considering that the words takes 
> into account the information related to the classes (e.g. 
> P(san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))?
Here is a high-level description of what -expand-classes does:

1) generate a list of all word ngrams obtained by replacing the class 
tokens in the given LM.
2) for each word ngram thus obtained:
          a) compute the joint probability p of the entire word ngram, 
according to the original class LM
          b) compute the joint probability q of the prefixes (excluding 
the last word) of the ngrams
          c) compute the conditional ngram probability as p/q .
3) insert the newly generated word ngrams into the original LM, remove 
the class-based ngrams
4) recompute backoff weights (renormalize the model)

Andreas


From stolcke at speech.sri.com  Thu Sep 20 12:22:42 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 20 Sep 2007 12:22:42 PDT
Subject: problem with multiwords splitting and meshes 
In-Reply-To: Your message of Thu, 20 Sep 2007 15:06:56 +0200.
             <F9E77EB0-327F-49BA-A2C2-278247E0A3B5@itc.it> 
Message-ID: <200709201922.l8KJMgc01722@huge>


> 
> Dear Andreas,
> I am working with word-lattices containing multi-words.
> 
> I need to extract meshes from them,
> but I noticed a wrong behavior by just using
> the parameters "-split-multiwords"
> 
> This is due to the fact, I think, that the additional nodes are set  
> with "wrong" timestamps
> (equal to the timestamp of the original endnode) as I can see when  
> saving in htk format instead.

Yes, that is expected.  If you have no sub-word (phone) alignment information,
there is no way to assign time stamps to the components of a multiword.

> 
> This fact should be solved by version 1.5.3 by means of parameter "- 
> multiword-dictionary".

That's what it was made for.

> Unfortunately I am not able to use it correctly.
> 
> I run the following command
> 
> cat example.lat | lattice-tool -htk-acscale 1 -htk-lmscale 14.766 - 
> htk-wdpenalty -3 -in-lattice - -read-htk -out-lattice - -write-htk - 
> split-multiwords -multiword-dictionary multiword.lexicon
> 
> and I got the following error message
> 
> Lattice::splitHTKMultiwordNodes: no pronunciation on multiword node  
> we_will
> 
> I attached a very small (artificial) lattice "example.lat" and a real  
> lattice "example2.lat".
> 
> The file multiword.lexicon contains lines like the following
> we_will w iy | w el
> 
> 
> So I would ask you if you can please help me.
> 
> Specifically, I have some specific questions
> 
> - Is the format of the file with the multiword lexicon correct

Yes.

> - Do I need also the lexicon dictionary? Something like the following?
> we w iy
> will w el

No.

> - Do I miss anything else?

Yes.  Look at the error message: "no pronunciation on multiword node".
If you have no pronunciation information in the original lattice you cannot
infer the alignment of the split multiword.

The pronunciation and phone alignment format for HTK lattices may not be
well documented.   It consists of a string of phone labels and durations 
separated by commas and colons.  In your case, the node for we_will would 
need to look like this:

J=1 S=0 E=1 W=we_will v=3 a=-200 l=-4 d=:w,0.1:iy,0.2:w,0.1:el:0.2:

AND the phone string needs to correspond exactly to an entry in your
multiword dictionary with boundary marker (as it does in this case).

I have no idea how you would get your decoder to output this information.
You might be able to "fake it" by 
(1) looking up the pronuncation variant (3 in this case) in your decoding
dictionary, and (2) making assumptions about the relative durations of the
phones  (you can get the total word duration from the lattice node times).
You would then have to insert properly formatted "d=" fields into the 
lattices before sending the lattice to lattice-tool.

> - What happens to the scores of the edge corresponding to the multiword?

All the scores are retained on the first multiword component, the remaining
components get 0 scores (so the total scores along the path is unchanged).

> In other words, how can I generate a new lattice with multiwords  
> splitted over several edges,
> containing "correct" scores and times,  somehow proportional to the  
> "length" of each component word?

If you want to split multiword nodes using a different strategy from what
is described above you can implement it yourself, either as a preprocessing
step or by modify ing the function Lattice::splitHTKMultiwordNodes() in
lattice/src/HTKLattice.cc .

Andreas


From stolcke at speech.sri.com  Fri Sep 21 12:53:39 2007
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 21 Sep 2007 12:53:39 PDT
Subject: srilm toolkit 
In-Reply-To: Your message of Fri, 21 Sep 2007 13:16:58 +0200.
             <9F79B713-3894-4455-9179-E29269B4A5EE@ehu.es> 
Message-ID: <200709211953.l8LJrdn11743@huge>


> 
> El 20/09/2007, a las 20:16, Andreas Stolcke escribi?:
> 
> > Raquel Justo wrote:
> >>
> >> I am working with class-based LMs and I propose the use of class n- 
> >> gram LMs (where classes are made up of "multiword" strings or  
> >> "subsequences of words") in two different ways:
> >> - In a first approach a multiword string is considered as a new  
> >> lexical unit generated by joining the words and it is treated as a  
> >> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P 
> >> ("san_franciso"| C_CITY_NAME))
> >> - Instead, in a second approach, the words (taking part in the  
> >> multiword string) are separately studied and the conditioned  
> >> probabilities are calculated. Thus, a class n-gram LM is generated  
> >> on the one hand, and on the other hand a word n-gram LM is  
> >> generated within each class. (e.g. "san francisco", P(C_CITY_NAME) 
> >> *P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)).
> > It looks to me like your second approach is equivalent to the  
> > first, modulo smoothing effects achieved by the different backing  
> > off distributions you might use in estimating the component  
> > probabilities.
> 
> I don't know if I have understood very well what you want to say but  
> I think that using backing off smoothing the first approach is  
> different from the second one because different combination of all  
> the words belonging to a class are allowed and in the second approach  
> instead, only the considered subsequences of words are allowed  
> because they are treated as unigrams inside each class. I think that  
> even when no smoothing is considered the first approach can  
> generalize better due to the fact that n-gram models themselves  
> generalize on the training data.

You are right.  That's actually what I meant by "different backing off".

> >>
> >> I send in an attached file a paper published in the "IEEE workshop  
> >> on machine learning and signal processing" explaining better the  
> >> two approaches.
> >>
> >> Does the -expand-classes or the -expand-exact option do something  
> >> similar to the aforementioned approaches do? or does it adapt the  
> >> class n-gram LM to a word n-gram LM considering that the words  
> >> takes into account the information related to the classes (e.g. P 
> >> (san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))?
> > Here is a high-level description of what -expand-classes does:
> >
> > 1) generate a list of all word ngrams obtained by replacing the  
> > class tokens in the given LM.
> > 2) for each word ngram thus obtained:
> >          a) compute the joint probability p of the entire word  
> > ngram, according to the original class LM
> 
> Would you mind telling me how you compute this probability when  
> multiwords are considered?
> do you consider the multiword as a unique token or do you estimate  
> the conditional probabilities between the words that make up the  
> multiword?

Are you talking about multiwords that are joined by underscores 
(as handled by the -multiwords) option?  In that case there is no
special processing for them in ngram -expand-classes.  The class mechanism
treats multiwords as regular word tokens.

If you are asking about class expansions that contain multiple words 
separated by spaces (e.g. CITY -> San Franscisco)  then the answer is that
the expansion algorithm deals with them just fine.  The algorithm I outlined
above handles this case quite naturally.

I forgot to mention one feature of the expansion algorithm:
If the same word ngram can be generated by expanding different class ngrams
then to corresponding joint probabilities are added, as they should be.

Andreas 


From oatgnaw at gmail.com  Wed Sep 26 05:20:09 2007
From: oatgnaw at gmail.com (=?GB2312?B?zfXMzg==?=)
Date: Wed, 26 Sep 2007 20:20:09 +0800
Subject: some questions about confusion network
Message-ID: <46fa4e78.07ec720a.1112.415a@mx.google.com>

Hi,
	I've tried to use SRILM to construct confusion network? but I met some problems.

 	1 
	command: lattice-tool -in-lattice INPUT -write-mesh OUTPUT
	I used this command to construct confusion network and the input lattice was in PFSG format.
	However, I didn't got the right answer. I wonder the problem lies in the probability for each transition.
	What's the meaning of the probability and how to calculate?

	2
	I can't use SRILM to directly convert a lattice in wlat-format to confusion network. Only lattices in PFSG 	format can be converted to confusion network. Right?
	
	3 
	In the conversion, no time information is used. Right?

	4 
	How to combine two confusion network into a big one?
	nbest-lattice -use-mesh -lattice-files mesh.filelist -write mesh.output 


	Thank you very much.