From ma.farajian at gmail.com  Tue Oct  5 22:19:53 2010
From: ma.farajian at gmail.com (amin farajian)
Date: Wed, 6 Oct 2010 08:49:53 +0330
Subject: [SRILM User List] problem in installing SRILM
In-Reply-To: <AANLkTi=739xAtjm91VUjcDPX+pRPia57TznZHey1Dfa-@mail.gmail.com>
References: <AANLkTi=739xAtjm91VUjcDPX+pRPia57TznZHey1Dfa-@mail.gmail.com>
Message-ID: <AANLkTinC1AsP5Z7ELebvE_FLjZp+djX4yDpDMhBdeKv6@mail.gmail.com>

Hi all,

I am trying to install SRILM on my machine (i486 with debian)
According the instructions in INSTALL file, I changed the SRILM variable in
Make file and CC and CXX variables in Makefile.machine.i686. I also added
NO_TCL=X to this file. but while trying to install SRILM (by this command:
make Wor), I faced these problems:

ar: creating ../obj/i686/libmisc.a
ar: creating ../obj/i686/libdstruct.a
make[2]: [/home/amin/MT/srilm//bin/
i686/maxalloc] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram-count] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram-merge] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram-class] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/disambig] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/anti-ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-lattice] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-mix] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-optimize] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-pron-score] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/segment] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/segment-nbest] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/hidden-ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/multi-ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/fngram-count] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/fngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/lattice-tool] Error 1 (ignored)


what is wrong? As it passes the dependencies check level, I think that this
problem is not from missed libraries. So what is wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101006/84b5f468/attachment.html>

From ma.farajian at gmail.com  Tue Oct  5 22:18:47 2010
From: ma.farajian at gmail.com (amin farajian)
Date: Wed, 6 Oct 2010 08:48:47 +0330
Subject: [SRILM User List] problem in installing SRILM
Message-ID: <AANLkTi==A+hrqz-m1T9WDFwO=KzxTbu_zOxkkRaS7De9@mail.gmail.com>

Hi all,

I am trying to install SRILM on my machine (i486 with debian)
According the instructions in INSTALL file, I changed the SRILM variable in
Make file and CC and CXX variables in Makefile.machine.i686. I also added
NO_TCL=X to this file. but while trying to install SRILM (by this command:
make Wor), I faced these problems:

ar: creating ../obj/i686/libmisc.a
ar: creating ../obj/i686/libdstruct.a
make[2]: [/home/amin/MT/srilm//bin/
i686/maxalloc] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram-count] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram-merge] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/ngram-class] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/disambig] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/anti-ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-lattice] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-mix] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-optimize] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/nbest-pron-score] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/segment] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/segment-nbest] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/hidden-ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/multi-ngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/fngram-count] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/fngram] Error 1 (ignored)
make[2]: [/home/amin/MT/srilm//bin/i686/lattice-tool] Error 1 (ignored)


what is wrong? As it passes the dependencies check level, I think that this
problem is not from missed libraries. So what is wrong?
All messages during the installation are saved in a file which is attached.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101006/74a13b92/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.rtf
Type: application/rtf
Size: 175210 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101006/74a13b92/attachment.rtf>

From stolcke at speech.sri.com  Tue Oct  5 22:30:33 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 05 Oct 2010 23:30:33 -0600
Subject: [SRILM User List] problem in installing SRILM
In-Reply-To: <AANLkTinC1AsP5Z7ELebvE_FLjZp+djX4yDpDMhBdeKv6@mail.gmail.com>
References: <AANLkTi=739xAtjm91VUjcDPX+pRPia57TznZHey1Dfa-@mail.gmail.com>
	<AANLkTinC1AsP5Z7ELebvE_FLjZp+djX4yDpDMhBdeKv6@mail.gmail.com>
Message-ID: <4CAC0979.7060005@speech.sri.com>

amin farajian wrote:
>
> Hi all,
>
> I am trying to install SRILM on my machine (i486 with debian)
> According the instructions in INSTALL file, I changed the SRILM 
> variable in
> Make file and CC and CXX variables in Makefile.machine.i686. I also added
> NO_TCL=X to this file. but while trying to install SRILM (by this command:
> make Wor), I faced these problems:
Your log file (in the original post) shows that you are still trying to 
link with -ltcl .

The INSTALL file says:
> TCL_INCLUDE,   to whatever is needed to find the Tcl header
>                 files and library.  If Tcl is not available, set NO_TCL=X
>                 and leave the above variables empty.

So you probably forgot to put
        TCL_LIBRARY =
in Makefile.machine.i686.

Andreas

 
From nakul777 at gmail.com  Thu Oct  7 04:37:54 2010
From: nakul777 at gmail.com (nakul sharma)
Date: Thu, 7 Oct 2010 17:07:54 +0530
Subject: [SRILM User List] installing SRILM
Message-ID: <AANLkTikDtFcUAw3MDhQ8q42pAs+7RuXXy9N5tUCsiebN@mail.gmail.com>

hi all,

i am installing SRILM software on Ubuntu 10.04(updated)(i686 arch).Going by
the INSTALL file i have set SRILM variable accordingly to point to  Make
file and CC and CXX variables in Makefile.machine.i686. i have changed
NO_TCL to X.
it showing me following errors after running make World command:-

make: /sbin/machine-type: Command not found
mkdir include lib bin
mkdir: cannot create directory `include': File exists
mkdir: cannot create directory `lib': File exists
mkdir: cannot create directory `bin': File exists
make: [dirs] Error 1 (ignored)
make init
make[1]: /sbin/machine-type: Command not found
make[1]: Entering directory `/home/nakul/Desktop/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
        (cd $subdir/src; make SRILM= MACHINE_TYPE= OPTION= MAKE_PIC= init)
|| exit 1; \
    done
make[2]: Entering directory `/home/nakul/Desktop/srilm/misc/src'
Makefile:24: /common/Makefile.common.variables: No such file or directory
Makefile:139: /common/Makefile.common.targets: No such file or directory
make[2]: *** No rule to make target `/common/Makefile.common.targets'.
Stop.
make[2]: Leaving directory `/home/nakul/Desktop/srilm/misc/src'
make[1]: *** [init] Error 1
make[1]: Leaving directory `/home/nakul/Desktop/srilm'
make: *** [World] Error 2


There seems to be come dependency problem.Please tell what should  be done ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101007/2f84f173/attachment.html>

From stolcke at speech.sri.com  Thu Oct  7 09:48:17 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 07 Oct 2010 10:48:17 -0600
Subject: [SRILM User List] installing SRILM
In-Reply-To: <AANLkTikDtFcUAw3MDhQ8q42pAs+7RuXXy9N5tUCsiebN@mail.gmail.com>
References: <AANLkTikDtFcUAw3MDhQ8q42pAs+7RuXXy9N5tUCsiebN@mail.gmail.com>
Message-ID: <4CADF9D1.90207@speech.sri.com>

nakul sharma wrote:
> hi all,
>
> i am installing SRILM software on Ubuntu 10.04(updated)(i686 
> arch).Going by the INSTALL file i have set SRILM variable accordingly 
> to point to  Make file and CC and CXX variables in 
> Makefile.machine.i686. i have changed NO_TCL to X.
> it showing me following errors after running make World command:-
>
> make: /sbin/machine-type: Command not found
This is a FAQ:  you don't have tcsh/csh installed, hence the above 
scripts won't run.
tcsh is an optional package on Ubuntu and cygiwin systems.

Andreas


From david at unizar.es  Thu Oct 14 04:27:39 2010
From: david at unizar.es (david at unizar.es)
Date: Thu, 14 Oct 2010 13:27:39 +0200
Subject: [SRILM User List] nbest-posterior
Message-ID: <20101014132739.2u6lkfow8gocc04o@webmail.unizar.es>

Hi,

I am using nbest-posteriors to process a list of nbest hypothesis. For  
each one, I get the log10 of the acoustic score, and if I convert it  
to a linear probability, the sum of all hypothesis is always one, for  
each file. So, they are the posterior distribution for the chosen  
number of nbest hypothesis.

Is there a way to obtain just a likelihood, a score for each  
hypothesis without summing one among all.

Many thanks,

DAVID.


From stolcke at speech.sri.com  Thu Oct 14 09:23:21 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 14 Oct 2010 09:23:21 -0700
Subject: [SRILM User List] nbest-posterior
In-Reply-To: <20101014132739.2u6lkfow8gocc04o@webmail.unizar.es>
References: <20101014132739.2u6lkfow8gocc04o@webmail.unizar.es>
Message-ID: <4CB72E79.404@speech.sri.com>

david at unizar.es wrote:
> Hi,
>
> I am using nbest-posteriors to process a list of nbest hypothesis. For 
> each one, I get the log10 of the acoustic score, and if I convert it 
> to a linear probability, the sum of all hypothesis is always one, for 
> each file. So, they are the posterior distribution for the chosen 
> number of nbest hypothesis.
>
> Is there a way to obtain just a likelihood, a score for each 
> hypothesis without summing one among all.
Just use the original nbest scores then (and add acoustic and LM scores 
according to whatever weighting you want to apply).

Andreas

>
> Many thanks,
>
> DAVID.
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From vagarwal at mit.edu  Fri Oct 15 10:14:13 2010
From: vagarwal at mit.edu (Vikram Agarwal)
Date: Fri, 15 Oct 2010 13:14:13 -0400
Subject: [SRILM User List] smoothing questions
Message-ID: <4CB88BE5.9020509@mit.edu>

Hello,

I am new to SRILM and just have a few questions that I'd greatly 
appreciate some help on:

1) Trying to use purely ML estimates w/ no smoothing, I used 
"ngram-count -cdiscount 0 -order 8 -read counts.txt -lm train.lm"

but using this model with ngram -debug 2, I do not get zeroprobs because 
it automatically backs off.  Is there a way to prohibit backoff so that 
I can retrieve the zeroprob sentences?

2) Suppose I perform: ngram -order 5 -ppl test.txt -lm train.lm.  Will I 
always get the same results if I generated train.lm with ngram-count at 
order 5 or greater than 5, regardless of which smoothing technique is 
used and whether backoff/interpolation is employed?

3) My work uses a very small vocabulary (4 letters), but requires 
smoothing at higher orders (5-8).  I read in the FAQ that -ukndiscount 
-order 7 may be good to use for modeling OOV words with a letter model.  
I wonder why ukndiscount was recommended over kndiscount?  If 
kndiscounting does not work due to the sparsity of low count-of-counts, 
could the extrapolated count-of-counts generated by "make-big-lm" 
outperform the ukndiscounting method?

Thank you,
Vikram

From stolcke at speech.sri.com  Thu Oct 21 10:34:48 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 21 Oct 2010 10:34:48 -0700
Subject: [SRILM User List] [Fwd: Re: Pruning of KN-smoothed models]
Message-ID: <4CC079B8.6010300@speech.sri.com>


This is great. Thanks, Ciprian!

Andreas

-------------- next part --------------
An embedded message was scrubbed...
From: Ciprian Chelba <ciprianchelba at google.com>
Subject: Re: Pruning of KN-smoothed models
Date: Wed, 20 Oct 2010 19:50:51 -0700
Size: 8107
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101021/9c545c1a/attachment.eml>

From stolcke at speech.sri.com  Sun Oct 24 11:05:38 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 24 Oct 2010 11:05:38 -0700
Subject: [SRILM User List] request some question about lattice-tool
In-Reply-To: <BAY145-w56C470533690F4BF96CA1188400@phx.gbl>
References: <BAY145-w56C470533690F4BF96CA1188400@phx.gbl>
Message-ID: <4CC47572.6080207@speech.sri.com>

minglemingle_fight wrote:
> Dear professor,
>
> Sorry to trobule you and thanks for your reading!
>
> I am a student from zhengzhou university of china,my name is ming
> yin,my major is speech recogniton.
> Recent days I read alot of article of confusion network ?I have do
> some work by lattice-tool,but when i use it to generate the Confusion
> network ,there are warning"fail to align 1 word(s),max
> posterior=4.03825e-013",and the 1-best recognition of the CN is worse
> than 1-pass decoding. I don't know why,is it the lattice have problems
> or my command is error.
This message describes a normal condition for confusion network building
from lattices, as long as the posterior value printed is small (4e-13 is
small).
So you did nothing wrong, and you don't have to worry about the message.

Andreas

> my command is :
>
> lattice-tool -in-lattice-list lat_list -read-htk -no-htk-nulls
> -htk-words-on-nodes -htk-logbase 2.718 -write-mesh-dir out_dir;
>
> Thank you for your reading in your very tight time, look forward your
> lettle.
> Best Wishes !
>
> your S incerly:YIN


From ma.farajian at gmail.com  Fri Oct 29 06:39:15 2010
From: ma.farajian at gmail.com (amin farajian)
Date: Fri, 29 Oct 2010 17:09:15 +0330
Subject: [SRILM User List] Problem in building a 4gram language model
Message-ID: <AANLkTi=c6v9BmAeCkw8v_P34dtoCFW9LzXpS+LS=DsT5@mail.gmail.com>

Hi,
I'm using SRILM to build a 4gram language on a 570MB data (about 6,500,000
sentences) using this command:

tools/srilm/bin/i686/ngram-count -order 4 -interpolate -kndiscount
-unk -text work/lm/MonoLing.pe -lm work/lm/Persian4.lm

but even after 24 hours of processing, nothing happenes.
when I use order 3, after 2 minutes the process finished and SRILM
builds the 3gram language model. but when I want SRILM to build a
4gram model on the same file, nothing happens(No output, no message).

what is the problem? how can I fix it?

Bests.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101029/65179fe4/attachment.html>

From stolcke at speech.sri.com  Fri Oct 29 09:39:15 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 29 Oct 2010 09:39:15 -0700
Subject: [SRILM User List] Problem in building a 4gram language model
In-Reply-To: Your message of Fri, 29 Oct 2010 17:09:15 +0330.
	<AANLkTi=c6v9BmAeCkw8v_P34dtoCFW9LzXpS+LS=DsT5@mail.gmail.com>
Message-ID: <201010291639.o9TGdGj20857@huge>


Please read the FAQ page (or "man srilm-faq") on the subject of 
"Large data and memory issues".

--Andreas

In message <AANLkTi=c6v9BmAeCkw8v_P34dtoCFW9LzXpS+LS=DsT5 at mail.gmail.com>you wr
ote:
> I'm using SRILM to build a 4gram language on a 570MB data (about 6,500,000
> sentences) using this command:
> 
> tools/srilm/bin/i686/ngram-count -order 4 -interpolate -kndiscount
> -unk -text work/lm/MonoLing.pe -lm work/lm/Persian4.lm
> 
> but even after 24 hours of processing, nothing happenes.
> when I use order 3, after 2 minutes the process finished and SRILM
> builds the 3gram language model. but when I want SRILM to build a
> 4gram model on the same file, nothing happens(No output, no message).
> 
> what is the problem? how can I fix it?
> 
> Bests.
> 

From avneesh.saluja at sv.cmu.edu  Wed Nov 10 23:34:54 2010
From: avneesh.saluja at sv.cmu.edu (Avneesh Saluja)
Date: Wed, 10 Nov 2010 23:34:54 -0800
Subject: [SRILM User List] Non-integer weights (-text-has-weights) and KN
	discounting
Message-ID: <AANLkTi=RqtD+p8SORvkaEbMMY4vanr+nHZbLuDtfZ8hd@mail.gmail.com>

Hello SRILM team,

I'm working with your language model, and I'm also using the
-text-has-weights feature to provide sentence-level weights to my training
corpora.  I read online in the documentation that modified Kneser-Ney
smoothing doesn't support fractional counts (and since I have weights
between 0 to 1, I will have fractional counts), so I tried forcing my
weights to be integers (by scaling up and rounding), which resulted in a
count-of-counts error for modified KN smoothing.

I found this link:
http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html

that discusses a possible reason for why this would be the case.  So on a
whim, I tried using non-scaled and rounded weights (i.e. weights between 0
and 1) and evaluated using kn discounting, and found I didn't get any errors
and got what seems to be a reasonable perplexity for my situation.

My question is, can I trust this number, because the documentation says
fractional counts are only available with absolute or Wb discounting.  If I
can't trust this number, is there any way I can get KN discounting to work
in a reliable manner?

Thanks,
Avneesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101110/c22fddf4/attachment.html>

From stolcke at speech.sri.com  Thu Nov 11 14:54:08 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 11 Nov 2010 14:54:08 -0800
Subject: [SRILM User List] Non-integer weights (-text-has-weights) and
	KN	discounting
In-Reply-To: <AANLkTi=RqtD+p8SORvkaEbMMY4vanr+nHZbLuDtfZ8hd@mail.gmail.com>
References: <AANLkTi=RqtD+p8SORvkaEbMMY4vanr+nHZbLuDtfZ8hd@mail.gmail.com>
Message-ID: <4CDC7410.8030006@speech.sri.com>

Avneesh Saluja wrote:
> Hello SRILM team,
>
> I'm working with your language model, and I'm also using the 
> -text-has-weights feature to provide sentence-level weights to my 
> training corpora.  I read online in the documentation that modified 
> Kneser-Ney smoothing doesn't support fractional counts (and since I 
> have weights between 0 to 1, I will have fractional counts), so I 
> tried forcing my weights to be integers (by scaling up and rounding), 
> which resulted in a count-of-counts error for modified KN smoothing.  
>
> I found this link: 
> http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html
>
> that discusses a possible reason for why this would be the case.  So 
> on a whim, I tried using non-scaled and rounded weights (i.e. weights 
> between 0 and 1) and evaluated using kn discounting, and found I 
> didn't get any errors and got what seems to be a reasonable perplexity 
> for my situation.  
>
> My question is, can I trust this number, because the documentation 
> says fractional counts are only available with absolute or Wb 
> discounting.  If I can't trust this number, is there any way I can get 
> KN discounting to work in a reliable manner?
If you read floating point counts as integers you might just get 
truncation unless you have exponent notation (123.4 becomes 123, but 
1.23e10 becomes 1!!!), and the results may be sort-of okay.

Are you still collecting counts with -float-counts?  You must, because 
if you use a weight between 0 and 1 without -float-counts you get a 
weight of 0, and all counts would be zero.  You can check this using

% ngram-count -text - -text-has-weights -write -
0.5 a
<s>     0
<s> a   0
<s> a </s>      0
a       0
a </s>  0
</s>    0

but with float-counts:

% ngram-count -text - -text-has-weights -write - -float-counts
0.5 a
<s>     0.5
<s> a   0.5
<s> a </s>      0.5
a       0.5
a </s>  0.5
</s>    0.5

In the latter case if you have sufficient data the counts will sum up to 
something > 1 and then be truncated (see above) when building the LM 
without -float-counts.

Andreas


>
> Thanks,
> Avneesh
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From avneesh.saluja at sv.cmu.edu  Thu Nov 11 23:28:07 2010
From: avneesh.saluja at sv.cmu.edu (Avneesh Saluja)
Date: Thu, 11 Nov 2010 23:28:07 -0800
Subject: [SRILM User List] Non-integer weights (-text-has-weights) and
	KN discounting
In-Reply-To: <4CDC7410.8030006@speech.sri.com>
References: <AANLkTi=RqtD+p8SORvkaEbMMY4vanr+nHZbLuDtfZ8hd@mail.gmail.com>
	<4CDC7410.8030006@speech.sri.com>
Message-ID: <AANLkTin3ad8pMX6CJg6WP8XZrKPXMhSEpXkXSAeJcU6p@mail.gmail.com>

Hi Andreas,

For some reason, float-counts doesn't seem to work well (this is on a file
with non-integer sentence weights, since according to man ngram-count this
works with WB, I tried WB):

$ ~/tools/srilm/bin/i686-m64/ngram-count -order 3 -text
training_sets/trainingwithweights.txt -text-has-weights -unk -memuse
-wbdiscount  -lm weights.lm -float-counts -debug 2
ngram-count: ngram-count.cc:370: int main(int, char**): Assertion `intStats
!= 0' failed.
Aborted

The code seems OK:
    NgramCounts<FloatCount> *floatStats =
              !useFloatCounts ? 0 :
                 new NgramCounts<FloatCount>(*vocab, order);

#define USE_STATS(what) (useFloatCounts ? floatStats->what : intStats->what)

    if (useFloatCounts) {
        assert(floatStats != 0);
    } else {
        assert(intStats != 0);
    }

So I'm not sure why I get this error.  I haven't found any previous
troubleshooting emails related to this.

Thanks for your help,
Avneesh


On Thu, Nov 11, 2010 at 2:54 PM, Andreas Stolcke <stolcke at speech.sri.com>wrote:

> Avneesh Saluja wrote:
>
>> Hello SRILM team,
>>
>> I'm working with your language model, and I'm also using the
>> -text-has-weights feature to provide sentence-level weights to my training
>> corpora.  I read online in the documentation that modified Kneser-Ney
>> smoothing doesn't support fractional counts (and since I have weights
>> between 0 to 1, I will have fractional counts), so I tried forcing my
>> weights to be integers (by scaling up and rounding), which resulted in a
>> count-of-counts error for modified KN smoothing.
>> I found this link:
>> http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html
>>
>> that discusses a possible reason for why this would be the case.  So on a
>> whim, I tried using non-scaled and rounded weights (i.e. weights between 0
>> and 1) and evaluated using kn discounting, and found I didn't get any errors
>> and got what seems to be a reasonable perplexity for my situation.
>> My question is, can I trust this number, because the documentation says
>> fractional counts are only available with absolute or Wb discounting.  If I
>> can't trust this number, is there any way I can get KN discounting to work
>> in a reliable manner?
>>
> If you read floating point counts as integers you might just get truncation
> unless you have exponent notation (123.4 becomes 123, but 1.23e10 becomes
> 1!!!), and the results may be sort-of okay.
>
> Are you still collecting counts with -float-counts?  You must, because if
> you use a weight between 0 and 1 without -float-counts you get a weight of
> 0, and all counts would be zero.  You can check this using
>
> % ngram-count -text - -text-has-weights -write -
> 0.5 a
> <s>     0
> <s> a   0
> <s> a </s>      0
> a       0
> a </s>  0
> </s>    0
>
> but with float-counts:
>
> % ngram-count -text - -text-has-weights -write - -float-counts
> 0.5 a
> <s>     0.5
> <s> a   0.5
> <s> a </s>      0.5
> a       0.5
> a </s>  0.5
> </s>    0.5
>
> In the latter case if you have sufficient data the counts will sum up to
> something > 1 and then be truncated (see above) when building the LM without
> -float-counts.
>
> Andreas
>
>
>
>
>
>> Thanks,
>> Avneesh
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101111/d4b9bbff/attachment.html>

From stolcke at speech.sri.com  Fri Nov 12 12:52:18 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 12 Nov 2010 12:52:18 -0800
Subject: [SRILM User List] Non-integer weights (-text-has-weights) and
 KN discounting
In-Reply-To: <AANLkTik8=EWHNELhErEyvOZVsMHverdr1HO_KuC+Sfht@mail.gmail.com>
References: <AANLkTi=RqtD+p8SORvkaEbMMY4vanr+nHZbLuDtfZ8hd@mail.gmail.com>
	<4CDC7410.8030006@speech.sri.com>
	<AANLkTin3ad8pMX6CJg6WP8XZrKPXMhSEpXkXSAeJcU6p@mail.gmail.com>
	<4CDD5A4E.7070205@speech.sri.com>
	<AANLkTik8=EWHNELhErEyvOZVsMHverdr1HO_KuC+Sfht@mail.gmail.com>
Message-ID: <4CDDA902.8050602@speech.sri.com>

Avneesh Saluja wrote:
> Hi Andreas, you're right - something wrong with my 64 bit compilation 
> on the current machine - the 32 bit version works fine, and I tried 
> the 64-bit version on another machine and it's OK too.  Sorry about 
> that, and thanks again for your help!
Thanks for verifying that it is not a problem with the code.

Andreas

>
> Avneesh
>
> On Fri, Nov 12, 2010 at 7:16 AM, Andreas Stolcke 
> <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
>     Avneesh Saluja wrote:
>
>         Hi Andreas,
>
>         For some reason, float-counts doesn't seem to work well (this
>         is on a file with non-integer sentence weights, since
>         according to man ngram-count this works with WB, I tried WB):
>         $ ~/tools/srilm/bin/i686-m64/ngram-count -order 3 -text
>         training_sets/trainingwithweights.txt -text-has-weights -unk
>         -memuse -wbdiscount  -lm weights.lm -float-counts -debug 2
>         ngram-count: ngram-count.cc:370: int main(int, char**):
>         Assertion `intStats != 0' failed.
>         Aborted
>
>     Very strange indeed, since this happens before anything is stored
>     in the counts data structure.   Does this happen with a 32bit
>     binary as well (MACHINE_TYPE=i686) ?
>     I cannot replicate the error using either 32 or 64bit Linux
>     binaries.  It could be a compiler problem.  What version of gcc
>     are you using?
>
>     Andreas
>
>
>         The code seems OK:     NgramCounts<FloatCount> *floatStats =
>                      !useFloatCounts ? 0 :
>                         new NgramCounts<FloatCount>(*vocab, order);
>
>         #define USE_STATS(what) (useFloatCounts ? floatStats->what :
>         intStats->what)
>
>            if (useFloatCounts) {
>                assert(floatStats != 0);
>            } else {
>                assert(intStats != 0);
>            }
>
>         So I'm not sure why I get this error.  I haven't found any
>         previous troubleshooting emails related to this.
>         Thanks for your help,
>         Avneesh
>
>
>         On Thu, Nov 11, 2010 at 2:54 PM, Andreas Stolcke
>         <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>
>         <mailto:stolcke at speech.sri.com
>         <mailto:stolcke at speech.sri.com>>> wrote:
>
>            Avneesh Saluja wrote:
>
>                Hello SRILM team,
>
>                I'm working with your language model, and I'm also
>         using the
>                -text-has-weights feature to provide sentence-level
>         weights to
>                my training corpora.  I read online in the
>         documentation that
>                modified Kneser-Ney smoothing doesn't support fractional
>                counts (and since I have weights between 0 to 1, I will
>         have
>                fractional counts), so I tried forcing my weights to be
>                integers (by scaling up and rounding), which resulted in a
>                count-of-counts error for modified KN smoothing.      
>            I found this link:
>              
>          http://www-speech.sri.com/pipermail/srilm-user/2006q3/000375.html
>
>                that discusses a possible reason for why this would be the
>                case.  So on a whim, I tried using non-scaled and rounded
>                weights (i.e. weights between 0 and 1) and evaluated
>         using kn
>                discounting, and found I didn't get any errors and got what
>                seems to be a reasonable perplexity for my situation.  
>                My question is, can I trust this number, because the
>                documentation says fractional counts are only available
>         with
>                absolute or Wb discounting.  If I can't trust this
>         number, is
>                there any way I can get KN discounting to work in a
>         reliable
>                manner?
>
>            If you read floating point counts as integers you might
>         just get
>            truncation unless you have exponent notation (123.4 becomes
>         123,
>            but 1.23e10 becomes 1!!!), and the results may be sort-of okay.
>
>            Are you still collecting counts with -float-counts?  You must,
>            because if you use a weight between 0 and 1 without
>         -float-counts
>            you get a weight of 0, and all counts would be zero.  You can
>            check this using
>
>            % ngram-count -text - -text-has-weights -write -
>            0.5 a
>            <s>     0
>            <s> a   0
>            <s> a </s>      0
>            a       0
>            a </s>  0
>            </s>    0
>
>            but with float-counts:
>
>            % ngram-count -text - -text-has-weights -write - -float-counts
>            0.5 a
>            <s>     0.5
>            <s> a   0.5
>            <s> a </s>      0.5
>            a       0.5
>            a </s>  0.5
>            </s>    0.5
>
>            In the latter case if you have sufficient data the counts
>         will sum
>            up to something > 1 and then be truncated (see above) when
>            building the LM without -float-counts.
>
>            Andreas
>
>
>
>
>
>                Thanks,
>                Avneesh
>
>              
>          ------------------------------------------------------------------------
>
>                _______________________________________________
>                SRILM-User site list
>                SRILM-User at speech.sri.com
>         <mailto:SRILM-User at speech.sri.com>
>         <mailto:SRILM-User at speech.sri.com
>         <mailto:SRILM-User at speech.sri.com>>
>
>                http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
>
>


From leona at postech.ac.kr  Sun Nov 14 15:58:13 2010
From: leona at postech.ac.kr (Hwidong Na)
Date: Mon, 15 Nov 2010 08:58:13 +0900
Subject: [SRILM User List] The backoff weights in the "ngram-count" result
Message-ID: <1289779093.4479.11.camel@pandora>

Hi, 

I'm going to utilize the backoff weights given by "ngram-count". 

In http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html
> These are optionally followed by the logarithm (base 10) 

However, I found that these are approximately ranged from -5 to +5. I
suspect it is not true that the actual backoff weights are ranged from
10^-5 to 10^+5, instead of from 0.0 to 1.0. What is the actual value of
the backoff weights?

-- 
Hwidong Na <leona at postech.ac.kr>
KLE lab, POSTECH, KOREA


From stolcke at speech.sri.com  Sun Nov 14 18:30:56 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 14 Nov 2010 18:30:56 -0800
Subject: [SRILM User List] The backoff weights in the "ngram-count"
	result
In-Reply-To: Your message of Mon, 15 Nov 2010 08:58:13 +0900.
	<1289779093.4479.11.camel@pandora>
Message-ID: <201011150230.oAF2Uuj08319@huge>


Backoff weights are not probabilities.  They can be greater than 1.

--Andreas

In message <1289779093.4479.11.camel at pandora>you wrote:
> Hi, 
> 
> I'm going to utilize the backoff weights given by "ngram-count". 
> 
> In http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html
> > These are optionally followed by the logarithm (base 10) 
> 
> However, I found that these are approximately ranged from -5 to +5. I
> suspect it is not true that the actual backoff weights are ranged from
> 10^-5 to 10^+5, instead of from 0.0 to 1.0. What is the actual value of
> the backoff weights?
> 
> -- 
> Hwidong Na <leona at postech.ac.kr>
> KLE lab, POSTECH, KOREA
> 
> 
> 
> 
> 
> 
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From leona at postech.ac.kr  Mon Nov 15 00:03:41 2010
From: leona at postech.ac.kr (Hwidong Na)
Date: Mon, 15 Nov 2010 17:03:41 +0900
Subject: [SRILM User List] The backoff weights in the "ngram-count"
 result
In-Reply-To: <201011150230.oAF2Uuj08319@huge>
References: <201011150230.oAF2Uuj08319@huge>
Message-ID: <1289808221.7858.12.camel@pandora>

Hi Andreas,

It is confusing for me backoff weights can be larger than 1. What would
be the correct usage of backoff? I used the following model as (Chen and
Goodman, 1998) summerized in the techincal paper "An Empirical Study of
Smoothing Technique for Language Modeling" (p.17).

p_smooth(w_n |w_(i-n+1)..w_(n-1)..w_(n-1)) = 
	bofw * p_LM(w_n |w_(i-n+1)..w_(n-1))
	+ (1-bofw) * p_LM(w_n |w_(i-n+2)..w_(n-1))

where bofw is the backoff weight for the word sequence w_(i-n
+1)..w_(n-1)

Best regards,
-- 
Hwidong Na <leona at postech.ac.kr>
KLE lab, POSTECH, KOREA

2010-11-14 (?), 18:30 -0800, Andreas Stolcke:
> Backoff weights are not probabilities.  They can be greater than 1.
> 
> --Andreas
> 
> In message <1289779093.4479.11.camel at pandora>you wrote:
> > Hi, 
> > 
> > I'm going to utilize the backoff weights given by "ngram-count". 
> > 
> > In http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html
> > > These are optionally followed by the logarithm (base 10) 
> > 
> > However, I found that these are approximately ranged from -5 to +5. I
> > suspect it is not true that the actual backoff weights are ranged from
> > 10^-5 to 10^+5, instead of from 0.0 to 1.0. What is the actual value of
> > the backoff weights?
> > 
> > -- 
> > Hwidong Na <leona at postech.ac.kr>
> > KLE lab, POSTECH, KOREA
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > SRILM-User site list
> > SRILM-User at speech.sri.com
> > http://www.speech.sri.com/mailman/listinfo/srilm-user
> 
> 
> 
> 
> 
> 


From zeeshankhans at gmail.com  Mon Nov 15 08:56:56 2010
From: zeeshankhans at gmail.com (zeeshan khan)
Date: Mon, 15 Nov 2010 17:56:56 +0100
Subject: [SRILM User List] -cache-lambda at 1.0
Message-ID: <AANLkTinc+vGbPm05FfSJA+FHpwGa8H+dAxAvyDFoEf9m@mail.gmail.com>

Hi all,

I am using the SRI toolkit to plot a perplexity curve for a corpus.
I am trying to interpolate the main LM with a unigram cache language model
based on a history of** (lets say) n words.
The SRI toolkit ngram command provides this option by using the -cache
option to specify cache size and -cache-lambda option to specify
interpolation factor.

The command looks like this:

ngram -order 4 -lm <LM> -cache <int> -cache-lambda <interpolation factor>
-ppl <txt>
(I have also tried it with -bayes 0 - the output is the same)

The PPL values for some of the interpolation factors (and fixed cache size)
are stated here:

Interpolation factor      PPL
0.9                            1848.62
0.999                         93059.1
0.99999                     4.32174e+06
1.0                            22.2459

as you can see, the PPL values increase over -cache-lambda values from 0.9
upwards at 0.999, 0.99999, as expected.
But at -cache-lambda = 1.0, the PPL suddenly falls to an extremely low value
(from about 4 million at 0.99999 to about 22 at 1.0).

Can you kindly comment why does it happen ? Is this behavior at
-cache-lambda = 1.0 a result of some error in the way the PPL is calculated
by SRI, or am I missing some options in the command ?

Regards,
Zeeshan Khan.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101115/9a6b0716/attachment.html>

From stolcke at speech.sri.com  Mon Nov 15 10:18:35 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 15 Nov 2010 10:18:35 -0800
Subject: [SRILM User List] The backoff weights in the "ngram-count"
	result
In-Reply-To: <1289808221.7858.12.camel@pandora>
References: <201011150230.oAF2Uuj08319@huge> <1289808221.7858.12.camel@pandora>
Message-ID: <4CE1797B.2070501@speech.sri.com>

Hwidong Na wrote:
> Hi Andreas,
>
> It is confusing for me backoff weights can be larger than 1. What would
> be the correct usage of backoff? I used the following model as (Chen and
> Goodman, 1998) summerized in the techincal paper "An Empirical Study of
> Smoothing Technique for Language Modeling" (p.17).
>
> p_smooth(w_n |w_(i-n+1)..w_(n-1)..w_(n-1)) = 
> 	bofw * p_LM(w_n |w_(i-n+1)..w_(n-1))
> 	+ (1-bofw) * p_LM(w_n |w_(i-n+2)..w_(n-1))
>
> where bofw is the backoff weight for the word sequence w_(i-n
> +1)..w_(n-1)
>   
In the formula you quote, the "bofw" is not the backoff weight, it is 
the interpolation weight controlling the mixture of higher and lower 
order estimates.
I only applies to interpolated smoothing methods (ngram-count -interpolate).

The backoff weight is the \gamma variable in equation (24) on that 
page.   It is computed as described in the 3rd column of the table at 
the top of p. 17.
In fact, the formula given for Katz smoothing will work for all methods 
(replacing p_katz with the appropriate p-estimate of course), and it is 
what is implemented by SRILM.

Andreas

> Best regards,
>   


From stolcke at speech.sri.com  Mon Nov 15 12:10:37 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 15 Nov 2010 12:10:37 -0800
Subject: [SRILM User List] -cache-lambda at 1.0
In-Reply-To: <AANLkTinc+vGbPm05FfSJA+FHpwGa8H+dAxAvyDFoEf9m@mail.gmail.com>
References: <AANLkTinc+vGbPm05FfSJA+FHpwGa8H+dAxAvyDFoEf9m@mail.gmail.com>
Message-ID: <4CE193BD.9010804@speech.sri.com>

zeeshan khan wrote:
> Hi all,
>
> I am using the SRI toolkit to plot a perplexity curve for a corpus.
> I am trying to interpolate the main LM with a unigram cache language 
> model based on a history of (lets say) n words.
> The SRI toolkit ngram command provides this option by using the -cache 
> option to specify cache size and -cache-lambda option to specify 
> interpolation factor.
>
> The command looks like this:
>
> ngram -order 4 -lm <LM> -cache <int> -cache-lambda <interpolation 
> factor> -ppl <txt>
> (I have also tried it with -bayes 0 - the output is the same)
>
> The PPL values for some of the interpolation factors (and fixed cache 
> size) are stated here:
>
> Interpolation factor      PPL
> 0.9                            1848.62
> 0.999                         93059.1
> 0.99999                     4.32174e+06
> 1.0                            22.2459
Check the count of "zeroprob" words.   My guess is that your cache LM 
gives probability 0 to a large number of words (in fact, it does that by 
design, since it only knows about words that appeared before).   When 
lambda=1 only few words will have nonzero probability.
So the PPL value at 1.0 is really infinity, but ngram-count -ppl  tries 
to give more useful output by giving you the PPL over the nonzero words 
and separately reporting the number of zeroprob words.

Andreas

>
> as you can see, the PPL values increase over -cache-lambda values from 
> 0.9 upwards at 0.999, 0.99999, as expected.
> But at -cache-lambda = 1.0, the PPL suddenly falls to an extremely low 
> value (from about 4 million at 0.99999 to about 22 at 1.0).
>
> Can you kindly comment why does it happen ? Is this behavior at 
> -cache-lambda = 1.0 a result of some error in the way the PPL is 
> calculated by SRI, or am I missing some options in the command ?
>
> Regards,
> Zeeshan Khan.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From avneesh.saluja at sv.cmu.edu  Thu Nov 18 00:11:17 2010
From: avneesh.saluja at sv.cmu.edu (Avneesh Saluja)
Date: Thu, 18 Nov 2010 00:11:17 -0800
Subject: [SRILM User List] Discrepancies in -text-has-weights
Message-ID: <AANLkTik5n9Ab0-+iViupBQm53iiMzMEa_v7Sw5tQKnVY@mail.gmail.com>

Hi Andreas and SRILM team,

There seem to be some discrepancies when I use the -text-has-weights (and
-float-counts) feature, or perhaps I'm not understanding these features
right.  I ran some experiments to test it out.  I have a small training set
of around 41k sentences and a test set of around 1000 sentences.  I ran the
following commands:

   ~/tools/srilm/bin/i686/ngram-count -order 3 -text test_baseline.txt -unk
-memuse -wbdiscount -gt1min 0 -gt2min 0 gt3min 0 -lm test_baseline.lm -debug
2

   ~/tools/srilm/bin/i686/ngram -unk -map-unk '<UNK>' -lm test_baseline.lm
-order 3 > baseline.ppl


On several training sets with weights prepended to the sentences: 0.1x, 1x,
10x, and 100x (I ran ngram-count on 0.1x with -text-has-weights and
-float-counts, and for the others I ran -text-has-weights only as the
sentences are weighted by integers).  I would expect the perplexities in all
of these cases would be the same but instead I got the following results:


      No weights 0.1x 1x 10x 100x  ppl 61.6418 125.688 61.6418 102.966
102.966

Is there any reason why they are different?


Thanks,

Avneesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101118/9d3a1b31/attachment.html>

From dianachih at gmail.com  Thu Nov 18 09:06:29 2010
From: dianachih at gmail.com (Jie Qi)
Date: Thu, 18 Nov 2010 12:06:29 -0500
Subject: [SRILM User List] make-batch-counts and merge-batch-counts
Message-ID: <AANLkTin1EnbMYYN_Q8jGrXp9-X7XhJTR+wr+t4hu+7s6@mail.gmail.com>

Hi all and Andreas,

I have a question about parameters of *make-batch-counts* and *
merge-batch-counts*. If I have many text files stored in subfolders like
2007/01/01 to 2007/06/31. What should be the correct form of file-list,
count-dir and start-iter? Can someone give me an example? Thanks!

*make-batch-counts* *file-list* \
	[ *batch-size* [ *filter* [ *count-dir* [ *options* ... ] ] ]
]*merge-batch-counts* [ *-float-counts* ] [ *-l* *N* ] *count-dir* [
*file-list*|*start-iter* ]

Here is my writing code:

>>make-batch-counts file-list 10 2007/01/01/*.txt
>>merge-batch-counts 2007/01/01/*.txt


-Diana
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101118/55e436c4/attachment.html>

From stolcke at speech.sri.com  Thu Nov 18 14:43:05 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 18 Nov 2010 14:43:05 -0800
Subject: [SRILM User List] Discrepancies in -text-has-weights
In-Reply-To: <AANLkTik5n9Ab0-+iViupBQm53iiMzMEa_v7Sw5tQKnVY@mail.gmail.com>
References: <AANLkTik5n9Ab0-+iViupBQm53iiMzMEa_v7Sw5tQKnVY@mail.gmail.com>
Message-ID: <4CE5ABF9.8050209@speech.sri.com>

Avneesh Saluja wrote:
> Hi Andreas and SRILM team,
>
> There seem to be some discrepancies when I use the -text-has-weights 
> (and -float-counts) feature, or perhaps I'm not understanding these 
> features right.  I ran some experiments to test it out.  I have a 
> small training set of around 41k sentences and a test set of around 
> 1000 sentences.  I ran the following commands:
>
> ~/tools/srilm/bin/i686/ngram-count -order 3 -text test_baseline.txt 
> -unk -memuse -wbdiscount -gt1min 0 -gt2min 0 gt3min 0 -lm 
> test_baseline.lm -debug 2
>
> ~/tools/srilm/bin/i686/ngram -unk -map-unk '<UNK>' -lm 
> test_baseline.lm -order 3 > baseline.ppl
>
>
> On several training sets with weights prepended to the sentences: 
> 0.1x, 1x, 10x, and 100x (I ran ngram-count on 0.1x with 
> -text-has-weights and -float-counts, and for the others I ran 
> -text-has-weights only as the sentences are weighted by integers).  I 
> would expect the perplexities in all of these cases would be the same 
> but instead I got the following results:
>
>
>   	No weights 	0.1x 	1x 	10x 	100x
> ppl 	61.6418 	125.688 	61.6418 	102.966 	102.966
>
>
> Is there any reason why they are different?
>
Yes, as you change the scaling factor you are effectively changing the 
number of samples.  The smoothing with Witten Bell depends on the ratio 
of type and token frequency.  If you duplicate the data x times you have 
increased the token frequencies by a factor x, but the number of 
distinct words (types) stays the same.

It is quite intuitive:  if you've seen 10 different words among a total 
sample of 10 your probability estimate that the next word will be a new 
word type will be much higher than if you had seen 10 word types among a 
sample of 1000!  So scaling up makes your LM overconfident in having 
seen all the words in training, and scaling down gives too much 
probability mass to unseen words.

What your results show nicely is that the "natural" counts work best for 
WB smoothing, which is reassuring since it validates the underlying 
model (new word frequency is used to estimate the frequency of unseen 
words, for each context).

Andreas

>
> Thanks,
>
> Avneesh
>


From stolcke at speech.sri.com  Thu Nov 18 14:55:15 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 18 Nov 2010 14:55:15 -0800
Subject: [SRILM User List] make-batch-counts and merge-batch-counts
In-Reply-To: <AANLkTin1EnbMYYN_Q8jGrXp9-X7XhJTR+wr+t4hu+7s6@mail.gmail.com>
References: <AANLkTin1EnbMYYN_Q8jGrXp9-X7XhJTR+wr+t4hu+7s6@mail.gmail.com>
Message-ID: <4CE5AED3.20205@speech.sri.com>

Jie Qi wrote:
> Hi all and Andreas,
>
> I have a question about parameters of *make-batch-counts* and 
> *merge-batch-counts*. If I have many text files stored in subfolders 
> like 2007/01/01 to 2007/06/31. What should be the correct form of 
> file-list, count-dir and start-iter? Can someone give me an example? 
> Thanks!
>
> *make-batch-counts* /file-list/ \
> 	[ /batch-size/ [ /filter/ [ /count-dir/ [ /options/ ... ] ] ] ]
> *merge-batch-counts* [ *-float-counts* ] [ *-l* /N/ ] /count-dir/ [ /file-list/|/start-iter/ ]
> Here is my writing code:
> >>make-batch-counts file-list 10 2007/01/01/*.txt 
> >>merge-batch-counts 2007/01/01/*.txt
>
You don't list the input files on the command line.   You put a list of 
the input files into a separate file, and then give that to 
make-batch-counts.
The "count-dir"  is a new directory that you chose to store the 
aggregated count information. Make sure it has plenty of disk space.  So 
for example:

ls 2007/01/01/*.txt > file-list
make-batch-counts file-list 10 mycounts
merge-batch-counts mycounts

should do what you intended to do.

Andreas

>
>
> -Diana
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From sidhurukku at yahoo.com  Fri Nov 19 04:53:54 2010
From: sidhurukku at yahoo.com (Jasleen Sidhu)
Date: Fri, 19 Nov 2010 04:53:54 -0800 (PST)
Subject: [SRILM User List] please help me ...
Message-ID: <163931.2543.qm@web111505.mail.gq1.yahoo.com>

hello
i m trying to build language model using srilm and moses.following error mesage is displayed.i tried installing srilm 3 times but same problem arises againand all the components are installed properly like gcc,gnu make ,tcl,gawk

/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/anti-ngram ../../bin/i686
ERROR:? File to be installed (../bin/i686/anti-ngram) does not exist.
ERROR:? File to be installed (../bin/i686/anti-ngram) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/anti-ngram
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/anti-ngram] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-lattice ../obj/i686/nbest-lattice.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-lattice ../../bin/i686
ERROR:? File to be installed (../bin/i686/nbest-lattice) does not exist.
ERROR:? File to be installed (../bin/i686/nbest-lattice) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/nbest-lattice
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/nbest-lattice] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-mix ../obj/i686/nbest-mix.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-mix ../../bin/i686
ERROR:? File to be installed (../bin/i686/nbest-mix) does not exist.
ERROR:? File to be installed (../bin/i686/nbest-mix) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/nbest-mix
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/nbest-mix] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-optimize ../obj/i686/nbest-optimize.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-optimize ../../bin/i686
ERROR:? File to be installed (../bin/i686/nbest-optimize) does not exist.
ERROR:? File to be installed (../bin/i686/nbest-optimize) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/nbest-optimize
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/nbest-optimize] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/nbest-pron-score ../obj/i686/nbest-pron-score.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/nbest-pron-score ../../bin/i686
ERROR:? File to be installed (../bin/i686/nbest-pron-score) does not exist.
ERROR:? File to be installed (../bin/i686/nbest-pron-score) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/nbest-pron-score
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/nbest-pron-score] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/segment ../obj/i686/segment.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/segment ../../bin/i686
ERROR:? File to be installed (../bin/i686/segment) does not exist.
ERROR:? File to be installed (../bin/i686/segment) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/segment
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/segment] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/segment-nbest ../obj/i686/segment-nbest.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/segment-nbest ../../bin/i686
ERROR:? File to be installed (../bin/i686/segment-nbest) does not exist.
ERROR:? File to be installed (../bin/i686/segment-nbest) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/segment-nbest
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/segment-nbest] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/hidden-ngram ../obj/i686/hidden-ngram.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/hidden-ngram ../../bin/i686
ERROR:? File to be installed (../bin/i686/hidden-ngram) does not exist.
ERROR:? File to be installed (../bin/i686/hidden-ngram) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/hidden-ngram
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/hidden-ngram] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/multi-ngram ../obj/i686/multi-ngram.o ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/multi-ngram ../../bin/i686
ERROR:? File to be installed (../bin/i686/multi-ngram) does not exist.
ERROR:? File to be installed (../bin/i686/multi-ngram) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/multi-ngram
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/multi-ngram] Error 1 (ignored)
make[2]: Leaving directory `/home/srilm/lm/src'
make[2]: Entering directory `/home/srilm/flm/src'
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/fngram-count ../obj/i686/fngram-count.o ../obj/i686/libflm.a -lm -ldl ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/fngram-count ../../bin/i686
ERROR:? File to be installed (../bin/i686/fngram-count) does not exist.
ERROR:? File to be installed (../bin/i686/fngram-count) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/fngram-count
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/fngram-count] Error 1 (ignored)
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/fngram ../obj/i686/fngram.o ../obj/i686/libflm.a -lm -ldl ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/fngram ../../bin/i686
ERROR:? File to be installed (../bin/i686/fngram) does not exist.
ERROR:? File to be installed (../bin/i686/fngram) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/fngram
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/fngram] Error 1 (ignored)
make[2]: Leaving directory `/home/srilm/flm/src'
make[2]: Entering directory `/home/srilm/lattice/src'
g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64??? -I. -I../../include?? -u matherr -L../../lib/i686? -g -O3 -o ../bin/i686/lattice-tool ../obj/i686/lattice-tool.o ../obj/i686/liblattice.a -lm -ldl ../../lib/i686/libflm.a ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -ltcl -lm 2>&1 | c++filt
/usr/bin/ld: cannot find -ltcl
collect2: ld returned 1 exit status
/home/srilm/sbin/decipher-install 0555 ../bin/i686/lattice-tool ../../bin/i686
ERROR:? File to be installed (../bin/i686/lattice-tool) does not exist.
ERROR:? File to be installed (../bin/i686/lattice-tool) is not a plain file.
Usage:? decipher-install <mode> <file1> ... <fileN> <directory>
??????? mode:???????????????? file permission mode, in octal
??????? file1 ... fileN:????? files to be installed
??????? directory:??????????? where the files should be installed

files =? ../bin/i686/lattice-tool
directory =? ../../bin/i686
mode =? 0555

make[2]: [../../bin/i686/lattice-tool] Error 1 (ignored)
make[2]: Leaving directory `/home/srilm/lattice/src'
make[2]: Entering directory `/home/srilm/utils/src'
make[2]: Nothing to be done for `release-programs'.
make[2]: Leaving directory `/home/srilm/utils/src'
make[1]: Leaving directory `/home/srilm'
make release-scripts
make[1]: Entering directory `/home/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
??????????????? (cd $subdir/src; make SRILM=/home/srilm MACHINE_TYPE=i686 OPTION= MAKE_PIC= release-scripts) || exit 1; \
??????? done
make[2]: Entering directory `/home/srilm/misc/src'
make[2]: Nothing to be done for `release-scripts'.
make[2]: Leaving directory `/home/srilm/misc/src'
make[2]: Entering directory `/home/srilm/dstruct/src'
make[2]: Nothing to be done for `release-scripts'.
make[2]: Leaving directory `/home/srilm/dstruct/src'
make[2]: Entering directory `/home/srilm/lm/src'
make[2]: Nothing to be done for `release-scripts'.
make[2]: Leaving directory `/home/srilm/lm/src'
make[2]: Entering directory `/home/srilm/flm/src'
make[2]: Nothing to be done for `release-scripts'.
make[2]: Leaving directory `/home/srilm/flm/src'
make[2]: Entering directory `/home/srilm/lattice/src'
make[2]: Nothing to be done for `release-scripts'.
make[2]: Leaving directory `/home/srilm/lattice/src'
make[2]: Entering directory `/home/srilm/utils/src'
make[2]: Nothing to be done for `release-scripts'.
make[2]: Leaving directory `/home/srilm/utils/src'
make[1]: Leaving directory `/home/srilm'


please help me? sir 
thank you .
jasleen


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101119/4e2de186/attachment.html>

From stolcke at speech.sri.com  Fri Nov 19 08:21:26 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 19 Nov 2010 08:21:26 -0800
Subject: [SRILM User List] please help me ...
In-Reply-To: Your message of Fri, 19 Nov 2010 04:53:54 -0800.
	<163931.2543.qm@web111505.mail.gq1.yahoo.com>
Message-ID: <201011191621.oAJGLQj06387@huge>


please read the FAQ section on building and  installation.

--Andreas

In message <163931.2543.qm at web111505.mail.gq1.yahoo.com>you wrote:
> 
> hello
> i m trying to build language model using srilm and moses.following error me=
> sage is displayed.i tried installing srilm 3 times but same problem arises =
> againand all the components are installed properly like gcc,gnu make ,tcl,g=
> awk
> 
> /usr/bin/ld: cannot find -ltcl
> collect2: ld returned 1 exit status
> /home/srilm/sbin/decipher-install 0555 ../bin/i686/anti-ngram ../../bin/i68=

From stolcke at speech.sri.com  Fri Nov 19 18:32:14 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 19 Nov 2010 18:32:14 -0800
Subject: [SRILM User List] how to create features like probability
In-Reply-To: <AANLkTikLsBmPdN7obxJ9Xw2SJBnNtnDZzZjWKE-jjV3a@mail.gmail.com>
References: <AANLkTikLsBmPdN7obxJ9Xw2SJBnNtnDZzZjWKE-jjV3a@mail.gmail.com>
Message-ID: <4CE7332E.4050905@speech.sri.com>

Jie Qi wrote:
> Hi all and Andreas,
>
> I have used make-batch-counts(I choose to merge all files) and 
> merge-batch-counts to construct large counts files for the file-list0102. 
>
> qijie at minus:~/Project/data/nytimes/2007> ls ./01/01/*.txt > file-list0101
> qijie at minus:~/Project/data/nytimes/2007> make-batch-counts 
> file-list0102 all
> qijie at minus:~/Project/data/nytimes/2007> merge-batch-counts counts
>
> Based on the result file of file-list0102-1.ngrams, I made 3 language 
> models(1-gram, 2-gram and 3-gram):
> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 
> 1 -read file-list0102-1.ngrams -lm file-list0102-1.unigramlm
> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 
> 2 -read file-list0102-1.ngrams -lm file-list0102-1.bigramlm
> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 
> 3 -read file-list0102-1.ngrams -lm file-list0102-1.trigramlm
>
> but in the .bigram and .trigram files, there is usually result of 
> 1-gram, how can I remove then and create features based on these 
> language models? Here is my code but -ppl file-list0102-1 seems wrong:
> ngram -lm file-list0102-1.trigramlm -ppl file-list0102-1 -debug 3 > 
> file-list0102-1.triprob3
>
I'm not sure what you want to achieve.  The ARPA format for backoff LMs 
contains all ngram probabilities up to the maximum order, by definition 
and by necessity (lower-order estimates are needed for backing off).

If you want to extract individual ngram probability parameters from the 
LM you can do that with gawk or perl text processing directly from the 
LM file.
The LM contains explicit probabilities only for ngrams observed in training.

If you want to generate the conditional ngram probabilities for a list 
of arbitrary ngrams use the option

    ngram -lm LM  -counts C  -debug 2

where C contains a list of ngrams each followed by a count value (e.g., 
1).  This will dump out the probabilities in a format similar to -ppl.  
You probably have to reformat the output using gawk/perl to suit your needs.

Hope this helps.

Andreas


From oatgnaw at gmail.com  Sat Nov 20 23:00:28 2010
From: oatgnaw at gmail.com (=?GB2312?B?zM7N9Q==?=)
Date: Sun, 21 Nov 2010 15:00:28 +0800
Subject: [SRILM User List] Discrepancies in -text-has-weights
In-Reply-To: <4CE5ABF9.8050209@speech.sri.com>
References: <AANLkTik5n9Ab0-+iViupBQm53iiMzMEa_v7Sw5tQKnVY@mail.gmail.com>
	<4CE5ABF9.8050209@speech.sri.com>
Message-ID: <AANLkTikU_BQe6D3t245=nLQ4RrrzaD9nyLkDQrayVUbE@mail.gmail.com>

would anyone help to remove me from the mail list?

Thanks.

2010/11/19 Andreas Stolcke <stolcke at speech.sri.com>

>  Avneesh Saluja wrote:
>
>> Hi Andreas and SRILM team,
>>
>> There seem to be some discrepancies when I use the -text-has-weights (and
>> -float-counts) feature, or perhaps I'm not understanding these features
>> right.  I ran some experiments to test it out.  I have a small training set
>> of around 41k sentences and a test set of around 1000 sentences.  I ran the
>> following commands:
>>
>> ~/tools/srilm/bin/i686/ngram-count -order 3 -text test_baseline.txt -unk
>> -memuse -wbdiscount -gt1min 0 -gt2min 0 gt3min 0 -lm test_baseline.lm -debug
>> 2
>>
>> ~/tools/srilm/bin/i686/ngram -unk -map-unk '<UNK>' -lm test_baseline.lm
>> -order 3 > baseline.ppl
>>
>>
>> On several training sets with weights prepended to the sentences: 0.1x,
>> 1x, 10x, and 100x (I ran ngram-count on 0.1x with -text-has-weights and
>> -float-counts, and for the others I ran -text-has-weights only as the
>> sentences are weighted by integers).  I would expect the perplexities in all
>> of these cases would be the same but instead I got the following results:
>>
>>
>>        No weights      0.1x    1x      10x     100x
>> ppl     61.6418         125.688         61.6418         102.966
>> 102.966
>>
>>
>> Is there any reason why they are different?
>>
>> Yes, as you change the scaling factor you are effectively changing the
> number of samples.  The smoothing with Witten Bell depends on the ratio of
> type and token frequency.  If you duplicate the data x times you have
> increased the token frequencies by a factor x, but the number of distinct
> words (types) stays the same.
>
> It is quite intuitive:  if you've seen 10 different words among a total
> sample of 10 your probability estimate that the next word will be a new word
> type will be much higher than if you had seen 10 word types among a sample
> of 1000!  So scaling up makes your LM overconfident in having seen all the
> words in training, and scaling down gives too much probability mass to
> unseen words.
>
> What your results show nicely is that the "natural" counts work best for WB
> smoothing, which is reassuring since it validates the underlying model (new
> word frequency is used to estimate the frequency of unseen words, for each
> context).
>
> Andreas
>
>
>> Thanks,
>>
>> Avneesh
>>
>>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101121/d44f2c0f/attachment.html>

From fabian_in_hongkong at hotmail.com  Tue Nov 23 00:38:15 2010
From: fabian_in_hongkong at hotmail.com (Fabian -)
Date: Tue, 23 Nov 2010 09:38:15 +0100
Subject: [SRILM User List] Interpolation of multiple class-based+word-based
	LMs
Message-ID: <BAY129-W19892DEF28C2E945C40514C83E0@phx.gbl>


Hi,
I am interested in interpolating multiple class-based and word-based language models. To be precise: 2 word-based (trivial) and 2 class-based. The classes of the 2 class-based LMs are different, as I computed the classes/LMs on different texts. 
As I understand the manual of the ngram tool it is not possible to do this, quote: the second and any additional interpolated models can also be class N-grams (using the same -classes definitions). And I do not satisfy "only one classes file". If I just combine the class-files (and rename classes in the classes file and LM, to avoid conflicts) I may have words in 2 classes, which may have an adverse effect.But it should be possible to do this, correct?
So, is there a way to interpolate two class-based LMs with two different class definitions to get one class-based LM and one classes-file (where one word is only in one class)?
Thank you,Fabian 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101123/9c34807f/attachment.html>

From stolcke at speech.sri.com  Thu Nov 25 09:41:19 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 25 Nov 2010 09:41:19 -0800
Subject: [SRILM User List] Interpolation of multiple
 class-based+word-based LMs
In-Reply-To: <BAY129-W19892DEF28C2E945C40514C83E0@phx.gbl>
References: <BAY129-W19892DEF28C2E945C40514C83E0@phx.gbl>
Message-ID: <4CEE9FBF.6080208@speech.sri.com>

Fabian - wrote:
> Hi,
>
> I am interested in interpolating multiple class-based and word-based 
> language models. To be precise: 2 word-based (trivial) and 2 
> class-based. The classes of the 2 class-based LMs are different, as I 
> computed the classes/LMs on different texts. 
>
> As I understand the manual of the ngram tool it is not possible to do 
> this, quote: the second and any additional interpolated models can 
> also be class N-grams (using the same *-classes *definitions). And I 
> do not satisfy "only one classes file". If I just combine the 
> class-files (and rename classes in the classes file and LM, to avoid 
> conflicts) I may have words in 2 classes, which may have an adverse 
> effect.
Having the same words in more than once class has no adverse effect.
So renaming the classes to be unique to each model, and then merging the 
class definitions is exactly the way to go.
(You also want to avoid any name clashes between class and word labels.)

Andreas

> But it should be possible to do this, correct?
>
> So, is there a way to interpolate two class-based LMs with two 
> different class definitions to get one class-based LM and one 
> classes-file (where one word is only in one class)?
>
> Thank you,
> Fabian
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From stolcke at speech.sri.com  Fri Nov 26 17:36:37 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 26 Nov 2010 17:36:37 -0800
Subject: [SRILM User List] FW: Re:  help on buildind srilm
In-Reply-To: Your message of Fri, 26 Nov 2010 16:31:09 +0100.
	<A4DD4D879404FB46840CB967E48CC4FA3B49E32E55@scomp0535.wurnet.nl>
Message-ID: <201011270136.oAR1abH21454@huge>


In message <A4DD4D879404FB46840CB967E48CC4FA3B49E32E55 at scomp0535.wurnet.nl>you wrote:
> 
> Hi,
> 
> I just posted this message on the moses mailinglist and I was refered to yo=
> u.
> 
> Could you help me?
> 
> ------------------------------------
> 
> Hi,
> 
> I have compiled SRILM on a machine type of: ppc64
> 
> The make world seems to have finished ok. These files are in place:
> 
> libdstruct.a
> libflm.a
> liblattice.a
> libmisc.a
> liboolm.a
> 
> The make test seems to perform great. However it hangs (more than an hour) =
> on this line:
> 
> *** Running test ngram-server ***
> 
> I have no idea what might cause this. Can anyone help me solve the problem.=
>  I have tried to ignore this and compile moses anyway, but that generates a=
> n error during make moses.
> 

I have no idea why this test wouldn't work on your machine, but the ngram-server
funcionality is more dependent on OS specifics than most because is involves networking.

If you don't need it specifically just disable the test and then rerun the other tests:

	cd $SRILM
	mkdir lm/test/tests.disabled
	mv lm/test/tests/ngram-server lm/test/tests.disabled
	make test


Andreas 


From dianachih at gmail.com  Sat Nov 27 00:30:34 2010
From: dianachih at gmail.com (Jie Qi)
Date: Sat, 27 Nov 2010 03:30:34 -0500
Subject: [SRILM User List] how to create features like probability
In-Reply-To: <4CE7332E.4050905@speech.sri.com>
References: <AANLkTikLsBmPdN7obxJ9Xw2SJBnNtnDZzZjWKE-jjV3a@mail.gmail.com>
	<4CE7332E.4050905@speech.sri.com>
Message-ID: <AANLkTimN42QoD7JpbYqDiDyZp5_496R4ZdHRr3b2yBhg@mail.gmail.com>

Hi all and Andreas,

I wish to build n-gram language models on a set of text files, and then use
the model to train on other corpora to compute their max sentence
probability, min sentence probability, average sentence probability and
overall log probability of the article. But I am not quite sure about how to
do that with SRILM. I am confused with how to train the model and get
overall and sentence probability most. Here is my code:

find $1 -type f -iname 'clean*.txt' > file-list200701
make-batch-counts file-list200701 all
merge-batch-counts counts
#build language models
ngram-count -order 1 -read merge-iter0-1.ngrams -lm
file-list200701.unigramlm
#get sentence probability
ngram -lm file-list200701.unigramlm -counts merge-iter0-1.ngrams -debug 3 >
nytfinance.unigramlm.prob3


Thanks!

Best,
Jie Qi


On Fri, Nov 19, 2010 at 9:32 PM, Andreas Stolcke <stolcke at speech.sri.com>wrote:

> Jie Qi wrote:
>
>> Hi all and Andreas,
>>
>> I have used make-batch-counts(I choose to merge all files) and
>> merge-batch-counts to construct large counts files for the file-list0102.
>> qijie at minus:~/Project/data/nytimes/2007> ls ./01/01/*.txt > file-list0101
>> qijie at minus:~/Project/data/nytimes/2007> make-batch-counts file-list0102
>> all
>> qijie at minus:~/Project/data/nytimes/2007> merge-batch-counts counts
>>
>> Based on the result file of file-list0102-1.ngrams, I made 3 language
>> models(1-gram, 2-gram and 3-gram):
>> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 1
>> -read file-list0102-1.ngrams -lm file-list0102-1.unigramlm
>> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 2
>> -read file-list0102-1.ngrams -lm file-list0102-1.bigramlm
>> qijie at ampersand:~/Project/data/nytimes/2007/counts> ngram-count -order 3
>> -read file-list0102-1.ngrams -lm file-list0102-1.trigramlm
>>
>> but in the .bigram and .trigram files, there is usually result of 1-gram,
>> how can I remove then and create features based on these language models?
>> Here is my code but -ppl file-list0102-1 seems wrong:
>> ngram -lm file-list0102-1.trigramlm -ppl file-list0102-1 -debug 3 >
>> file-list0102-1.triprob3
>>
>>  I'm not sure what you want to achieve.  The ARPA format for backoff LMs
> contains all ngram probabilities up to the maximum order, by definition and
> by necessity (lower-order estimates are needed for backing off).
>
> If you want to extract individual ngram probability parameters from the LM
> you can do that with gawk or perl text processing directly from the LM file.
> The LM contains explicit probabilities only for ngrams observed in
> training.
>
> If you want to generate the conditional ngram probabilities for a list of
> arbitrary ngrams use the option
>
>   ngram -lm LM  -counts C  -debug 2
>
> where C contains a list of ngrams each followed by a count value (e.g., 1).
>  This will dump out the probabilities in a format similar to -ppl.  You
> probably have to reformat the output using gawk/perl to suit your needs.
>
> Hope this helps.
>
> Andreas
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101127/98694b04/attachment.html>

From stolcke at speech.sri.com  Sun Nov 28 20:44:38 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 28 Nov 2010 20:44:38 -0800
Subject: [SRILM User List] how to create features like probability
In-Reply-To: <AANLkTimN42QoD7JpbYqDiDyZp5_496R4ZdHRr3b2yBhg@mail.gmail.com>
References: <AANLkTikLsBmPdN7obxJ9Xw2SJBnNtnDZzZjWKE-jjV3a@mail.gmail.com>
	<4CE7332E.4050905@speech.sri.com>
	<AANLkTimN42QoD7JpbYqDiDyZp5_496R4ZdHRr3b2yBhg@mail.gmail.com>
Message-ID: <4CF32FB6.5050803@speech.sri.com>

Jie Qi wrote:
> Hi all and Andreas,
>
> I wish to build n-gram language models on a set of text files, and 
> then use the model to train on other corpora to compute their max 
> sentence probability, min sentence probability, average sentence 
> probability and overall log probability of the article. But I am not 
> quite sure about how to do that with SRILM. I am confused with how to 
> train the model and get overall and sentence probability most. Here is 
> my code:
>
> find $1 -type f -iname 'clean*.txt' > file-list200701
> make-batch-counts file-list200701 all
> merge-batch-counts counts
> #build language models
> ngram-count -order 1 -read merge-iter0-1.ngrams -lm 
> file-list200701.unigramlm
Everything looks good up to this point.
> #get sentence probability
> ngram -lm file-list200701.unigramlm -counts merge-iter0-1.ngrams 
> -debug 3 > nytfinance.unigramlm.prob3
I'm still confused about what you want to compute.
Since merge-iter0-1.ngrams contains counts from the training data you 
would be computing the total training set likelihood, as well as that of 
all the individual ngrams occurring in it.
Also, -debug 3 is very slow because it computes the sum of all the 
conditional probabilities for all histories.  

To compute just sentence probabilities (as indicated by your comment) 
for a list of test sentences contained in TEST (one sentence per line), use

ngram -lm -debug 1 -ppl TEST > TEST.ppl

Andreas


From stolcke at speech.sri.com  Sun Nov 28 20:59:06 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 28 Nov 2010 20:59:06 -0800
Subject: [SRILM User List] smoothing questions
In-Reply-To: <4CB88BE5.9020509@mit.edu>
References: <4CB88BE5.9020509@mit.edu>
Message-ID: <4CF3331A.7080103@speech.sri.com>

Vikram Agarwal wrote:
> Hello,
>
> I am new to SRILM and just have a few questions that I'd greatly 
> appreciate some help on:
>
> 1) Trying to use purely ML estimates w/ no smoothing, I used 
> "ngram-count -cdiscount 0 -order 8 -read counts.txt -lm train.lm"
>
> but using this model with ngram -debug 2, I do not get zeroprobs 
> because it automatically backs off.  Is there a way to prohibit 
> backoff so that I can retrieve the zeroprob sentences?
I believe you are seeing backing-off only because the default minimum 
count for ngrams longer than 2 is 2 (so a singleton 4-gram, for example 
is not recorded in the model, and triggers backoff).  

Try using these additional options:

    -gt3min 1 -gt4min 1 -gt5min 1 -gt6min 1 -gt7min 1 -gt8min 1

(yes, it would be nice to have a single option to do this).
>
> 2) Suppose I perform: ngram -order 5 -ppl test.txt -lm train.lm.  Will 
> I always get the same results if I generated train.lm with ngram-count 
> at order 5 or greater than 5, regardless of which smoothing technique 
> is used and whether backoff/interpolation is employed?
The answer is yes for smoothing methods that apply to different ngram 
lengths uniformly.  That is all the methods except those based on 
Kneser-Ney.
For KN, the lower-order distributions are treated differently from the 
highest order, hence the 5grams are smoothed differently depending on 
whether 5 is the highest order or not.
>
> 3) My work uses a very small vocabulary (4 letters), but requires 
> smoothing at higher orders (5-8).  I read in the FAQ that -ukndiscount 
> -order 7 may be good to use for modeling OOV words with a letter 
> model.  I wonder why ukndiscount was recommended over kndiscount?  If 
> kndiscounting does not work due to the sparsity of low 
> count-of-counts, could the extrapolated count-of-counts generated by 
> "make-big-lm" outperform the ukndiscounting method?
I don't think there was a particular reason to recommend ukndiscount 
over kndiscount, but you conjecture makes sense.
The count-of-counts extrapolation is designed for cases were the lowest 
order counts-counts (starting with the count of singletons) are missing, 
so it wouldn't really be relevant in this case.

Andreas

>
> Thank you,
> Vikram
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From stolcke at speech.sri.com  Sun Nov 28 22:32:36 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 28 Nov 2010 22:32:36 -0800
Subject: [SRILM User List] how to create features like probability
In-Reply-To: <AANLkTinRgtQUBQZH6XPNe3RUx7o_ECcstfqmX07j8Gaz@mail.gmail.com>
References: <AANLkTikLsBmPdN7obxJ9Xw2SJBnNtnDZzZjWKE-jjV3a@mail.gmail.com>
	<4CE7332E.4050905@speech.sri.com>
	<AANLkTimN42QoD7JpbYqDiDyZp5_496R4ZdHRr3b2yBhg@mail.gmail.com>
	<4CF32FB6.5050803@speech.sri.com>
	<AANLkTinRgtQUBQZH6XPNe3RUx7o_ECcstfqmX07j8Gaz@mail.gmail.com>
Message-ID: <4CF34904.1050300@speech.sri.com>

Jie Qi wrote:
> Hi Andreas,
>  
> Thanks for your help! My goal is to compute the max sentence 
> probability, min sentence probability, average sentence probability 
> and overall log probability of the article for every text file in the 
> folder of New York Times Finance. Would you please teach me how to do 
> that with SRILM? Also, is it better to use make-batch-counts or to 
> write every text flie into one file and use ngram-counts?

1. Format the data so that each line of the text file contains one 
sentence.  Use one file per article.
2. run ngram -debug 2 -ppl FILE on each text file.  this gives you the 
log probability for each sentence, as well as the total log prob.
3. post process the output to get min/max/avg. etc.

For the first and last step you need to learn gawk or perl.  I cannot 
help you with that, but there are many books and online articles to 
teach you.
Or ask someone local to help you.

Good luck

Andreas

>  
>  
> Best,
> Jie
>
> On Sun, Nov 28, 2010 at 11:44 PM, Andreas Stolcke 
> <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
>     Jie Qi wrote:
>
>         Hi all and Andreas,
>
>         I wish to build n-gram language models on a set of text files,
>         and then use the model to train on other corpora to compute
>         their max sentence probability, min sentence probability,
>         average sentence probability and overall log probability of
>         the article. But I am not quite sure about how to do that with
>         SRILM. I am confused with how to train the model and get
>         overall and sentence probability most. Here is my code:
>
>         find $1 -type f -iname 'clean*.txt' > file-list200701
>         make-batch-counts file-list200701 all
>         merge-batch-counts counts
>         #build language models
>         ngram-count -order 1 -read merge-iter0-1.ngrams -lm
>         file-list200701.unigramlm
>
>     Everything looks good up to this point.
>
>         #get sentence probability
>         ngram -lm file-list200701.unigramlm -counts
>         merge-iter0-1.ngrams -debug 3 > nytfinance.unigramlm.prob3
>
>     I'm still confused about what you want to compute.
>     Since merge-iter0-1.ngrams contains counts from the training data
>     you would be computing the total training set likelihood, as well
>     as that of all the individual ngrams occurring in it.
>     Also, -debug 3 is very slow because it computes the sum of all the
>     conditional probabilities for all histories.  
>     To compute just sentence probabilities (as indicated by your
>     comment) for a list of test sentences contained in TEST (one
>     sentence per line), use
>
>     ngram -lm -debug 1 -ppl TEST > TEST.ppl
>
>     Andreas
>
>


From zeeshankhans at gmail.com  Mon Nov 29 16:19:27 2010
From: zeeshankhans at gmail.com (zeeshan khan)
Date: Tue, 30 Nov 2010 01:19:27 +0100
Subject: [SRILM User List] lattice rescoring with LM + cache
Message-ID: <AANLkTikG_PGvXFxrbkHLxn1eh2cxSAxMXjs0rthkfSFd@mail.gmail.com>

Dear all,

I am trying to investigate the effect of cache on the word error rates
improvements of a corpus.

For this, I want to rescore the HTK-lattices with the LM + cache and then
extract the CTM from the lattice.

Ideally, it should work in a similar way to calculating perplexity with the
ngram tool:  while rescoring the lattices, the SRILM tool should take a
unigram LM based on a history of n words and interpolate the main LM with it
(just like done with the -cache <length> and -cache-lambda <weight> options
ngram tool).

I am using the SRILM lattice-tool to rescore the lattice and producing the
CTM. And the command currently looks like this :

latticeTool -order <ORDER> -lm <LM> -bayes 0 -in-lattice <LATTICE> -read-htk
-posterior-decode -output-ctm

But I cant find the proper set of configuration options to achieve what I
want to do - Ideally there should be -cache <length> and -cache-lambda
<weight> options like the ones in SRILM ngram tool. But there isnt any such
option in lattice-tool. Can anyone guide me how can I achieve it ?

Thanks & Best Regards,

Zeeshan Khan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101130/742104ef/attachment.html>

From zeeshankhans at gmail.com  Mon Nov 29 18:04:06 2010
From: zeeshankhans at gmail.com (zeeshan khan)
Date: Tue, 30 Nov 2010 03:04:06 +0100
Subject: [SRILM User List] nbest list format
Message-ID: <AANLkTi==UH+X+YdBFxCkZ5uh_Jn6NL51bhYy9VZc+HXj@mail.gmail.com>

Dear all,
There are three formats specified by SRILM for the nbest lists generated by
using it:
http://www-speech.sri.com/projects/srilm/manpages/nbest-format.5.html
I want to generate nbest-lists in the 2nd format i.e. decipher format
specified on this page (to preserve the timing information) from HTK
lattices.
However, by default, I get the nbest lists in the 3rd format specified on
the page.
Ideally, I should be able to specify the output format using some option of
the lattice-tool or nbest-lattice, however, I couldn't find configuration
options to specify the output nbest list's format.
Any suggestions on how to do it ?
Thanks and Best Regards,
Zeeshan Khan.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101130/956782b4/attachment.html>

From stolcke at speech.sri.com  Mon Nov 29 23:11:55 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 29 Nov 2010 23:11:55 -0800
Subject: [SRILM User List] nbest list format
In-Reply-To: <AANLkTi==UH+X+YdBFxCkZ5uh_Jn6NL51bhYy9VZc+HXj@mail.gmail.com>
References: <AANLkTi==UH+X+YdBFxCkZ5uh_Jn6NL51bhYy9VZc+HXj@mail.gmail.com>
Message-ID: <4CF4A3BB.3000900@speech.sri.com>

zeeshan khan wrote:
> Dear all,
> There are three formats specified by SRILM for the nbest lists 
> generated by using it: 
> http://www-speech.sri.com/projects/srilm/manpages/nbest-format.5.html
> I want to generate nbest-lists in the 2nd format i.e. decipher format 
> specified on this page (to preserve the timing information) from HTK 
> lattices.
> However, by default, I get the nbest lists in the 3rd format specified 
> on the page.
> Ideally, I should be able to specify the output format using some 
> option of the lattice-tool or nbest-lattice, however, I couldn't find 
> configuration options to specify the output nbest list's format.
> Any suggestions on how to do it ?
Unfortunately the N-best generation for lattices does not support the 
output format with time alignment information.
It it possible in principle, but the code just wasn't written to do 
that.  If you have a fair amount of time you could dig into 
lattice/src/LatticeNBest.cc
to change it.

Andreas

> Thanks and Best Regards,
> Zeeshan Khan.
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From stolcke at speech.sri.com  Tue Nov 30 00:01:16 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 30 Nov 2010 00:01:16 -0800
Subject: [SRILM User List] lattice rescoring with LM + cache
In-Reply-To: <AANLkTikG_PGvXFxrbkHLxn1eh2cxSAxMXjs0rthkfSFd@mail.gmail.com>
References: <AANLkTikG_PGvXFxrbkHLxn1eh2cxSAxMXjs0rthkfSFd@mail.gmail.com>
Message-ID: <4CF4AF4C.5020309@speech.sri.com>

zeeshan khan wrote:
> Dear all,
>
> I am trying to investigate the effect of cache on the word error rates 
> improvements of a corpus.
>
> For this, I want to rescore the HTK-lattices with the LM + cache and 
> then extract the CTM from the lattice.
>
> Ideally, it should work in a similar way to calculating perplexity 
> with the ngram tool:  while rescoring the lattices, the SRILM tool 
> should take a unigram LM based on a history of n words and interpolate 
> the main LM with it (just like done with the -cache <length> and 
> -cache-lambda <weight> options ngram tool).
>
> I am using the SRILM lattice-tool to rescore the lattice and producing 
> the CTM. And the command currently looks like this :
>
> latticeTool -order <ORDER> -lm <LM> -bayes 0 -in-lattice <LATTICE> 
> -read-htk -posterior-decode -output-ctm
>
> But I cant find the proper set of configuration options to achieve 
> what I want to do - Ideally there should be -cache <length> and 
> -cache-lambda <weight> options like the ones in SRILM ngram tool. But 
> there isnt any such option in lattice-tool. Can anyone guide me how 
> can I achieve it ?
Applying a cache LM to lattice rescoring is not straightforward, because 
you're not processing a linear sequence of words.
Strictly speaking you'd have to maintain a different cache LM for each 
partial path through the lattice, which would be very expensive.
Also, you don't know what the correct word string is, so you have to 
think about which words should go into the cache.

What people usually do when dealing with lattice or nbest lists is that 
you compute a cache based on all utterances preceding the utterance to 
be rescored.
So you would compute a unigram LM specific to each utterance, then 
interpolate that with the main LM (-bayes 0 -mix-lm CACHELM).

As to the question of what words to cache:  a sensible approach would be 
to weight words unigrams according to their posterior probabilities.
So use lattice-tool -order 1 -write-ngrams to dump the weighted counts, 
and ngram-count -float-counts to build the cache LM (you can also turn 
off smoothing).

Andreas

>
> Thanks & Best Regards,
>
> Zeeshan Khan
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From cwsunshine at gmail.com  Tue Nov 30 01:18:11 2010
From: cwsunshine at gmail.com (wei chen)
Date: Tue, 30 Nov 2010 17:18:11 +0800
Subject: [SRILM User List] lattice-tool a-star
Message-ID: <AANLkTi=UkHj35C1Ss738X-Ni1d2nBASHr+tzU8YFheSv@mail.gmail.com>

Hi,all
   can lattice-tool realize A-star algorithm which is always used the 2nd
pass of speech recognition? I am so confused about that,thanks a lot!

Best wishes,
Wei Chen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101130/d34e335a/attachment.html>

From stolcke at speech.sri.com  Tue Nov 30 10:11:32 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 30 Nov 2010 10:11:32 -0800
Subject: [SRILM User List] lattice-tool a-star
In-Reply-To: <AANLkTi=UkHj35C1Ss738X-Ni1d2nBASHr+tzU8YFheSv@mail.gmail.com>
References: <AANLkTi=UkHj35C1Ss738X-Ni1d2nBASHr+tzU8YFheSv@mail.gmail.com>
Message-ID: <4CF53E54.6080006@speech.sri.com>

wei chen wrote:
> Hi,all
>    can lattice-tool realize A-star algorithm which is always used the 
> 2nd pass of speech recognition? I am so confused about that,thanks a lot!
lattice-tool uses a Viterbi algorithm as the default for 1-best decoding.
For nbest decoding you have the option to use A-star or Viterbi.
See the man page for detail.

Andreas

>
> Best wishes,
> Wei Chen
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From mehdi_hoseini at comp.iust.ac.ir  Wed Dec  1 09:00:35 2010
From: mehdi_hoseini at comp.iust.ac.ir (Mehdi hoseini)
Date: Wed, 01 Dec 2010 20:30:35 +0330
Subject: [SRILM User List] Basic questions
Message-ID: <WC20101201170035.5322A2@comp.iust.ac.ir>

-----Original Message-----

From: "Mehdi hoseini" <mehdi_hoseini at comp.iust.ac.ir>

To: srilm-user at speech.sri.com

Date: Wed, 01 Dec 2010 18:58:31 +0330

Subject: Basic questions


hi

first thanks for attention. I am new in SRILM and HTK and i am sorry to 
all for my very basic questions.
I couldn't run SRILM on Linux and Cygwin so i search for solution in 
visual studio and i found it in here " 
http://www.keithv.com/software/srilm/ 
[http://www.keithv.com/software/srilm/] " I compile the files and Use them 
for building my language models. how can i use my results(LMs) in HTK?
does SRILM support Topic language models like "PLSA" or "LDA LM"? if 
not is there any toolkit that cover these models?thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101201/8e0b8cfe/attachment.html>

From zeeshankhans at gmail.com  Mon Dec  6 07:20:15 2010
From: zeeshankhans at gmail.com (zeeshan khan)
Date: Mon, 6 Dec 2010 16:20:15 +0100
Subject: [SRILM User List] Cache-lambda with lowest perplexity
Message-ID: <AANLkTi=h7fQnAvUYofA7QUoAKnF84-1GvjQRugDpmtFy@mail.gmail.com>

Hi all,

I am using the following options to calculate the perplexity with a cache
size of lets say 500. All I can do is to run it for various values of
CACHE-LAMBDA and find out manually that for which value of CACHE-LAMBDA does
the least perplexity occur.

ngram -unk -map-unk '[UNKNOWN]' -lm <LM> -cache 500 -cache-lambda
<CACHE-LAMBDA> -ppl <TEXT>

Is it possible with SRI toolkit to somehow automatically get the
CACHE-LAMBDA weight which gives the least perplexity for the given corpus.

Best Regards,

Zeeshan Khan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101206/7a9ad0e4/attachment.html>

From stolcke at speech.sri.com  Mon Dec  6 10:58:08 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 06 Dec 2010 10:58:08 -0800
Subject: [SRILM User List] Cache-lambda with lowest perplexity
In-Reply-To: <AANLkTi=h7fQnAvUYofA7QUoAKnF84-1GvjQRugDpmtFy@mail.gmail.com>
References: <AANLkTi=h7fQnAvUYofA7QUoAKnF84-1GvjQRugDpmtFy@mail.gmail.com>
Message-ID: <4CFD3240.10409@speech.sri.com>

zeeshan khan wrote:
> Hi all,
>
> I am using the following options to calculate the perplexity with a 
> cache size of lets say 500. All I can do is to run it for various 
> values of CACHE-LAMBDA and find out manually that for which value of 
> CACHE-LAMBDA does the least perplexity occur.
>
> ngram -unk -map-unk '[UNKNOWN]' -lm <LM> -cache 500 -cache-lambda 
> <CACHE-LAMBDA> -ppl <TEXT>
>
> Is it possible with SRI toolkit to somehow automatically get the 
> CACHE-LAMBDA weight which gives the least perplexity for the given 
> corpus.
Yes.  You use the same method as used to optimize the linear 
interpolation of two arbitrary LMs.

1. Generate the probabilities from the basic LM

ngram -unk -map-unk '[UNKNOWN]' -lm <LM>  -ppl <TEXT> -debug 2 > lm1.ppl

ngram -unk -map-unk '[UNKNOWN]' -null -cache 500 -cache-lambda 1.0 -ppl 
<TEXT> -debug 2 > cachelm.ppl

2. Use an EM algorithm to estimate the best lambda:
compute-best-mix cachelm.ppl lm1.ppl

(see ppl-scripts(1) man page for details on compute-best-mix).  Of 
course for meaningful results you should use a development set separate 
from the evaluation data to optimize the mixture weight.

Andreas


From stolcke at speech.sri.com  Tue Dec 21 06:09:26 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 21 Dec 2010 06:09:26 -0800
Subject: [SRILM User List] SRILM-user list maintenance
Message-ID: <201012211409.oBLE9QH20710@huge>


The list (and its web interface) were nonfunctional
for a while due to a server crash.

If you read this, it means that things are back to working.

Sorry for any inconvenience.

--Andreas


From amr_desoky at yahoo.com  Wed Dec 29 10:40:25 2010
From: amr_desoky at yahoo.com (Amr Desoky)
Date: Wed, 29 Dec 2010 10:40:25 -0800 (PST)
Subject: [SRILM User List] ARPA LM with only higher order grams?
Message-ID: <606025.9895.qm@web51001.mail.re2.yahoo.com>

Hi,
  I am asking is it possible to have an ARPA LM storing only 3-gram log 
probabilities?
  Assuming that in my application (in which I will use the LM), I will only 
require the probability of these specific 3-grams.
  example of the LM:

\data\
ngram 1=0
ngram 2=0
ngram 3=3

\1-grams:

\2-grams:

\3-grams:
<logprob> <w1 w2 w3>
<logprob> <w4 w5 w6>
<logprob> <w7 w8 w9>

\end\


To say in other words: if I got some method to estimate the probability of some 
3-grams needed for 3-gram lattice rescoring for ASR, is it possible to insert 
the probabilities of these 3-grams in a normal ARPA backoff LM? I did so, but 
when I tried to normalize the new LM (after adding the new 3-grams), I got the 
following warinings and the new grams are filtered out!

warning: no bow for prefix of ngram "w1 w2 w3"
.........(lots of the above warinig)
BOW numerator for context "w4 w5" is -0.535204 < 0
.........(lots of the above warinig)

could you tell me why this is happening? since if some 3-gram probability is 
there I will not need to backoff and I will not need to use the lower order 
grams to get the probability of this specific 3-gram...yes?

What if I did not normalize the new LM will it be a correct LM or you see some 
bug, is there some other way to validate the correctness of this LM?

I will appreciate your help very much.

Best regards,
Amr


 Amr Ibrahim El-Desoky, Mousa 
PhD Student, Computer Science (i6), 
RWTH-Aachen University, 
Aachen, Germany 
Cel.     : +49 0176 56418470 
Office : +49 241 8021620 
Fax      : +49 241 8022219


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20101229/9402b11b/attachment.html>

From stolcke at speech.sri.com  Wed Dec 29 14:29:10 2010
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 29 Dec 2010 16:29:10 -0600
Subject: [SRILM User List] ARPA LM with only higher order grams?
In-Reply-To: <606025.9895.qm@web51001.mail.re2.yahoo.com>
References: <606025.9895.qm@web51001.mail.re2.yahoo.com>
Message-ID: <4D1BB636.7030008@speech.sri.com>

Amr Desoky wrote:
> Hi,
>   I am asking is it possible to have an ARPA LM storing only 3-gram 
> log probabilities?
>   Assuming that in my application (in which I will use the LM), I will 
> only require the probability of these specific 3-grams.
>   example of the LM:
>
> \data\
> ngram 1=0
> ngram 2=0
> ngram 3=3
>
> \1-grams:
>
> \2-grams:
>
> \3-grams:
> <logprob> <w1 w2 w3>
> <logprob> <w4 w5 w6>
> <logprob> <w7 w8 w9>
>
> \end\
>
>
> To say in other words: if I got some method to estimate the 
> probability of some 3-grams needed for 3-gram lattice rescoring for 
> ASR, is it possible to insert the probabilities of these 3-grams in a 
> normal ARPA backoff LM? I did so, but when I tried to normalize the 
> new LM (after adding the new 3-grams), I got the following warinings 
> and the new grams are filtered out!
>
> warning: no bow for prefix of ngram "w1 w2 w3"
> .........(lots of the above warinig)
This is a sanity check of the backoff format.  For each ngram w1 w2 w3 
it is checked the the history "w1 w2" has a corresponding backoff weight.

> BOW numerator for context "w4 w5" is -0.535204 < 0
> .........(lots of the above warinig)
>
> could you tell me why this is happening? since if some 3-gram 
> probability is there I will not need to backoff and I will not need to 
> use the lower order grams to get the probability of this specific 
> 3-gram...yes?
>
> What if I did not normalize the new LM will it be a correct LM or you 
> see some bug, is there some other way to validate the correctness of 
> this LM?
As long as you don't renormalize the LM, AND you only use the trigram 
probabilities, AND you insert dummy unigrams and bigrams (to satisfy the 
above sanity check) with arbitrary log probabilities and backoff weights 
(make them 0) you can use the model in the standard way.

Andreas