From dmytro.prylipko at ovgu.de Mon Oct 1 08:34:28 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Mon, 01 Oct 2012 17:34:28 +0200
Subject: [SRILM User List] Strange log probabilities
Message-ID: <5069B804.5080107@ovgu.de>
Hi,
I am sorry for such a long e-mail, but I found a strange behavior during
the log probability calculation of the unigrams.
I have two language models trained on two text sets. Actually, those
sets are just two different sentences, repeated 100 times each:
ACTION_REJECT_003.train.txt:
der gew?nschte artikel ist nicht im koffer enthalten (x 100)
ACTION_REJECT_004.train.txt:
ihre aussage kann nicht verarbeitet werden (x 100)
Also, I have defined few specific categories to build a class-based LM.
One class is numbers (ein, eine, eins, einundachtzig etc.), the second
one comprises names of specific items related to the task domain
(achselshirt, blusen), and the last one consists just of two words:
'wurde' and 'wurden'.
So, I am building two expanded class-based LMs using Witten-Bell
discounting (I triedalso the default Good-Turing, but with the same result):
replace-words-with-classes classes=wizard.class.defs
ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt
ngram-count -text ACTION_REJECT_003.train.class.txt -lm
ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3
ngram -lm ACTION_REJECT_003.lm -write-lm ACTION_REJECT_003.expanded.lm
-order 3 -classes wizard.class.defs -expand-classes 3 -expand-exact 3
-vocab wizard.wlist
The second LM (ACTION_REJECT_004) is built using the same approach. But
these two models are pretty different.
ACTION_REJECT_003.expanded.lm has reasonable smoothed log probabilities
for the unseen unigrams:
\data\
ngram 1=924
ngram 2=9
ngram 3=8
\1-grams:
-0.9542425
-10.34236
-99 -99
-10.34236 ab
-10.34236 abgeben
[...]
-10.34236 ?berschritten
-10.34236 ?bertragung
\2-grams:
0 der 0
0 artikel ist 0
0 der gew?nschte 0
0 enthalten
0 gew?nschte artikel 0
0 im koffer 0
0 ist nicht 0
0 koffer enthalten 0
0 nicht im 0
\3-grams:
0 gew?nschte artikel ist
0 der gew?nschte
0 koffer enthalten
0 der gew?nschte artikel
0 nicht im koffer
0 artikel ist nicht
0 im koffer enthalten
0 ist nicht im
\end\
Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have a zero
probability:
\data\
ngram 1=924
ngram 2=7
ngram 3=6
\1-grams:
-0.845098
-99
-99 -99
-99 ab
-99 abgeben
[...]
-0.845098 aussage -99
[...]
-99 ?berschritten
-99 ?bertragung
\2-grams:
0 ihre 0
0 aussage kann 0
0 ihre aussage 0
0 kann nicht 0
0 nicht verarbeitet 0
0 sagen
0 verarbeitet sagen 0
\3-grams:
0 ihre aussage kann
0 ihre aussage
0 aussage kann nicht
0 kann nicht verarbeitet
0 verarbeitet sagen
0 nicht verarbeitet sagen
\end\
None of the words from both training sentences belong to any class.
Also, I found that removing the last word from the second training
sentence fixes the problem.
Thus, for the following sentence:
ihre aussage kann nicht
corresponding LM has correctly discounted probabilities (also around
-10). Replacing 'werden' with any other word (I tried 'sagen', 'abgeben'
and 'beer') causes the same problem again.
Is it a bug or am I doing something wrong?
I would be appreciated for any advice. I also can provide all necessary
data and scripts if needed.
Sincerely yours,
Dmytro Prylipko.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dmytro.prylipko at ovgu.de Tue Oct 2 02:48:16 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Tue, 02 Oct 2012 11:48:16 +0200
Subject: [SRILM User List] Strange log probabilities
In-Reply-To:
References: <5069B804.5080107@ovgu.de>
Message-ID: <506AB860.2010606@ovgu.de>
Hi,
Thank you for the quick feedback.
I found out something else remarkable: I tried to run the script on our
cluster under CentOS (my workstation is running Kubuntu 12.04) and
discovered that on the cluster all the LMs have zero probabilities for
unseen 1-grams. No smoothing at all!
The setup is of course different. Output of the uname -a on the cluster:
Linux frontend1.service 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21
EST 2010 x86_64 x86_64 x86_64 GNU/Linux
On the workstation:
Linux KS-PC113 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7 16:39:45
UTC 2012 i686 i686 i386 GNU/Linux
SRILM on the cluster was build with MACHINE_TYPE=i686-m64 (with and
without _C option, both give the same result), on the workstation with
MACHINE_TYPE=i686-gcc4
LANG variable is en_US.UTF-8 on both machines. Replacing umlauts with
regular characters gave no difference.
What do you mean exactly under 'behavior of your local awk installation
when it encounters extended chars'?
So, I am sending you the minimal dataset for replicating it. Shell
script buildtaglm.sh does all the job.
Yours,
Dmytro Prylipko.
On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote:
>
> On a first reading of your email I'm indeed surprised that the results
> differ between the two texts. Have you tried replacing the umlaut in
> the first corpus with a regular "u" and checked if you still get the
> same behavior. Check the LANG environment variable and the behavior of
> your local awk installation when it encounters extended chars.
>
> If the problem persists, please send me the two corpora, along with
> the class file and I'll be glad to take a look for you.
>
> &
>
> On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko
> > wrote:
>
> Hi,
>
> I am sorry for such a long e-mail, but I found a strange behavior
> during the log probability calculation of the unigrams.
>
> I have two language models trained on two text sets. Actually,
> those sets are just two different sentences, repeated 100 times each:
>
> ACTION_REJECT_003.train.txt:
> der gew?nschte artikel ist nicht im koffer enthalten (x
> 100)
>
> ACTION_REJECT_004.train.txt:
> ihre aussage kann nicht verarbeitet werden (x 100)
>
> Also, I have defined few specific categories to build a
> class-based LM.
> One class is numbers (ein, eine, eins, einundachtzig etc.), the
> second one comprises names of specific items related to the task
> domain (achselshirt, blusen), and the last one consists just of
> two words: 'wurde' and 'wurden'.
>
> So, I am building two expanded class-based LMs using Witten-Bell
> discounting (I triedalso the default Good-Turing, but with the
> same result):
>
> replace-words-with-classes classes=wizard.class.defs
> ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt
>
> ngram-count -text ACTION_REJECT_003.train.class.txt -lm
> ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram -lm ACTION_REJECT_003.lm -write-lm
> ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs
> -expand-classes 3 -expand-exact 3 -vocab wizard.wlist
>
>
> The second LM (ACTION_REJECT_004) is built using the same
> approach. But these two models are pretty different.
>
> ACTION_REJECT_003.expanded.lm has reasonable smoothed log
> probabilities for the unseen unigrams:
>
> \data\
> ngram 1=924
> ngram 2=9
> ngram 3=8
>
> \1-grams:
> -0.9542425
> -10.34236
> -99 -99
> -10.34236 ab
> -10.34236 abgeben
>
> [...]
>
> -10.34236 ?berschritten
> -10.34236 ?bertragung
>
> \2-grams:
> 0 der 0
> 0 artikel ist 0
> 0 der gew?nschte 0
> 0 enthalten
> 0 gew?nschte artikel 0
> 0 im koffer 0
> 0 ist nicht 0
> 0 koffer enthalten 0
> 0 nicht im 0
>
> \3-grams:
> 0 gew?nschte artikel ist
> 0 der gew?nschte
> 0 koffer enthalten
> 0 der gew?nschte artikel
> 0 nicht im koffer
> 0 artikel ist nicht
> 0 im koffer enthalten
> 0 ist nicht im
>
> \end\
>
>
> Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have
> a zero probability:
>
> \data\
> ngram 1=924
> ngram 2=7
> ngram 3=6
>
> \1-grams:
> -0.845098
> -99
> -99 -99
> -99 ab
> -99 abgeben
> [...]
> -0.845098 aussage -99
> [...]
> -99 ?berschritten
> -99 ?bertragung
>
> \2-grams:
> 0 ihre 0
> 0 aussage kann 0
> 0 ihre aussage 0
> 0 kann nicht 0
> 0 nicht verarbeitet 0
> 0 sagen
> 0 verarbeitet sagen 0
>
> \3-grams:
> 0 ihre aussage kann
> 0 ihre aussage
> 0 aussage kann nicht
> 0 kann nicht verarbeitet
> 0 verarbeitet sagen
> 0 nicht verarbeitet sagen
>
> \end\
>
>
> None of the words from both training sentences belong to any class.
>
> Also, I found that removing the last word from the second training
> sentence fixes the problem.
> Thus, for the following sentence:
>
> ihre aussage kann nicht
>
> corresponding LM has correctly discounted probabilities (also
> around -10). Replacing 'werden' with any other word (I tried
> 'sagen', 'abgeben' and 'beer') causes the same problem again.
>
> Is it a bug or am I doing something wrong?
> I would be appreciated for any advice. I also can provide all
> necessary data and scripts if needed.
>
> Sincerely yours,
> Dmytro Prylipko.
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testbed.zip
Type: application/zip
Size: 21408 bytes
Desc: not available
URL:
From venkataraman.anand at gmail.com Tue Oct 2 13:35:07 2012
From: venkataraman.anand at gmail.com (Anand Venkataraman)
Date: Tue, 2 Oct 2012 13:35:07 -0700
Subject: [SRILM User List] Strange log probabilities
In-Reply-To: <506AB860.2010606@ovgu.de>
References: <5069B804.5080107@ovgu.de>
<506AB860.2010606@ovgu.de>
Message-ID:
The problem is that your final vocabulary is introduced as a surprise in
the last step (to ngram). When class expansion likelihoods sum to exactly
1.0 there is no room for novelty in back off orders at this stage.
To get the correct behavior you must prime the initial language model with
a vocabulary of all either class tags or the individual words themselves.
E.g.
awk '{print $1}' wizard.class.defs | sort -u >wizard.classnames.txt
cat $datafile \
| replace-words-with-classes classes=wizard.class.defs - \
| ngram-count -text - -lm - -order 1 -wbdiscount -vocab
wizard.classnames.txt \
> your-lm.1bo
# Expanding classes in your-lm.1bo now will give you the desired behavior.
HTH
&
On Tue, Oct 2, 2012 at 2:48 AM, Dmytro Prylipko wrote:
> Hi,
>
> Thank you for the quick feedback.
>
> I found out something else remarkable: I tried to run the script on our
> cluster under CentOS (my workstation is running Kubuntu 12.04) and
> discovered that on the cluster all the LMs have zero probabilities for
> unseen 1-grams. No smoothing at all!
>
> The setup is of course different. Output of the uname -a on the cluster:
>
> Linux frontend1.service 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST
> 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> On the workstation:
>
> Linux KS-PC113 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7 16:39:45 UTC
> 2012 i686 i686 i386 GNU/Linux
>
> SRILM on the cluster was build with MACHINE_TYPE=i686-m64 (with and
> without _C option, both give the same result), on the workstation with
> MACHINE_TYPE=i686-gcc4
>
> LANG variable is en_US.UTF-8 on both machines. Replacing umlauts with
> regular characters gave no difference.
>
> What do you mean exactly under 'behavior of your local awk installation
> when it encounters extended chars'?
>
> So, I am sending you the minimal dataset for replicating it. Shell script
> buildtaglm.sh does all the job.
>
> Yours,
> Dmytro Prylipko.
>
>
> On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote:
>
>
> On a first reading of your email I'm indeed surprised that the results
> differ between the two texts. Have you tried replacing the umlaut in
> the first corpus with a regular "u" and checked if you still get the
> same behavior. Check the LANG environment variable and the behavior of
> your local awk installation when it encounters extended chars.
>
> If the problem persists, please send me the two corpora, along with
> the class file and I'll be glad to take a look for you.
>
> &
>
> On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko
> >
> wrote:
>
> Hi,
>
> I am sorry for such a long e-mail, but I found a strange behavior
> during the log probability calculation of the unigrams.
>
> I have two language models trained on two text sets. Actually,
> those sets are just two different sentences, repeated 100 times each:
>
> ACTION_REJECT_003.train.txt:
> der gew?nschte artikel ist nicht im koffer enthalten (x
> 100)
>
> ACTION_REJECT_004.train.txt:
> ihre aussage kann nicht verarbeitet werden (x 100)
>
> Also, I have defined few specific categories to build a
> class-based LM.
> One class is numbers (ein, eine, eins, einundachtzig etc.), the
> second one comprises names of specific items related to the task
> domain (achselshirt, blusen), and the last one consists just of
> two words: 'wurde' and 'wurden'.
>
> So, I am building two expanded class-based LMs using Witten-Bell
> discounting (I triedalso the default Good-Turing, but with the
>
> same result):
>
> replace-words-with-classes classes=wizard.class.defs
> ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt
>
> ngram-count -text ACTION_REJECT_003.train.class.txt -lm
> ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram -lm ACTION_REJECT_003.lm -write-lm
> ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs
> -expand-classes 3 -expand-exact 3 -vocab wizard.wlist
>
>
> The second LM (ACTION_REJECT_004) is built using the same
> approach. But these two models are pretty different.
>
> ACTION_REJECT_003.expanded.lm has reasonable smoothed log
> probabilities for the unseen unigrams:
>
> \data\
> ngram 1=924
> ngram 2=9
> ngram 3=8
>
> \1-grams:
> -0.9542425
> -10.34236
> -99 -99
> -10.34236 ab
> -10.34236 abgeben
>
> [...]
>
> -10.34236 ?berschritten
> -10.34236 ?bertragung
>
> \2-grams:
> 0 der 0
> 0 artikel ist 0
> 0 der gew?nschte 0
> 0 enthalten
> 0 gew?nschte artikel 0
> 0 im koffer 0
> 0 ist nicht 0
> 0 koffer enthalten 0
> 0 nicht im 0
>
> \3-grams:
> 0 gew?nschte artikel ist
> 0 der gew?nschte
> 0 koffer enthalten
> 0 der gew?nschte artikel
> 0 nicht im koffer
> 0 artikel ist nicht
> 0 im koffer enthalten
> 0 ist nicht im
>
> \end\
>
>
> Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have
> a zero probability:
>
> \data\
> ngram 1=924
> ngram 2=7
> ngram 3=6
>
> \1-grams:
> -0.845098
> -99
> -99 -99
> -99 ab
> -99 abgeben
> [...]
> -0.845098 aussage -99
> [...]
> -99 ?berschritten
> -99 ?bertragung
>
> \2-grams:
> 0 ihre 0
> 0 aussage kann 0
> 0 ihre aussage 0
> 0 kann nicht 0
> 0 nicht verarbeitet 0
> 0 sagen
> 0 verarbeitet sagen 0
>
> \3-grams:
> 0 ihre aussage kann
> 0 ihre aussage
> 0 aussage kann nicht
> 0 kann nicht verarbeitet
> 0 verarbeitet sagen
> 0 nicht verarbeitet sagen
>
> \end\
>
>
> None of the words from both training sentences belong to any class.
>
> Also, I found that removing the last word from the second training
> sentence fixes the problem.
> Thus, for the following sentence:
>
> ihre aussage kann nicht
>
> corresponding LM has correctly discounted probabilities (also
> around -10). Replacing 'werden' with any other word (I tried
> 'sagen', 'abgeben' and 'beer') causes the same problem again.
>
> Is it a bug or am I doing something wrong?
> I would be appreciated for any advice. I also can provide all
> necessary data and scripts if needed.
>
> Sincerely yours,
> Dmytro Prylipko.
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From bibek9500 at gmail.com Thu Oct 4 09:11:14 2012
From: bibek9500 at gmail.com (bibek kc)
Date: Thu, 4 Oct 2012 21:56:14 +0545
Subject: [SRILM User List] help regarding kartz backoff bigram and trigram
model
Message-ID:
Hi all of you,
I am new to the srilm toolkit.
I want to make a kartz backoff bigram and trigram model where the value
of K=5.
Also calculate the Katz backoff bigram probabilities and Katz backoff
trigram probabilities.
if possible please enlist the steps to calculate the model and the
probability.
Regards,
bibek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dyuret at ku.edu.tr Sun Oct 7 01:05:31 2012
From: dyuret at ku.edu.tr (Deniz Yuret)
Date: Sun, 7 Oct 2012 11:05:31 +0300
Subject: [SRILM User List] finding likely substitutes quickly
In-Reply-To:
References:
Message-ID:
Dear SRILM users,
I have developed an algorithm (FASTSUBS) that can generate the most
likely word substitutes from an n-gram model fast. We have used
FASTSUBS to achieve state of the art results in unsupervised part of
speech induction in EMNLP 2012. The paper, the code, and a dataset
with the top 100 substitutes of each token in the WSJ section of the
Penn Treebank are available at http://goo.gl/jzKH0.
best,
deniz
From member at linkedin.com Wed Oct 10 08:05:36 2012
From: member at linkedin.com (Manuel Prof.Manuel via LinkedIn)
Date: Wed, 10 Oct 2012 15:05:36 +0000 (UTC)
Subject: [SRILM User List] =?utf-8?q?Invitation_=C3=A0_se_connecter_sur_Li?=
=?utf-8?q?nkedIn?=
Message-ID: <1035697199.4930135.1349881536385.JavaMail.app@ela4-app2320.prod>
LinkedIn
------------
Manuel Prof.Manuel requested to add you as a connection on LinkedIn:
------------------------------------------
G?khan Can,
J'aimerais vous inviter ? rejoindre mon r?seau professionnel en ligne, sur le site LinkedIn.
Manuel
Accept invitation from Manuel Prof.Manuel
http://www.linkedin.com/e/t1zgkk-h84km3jr-68/GYEITBnGRvUQDaFc0a-73k8G9tC-sl29aHrv22N/blk/I337404383_55/0UcDpKqiRzolZKqiRybmRSrCBvrmRLoORIrmkZt5YCpnlOt3RApnhMpmdzgmhxrSNBszYRdlYPe3cQc3gTcPd9bSxOrSYViCpobP4QcPwQd3ANc3ALrCBxbOYWrSlI/eml-comm_invm-b-in_ac-inv28/?hs=false&tok=3Cg4O7wX_FiBs1
View profile of Manuel Prof.Manuel
http://www.linkedin.com/e/t1zgkk-h84km3jr-68/rso/208842380/wW2U/name/57580855_I337404383_55/?hs=false&tok=1xAHfqDdnFiBs1
------------------------------------------
You are receiving Invitation emails.
This email was intended for G?khan Can Avcu.
Learn why this is included: http://www.linkedin.com/e/t1zgkk-h84km3jr-68/plh/http%3A%2F%2Fhelp%2Elinkedin%2Ecom%2Fapp%2Fanswers%2Fdetail%2Fa_id%2F4788/-GXI/?hs=false&tok=2YzXfRYi7FiBs1
(c) 2012, LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From gregor.donaj at uni-mb.si Thu Oct 11 07:59:23 2012
From: gregor.donaj at uni-mb.si (Gregor Donaj)
Date: Thu, 11 Oct 2012 16:59:23 +0200
Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile
Message-ID: <5076DECB.4080301@uni-mb.si>
Hi,
I'm trying to rescore factored hypothesizes with ngram with the
-factored option. I realized that the program requires the countfile to
be present as specified in the flm definition file and that it also
seems to be loaded into memory. Same with using fngram. Why is this so?
Since for calculating probabilities and perplexities I only need the
actual language model file and not the counts, this is a bit annoying as
my countfiles are sometimes larger than my RAM.
I kind of "solved" the problem by creating and empty countfile. I tested
this on a small example and saw that it calculates the rescored
probabilities fine. Is there any way to tell ngram not to look for the
countfile? I guess that would be a better solution that just giving the
program a dummy countfile that doesn't correspond to the language model
file.
Thanks
--
Gregor Donaj, univ. dipl. in?. el., univ. dipl. mat.
Laboratorij za digitalno procesiranje signalov
Fakulteta za elektrotehniko, ra?unalni?tvo in informatiko
Smetanova ulica 17, 2000 Maribor
Tel.: 02/220 72 05
E-mail: gregor.donaj at uni-mb.si
Digital Signal Processing Laboratory
Faculty of Electrical Engineering and Computer Science
Smetanova ulica 17, 2000 Maribor, Slovenia
Tel.: +386 2 220 72 05
From stolcke at icsi.berkeley.edu Thu Oct 11 09:41:48 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 11 Oct 2012 09:41:48 -0700
Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile
In-Reply-To: <5076DECB.4080301@uni-mb.si>
References: <5076DECB.4080301@uni-mb.si>
Message-ID: <5076F6CC.80807@icsi.berkeley.edu>
On 10/11/2012 7:59 AM, Gregor Donaj wrote:
> Hi,
>
> I'm trying to rescore factored hypothesizes with ngram with the
> -factored option. I realized that the program requires the countfile
> to be present as specified in the flm definition file and that it also
> seems to be loaded into memory. Same with using fngram. Why is this so?
>
> Since for calculating probabilities and perplexities I only need the
> actual language model file and not the counts, this is a bit annoying
> as my countfiles are sometimes larger than my RAM.
>
> I kind of "solved" the problem by creating and empty countfile. I
> tested this on a small example and saw that it calculates the rescored
> probabilities fine. Is there any way to tell ngram not to look for the
> countfile? I guess that would be a better solution that just giving
> the program a dummy countfile that doesn't correspond to the language
> model file.
>
> Thanks
>
>
I would agree with you, but I'm cc-ing Jeff Bilmes, who wrote the
original code and might know of other reasons for handling the
countfiles the way it is done now.
If empty countfiles work for you then a quick workaround is to write a
few lines of perl that replace the count files with /dev/null (no need
to create actual empty files) in any given FLM model file.
Andreas
From stolcke at icsi.berkeley.edu Thu Oct 11 11:15:22 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 11 Oct 2012 11:15:22 -0700
Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile
In-Reply-To: <5076F6CC.80807@icsi.berkeley.edu>
References: <5076DECB.4080301@uni-mb.si> <5076F6CC.80807@icsi.berkeley.edu>
Message-ID: <50770CBA.5050707@icsi.berkeley.edu>
FYI, here is Jeff's response, which didn't get propagated to the list
since he isn't subscribed:
On 10/11/2012 10:37 AM, Jeff Bilmes wrote:
> For some backoff strategies (which can only be determined based on the
> options associated with the backoff graph), one does need the count
> file to determine how to do backoff. If I remember correctly, I think
> that the check for existence of count file is done at a stage in the
> code far different than when it is determined if it is needed or not
> which might be the reason why it just, by default, always asks for
> one. But if you are certain that in your backoffoptions associated
> with the backoff graph it is not necessary to have a count file, then
> it should be safe to use the /dev/null solution mentioned by Andreas
> below ...
Andreas
On 10/11/2012 9:41 AM, Andreas Stolcke wrote:
> On 10/11/2012 7:59 AM, Gregor Donaj wrote:
>> Hi,
>>
>> I'm trying to rescore factored hypothesizes with ngram with the
>> -factored option. I realized that the program requires the countfile
>> to be present as specified in the flm definition file and that it
>> also seems to be loaded into memory. Same with using fngram. Why is
>> this so?
>>
>> Since for calculating probabilities and perplexities I only need the
>> actual language model file and not the counts, this is a bit annoying
>> as my countfiles are sometimes larger than my RAM.
>>
>> I kind of "solved" the problem by creating and empty countfile. I
>> tested this on a small example and saw that it calculates the
>> rescored probabilities fine. Is there any way to tell ngram not to
>> look for the countfile? I guess that would be a better solution that
>> just giving the program a dummy countfile that doesn't correspond to
>> the language model file.
>>
>> Thanks
>>
>>
> I would agree with you, but I'm cc-ing Jeff Bilmes, who wrote the
> original code and might know of other reasons for handling the
> countfiles the way it is done now.
>
> If empty countfiles work for you then a quick workaround is to write a
> few lines of perl that replace the count files with /dev/null (no need
> to create actual empty files) in any given FLM model file.
>
> Andreas
>
From gregor.donaj at uni-mb.si Fri Oct 12 01:25:38 2012
From: gregor.donaj at uni-mb.si (Gregor Donaj)
Date: Fri, 12 Oct 2012 10:25:38 +0200
Subject: [SRILM User List] Why does 'ngram -factored' needs the countfile
In-Reply-To: <50770CBA.5050707@icsi.berkeley.edu>
References: <5076DECB.4080301@uni-mb.si> <5076F6CC.80807@icsi.berkeley.edu>
<50770CBA.5050707@icsi.berkeley.edu>
Message-ID: <5077D402.9080905@uni-mb.si>
Thank youfor your answers. I already thought it has something to do with
backoff strategies. I am currently experimenting only on models with
fixed backoff paths, so I will use /dev/null.
Gregor
On 10/11/2012 08:15 PM, Andreas Stolcke wrote:
> FYI, here is Jeff's response, which didn't get propagated to the list
> since he isn't subscribed:
>
> On 10/11/2012 10:37 AM, Jeff Bilmes wrote:
>> For some backoff strategies (which can only be determined based on
>> the options associated with the backoff graph), one does need the
>> count file to determine how to do backoff. If I remember correctly, I
>> think that the check for existence of count file is done at a stage
>> in the code far different than when it is determined if it is needed
>> or not which might be the reason why it just, by default, always asks
>> for one. But if you are certain that in your backoffoptions
>> associated with the backoff graph it is not necessary to have a count
>> file, then it should be safe to use the /dev/null solution mentioned
>> by Andreas below ...
>
> Andreas
>
> On 10/11/2012 9:41 AM, Andreas Stolcke wrote:
>> On 10/11/2012 7:59 AM, Gregor Donaj wrote:
>>> Hi,
>>>
>>> I'm trying to rescore factored hypothesizes with ngram with the
>>> -factored option. I realized that the program requires the countfile
>>> to be present as specified in the flm definition file and that it
>>> also seems to be loaded into memory. Same with using fngram. Why is
>>> this so?
>>>
>>> Since for calculating probabilities and perplexities I only need the
>>> actual language model file and not the counts, this is a bit
>>> annoying as my countfiles are sometimes larger than my RAM.
>>>
>>> I kind of "solved" the problem by creating and empty countfile. I
>>> tested this on a small example and saw that it calculates the
>>> rescored probabilities fine. Is there any way to tell ngram not to
>>> look for the countfile? I guess that would be a better solution that
>>> just giving the program a dummy countfile that doesn't correspond to
>>> the language model file.
>>>
>>> Thanks
>>>
>>>
>> I would agree with you, but I'm cc-ing Jeff Bilmes, who wrote the
>> original code and might know of other reasons for handling the
>> countfiles the way it is done now.
>>
>> If empty countfiles work for you then a quick workaround is to write
>> a few lines of perl that replace the count files with /dev/null (no
>> need to create actual empty files) in any given FLM model file.
>>
>> Andreas
>>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
From alphabnu at gmail.com Fri Oct 12 10:17:52 2012
From: alphabnu at gmail.com (Xiaolin Xie)
Date: Fri, 12 Oct 2012 12:17:52 -0500
Subject: [SRILM User List] ask help with calculating word conditional
probability
Message-ID:
Hi SRILM users.
I am working on a project that needs to calculate the conditional
probability of each word given its previous two words in a paragraph. A
language model has been trained from a training set. Do your guys have any
idea about how to directly calculate the conditional probability
p(W_k|W_k-1, W_k-2), using the SRILM toolkit and the trained language
model? Thanks a lot! I really appreciate any help you can offer.
Xiaolin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Fri Oct 12 10:45:41 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 12 Oct 2012 10:45:41 -0700
Subject: [SRILM User List] ask help with calculating word conditional
probability
In-Reply-To:
References:
Message-ID: <50785745.8090909@icsi.berkeley.edu>
On 10/12/2012 10:17 AM, Xiaolin Xie wrote:
>
> Hi SRILM users.
>
> I am working on a project that needs to calculate the conditional
> probability of each word given its previous two words in a paragraph.
> A language model has been trained from a training set. Do your guys
> have any idea about how to directly calculate the conditional
> probability p(W_k|W_k-1, W_k-2), using the SRILM toolkit and the
> trained language model? Thanks a lot! I really appreciate any help you
> can offer.
>
One method is described in
http://www.speech.sri.com/pipermail/srilm-user/2012q3/001314.html .
Hope that helps,
Andreas
From yuan at ks.cs.titech.ac.jp Sat Oct 13 02:37:39 2012
From: yuan at ks.cs.titech.ac.jp (yuan liang)
Date: Sat, 13 Oct 2012 18:37:39 +0900
Subject: [SRILM User List] lattice rescoring with conventional LM and FLM
Message-ID:
Hi srilm users,
Now I'm using the 'lattice-tool' to rescore the lattice, my goal is using a
Factor Language Model(FLM) score to replace the original language model
score in the word lattice.
1) First in the baseline system, I used conventional Bigram LM to do speech
recognition and generate the htk word lattice (we name it "Lattice_1").
Then I try to use a conventional Trigram LM to rescore the "Lattice_1",
using:
"lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file] -read-htk
-no-nulls -no-htk-nulls -lm [Trigram_file] -htk-lmscale 15 -htk-logbase
2.71828183 -posterior-scale 15 -write-htk -out-lattice Lattice_2"
I just want to use the new Trigram LM score to replace the old LM score in
"Lattice_1", so I think "Lattice_2" and "Lattice_1" should have the same
size, just each word's LM score will be different. But I found the size of
"Lattice_2" are larger than "Latttice_1". Did I miss something? How can I
only replace the LM score without expanding the size of the lattice?
2) I used a Trigram in FLM format to rescore "Lattice_1":
First I converted all word nodes (HTk format) to FLM representation;
Then rescored with:
" lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file]
-read-htk -no-nulls -no-htk-nulls -factored -lm
[FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183
-posterior-scale 15 -write-htk -out-lattice Lattice_3"
I think "Lattice_2" and "Lattice_3" should be the same, since the
perplexity of using Trigram and using Trigram in FLM format are same.
However, they are different. Did I miss something?
3) Also I checked the accuracy from the decoding result of using
"Lattice_2" and "Lattice_3", the result are:
viterbi decode result is the same;
n-best list are almost same, but using "Lattice_2" is
better than using "Lattice_3";
posterior decode result is quite different, using
"Lattice_2" is better than using "Lattice_3";
Did I miss something when I using FLM to rescore the lattice?
Thank you very much!
Yuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From alphabnu at gmail.com Sun Oct 14 10:15:42 2012
From: alphabnu at gmail.com (Xiaolin Xie)
Date: Sun, 14 Oct 2012 12:15:42 -0500
Subject: [SRILM User List] ask help with calculating word conditional
probability
In-Reply-To: <50785745.8090909@icsi.berkeley.edu>
References:
<50785745.8090909@icsi.berkeley.edu>
Message-ID:
Hi Andreas
Thank you very much. This information is very helpful.
Xiaolin.
On Fri, Oct 12, 2012 at 12:45 PM, Andreas Stolcke wrote:
> On 10/12/2012 10:17 AM, Xiaolin Xie wrote:
>
>>
>> Hi SRILM users.
>>
>> I am working on a project that needs to calculate the conditional
>> probability of each word given its previous two words in a paragraph. A
>> language model has been trained from a training set. Do your guys have any
>> idea about how to directly calculate the conditional probability
>> p(W_k|W_k-1, W_k-2), using the SRILM toolkit and the trained language
>> model? Thanks a lot! I really appreciate any help you can offer.
>>
>>
> One method is described in http://www.speech.sri.com/**
> pipermail/srilm-user/2012q3/**001314.html.
> Hope that helps,
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Tue Oct 16 09:59:46 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 16 Oct 2012 09:59:46 -0700
Subject: [SRILM User List] lattice rescoring with conventional LM and FLM
In-Reply-To:
References:
Message-ID: <507D9282.1040306@icsi.berkeley.edu>
On 10/13/2012 2:37 AM, yuan liang wrote:
> Hi srilm users,
>
> Now I'm using the 'lattice-tool' to rescore the lattice, my goal is
> using a Factor Language Model(FLM) score to replace the original
> language model score in the word lattice.
>
> 1) First in the baseline system, I used conventional Bigram LM to do
> speech recognition and generate the htk word lattice (we name it
> "Lattice_1"). Then I try to use a conventional Trigram LM to rescore
> the "Lattice_1", using:
>
> "lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file]
> -read-htk -no-nulls -no-htk-nulls -lm [Trigram_file] -htk-lmscale 15
> -htk-logbase 2.71828183 -posterior-scale 15 -write-htk -out-lattice
> Lattice_2"
Two factors come into play here:
1) when you apply a trigram model to a bigram lattice the lattice is
expanded so that trigram contexts (i.e., the last two words) are encoded
uniquely at each node. Hence the size increase.
2) The options -no-nulls -no-htk-nulls actually imply a size increase
all on their own because of the way HTK lattices are represented
internally (arcs are encode as nodes, and then mapped back to arc on
output). You should not use them.
>
> I just want to use the new Trigram LM score to replace the old LM
> score in "Lattice_1", so I think "Lattice_2" and "Lattice_1" should
> have the same size, just each word's LM score will be different. But I
> found the size of "Lattice_2" are larger than "Latttice_1". Did I miss
> something? How can I only replace the LM score without expanding the
> size of the lattice?
>
>
>
> 2) I used a Trigram in FLM format to rescore "Lattice_1":
>
> First I converted all word nodes (HTk format) to FLM representation;
>
> Then rescored with:
>
> " lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file]
> -read-htk -no-nulls -no-htk-nulls -factored -lm
> [FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183
> -posterior-scale 15 -write-htk -out-lattice Lattice_3"
>
> I think "Lattice_2" and "Lattice_3" should be the same, since the
> perplexity of using Trigram and using Trigram in FLM format are same.
> However, they are different. Did I miss something?
This is a question about the equivalent encoding of standard word-based
LMs as FLMs, and I'm not an expert here.
However, as a sanity check, I would first do a simple perplexity
computation (ngram -debug 2 -ppl) with both models on some test set and
make sure you get the same word-for-word conditional probabilities. If
not, you can spot where the differences are and present a specific case
of different probabilities to the group for debugging.
>
>
>
> 3) Also I checked the accuracy from the decoding result of using
> "Lattice_2" and "Lattice_3", the result are:
>
> viterbi decode result is the same;
> n-best list are almost same, but using "Lattice_2"
> is better than using "Lattice_3";
> posterior decode result is quite different, using
> "Lattice_2" is better than using "Lattice_3";
>
> Did I miss something when I using FLM to rescore the lattice?
You need to resolve question 2 above first before tackling this one.
Andreas
From yuan at ks.cs.titech.ac.jp Tue Oct 16 17:33:03 2012
From: yuan at ks.cs.titech.ac.jp (yuan liang)
Date: Wed, 17 Oct 2012 09:33:03 +0900
Subject: [SRILM User List] lattice rescoring with conventional LM and FLM
In-Reply-To: <507D9282.1040306@icsi.berkeley.edu>
References:
<507D9282.1040306@icsi.berkeley.edu>
Message-ID:
Hi Andreas,
Thank you very much!
>> 2) I used a Trigram in FLM format to rescore "Lattice_1":
>>
>> First I converted all word nodes (HTk format) to FLM representation;
>>
>> Then rescored with:
>>
>> " lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file]
>> -read-htk -no-nulls -no-htk-nulls -factored -lm
>> [FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183
>> -posterior-scale 15 -write-htk -out-lattice Lattice_3"
>>
>> I think "Lattice_2" and "Lattice_3" should be the same, since the
>> perplexity of using Trigram and using Trigram in FLM format are same.
>> However, they are different. Did I miss something?
>>
>
> This is a question about the equivalent encoding of standard word-based
> LMs as FLMs, and I'm not an expert here.
> However, as a sanity check, I would first do a simple perplexity
> computation (ngram -debug 2 -ppl) with both models on some test set and
> make sure you get the same word-for-word conditional probabilities. If
> not, you can spot where the differences are and present a specific case of
> different probabilities to the group for debugging.
>
>
> Actually I did the perplexity test on a test set of 6564 sentences (72854
words). The total perplexity are the same using standard word-based Trigram
LM as using FLM Trigram. Also I checked the details of the word-for-word
conditional probability, for these 72854 words, only 442 words' conditional
probabilities are not exactly the same, others are exactly the same.
However the probability difference is negligible ( like 0.00531048 and
0.00531049, 5.38809e-07 and 5.38808e-07 ). So I thought we can say both
models can get the same word-for-word conditional probabilities.
I also considered probably it's because of the FLM format, lattice
expanding with standard Trigram is seems different with FLM Trigram, using
FLM Trigram lattice expanded around 300 times larger than using standard
Trigram, maybe the expanding way is different. I'm not sure, I still need
to investigate more.
Thank you very much for your advices!
Regards,
Yuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Tue Oct 16 21:52:44 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 16 Oct 2012 21:52:44 -0700
Subject: [SRILM User List] lattice rescoring with conventional LM and FLM
In-Reply-To:
References:
<507D9282.1040306@icsi.berkeley.edu>
Message-ID: <507E399C.8030809@icsi.berkeley.edu>
On 10/16/2012 5:33 PM, yuan liang wrote:
> Hi Andreas,
>
> Thank you very much!
>
>
> 2) I used a Trigram in FLM format to rescore "Lattice_1":
>
> First I converted all word nodes (HTk format) to FLM
> representation;
>
> Then rescored with:
>
> " lattice-tool -in-lattice Lattice_1 -unk -vocab
> [voc_file] -read-htk -no-nulls -no-htk-nulls -factored
> -lm [FLM_specification_file] -htk-lmscale 15 -htk-logbase
> 2.71828183 -posterior-scale 15 -write-htk -out-lattice
> Lattice_3"
>
> I think "Lattice_2" and "Lattice_3" should be the same,
> since the perplexity of using Trigram and using Trigram in FLM
> format are same. However, they are different. Did I miss
> something?
>
>
> This is a question about the equivalent encoding of standard
> word-based LMs as FLMs, and I'm not an expert here.
> However, as a sanity check, I would first do a simple perplexity
> computation (ngram -debug 2 -ppl) with both models on some test
> set and make sure you get the same word-for-word conditional
> probabilities. If not, you can spot where the differences are and
> present a specific case of different probabilities to the group
> for debugging.
>
>
> Actually I did the perplexity test on a test set of 6564 sentences
> (72854 words). The total perplexity are the same using standard
> word-based Trigram LM as using FLM Trigram. Also I checked the details
> of the word-for-word conditional probability, for these 72854 words,
> only 442 words' conditional probabilities are not exactly the same,
> others are exactly the same. However the probability difference is
> negligible ( like 0.00531048 and 0.00531049, 5.38809e-07 and
> 5.38808e-07 ). So I thought we can say both models can get the same
> word-for-word conditional probabilities.
>
> I also considered probably it's because of the FLM format, lattice
> expanding with standard Trigram is seems different with FLM Trigram,
> using FLM Trigram lattice expanded around 300 times larger than using
> standard Trigram, maybe the expanding way is different. I'm not sure,
> I still need to investigate more.
The lattice expansion algorithm makes use of the backoff structure of
the standard LM to minimize the number of nodes that need to be
duplicated to correctly apply the probabilities. The FLM makes more
conservative assumptions and always assumes you need two words of
context, leading to more nodes after expansion. That would explain the
size difference.
You can also check the probabilities in expanded lattices. The command
lattice-tool -in-lattice LATTICE -ppl TEXT -debug 2 ...
will compute the probabilities assigned to the words in TEXT by
traversing the lattice. It is worth checking first that expansion with
FLMs yields the right probabilities.
You say that viterbi decoding gives almost the same results (this
suggests the expansion works correctly), but posterior (confusion
network) decoding doesn't. It is possible there is a problem with
building CNs from lattices with factored vocabularies. I don't think I
every tried that. It would help to find a minimal test case that shows
the problem.
Andreas
>
>
> Thank you very much for your advices!
>
> Regards,
> Yuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From yuan at ks.cs.titech.ac.jp Wed Oct 17 03:04:53 2012
From: yuan at ks.cs.titech.ac.jp (yuan liang)
Date: Wed, 17 Oct 2012 19:04:53 +0900
Subject: [SRILM User List] lattice rescoring with conventional LM and FLM
In-Reply-To: <507E399C.8030809@icsi.berkeley.edu>
References:
<507D9282.1040306@icsi.berkeley.edu>
<507E399C.8030809@icsi.berkeley.edu>
Message-ID:
Hi Andres,
Thank you very much!
I will test more.
Regards,
Yuan
On Wed, Oct 17, 2012 at 1:52 PM, Andreas Stolcke
wrote:
> On 10/16/2012 5:33 PM, yuan liang wrote:
>
> Hi Andreas,
>
> Thank you very much!
>
>
>>> 2) I used a Trigram in FLM format to rescore "Lattice_1":
>>>
>>> First I converted all word nodes (HTk format) to FLM representation;
>>>
>>> Then rescored with:
>>>
>>> " lattice-tool -in-lattice Lattice_1 -unk -vocab [voc_file]
>>> -read-htk -no-nulls -no-htk-nulls -factored -lm
>>> [FLM_specification_file] -htk-lmscale 15 -htk-logbase 2.71828183
>>> -posterior-scale 15 -write-htk -out-lattice Lattice_3"
>>>
>>> I think "Lattice_2" and "Lattice_3" should be the same, since the
>>> perplexity of using Trigram and using Trigram in FLM format are same.
>>> However, they are different. Did I miss something?
>>>
>>
>> This is a question about the equivalent encoding of standard word-based
>> LMs as FLMs, and I'm not an expert here.
>> However, as a sanity check, I would first do a simple perplexity
>> computation (ngram -debug 2 -ppl) with both models on some test set and
>> make sure you get the same word-for-word conditional probabilities. If
>> not, you can spot where the differences are and present a specific case of
>> different probabilities to the group for debugging.
>>
>>
>> Actually I did the perplexity test on a test set of 6564 sentences
> (72854 words). The total perplexity are the same using standard word-based
> Trigram LM as using FLM Trigram. Also I checked the details of the
> word-for-word conditional probability, for these 72854 words, only 442
> words' conditional probabilities are not exactly the same, others are
> exactly the same. However the probability difference is negligible ( like
> 0.00531048 and 0.00531049, 5.38809e-07 and 5.38808e-07 ). So I thought we
> can say both models can get the same word-for-word conditional
> probabilities.
>
> I also considered probably it's because of the FLM format, lattice
> expanding with standard Trigram is seems different with FLM Trigram, using
> FLM Trigram lattice expanded around 300 times larger than using standard
> Trigram, maybe the expanding way is different. I'm not sure, I still need
> to investigate more.
>
>
> The lattice expansion algorithm makes use of the backoff structure of the
> standard LM to minimize the number of nodes that need to be duplicated to
> correctly apply the probabilities. The FLM makes more conservative
> assumptions and always assumes you need two words of context, leading to
> more nodes after expansion. That would explain the size difference.
>
> You can also check the probabilities in expanded lattices. The command
>
> lattice-tool -in-lattice LATTICE -ppl TEXT -debug 2 ...
>
> will compute the probabilities assigned to the words in TEXT by traversing
> the lattice. It is worth checking first that expansion with FLMs yields
> the right probabilities.
>
> You say that viterbi decoding gives almost the same results (this suggests
> the expansion works correctly), but posterior (confusion network) decoding
> doesn't. It is possible there is a problem with building CNs from lattices
> with factored vocabularies. I don't think I every tried that. It would
> help to find a minimal test case that shows the problem.
>
> Andreas
>
>
>
>
> Thank you very much for your advices!
>
> Regards,
> Yuan
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Tue Oct 23 09:52:47 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 23 Oct 2012 09:52:47 -0700
Subject: [SRILM User List] Commands not found
In-Reply-To: <1350977625.53790.YahooMailNeo@web194705.mail.sg3.yahoo.com>
References: <1350977625.53790.YahooMailNeo@web194705.mail.sg3.yahoo.com>
Message-ID: <5086CB5F.5000109@icsi.berkeley.edu>
On 10/23/2012 12:33 AM, Don Erick Bonus wrote:
> Hi everyone.
>
> I'm new to SRILM and I do have SRILM installed in an Ubuntu machine.
> Based on what I got from the Internet, make World and make Test did
> work by displaying a lot of information. However, when I try to run
> commands which are in the bin/i686 folder for testing, I always
> encounter the COMMAND NOT FOUND message. As I try to run man in
> displaying the manual for the commands it says "No manual entry ...".
> I've been searching the Internet for solutions and can't find one.
>
> Please help me on this one... I need to do a statistical-based spell
> and grammar checker for Tagalog as a project. You may suggest steps
> on how I can do this also since I'm also new to statistical-based NLP.
>
> Your help will be highly appreaciated. Thanks.
> Erick
Try invoking ./bin/i686/ngram -version. Assuming that works, the only
problem is that your shell's executable search path is not set to
include $SRILM/bin and $SRILM/bin/i686 . This is item 6 in the INSTALL
instructions. Please consult a local Linux/Unix expert if needed.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From lisepul at gmail.com Fri Oct 26 05:36:44 2012
From: lisepul at gmail.com (Lianet Sepulveda Torres)
Date: Fri, 26 Oct 2012 10:36:44 -0200
Subject: [SRILM User List] SRILM install problem
Message-ID:
Hi,
I'm tried to install SRILM on Windows 7, using cywing.
The following errors is showing when I giving make World
mkdir -p include lib bin
make init
make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
(cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm
MACHINE_TYPE=cygwi
n OPTION=
MAKE_PIC= init) || exit 1; \
done
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories
make ../obj/cygwin/STAMP ../bin/cygwin/STAMP
make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[3]: `../obj/cygwin/STAMP' est? atualizado.
make[3]: `../bin/cygwin/STAMP' est? atualizado.
make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories
make ../obj/cygwin/STAMP ../bin/cygwin/STAMP
make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[3]: `../obj/cygwin/STAMP' est? atualizado.
make[3]: `../bin/cygwin/STAMP' est? atualizado.
make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories
make ../obj/cygwin/STAMP ../bin/cygwin/STAMP
make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[3]: `../obj/cygwin/STAMP' est? atualizado.
make[3]: `../bin/cygwin/STAMP' est? atualizado.
make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories
make ../obj/cygwin/STAMP ../bin/cygwin/STAMP
make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[3]: `../obj/cygwin/STAMP' est? atualizado.
make[3]: `../bin/cygwin/STAMP' est? atualizado.
make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories
make ../obj/cygwin/STAMP ../bin/cygwin/STAMP
make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[3]: `../obj/cygwin/STAMP' est? atualizado.
make[3]: `../bin/cygwin/STAMP' est? atualizado.
make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
cd ..; /cygdrive/c/cygwin/srilm/sbin/make-standard-directories
make ../obj/cygwin/STAMP ../bin/cygwin/STAMP
make[3]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[3]: `../obj/cygwin/STAMP' est? atualizado.
make[3]: `../bin/cygwin/STAMP' est? atualizado.
make[3]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm'
make release-headers
make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
(cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm
MACHINE_TYPE=cygwi
n OPTION=
MAKE_PIC= release-headers) || exit 1; \
done
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Nada a ser feito para `release-headers'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[2]: Nada a ser feito para `release-headers'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[2]: Nada a ser feito para `release-headers'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[2]: Nada a ser feito para `release-headers'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[2]: Nada a ser feito para `release-headers'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[2]: Nada a ser feito para `release-headers'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm'
make depend
make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
(cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm
MACHINE_TYPE=cygwi
n OPTION=
MAKE_PIC= depend) || exit 1; \
done
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
rm -f Dependencies.cygwin
C:/cygwin/bin -I. -I../../include -MM ./option.c ./zio.c ./fcheck.c
./fake-
rand48.c ./version.c ./ztest.c |
sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION
)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
C:/cygwin/bin -I. -I../../include -MM ./Debug.cc ./File.cc
./MStringTokUtil.
cc ./testFile.cc | sed
-e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "
s&\.o&.o&g" >> Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
/cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin
../obj
/cygwin ".exe" ztest testFile |
sed -e "s&\.o&.o&g" >> Dependencies.cygwin
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
rm -f Dependencies.cygwin
C:/cygwin/bin -I. -I../../include -MM ./qsort.c ./BlockMalloc.c
./maxalloc.
c | sed -e "s&^\([^
]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >>
Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
C:/cygwin/bin -I. -I../../include -MM ./MemStats.cc ./LHashTrie.cc
./SArrayT
rie.cc ./Array.cc
./IntervalHeap.cc ./Map.cc ./SArray.cc ./LHash.cc ./Map2.cc ./
Trie.cc ./CachedMem.cc ./testArray.cc ./testMap.cc
./benchHash.cc ./testHash.cc
./testSizes.cc ./testCachedMem.cc ./testBlockMalloc.cc | sed -e "s&^\([^
]\)&../
obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >> Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
/cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin
../obj
/cygwin ".exe" maxalloc
testArray testMap benchHash testHash testSizes tes
tCachedMem testBlockMalloc | sed -e "s&\.o&.o&g" >>
Dependencies.cygwin
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
rm -f Dependencies.cygwin
C:/cygwin/bin -I. -I../../include -MM ./matherr.c | sed -e "s&^\([^
]\)&../
obj/cygwin"'$(OBJ_OPTION)'"/\1&g"
-e "s&\.o&.o&g" >> Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
C:/cygwin/bin -I. -I../../include -MM ./Prob.cc ./Counts.cc ./XCount.cc
./Vo
cab.cc ./VocabMap.cc
./VocabMultiMap.cc ./VocabDistance.cc ./SubVocab.cc ./Multi
wordVocab.cc ./TextStats.cc ./LM.cc ./LMClient.cc
./LMStats.cc ./RefList.cc ./Bl
eu.cc
./NBest.cc ./NBestSet.cc ./NgramLM.cc ./NgramStatsInt.cc ./NgramStatsShort
.cc ./NgramStatsLong.cc
./NgramStatsLongLong.cc ./NgramStatsFloat.cc ./NgramStat
sDouble.cc ./NgramStatsXCount.cc ./NgramCountLM.cc
./Discount.cc ./ClassNgram.cc
./SimpleClassNgram.cc ./DFNgram.cc ./SkipNgram.cc ./HiddenNgram.cc
./HiddenSNgr
am.cc ./VarNgram.cc
./DecipherNgram.cc ./TaggedVocab.cc ./TaggedNgram.cc ./Tagge
dNgramStats.cc ./StopNgram.cc ./StopNgramStats.cc
./MultiwordLM.cc ./NonzeroLM.c
c
./BayesMix.cc ./LoglinearMix.cc ./AdaptiveMix.cc ./AdaptiveMarginals.cc
./Cach
eLM.cc ./DynamicLM.cc
./HMMofNgrams.cc ./WordAlign.cc ./WordLattice.cc ./WordMes
h.cc ./simpleTrigram.cc ./NgramStats.cc ./Trellis.cc
./testBinaryCounts.cc ./tes
tHash.cc
./testProb.cc ./testXCount.cc ./testParseFloat.cc ./testVocabDistance.c
c ./testNgram.cc ./testNgramAlloc.cc
./testMultiReadLM.cc ./hoeffding.cc ./tolow
er.cc ./testLattice.cc ./testError.cc ./testNBest.cc ./testMix.cc
./ngram.cc ./n
gram-count.cc
./ngram-merge.cc ./ngram-class.cc ./disambig.cc ./anti-ngram.cc ./
nbest-lattice.cc ./nbest-mix.cc
./nbest-optimize.cc ./nbest-pron-score.cc ./segm
ent.cc ./segment-nbest.cc ./hidden-ngram.cc ./multi-ngram.cc | sed
-e "s&^\([^ ]
\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&.o&g" >>
Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
/cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin
../obj
/cygwin ".exe" testBinaryCounts
testHash testProb testXCount testParseFloat
testVocabDistance testNgram testNgramAlloc testMultiReadLM
hoeffding tolow
er testLattice
testError testNBest testMix ngram ngram-count ngram-merge
ngram-class disambig anti-ngram nbest-lattice
nbest-mix nbest-optimize nb
est-pron-score segment segment-nbest hidden-ngram multi-ngram | sed -e
"s&\.
o&.o&g" >> Dependencies.cygwin
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
rm -f Dependencies.cygwin
C:/cygwin/bin -I. -I../../include -MM ./FDiscount.cc ./FNgramStats.cc
./FNgr
amStatsInt.cc ./FNgramSpecs.cc
./FNgramSpecsInt.cc ./FactoredVocab.cc ./FNgramLM
.cc ./ProductVocab.cc ./ProductNgram.cc ./wmatrix.cc ./pngram.cc
./fngram-count.
cc ./fngram.cc | sed -e
"s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&
\.o&.o&g" >> Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
/cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin
../obj
/cygwin ".exe" pngram
fngram-count fngram | sed -e "s&\.o&.o&g" >> Dependenci
es.cygwin
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
rm -f Dependencies.cygwin
C:/cygwin/bin -I. -I../../include -MM ./Lattice.cc ./LatticeAlign.cc
./Latti
ceExpand.cc ./LatticeIndex.cc
./LatticeNBest.cc ./LatticeNgrams.cc ./LatticeRedu
ce.cc ./HTKLattice.cc ./LatticeLM.cc ./LatticeDecode.cc
./testLattice.cc ./latti
ce-tool.cc |
sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e "s&\.o&
.o&g" >> Dependencies.cygwin
/bin/sh: C:/cygwin/bin: ? um diret?rio
/cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin
../obj
/cygwin ".exe" testLattice
lattice-tool | sed -e "s&\.o&.o&g" >> Dependencies.
cygwin
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
rm -f Dependencies.cygwin
/cygdrive/c/cygwin/srilm/sbin/generate-program-dependencies ../bin/cygwin
../obj
/cygwin ".exe" | sed -e
"s&\.o&.o&g" >> Dependencies.cygwin
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm'
make release-libraries
make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
(cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm
MACHINE_TYPE=cygwi
n OPTION=
MAKE_PIC= release-libraries) || exit 1; \
done
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Nada a ser feito para `release-libraries'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[2]: Nada a ser feito para `release-libraries'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[2]: Nada a ser feito para `release-libraries'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[2]: Nada a ser feito para `release-libraries'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/flm/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[2]: Nada a ser feito para `release-libraries'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/lattice/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[2]: Nada a ser feito para `release-libraries'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/utils/src'
make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm'
make release-programs
make[1]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
(cd $subdir/src; make SRILM=/cygdrive/c/cygwin/srilm
MACHINE_TYPE=cygwi
n OPTION=
MAKE_PIC= release-programs) || exit 1; \
done
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Nada a ser feito para `release-programs'.
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/misc/src'
make[2]: Entrando no diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
C:/cygwin/bin -I. -I../../include -L../../lib/cygwin -g -O2 -o
../bin/cygw
in/maxalloc.exe
../obj/cygwin/maxalloc.o ../obj/cygwin/libdstruct.a -lm ../../
lib/cygwin/libmisc.a -lm C:/cygwin/bin
make[2]: C:/cygwin/bin: Comando n?o encontrado
/cygdrive/c/cygwin/srilm/common/Makefile.common.targets:108: recipe for
target `
../bin/cygwin/maxalloc.exe'
failed
make[2]: *** [../bin/cygwin/maxalloc.exe] Error 127
make[2]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm/dstruct/src'
Makefile:105: recipe for target `release-programs' failed
make[1]: *** [release-programs] Error 1
make[1]: Saindo do diret?rio `/cygdrive/c/cygwin/srilm'
Makefile:54: recipe for target `World' failed
make: *** [World] Error 2
Any Ideas??
Regards,
Lisepul
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From jmelvinjose73 at yahoo.com Sun Oct 28 17:24:08 2012
From: jmelvinjose73 at yahoo.com (Melvin Jose)
Date: Sun, 28 Oct 2012 17:24:08 -0700 (PDT)
Subject: [SRILM User List] FLM Training takes too long!
Message-ID: <1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com>
Hey,
??? I am presently working with Tamil - a morphologically rich language. I am trying to build an FLM with approximately 3 million entires but it seems to take more than a day and a half now. The FLM specification is
W : W(-1) W(-2) B(-1) S(-1) using generalized backoff. where B is word-base and S is suffix.
Below is the output of -debug 2
warning: distributing 0.0989813 left-over probability mass over all 577519 words
discarded 1 0x4-gram probs predicting pseudo-events
discarded 1587186 0x4-gram probs discounted to zero
discarded 1 0x8-gram probs predicting pseudo-events
discarded 1 0xc-gram probs predicting pseudo-events
discarded 4721615 0xc-gram probs discnounted to zero
Starting estimation of general graph-backoff node: LM 0 Node 0xC, children: 0x8 0x4
Finished estimation of multi-child graph-backoff node: LM 0 Node 0xC
This was the last message I received a day and a half ago. Is it normal for it to take soo long? I read that Katrin had no problem training on 5 million entries. Did it take so long? I am using a cluster in my lab to do the computation, so there shouln't be a problem with memory and computational power.
Is there any way by which I cantell the fngram-count to utilize as much memory as it wants or parallelize the computation?
Thanks,
Melvin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Sun Oct 28 23:16:35 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 28 Oct 2012 23:16:35 -0700
Subject: [SRILM User List] FLM Training takes too long!
In-Reply-To: <1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com>
References: <1351470248.78427.YahooMailNeo@web125501.mail.ne1.yahoo.com>
Message-ID: <508E1F43.2050405@icsi.berkeley.edu>
On 10/28/2012 5:24 PM, Melvin Jose wrote:
>
>
> Hey,
>
> I am presently working with Tamil - a morphologically rich language. I
> am trying to build an FLM with approximately 3 million entires but it
> seems to take more than a day and a half now. The FLM specification is
>
> W : W(-1) W(-2) B(-1) S(-1) using generalized backoff. where B is
> word-base and S is suffix.
>
> Below is the output of -debug 2
>
> warning: distributing 0.0989813 left-over probability mass over all
> 577519 words
> discarded 1 0x4-gram probs predicting pseudo-events
> discarded 1587186 0x4-gram probs discounted to zero
> discarded 1 0x8-gram probs predicting pseudo-events
> discarded 1 0xc-gram probs predicting pseudo-events
> discarded 4721615 0xc-gram probs discnounted to zero
> Starting estimation of general graph-backoff node: LM 0 Node 0xC,
> children: 0x8 0x4
> Finished estimation of multi-child graph-backoff node: LM 0 Node 0xC
>
> This was the last message I received a day and a half ago. Is it
> normal for it to take soo long? I read that Katrin had no problem
> training on 5 million entries. Did it take so long? I am using a
> cluster in my lab to do the computation, so there shouln't be a
> problem with memory and computational power.
I have no experience myself to tell you how long it should take.
However, in cases like this I would run some experiments increasing the
amount of data from, say 10k to 100k to see how the runtime increases as
a function of input size. Then you can extrapolate to the full data set
instead of just waiting.
>
> Is there any way by which I can tell the fngram-count to utilize as
> much memory as it wants or parallelize the computation?
It will take as much memory as it needs to, and there is no easy way to
parallelize.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From chenmengdx at gmail.com Mon Oct 29 03:09:55 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Mon, 29 Oct 2012 18:09:55 +0800
Subject: [SRILM User List] About the -prune option
Message-ID:
Hi, I need to obtain a small LM for ASR decoding by pruning from a large
LM. The original large LM contains about 1.6 billion n-grams, and the small
one should contains about 30 million n-grams. The -prune option in SRILM
could do this. However, I want to ask if it's the same by pruning in one
time and in serveral times. For example, there are two approaches to finish
this pruning task.
1) Set a proper value and prune only one time to get the targe LM:
ngram -lm LM_Large -prune 1e-9 -order 5 -write-lm LM_Small
2) Set several proper values to prune gradually to get the targe LM:
ngram -lm LM_Large -prune 1e-10 -order 5 -write-lm LM_Small1
... ...
ngram -lm LM_Small1 -prune 1e-9 -order 5 -write-lm LM_Small
Are there any differences between above two approaches? Does the pruned LM
have a lower perplexity by the second method?
Thanks!
Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From tsuki_stefy at yahoo.com Mon Oct 29 09:15:25 2012
From: tsuki_stefy at yahoo.com (Stefy D.)
Date: Mon, 29 Oct 2012 09:15:25 -0700 (PDT)
Subject: [SRILM User List] lm interpolation
Message-ID: <1351527325.44053.YahooMailNeo@web112503.mail.gq1.yahoo.com>
Hello everyone,
I am trying to interpolate 2 language models because I want to do an experiment in domain adaption. Below are the commands that I used. When I try to compute lamda, I get the error "mismatch in number of samples (60001 != 67708)". I don't know what to fix...please help me.
~/local/tools/srilm/bin/i686/ngram -order 3? -unk -lm
~/local/test1/lm/lm1.lm -ppl ~/local/test1/lm/de-en_corpus1.lowercased.en -debug 2 >? ppl1.ppl
~/local/tools/srilm/bin/i686/ngram -order 3? -unk -lm ~/local/test2/lm/lm2.lm -ppl ~/local/test2/lm/de-en_corpus2.lowercased.en -debug 2 >? ppl2.ppl
~/local/tools/srilm/bin/i686/compute-best-mix ~/local/test1/ppl1.ppl ~/local/test2/ppl2.ppl
The ppl1.ppl file contains: " 2082 sentences, 57919 words, 0 OOVs
0 zeroprobs, logprob= -100036 ppl= 46.4762 ppl1= 53.3534" and
the ppl2.ppl file contains: "2091 sentences, 65617 words, 0 OOVs
0 zeroprobs, logprob= -89850.8 ppl= 21.2341 ppl1= 23.4057"
I apologise for asking such a basic question...I have just started reading about machine translation.
Thank you very much for your time!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Mon Oct 29 12:44:11 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 29 Oct 2012 12:44:11 -0700
Subject: [SRILM User List] About the -prune option
In-Reply-To:
References:
Message-ID: <508EDC8B.90903@icsi.berkeley.edu>
On 10/29/2012 3:09 AM, Meng Chen wrote:
> Hi, I need to obtain a small LM for ASR decoding by pruning from a
> large LM. The original large LM contains about 1.6 billion n-grams,
> and the small one should contains about 30 million n-grams. The -prune
> option in SRILM could do this. However, I want to ask if it's the same
> by pruning in one time and in serveral times. For example, there are
> two approaches to finish this pruning task.
>
> 1) Set a proper value and prune only one time to get the targe LM:
> ngram -lm LM_Large -prune 1e-9 -order 5 -write-lm LM_Small
>
> 2) Set several proper values to prune gradually to get the targe LM:
> ngram -lm LM_Large -prune 1e-10 -order 5 -write-lm LM_Small1
> ... ...
> ngram -lm LM_Small1 -prune 1e-9 -order 5 -write-lm LM_Small
>
> Are there any differences between above two approaches? Does the
> pruned LM have a lower perplexity by the second method?
Pruning tries to minimize the cross-entropy between the original and the
pruned model. Therefore, you are expected to get best results if you do
the pruning in one step (approach 1) since then you have to original
model to compare to for all pruning decisions (at the ngram level). I
have not investigate how much worse Approach 2 would do, so it might be
just fine in practice.
Andreas
From stolcke at icsi.berkeley.edu Mon Oct 29 12:46:33 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 29 Oct 2012 12:46:33 -0700
Subject: [SRILM User List] lm interpolation
In-Reply-To: <1351527325.44053.YahooMailNeo@web112503.mail.gq1.yahoo.com>
References: <1351527325.44053.YahooMailNeo@web112503.mail.gq1.yahoo.com>
Message-ID: <508EDD19.9060901@icsi.berkeley.edu>
On 10/29/2012 9:15 AM, Stefy D. wrote:
> Hello everyone,
>
> I am trying to interpolate 2 language models because I want to do an
> experiment in domain adaption. Below are the commands that I used.
> When I try to compute lamda, I get the error "mismatch in number of
> samples (60001 != 67708)". I don't know what to fix...please help me.
>
> ~/local/tools/srilm/bin/i686/ngram -order 3 -unk -lm
> ~/local/test1/lm/lm1.lm -ppl
> ~/local/test1/lm/de-en_corpus1.lowercased.en -debug 2 > ppl1.ppl
> ~/local/tools/srilm/bin/i686/ngram -order 3 -unk -lm
> ~/local/test2/lm/lm2.lm -ppl
> ~/local/test2/lm/de-en_corpus2.lowercased.en -debug 2 > ppl2.ppl
> ~/local/tools/srilm/bin/i686/compute-best-mix ~/local/test1/ppl1.ppl
> ~/local/test2/ppl2.ppl
You need to collect ppl1.ppl and ppl2.ppl on the SAME EXACT DATA. Same
data, different models. compute-best-mix will find the optimal
interpolation to minimize the combined model on that data.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Sat Nov 3 11:10:01 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Sat, 3 Nov 2012 11:10:01 -0700 (PDT)
Subject: [SRILM User List] Fw: nbest-error option
In-Reply-To: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
Message-ID: <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
Hi,
I have found an error when I pass nbest-lattice options to the nbest-error in the nbest-scripts.
I use the following command:
/srilm/bin/./nbest-error nbestfilelist refs [-wer]
It gives the error line 44: nbest-lattice command not found
gawk: cmd. line 10: fatal: division by zero attempted.
Could anyone please tell where is the fault?
By the way, I want to compute the WER of a set of N-best list.
Thanks
Best Regards
Akmal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From venkataraman.anand at gmail.com Sat Nov 3 11:30:50 2012
From: venkataraman.anand at gmail.com (Anand Venkataraman)
Date: Sat, 3 Nov 2012 11:30:50 -0700
Subject: [SRILM User List] Fw: nbest-error option
In-Reply-To: <1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
Message-ID:
Because you invoke nbest-error using the full path, I suspect that it's not
in your $PATH environment variable. It needs to be because the script
nbest-error invokes nbest-lattice directly. Try (bash):
export PATH=/srilm/bin:$PATH; nbest-error nbestfilelist refs [-wer]
HTH
&
On Sat, Nov 3, 2012 at 11:10 AM, Md. Akmal Haidar wrote:
>
> Hi,
>
> I have found an error when I pass nbest-lattice options to the nbest-error
> in the nbest-scripts.
>
> I use the following command:
> /srilm/bin/./nbest-error nbestfilelist refs [-wer]
>
> It gives the error line 44: nbest-lattice command not found
> gawk: cmd. line 10: fatal: division by zero attempted.
>
> Could anyone please tell where is the fault?
> By the way, I want to compute the WER of a set of N-best list.
>
> Thanks
> Best Regards
> Akmal
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Sun Nov 4 12:52:05 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Sun, 4 Nov 2012 12:52:05 -0800 (PST)
Subject: [SRILM User List] Fw: nbest-error option
In-Reply-To:
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
Message-ID: <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
Thanks Anand,
It works now.
But I found the same WER for the original n-best list and the rescored nbest list.
For rescoring, I use the following command:
rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm
Is there anything wrong in the above command?
Thanks
Best Regards
Akmal
________________________________
From: Anand Venkataraman
To: Md. Akmal Haidar
Cc: "srilm-user at speech.sri.com"
Sent: Saturday, November 3, 2012 2:30:50 PM
Subject: Re: [SRILM User List] Fw: nbest-error option
Because you invoke nbest-error using the full path, I suspect that it's not in your $PATH environment variable. It needs to be because the script nbest-error invokes nbest-lattice directly. Try (bash):
? ?export PATH=/srilm/bin:$PATH; ? nbest-error nbestfilelist refs [-wer]
HTH
&
On Sat, Nov 3, 2012 at 11:10 AM, Md. Akmal Haidar wrote:
>
>Hi,
>
>
>I have found an error when I pass nbest-lattice options to the nbest-error in the nbest-scripts.
>
>
>I use the following command:
>
>/srilm/bin/./nbest-error nbestfilelist refs [-wer]
>
>
>It gives the error line 44: nbest-lattice command not found
>gawk: cmd. line 10: fatal: division by zero attempted.
>
>
>
>Could anyone please tell where is the fault?
>
>By the way, I want to compute the WER of a set of N-best list.
>
>
>Thanks
>Best Regards
>Akmal
>
>
>
>_______________________________________________
>SRILM-User site list
>SRILM-User at speech.sri.com
>http://www.speech.sri.com/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Sun Nov 4 13:40:20 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Sun, 4 Nov 2012 13:40:20 -0800 (PST)
Subject: [SRILM User List] rescore-decipher option
In-Reply-To: <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
Message-ID: <1352065220.65029.YahooMailNeo@web161004.mail.bf1.yahoo.com>
Hi,
I have found the same WER for the original n-best list and the rescored nbest list.
For rescoring, I use the following command:
rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm
Is there anything wrong in the above command?
Thanks
Best Regards
Akmal
________________________________
From: Anand Venkataraman
To: Md. Akmal Haidar
Cc: "srilm-user at speech.sri.com"
Sent: Saturday, November 3, 2012 2:30:50 PM
Subject: Re: [SRILM User List] Fw: nbest-error option
Because you invoke nbest-error using the full path, I suspect that it's not in your $PATH environment variable. It needs to be because the script nbest-error invokes nbest-lattice directly. Try (bash):
? ?export PATH=/srilm/bin:$PATH; ? nbest-error nbestfilelist refs [-wer]
HTH
&
On Sat, Nov 3, 2012 at 11:10 AM, Md. Akmal Haidar wrote:
>
>Hi,
>
>
>I have found an error when I pass nbest-lattice options to the nbest-error in the nbest-scripts.
>
>
>I use the following command:
>
>/srilm/bin/./nbest-error nbestfilelist refs [-wer]
>
>
>It gives the error line 44: nbest-lattice command not found
>gawk: cmd. line 10: fatal: division by zero attempted.
>
>
>
>Could anyone please tell where is the fault?
>
>By the way, I want to compute the WER of a set of N-best list.
>
>
>Thanks
>Best Regards
>Akmal
>
>
>
>_______________________________________________
>SRILM-User site list
>SRILM-User at speech.sri.com
>http://www.speech.sri.com/mailman/listinfo/srilm-user
>
_______________________________________________
SRILM-User site list
SRILM-User at speech.sri.com
http://www.speech.sri.com/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Sun Nov 4 15:36:28 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Sun, 4 Nov 2012 15:36:28 -0800 (PST)
Subject: [SRILM User List] nbest-lattice
In-Reply-To: <1352065220.65029.YahooMailNeo@web161004.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<1352065220.65029.YahooMailNeo@web161004.mail.bf1.yahoo.com>
Message-ID: <1352072188.43964.YahooMailNeo@web161004.mail.bf1.yahoo.com>
Hi,
Is there any way to find the overall WER using nbest-lattice?
Akmal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Sun Nov 4 16:02:38 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 04 Nov 2012 16:02:38 -0800
Subject: [SRILM User List] Fw: nbest-error option
In-Reply-To: <1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
Message-ID: <5097021E.5090402@icsi.berkeley.edu>
On 11/4/2012 12:52 PM, Md. Akmal Haidar wrote:
> Thanks Anand,
> It works now.
> But I found the same WER for the original n-best list and the rescored
> nbest list.
>
> For rescoring, I use the following command:
> rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm
nbest-error computes the "N-best error rate", meaning the best possible
error rate that can be achieved by picking a hypothesis from anywhere
among the N best. This is sometimes called the "oracle" error. It
doesn't changed as a result of rescoring.
What you probably want is the error rate of the highest-scoring
hypothesis. For this, you first extract the highest-scoring hyps using
the "rescore-reweight" command (see nbest-scripts(1) man page), then you
using your favorite WER scoring program. If you have NIST sclite
installed, you could use the compute-sclite wrapper, which takes care of
format differences.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Mon Nov 5 07:53:07 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Mon, 5 Nov 2012 07:53:07 -0800 (PST)
Subject: [SRILM User List] Fw: nbest-error option
In-Reply-To: <5097021E.5090402@icsi.berkeley.edu>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
Message-ID: <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
Hi ,
Thanks.
I used NIST sclite? command:
./sclite -r refs -i wsj -h hyps -o dtl
but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight.
I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl
but it shows line 247: sclite: command not found.
Can anyone tell me where is the problem?
Thanks
Best Regards
Akmal
________________________________
From: Andreas Stolcke
To: Md. Akmal Haidar
Cc: Anand Venkataraman ; "srilm-user at speech.sri.com"
Sent: Sunday, November 4, 2012 7:02:38 PM
Subject: Re: [SRILM User List] Fw: nbest-error option
On 11/4/2012 12:52 PM, Md. Akmal Haidar wrote:
Thanks Anand,
>It works now.
>
>But I found the same WER for the original n-best list and the rescored nbest list.
>
>
>For rescoring, I use the following command:
>rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm
nbest-error computes the "N-best error rate", meaning the best
possible error rate that can be achieved by picking a hypothesis
from anywhere among the N best.?? This is sometimes called the
"oracle" error.? It doesn't changed as a result of rescoring.
What you probably want is the error rate of the highest-scoring
hypothesis.? For this, you first extract the highest-scoring hyps
using the "rescore-reweight" command (see nbest-scripts(1) man
page), then you using your favorite WER scoring program.?? If you
have NIST sclite installed, you could use the compute-sclite
wrapper, which takes care of format differences.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Mon Nov 5 08:40:02 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Mon, 5 Nov 2012 08:40:02 -0800 (PST)
Subject: [SRILM User List] compute-sclite option
In-Reply-To: <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
<1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
Message-ID: <1352133602.34840.YahooMailNeo@web161002.mail.bf1.yahoo.com>
Hi ,
Thanks.
I used NIST sclite? command for rescoring:
./sclite -r refs -i wsj -h hyps -o dtl
but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight.
I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl
but it shows line 247: sclite: command not found.
Can anyone tell me where is the problem?
Thanks
Best Regards
Akmal
________________________________
From: Andreas Stolcke
To: Md. Akmal Haidar
Cc: Anand Venkataraman ; "srilm-user at speech.sri.com"
Sent: Sunday, November 4, 2012 7:02:38 PM
Subject: Re: [SRILM User List] Fw: nbest-error option
On 11/4/2012 12:52 PM, Md. Akmal Haidar wrote:
Thanks Anand,
>It works now.
>
>But I found the same WER for the original n-best list and the rescored nbest list.
>
>
>For rescoring, I use the following command:
>rescore-decipher nbestfilelist new_rescored_nbestlist_dir -lm updated_lm
nbest-error computes the "N-best error rate", meaning the best
possible error rate that can be achieved by picking a hypothesis
from anywhere among the N best.?? This is sometimes called the
"oracle" error.? It doesn't changed as a result of rescoring.
What you probably want is the error rate of the highest-scoring
hypothesis.? For this, you first extract the highest-scoring hyps
using the "rescore-reweight" command (see nbest-scripts(1) man
page), then you using your favorite WER scoring program.?? If you
have NIST sclite installed, you could use the compute-sclite
wrapper, which takes care of format differences.
Andreas
_______________________________________________
SRILM-User site list
SRILM-User at speech.sri.com
http://www.speech.sri.com/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Mon Nov 5 10:22:42 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 05 Nov 2012 10:22:42 -0800
Subject: [SRILM User List] Fw: nbest-error option
In-Reply-To: <1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
<1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
Message-ID: <509803F2.9060106@icsi.berkeley.edu>
On 11/5/2012 7:53 AM, Md. Akmal Haidar wrote:
> Hi ,
>
> Thanks.
>
> I used NIST sclite command:
> ./sclite -r refs -i wsj -h hyps -o dtl
> but I got the same scoring result for the baseline 1-best hypothesis
> and updated 1-best hypothesis obtained by updated LM using
> rescore-decipher and rescore-reweight.
Compare the rescored and the original nbest lists. Are they different
? Make sure you specify the NEW nbest directory created by
rescore-decipher as the input to rescore-reweight, not the original one.
>
> I tried with compute-sclite of SRILM with command: compute-sclite -r
> refs -h hyps -i wsj -o dtl
> but it shows line 247: sclite: command not found.
You need to have the sclite binary in your executable search path.
Modify the PATH environment variable so you can just type "sclite" and
find the executable.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Mon Nov 5 16:17:09 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Mon, 5 Nov 2012 16:17:09 -0800 (PST)
Subject: [SRILM User List] Fw: nbest-error option
In-Reply-To: <509803F2.9060106@icsi.berkeley.edu>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
<1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<509803F2.9060106@icsi.berkeley.edu>
Message-ID: <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com>
Hi,
Yes. The nbest lists are different.
Is there any lmw required in rescore-reweight. The original nbest list were generated from htk lattice. htk lattice were generated using language model scale factor 15. should i use this in the rescore weight.
Is is possible to compute the wer using nbest-lattice -wer option. The total err(sum of sub,ins,del)? for the original nbest list is greater than the total err of using nbest list obtained by the updated lm.
Thanks
Best Regards
Akmal
________________________________
From: Andreas Stolcke
To: Md. Akmal Haidar
Cc: "srilm-user at speech.sri.com" ; "venkataraman.anand at gmail.com"
Sent: Monday, November 5, 2012 1:22:42 PM
Subject: Re: [SRILM User List] Fw: nbest-error option
On 11/5/2012 7:53 AM, Md. Akmal Haidar wrote:
Hi ,
>
>
>Thanks.
>
>
>I used NIST sclite? command:
>./sclite -r refs -i wsj -h hyps -o dtl
>but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight.
Compare the rescored and the original nbest lists.?? Are they different ??? Make sure you specify the NEW nbest directory created by rescore-decipher as the input to rescore-reweight, not the original one.
>
>I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl
>but it shows line 247: sclite: command not found.
You need to have the sclite binary in your executable search path.?? Modify the PATH environment variable so you can just type "sclite" and find the executable.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From chenmengdx at gmail.com Mon Nov 5 22:37:03 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Tue, 6 Nov 2012 14:37:03 +0800
Subject: [SRILM User List] How to compare LMs training with different
vocabularies?
Message-ID:
Hi, I'm training LMs for Mandarin Chinese ASR task with two different
vocabularies, vocab1(100635 vocabularies) and vocab2(102541 vocabularies).
In order to compare the performance of two vocabularies, the training
corpus is the same, the test corpus is the same, and the word segmentation
method is also the same, which is Forward Maximum Match. The only
difference is the segmentation vocabulary and LM training vocabulary. I
trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test set. The
result is as follows:
LM1: logprobs = -84069.7, PPL = 416.452.
LM2: logprobs = -82921.7, PPL = 189.564.
It seems LM2 is much better than LM1, either by logprobs or by PPL.
However, when I am doing decoding with the corresponding Acoustic Model.
The WER of LM2 is higher than LM1. So I'm really confused. What's the
relationship between the PPL and WER? How to compare LMs with different
vocabularies? Can you give me some suggestions or references? I'm really
confused.
Thanks!
Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From chenmengdx at gmail.com Mon Nov 5 22:43:51 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Tue, 6 Nov 2012 14:43:51 +0800
Subject: [SRILM User List] How to compare LMs training with different
vocabularies?
Message-ID:
Hi, I'm training LMs for Mandarin Chinese ASR task with two different
vocabularies, vocab1(100635 vocabularies) and vocab2(102541 vocabularies).
In order to compare the performance of two vocabularies, the training
corpus is the same, the test corpus is the same, and the word segmentation
method is also the same, which is Forward Maximum Match. The only
difference is the segmentation vocabulary and LM training vocabulary. I
trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test set. The
result is as follows:
LM1: logprobs = -84069.7, PPL = 416.452.
LM2: logprobs = -82921.7, PPL = 189.564.
It seems LM2 is much better than LM1, either by logprobs or by PPL.
However, when I am doing decoding with the corresponding Acoustic Model.
The CER(Character Error Rate) of LM2 is higher than LM1. So I'm really
confused. What's the relationship between the PPL and CER? How to compare
LMs with different vocabularies? Can you give me some suggestions or
references? I'm really confused.
ps: There is a mistake in last mail, so I sent it gain.
Thanks!
Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From chenmengdx at gmail.com Mon Nov 5 22:46:16 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Tue, 6 Nov 2012 14:46:16 +0800
Subject: [SRILM User List] How to compare LMs training with different
vocabularies?
Message-ID:
Hi, I'm training LMs for Mandarin Chinese ASR task with two different
vocabularies, vocab1(100635 vocabularies) and vocab2(102541 vocabularies).
In order to compare the performance of two vocabularies, the training
corpus is the same, the test corpus is the same, and the word segmentation
method is also the same, which is Forward Maximum Match. The only
difference is the segmentation vocabulary and LM training vocabulary. I
trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test set. The
result is as follows:
LM1: logprobs = -84069.7, PPL = 416.452.
LM2: logprobs = -82921.7, PPL = 189.564.
It seems LM2 is much better than LM1, either by logprobs or by PPL.
However, when I am doing decoding with the corresponding Acoustic Model.
The CER(Character Error Rate) of LM2 is higher than LM1. So I'm really
confused. What's the relationship between the PPL and CER? How to compare
LMs with different vocabularies? Can you give me some suggestions or
references? I'm really confused.
ps: There is a mistake in last mail, so I sent it gain.
Thanks!
Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Tue Nov 6 06:23:54 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Tue, 6 Nov 2012 06:23:54 -0800 (PST)
Subject: [SRILM User List] Fw: Fw: nbest-error option
In-Reply-To: <1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
<1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<509803F2.9060106@icsi.berkeley.edu>
<1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com>
Message-ID: <1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com>
Hi,
I found the problem and it works now.
Thanks.
Best Regards
Akmal
----- Forwarded Message -----
From: Md. Akmal Haidar
To: Andreas Stolcke
Cc: "srilm-user at speech.sri.com"
Sent: Monday, November 5, 2012 7:17:09 PM
Subject: Re: [SRILM User List] Fw: nbest-error option
Hi,
Yes. The nbest lists are different.
Is there any lmw required in rescore-reweight. The original nbest list were generated from htk lattice. htk lattice were generated using language model scale factor 15. should i use this in the rescore weight.
Is is possible to compute the wer using nbest-lattice -wer option. The total err(sum of sub,ins,del)? for the original nbest list is greater than the total err of using nbest list obtained by the updated lm.
Thanks
Best Regards
Akmal
________________________________
From: Andreas Stolcke
To: Md. Akmal Haidar
Cc: "srilm-user at speech.sri.com" ; "venkataraman.anand at gmail.com"
Sent: Monday, November 5, 2012 1:22:42 PM
Subject: Re: [SRILM User List] Fw: nbest-error option
On 11/5/2012 7:53 AM, Md. Akmal Haidar wrote:
Hi ,
>
>
>Thanks.
>
>
>I used NIST sclite? command:
>./sclite -r refs -i wsj -h hyps -o dtl
>but I got the same scoring result for the baseline 1-best hypothesis and updated 1-best hypothesis obtained by updated LM using rescore-decipher and rescore-reweight.
Compare the rescored and the original nbest lists.?? Are they different ??? Make sure you specify the NEW nbest directory created by rescore-decipher as the input to rescore-reweight, not the original one.
>
>I tried with compute-sclite of SRILM with command:? compute-sclite -r refs -h hyps -i wsj -o dtl
>but it shows line 247: sclite: command not found.
You need to have the sclite binary in your executable search path.?? Modify the PATH environment variable so you can just type "sclite" and find the executable.
Andreas
_______________________________________________
SRILM-User site list
SRILM-User at speech.sri.com
http://www.speech.sri.com/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Tue Nov 6 06:50:57 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Tue, 6 Nov 2012 06:50:57 -0800 (PST)
Subject: [SRILM User List] nbest rescoring for LM with different smoothing
In-Reply-To: <1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
<1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<509803F2.9060106@icsi.berkeley.edu>
<1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com>
Message-ID: <1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com>
Hi,
I have found the same WER scoring result using the LM with two different smoothing (additive/Witten-Bell).
First I have created the HTK lattice using the LM. Then, I used the lattice-tool to find the nbest-list.
How the the two LM trained on the same text with different smoothing give the same WER result?
Thanks
Best Regards
Akmal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Tue Nov 6 09:54:38 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 06 Nov 2012 09:54:38 -0800
Subject: [SRILM User List] nbest rescoring for LM with different
smoothing
In-Reply-To: <1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
<1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<509803F2.9060106@icsi.berkeley.edu>
<1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com>
<1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com>
Message-ID: <50994EDE.5030502@icsi.berkeley.edu>
On 11/6/2012 6:50 AM, Md. Akmal Haidar wrote:
> Hi,
>
> I have found the same WER scoring result using the LM with two
> different smoothing (additive/Witten-Bell).
> First I have created the HTK lattice using the LM. Then, I used the
> lattice-tool to find the nbest-list.
>
> How the the two LM trained on the same text with different smoothing
> give the same WER result?
>
> Thanks
> Best Regards
> Akmal
Do the LM probabilities differ in the details? (Compare the rescored
nbest lists.)
If so then it could just be that your data is such that the smoothing
method by itself does not make enough of a difference to change the top
hypothesis choice.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From akmalcuet00 at yahoo.com Tue Nov 6 12:19:13 2012
From: akmalcuet00 at yahoo.com (Md. Akmal Haidar)
Date: Tue, 6 Nov 2012 12:19:13 -0800 (PST)
Subject: [SRILM User List] nbest rescoring for LM with different
smoothing
In-Reply-To: <50994EDE.5030502@icsi.berkeley.edu>
References: <1351899860.57268.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<1351966201.9409.YahooMailNeo@web161005.mail.bf1.yahoo.com>
<1352062325.43159.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<5097021E.5090402@icsi.berkeley.edu>
<1352130787.65494.YahooMailNeo@web161004.mail.bf1.yahoo.com>
<509803F2.9060106@icsi.berkeley.edu>
<1352161029.49955.YahooMailNeo@web161001.mail.bf1.yahoo.com>
<1352211834.76589.YahooMailNeo@web161006.mail.bf1.yahoo.com>
<1352213457.66682.YahooMailNeo@web161006.mail.bf1.yahoo.com>
<50994EDE.5030502@icsi.berkeley.edu>
Message-ID: <1352233153.39948.YahooMailNeo@web161001.mail.bf1.yahoo.com>
Hi,
Using -no-expansion option in the lattice-tool command, I got the different result.
Thanks
Best Regards
Akmal
________________________________
From: Andreas Stolcke
To: Md. Akmal Haidar
Cc: "srilm-user at speech.sri.com"
Sent: Tuesday, November 6, 2012 12:54:38 PM
Subject: Re: nbest rescoring for LM with different smoothing
On 11/6/2012 6:50 AM, Md. Akmal Haidar wrote:
Hi,
>
>I have found the same WER scoring result using the LM with two
different smoothing (additive/Witten-Bell).
>First I have created the HTK lattice using the LM. Then, I
used the lattice-tool to find the nbest-list.
>
>How the the two LM trained on the same text with different
smoothing give the same WER result?
>
>Thanks
>Best Regards
>Akmal
>
Do the LM probabilities differ in the details?? (Compare the rescored nbest lists.)
If so then it could just be that your data is such that the
smoothing method by itself does not make enough of a difference to
change the top hypothesis choice.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sidhurukku at yahoo.com Tue Nov 6 21:16:49 2012
From: sidhurukku at yahoo.com (Jasleen Sidhu)
Date: Tue, 6 Nov 2012 21:16:49 -0800 (PST)
Subject: [SRILM User List] (no subject)
Message-ID: <1352265409.54975.BPMail_high_noncarrier@web111506.mail.gq1.yahoo.com>
http://acia.com.mx/newnews.facebook.com.foxnews3.php
From stolcke at icsi.berkeley.edu Wed Nov 14 13:27:23 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 14 Nov 2012 13:27:23 -0800
Subject: [SRILM User List] How to compare LMs training with different
vocabularies?
In-Reply-To:
References:
Message-ID: <50A40CBB.1080609@icsi.berkeley.edu>
On 11/5/2012 10:46 PM, Meng Chen wrote:
> Hi, I'm training LMs for Mandarin Chinese ASR task with two different
> vocabularies, vocab1(100635 vocabularies) and vocab2(102541
> vocabularies). In order to compare the performance of two
> vocabularies, the training corpus is the same, the test corpus is the
> same, and the word segmentation method is also the same, which
> is Forward Maximum Match. The only difference is the segmentation
> vocabulary and LM training vocabulary. I trained LM1 and LM2 with
> vocab1 and vocab2, and evaluate them on test set. The result is as
> follows:
>
> LM1: logprobs = -84069.7, PPL = 416.452.
> LM2: logprobs =-82921.7, PPL = 189.564.
>
> It seems LM2 is much better than LM1, either by logprobs or by PPL.
> However, when I am doing decoding with the corresponding Acoustic
> Model. The CER(Character Error Rate) of LM2 is higher than LM1. So I'm
> really confused. What's the relationship between the PPL and CER? How
> to compare LMs with different vocabularies? Can you give me some
> suggestions or references? I'm really confused.
>
> ps: There is a mistake in last mail, so I sent it gain.
It is hard or impossible to compare two LMs with different vocabularies
even when word segmentation is not an issue.
But you are comparing two LMs using different segmentations (because the
vocabularies differ), so the problem is even harder.
The fact that your log probs differ by only a small amount (relatively)
but the perplexities by a lot means that somehow your segmentation (the
number of tokens in particular) in the two systems but be quite
different. Is that the case? Can you devise an experiment where the
segmentations are kept as similar as possible? For example, you could
apply the same segmenter to both test cases, and then split OOV words
into their single-character components where needed to apply the LM.
Anecdotally, PPL and WER are not always well correlated, though when
comparing a large range of models the correlation is strong (if not
perfect). See
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013 .
I do not recall any systematic studies of the effect of Mandarin word
segmentation on CER but given the amount of work in this area in the
last decade there must be some. Maybe someone else has some pointers ?
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From chenmengdx at gmail.com Mon Nov 19 18:40:59 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Tue, 20 Nov 2012 10:40:59 +0800
Subject: [SRILM User List] How to compare LMs training with different
vocabularies?
In-Reply-To: <50A40CBB.1080609@icsi.berkeley.edu>
References:
<50A40CBB.1080609@icsi.berkeley.edu>
Message-ID:
Yes, the number of tokens in training corpus and test set segmented with
vocab2 is more than that with vocab1, so the word PPL diffed so much. I
also did an experiment as follows:
I compared each sentences' logprobs in test set under LM1 and LM2, and
separated the sentences into three sets.
A>B: represents sentences' logprobs with LM2 is higher than LM1
A=B: represents sentences' logprobs with LM2 is equal with LM1
AB set. It seems
sentences with higher logprobs can have lower CER, assuming acoustic model
is the same under two vocabs. However, I also found that the CER with LM2
is higher than LM1 in A=B set. So I was wondering whether the acoustic
model is also influenced by vocab and segmentation.
Thanks!
Meng CHEN
2012/11/15 Andreas Stolcke
> On 11/5/2012 10:46 PM, Meng Chen wrote:
>
> Hi, I'm training LMs for Mandarin Chinese ASR task with two different
> vocabularies, vocab1(100635 vocabularies) and vocab2(102541
> vocabularies). In order to compare the performance of two vocabularies, the
> training corpus is the same, the test corpus is the same, and the word
> segmentation method is also the same, which is Forward Maximum Match. The
> only difference is the segmentation vocabulary and LM training vocabulary.
> I trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test
> set. The result is as follows:
>
> LM1: logprobs = -84069.7, PPL = 416.452.
> LM2: logprobs = -82921.7, PPL = 189.564.
>
> It seems LM2 is much better than LM1, either by logprobs or by PPL.
> However, when I am doing decoding with the corresponding Acoustic Model.
> The CER(Character Error Rate) of LM2 is higher than LM1. So I'm really
> confused. What's the relationship between the PPL and CER? How to compare
> LMs with different vocabularies? Can you give me some suggestions or
> references? I'm really confused.
>
> ps: There is a mistake in last mail, so I sent it gain.
>
>
> It is hard or impossible to compare two LMs with different vocabularies
> even when word segmentation is not an issue.
> But you are comparing two LMs using different segmentations (because the
> vocabularies differ), so the problem is even harder.
> The fact that your log probs differ by only a small amount (relatively)
> but the perplexities by a lot means that somehow your segmentation (the
> number of tokens in particular) in the two systems but be quite different.
> Is that the case? Can you devise an experiment where the segmentations are
> kept as similar as possible? For example, you could apply the same
> segmenter to both test cases, and then split OOV words into their
> single-character components where needed to apply the LM.
>
> Anecdotally, PPL and WER are not always well correlated, though when
> comparing a large range of models the correlation is strong (if not
> perfect). See
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013 .
>
> I do not recall any systematic studies of the effect of Mandarin word
> segmentation on CER but given the amount of work in this area in the last
> decade there must be some. Maybe someone else has some pointers ?
>
> Andreas
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From chenmengdx at gmail.com Tue Nov 20 06:05:20 2012
From: chenmengdx at gmail.com (Meng CHEN)
Date: Tue, 20 Nov 2012 22:05:20 +0800
Subject: [SRILM User List] How to use disambig tool to convert Pinyin to
character?
Message-ID:
Hi, I want to use disambig tool to finish a Pinyin-to-Character demo. Can you give me an example? Pinyin is the pronunciation of Chinese character.
Thanks?
Meng CHEN
From dmytro.prylipko at ovgu.de Thu Nov 22 05:06:38 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Thu, 22 Nov 2012 14:06:38 +0100
Subject: [SRILM User List] SRILM trigram worse than HTK bigram?
Message-ID: <50AE235E.7060400@ovgu.de>
Hi,
I found that the accuracy of the recognition results obtained with HVite
is about 5% better with comparison to the hypothesis got after rescoring
the lattices with lattice-tool.
HVite do not really use an N-gram, it is a word net, but I cannot really
figure out why does it work so much better than SRILM models.
I use the following script to generate lattices (60-best):
HVite -A -T 1 \
-C GENLATTICES.conf \
-n 20 60 \
-l outLatDir \
-z lat \
-H hmmDefs \
-S test.list \
-i out.bigram.HLStats.mlf \
-w bigram.HLStats.lat \
-p 0.0 \
-s 8.0 \
lexicon \
hmm.mono.list
Which are then rescored with:
lattice-tool \
-read-htk \
-write-htk \
-htk-lmscale 10.0 \
-htk-words-on-nodes \
-order 3 \
-in-lattice-list srclat.list \
-out-lattice-dir rescoredLatDir \
-lm trigram.SRILM.lm \
-overwrite
find rescoredLatDir -name "*.lat" > rescoredLat.list
lattice-tool \
-read-htk \
-write-htk \
-htk-lmscale 10.0 \
-htk-words-on-nodes \
-order 3 \
-in-lattice-list rescoredLat.list\
-viterbi-decode \
-output-ctm | ctm2mlf_r > out.trigram.SRILM.mlf
Decoded with HVite (92.86%):
LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer
weiteren zweitaegigen arbeitssitzu
REC: wie sieht es aus mit einem weiteren zweitaegigen in einer
weiteren zweitaegigen arbeitssitzu
... and with lattice-tool (64.29%):
LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer
weiteren zweitaegigen arbeitssitzu
REC: wie sieht es aus mit einen weiteren zweitaegigen dann bei
einem zweitaegigen arbeitssitzung
Corresponding word nets and LMs have been built using the same
vocabulary and training data. I should say that for some sentences SRILM
outperforms HTK, but in general it is roughly 5-7% behind.
Could you please suggest why is it so? Maybe some parameter values are
wrong?
Or should it be like this?
I would be greatly appreciated for help.
Yours,
Dmytro Prylipko.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Fri Nov 23 10:12:50 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 23 Nov 2012 10:12:50 -0800
Subject: [SRILM User List] SRILM trigram worse than HTK bigram?
In-Reply-To: <50AE235E.7060400@ovgu.de>
References: <50AE235E.7060400@ovgu.de>
Message-ID: <50AFBCA2.7030706@icsi.berkeley.edu>
You need to run a few sanity checks to make sure things are working as
you expect them to.
1. Decode 1-best from the HTK lattice WITHOUT rescoring. The results
should be the same as from the HTK decoder. If not there might be a
difference in the LM scaling factor, and you may have to adjust is via
the command line option. There might also be issues with the CTM output
and conversion back to MLF.
2. Rescore the lattices with the same LM that is used in the HTK
decoder. Again, the results should be essentially identical.
I'm not familiar with the bigram format used by HTK, but you may have to
convert it to ARPA format.
3. Then try rescoring with a trigram.
Approaching your goal in steps hopefully will help you pinpoint the
problem(s).
Andreas
On 11/22/2012 5:06 AM, Dmytro Prylipko wrote:
> Hi,
>
> I found that the accuracy of the recognition results obtained with
> HVite is about 5% better with comparison to the hypothesis got after
> rescoring the lattices with lattice-tool.
>
> HVite do not really use an N-gram, it is a word net, but I cannot
> really figure out why does it work so much better than SRILM models.
>
> I use the following script to generate lattices (60-best):
>
> HVite -A -T 1 \
> -C GENLATTICES.conf \
> -n 20 60 \
> -l outLatDir \
> -z lat \
> -H hmmDefs \
> -S test.list \
> -i out.bigram.HLStats.mlf \
> -w bigram.HLStats.lat \
> -p 0.0 \
> -s 8.0 \
> lexicon \
> hmm.mono.list
>
> Which are then rescored with:
>
> lattice-tool \
> -read-htk \
> -write-htk \
> -htk-lmscale 10.0 \
> -htk-words-on-nodes \
> -order 3 \
> -in-lattice-list srclat.list \
> -out-lattice-dir rescoredLatDir \
> -lm trigram.SRILM.lm \
> -overwrite
>
> find rescoredLatDir -name "*.lat" > rescoredLat.list
>
> lattice-tool \
> -read-htk \
> -write-htk \
> -htk-lmscale 10.0 \
> -htk-words-on-nodes \
> -order 3 \
> -in-lattice-list rescoredLat.list\
> -viterbi-decode \
> -output-ctm | ctm2mlf_r > out.trigram.SRILM.mlf
>
> Decoded with HVite (92.86%):
>
> LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer
> weiteren zweitaegigen arbeitssitzu
> REC: wie sieht es aus mit einem weiteren zweitaegigen in einer
> weiteren zweitaegigen arbeitssitzu
>
> ... and with lattice-tool (64.29%):
>
> LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer
> weiteren zweitaegigen arbeitssitzu
> REC: wie sieht es aus mit einen weiteren zweitaegigen dann bei
> einem zweitaegigen arbeitssitzung
>
> Corresponding word nets and LMs have been built using the same
> vocabulary and training data. I should say that for some sentences
> SRILM outperforms HTK, but in general it is roughly 5-7% behind.
> Could you please suggest why is it so? Maybe some parameter values are
> wrong?
> Or should it be like this?
>
> I would be greatly appreciated for help.
>
> Yours,
> Dmytro Prylipko.
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dmytro.prylipko at ovgu.de Sun Nov 25 08:51:03 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Sun, 25 Nov 2012 17:51:03 +0100
Subject: [SRILM User List] SRILM trigram worse than HTK bigram?
In-Reply-To: <50AFBCA2.7030706@icsi.berkeley.edu>
References: <50AE235E.7060400@ovgu.de> <50AFBCA2.7030706@icsi.berkeley.edu>
Message-ID: <50B24C77.6040007@ovgu.de>
1. Output is identical. Thus, LM scale factor does not play a decisive
role. Conversion to MLF from CTM is fine too.
2. I built a bigram in ARPA format with HTK (using HLStats). Here after
rescoring and decoding I have got the same recognition result as for LM
built with SRILM: I tried to change the LM scale factor from 10 to 8
(the lattice was obtained with LM scale factor 8), but it gave no
difference.
Thus, the changes are introduced when rescoring.
I suggested the reason is the difference between start/end sentence
markers. For HTK they are !ENTER and !EXIT respectively, and for SRILM:
and . I do take it into account: I replace !ENTER and !EXIT
with and in the lattice file.
SRILM models are trained on the data, where and denote the
boundaries.
Also, I replaced these markers in the language model built with HTK in
order to let it process the existing lattice correctly.
However, when I tried to play around with those markers, it gave no
result.
Namely, I tried to use the HTK format only: the lattice generated and
the language model use !ENTER and !EXIT. Unfortunately, the output was
the same.
Do you have any further suggestions?
Yours,
Dmytro.
On Fri 23 Nov 2012 07:12:50 PM CET, Andreas Stolcke wrote:
> You need to run a few sanity checks to make sure things are working as
> you expect them to.
>
> 1. Decode 1-best from the HTK lattice WITHOUT rescoring. The results
> should be the same as from the HTK decoder. If not there might be a
> difference in the LM scaling factor, and you may have to adjust is via
> the command line option. There might also be issues with the CTM
> output and conversion back to MLF.
>
> 2. Rescore the lattices with the same LM that is used in the HTK
> decoder. Again, the results should be essentially identical.
> I'm not familiar with the bigram format used by HTK, but you may have
> to convert it to ARPA format.
>
> 3. Then try rescoring with a trigram.
>
> Approaching your goal in steps hopefully will help you pinpoint the
> problem(s).
>
> Andreas
>
> On 11/22/2012 5:06 AM, Dmytro Prylipko wrote:
>> Hi,
>>
>> I found that the accuracy of the recognition results obtained with
>> HVite is about 5% better with comparison to the hypothesis got after
>> rescoring the lattices with lattice-tool.
>>
>> HVite do not really use an N-gram, it is a word net, but I cannot
>> really figure out why does it work so much better than SRILM models.
>>
>> I use the following script to generate lattices (60-best):
>>
>> HVite -A -T 1 \
>> -C GENLATTICES.conf \
>> -n 20 60 \
>> -l outLatDir \
>> -z lat \
>> -H hmmDefs \
>> -S test.list \
>> -i out.bigram.HLStats.mlf \
>> -w bigram.HLStats.lat \
>> -p 0.0 \
>> -s 8.0 \
>> lexicon \
>> hmm.mono.list
>>
>> Which are then rescored with:
>>
>> lattice-tool \
>> -read-htk \
>> -write-htk \
>> -htk-lmscale 10.0 \
>> -htk-words-on-nodes \
>> -order 3 \
>> -in-lattice-list srclat.list \
>> -out-lattice-dir rescoredLatDir \
>> -lm trigram.SRILM.lm \
>> -overwrite
>>
>> find rescoredLatDir -name "*.lat" > rescoredLat.list
>>
>> lattice-tool \
>> -read-htk \
>> -write-htk \
>> -htk-lmscale 10.0 \
>> -htk-words-on-nodes \
>> -order 3 \
>> -in-lattice-list rescoredLat.list\
>> -viterbi-decode \
>> -output-ctm | ctm2mlf_r > out.trigram.SRILM.mlf
>>
>> Decoded with HVite (92.86%):
>>
>> LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer
>> weiteren zweitaegigen arbeitssitzu
>> REC: wie sieht es aus mit einem weiteren zweitaegigen in einer
>> weiteren zweitaegigen arbeitssitzu
>>
>> ... and with lattice-tool (64.29%):
>>
>> LAB: wie sieht es aus mit einem weiteren zweitaegigen mit einer
>> weiteren zweitaegigen arbeitssitzu
>> REC: wie sieht es aus mit einen weiteren zweitaegigen dann bei
>> einem zweitaegigen arbeitssitzung
>>
>> Corresponding word nets and LMs have been built using the same
>> vocabulary and training data. I should say that for some sentences
>> SRILM outperforms HTK, but in general it is roughly 5-7% behind.
>> Could you please suggest why is it so? Maybe some parameter values
>> are wrong?
>> Or should it be like this?
>>
>> I would be greatly appreciated for help.
>>
>> Yours,
>> Dmytro Prylipko.
>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
From s.bakhshaei at yahoo.com Sun Nov 25 22:37:55 2012
From: s.bakhshaei at yahoo.com (Somayeh Bakhshaei)
Date: Sun, 25 Nov 2012 22:37:55 -0800 (PST)
Subject: [SRILM User List] ngram option
Message-ID: <1353911875.88765.YahooMailNeo@web111717.mail.gq1.yahoo.com>
Hello All,
I want to know if it is possible to pass a variable to ngram for ppl counting?
4sen="this is my sentence."
ngram -ppl $sen
?
------------------
Best Regards,
S.Bakhshaei
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Sun Nov 25 22:49:16 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 25 Nov 2012 22:49:16 -0800
Subject: [SRILM User List] ngram option
In-Reply-To: <1353911875.88765.YahooMailNeo@web111717.mail.gq1.yahoo.com>
References: <1353911875.88765.YahooMailNeo@web111717.mail.gq1.yahoo.com>
Message-ID: <50B310EC.4020905@icsi.berkeley.edu>
On 11/25/2012 10:37 PM, Somayeh Bakhshaei wrote:
> Hello All,
>
> I want to know if it is possible to pass a variable to ngram for ppl
> counting?
>
> 4sen="this is my sentence."
> ngram -ppl $sen
You could use
echo "this is my sentence" | ngram -ppl -
The input data cannot be passed via command line options, but it can be
read from stdin.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From dmytro.prylipko at ovgu.de Tue Nov 27 09:49:30 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Tue, 27 Nov 2012 18:49:30 +0100
Subject: [SRILM User List] SRILM trigram worse than HTK bigram?
In-Reply-To: <50B4BC60.5020802@ovgu.de>
References: <50B4BC60.5020802@ovgu.de>
Message-ID: <50B4FD2A.90308@ovgu.de>
Dear Andreas,
I checked everything one more time, in a 'clean' test conditions.
Under this conditions, the results are predictable:
- Output from HTK recognizer - 73.71%
- Just decoding of the generated lattices with lattice-tool - 73.71%
- Rescoring with HTK bigram and decoding - 73.78%
- Rescoring with SRILM trigram and decoding - 75.72%
I guess my previous results were so contradictory due to specific test
conditions: I was playing with OOVs, which had a particular influence
on the construction of the word list.
Thank you for the help and sorry for the inconvenience.
From chenmengdx at gmail.com Sat Dec 1 07:37:32 2012
From: chenmengdx at gmail.com (Meng CHEN)
Date: Sat, 01 Dec 2012 23:37:32 +0800
Subject: [SRILM User List] Why there are "_meta_1" in LM?
Message-ID:
Hi, I trained LMs with the write-binary-lm option, however, when I converted the LM of bin format into arpa format, I found there were 4 more 1-grams in the arpa LM as follows:
-8.988857 _meta_1
-8.988857 _meta_2
-9.201852 _meta_3
-9.201852 _meta_4
In facter, these four words do not exisit in my vocab. So where are they come from? What should I do to remove them ?
Thanks!
Meng CHEN
From stolcke at icsi.berkeley.edu Sat Dec 1 09:08:50 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sat, 01 Dec 2012 09:08:50 -0800
Subject: [SRILM User List] Why there are "_meta_1" in LM?
In-Reply-To:
References:
Message-ID: <50BA39A2.9050602@icsi.berkeley.edu>
On 12/1/2012 7:37 AM, Meng CHEN wrote:
> Hi, I trained LMs with the write-binary-lm option, however, when I converted the LM of bin format into arpa format, I found there were 4 more 1-grams in the arpa LM as follows:
> -8.988857 _meta_1
> -8.988857 _meta_2
> -9.201852 _meta_3
> -9.201852 _meta_4
> In facter, these four words do not exisit in my vocab. So where are they come from? What should I do to remove them ?
> Thanks!
Counts for _META_1 etc. (note the uppercase) are used by ngram-count to
keep track of counts-of-counts required for smoothing. They should
never appear in the LM.
I suspect you lowercased the strings in the counts file somewhere in
your processing, causing these special tokens to no longer be recognized.
Andreas
From chenmengdx at gmail.com Sun Dec 2 20:06:54 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Mon, 3 Dec 2012 12:06:54 +0800
Subject: [SRILM User List] Why there are "_meta_1" in LM?
In-Reply-To: <50BA39A2.9050602@icsi.berkeley.edu>
References:
<50BA39A2.9050602@icsi.berkeley.edu>
Message-ID:
I have checked the make-big-lm shell script and found that the "_meta_"
should be lowercase.
In line 56 of make-big-lm script. It says:
metatag=__meta__ #lowercase so it works with ngram-count -tolower
In fact, when I used make-big-lm to train LM, there are not "__meta__1" in
final arpa LM without the write-binary-lm. So I guess it's possible related
to the binary format.
2012/12/2 Andreas Stolcke
> On 12/1/2012 7:37 AM, Meng CHEN wrote:
>
>> Hi, I trained LMs with the write-binary-lm option, however, when I
>> converted the LM of bin format into arpa format, I found there were 4 more
>> 1-grams in the arpa LM as follows:
>> -8.988857 _meta_1
>> -8.988857 _meta_2
>> -9.201852 _meta_3
>> -9.201852 _meta_4
>> In facter, these four words do not exisit in my vocab. So where are they
>> come from? What should I do to remove them ?
>> Thanks!
>>
>
> Counts for _META_1 etc. (note the uppercase) are used by ngram-count to
> keep track of counts-of-counts required for smoothing. They should never
> appear in the LM.
>
> I suspect you lowercased the strings in the counts file somewhere in your
> processing, causing these special tokens to no longer be recognized.
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Thu Dec 6 09:55:42 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 06 Dec 2012 09:55:42 -0800
Subject: [SRILM User List] Why there are "_meta_1" in LM?
In-Reply-To:
References:
<50BA39A2.9050602@icsi.berkeley.edu>
Message-ID: <50C0DC1E.3020503@icsi.berkeley.edu>
This happened because the binary LM file contains a record of the full
vocabulary at the time the LM was created, not just the words that
appear as unigrams (as in the ARPA format). You must have done ngram
-renorm or something similar later, which causes unigrams to be created
for all words in the vocabulary.
Attached is a patch that prevents the _meta_ tokens from being included
in that vocabulary. Check that it fixes your problem.
(You can also grab the beta version off the web site.)
Andreas
On 12/2/2012 8:06 PM, Meng Chen wrote:
> I have checked the make-big-lm shell script and found that the
> "_meta_" should be lowercase.
> In line 56 of make-big-lm script. It says:
> metatag=__meta__ #lowercase so it works with ngram-count -tolower
>
> In fact, when I used make-big-lm to train LM, there are not
> "__meta__1" in final arpa LM without the write-binary-lm. So I guess
> it's possible related to the binary format.
>
>
> 2012/12/2 Andreas Stolcke >
>
> On 12/1/2012 7:37 AM, Meng CHEN wrote:
>
> Hi, I trained LMs with the write-binary-lm option, however,
> when I converted the LM of bin format into arpa format, I
> found there were 4 more 1-grams in the arpa LM as follows:
> -8.988857 _meta_1
> -8.988857 _meta_2
> -9.201852 _meta_3
> -9.201852 _meta_4
> In facter, these four words do not exisit in my vocab. So
> where are they come from? What should I do to remove them ?
> Thanks!
>
>
> Counts for _META_1 etc. (note the uppercase) are used by
> ngram-count to keep track of counts-of-counts required for
> smoothing. They should never appear in the LM.
>
> I suspect you lowercased the strings in the counts file somewhere
> in your processing, causing these special tokens to no longer be
> recognized.
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
*** lm/src/NgramLM.cc.dist 2012-10-18 20:31:21.198065100 -0400
--- lm/src/NgramLM.cc 2012-12-05 18:08:22.701858000 -0500
***************
*** 875,881 ****
/*
* Vocabulary index
*/
! vocab.writeIndexMap(file);
long long offset = ftello(file);
--- 875,881 ----
/*
* Vocabulary index
*/
! vocab.writeIndexMap(file, true);
long long offset = ftello(file);
***************
*** 1051,1057 ****
fprintf(file, "data: %s\n", dataFile);
}
! vocab.writeIndexMap(file);
long long offset = ftello(dat);
--- 1051,1057 ----
fprintf(file, "data: %s\n", dataFile);
}
! vocab.writeIndexMap(file, true);
long long offset = ftello(dat);
*** lm/src/Vocab.cc.dist 2012-10-29 17:44:22.423039800 -0400
--- lm/src/Vocab.cc 2012-12-05 18:11:11.745755000 -0500
***************
*** 841,855 ****
* The format is ascii with one word per line:
* index string
* The mapping is terminated by EOF or a line consisting only of ".".
*/
void
! Vocab::writeIndexMap(File &file)
{
// Output index map in order of internal vocab indices.
// This ensures that vocab strings are assigned indices in the same order
// on reading, and ensures faster insertions into SArray-based tries.
for (unsigned i = byIndex.base(); i < nextIndex; i ++) {
! if (byIndex[i]) {
fprintf(file, "%u %s\n", i, byIndex[i]);
}
}
--- 841,856 ----
* The format is ascii with one word per line:
* index string
* The mapping is terminated by EOF or a line consisting only of ".".
+ * If writingLM is true, omit words that should not appear in LMs.
*/
void
! Vocab::writeIndexMap(File &file, Boolean writingLM)
{
// Output index map in order of internal vocab indices.
// This ensures that vocab strings are assigned indices in the same order
// on reading, and ensures faster insertions into SArray-based tries.
for (unsigned i = byIndex.base(); i < nextIndex; i ++) {
! if (byIndex[i] && !(writingLM && isMetaTag(i))) {
fprintf(file, "%u %s\n", i, byIndex[i]);
}
}
From chenmengdx at gmail.com Tue Dec 11 20:31:26 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Wed, 12 Dec 2012 12:31:26 +0800
Subject: [SRILM User List] Are there any trigger-based language modeling
open-source tools?
Message-ID:
Hi, I need to train a trigger-based language model, and it seems SRILM
doesn't support this task. Are there any open-source tools which can do
this job? Please give me some suggestions.
Thanks!
Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From yhifny at yahoo.com Fri Dec 14 12:51:36 2012
From: yhifny at yahoo.com (yasser hifny)
Date: Fri, 14 Dec 2012 12:51:36 -0800 (PST)
Subject: [SRILM User List] how do the script compute-best-mix work?
Message-ID: <1355518296.89321.YahooMailNeo@web125801.mail.ne1.yahoo.com>
Hello,
I would like to ask how do the script?compute-best-mix work? I mean what is the idea behind it. Can you refer to any paper explain how it works.
Thanks in advance,
Yasser ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mohammed.mediani at kit.edu Fri Dec 14 13:21:59 2012
From: mohammed.mediani at kit.edu (Mohammed Mediani)
Date: Fri, 14 Dec 2012 22:21:59 +0100
Subject: [SRILM User List] gtmin and kndiscount
In-Reply-To:
References:
Message-ID: <20121214222159.c7gbsnki880oc8kk@webmail.ira.uni-karlsruhe.de>
Could anybody please tell me how the discounting parameters for
modified kneser-ney smoothing (D1, D2, D3+) are computed in case we
have gtmin parameter greater than 1.
In such case, the corresponding ni would be zero, and we eventually
have to divide by this ni to get one of the Di's.
Many thanks,
From stolcke at icsi.berkeley.edu Fri Dec 14 13:22:05 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 14 Dec 2012 13:22:05 -0800
Subject: [SRILM User List] how do the script compute-best-mix work?
In-Reply-To: <1355518296.89321.YahooMailNeo@web125801.mail.ne1.yahoo.com>
References: <1355518296.89321.YahooMailNeo@web125801.mail.ne1.yahoo.com>
Message-ID: <50CB987D.8020101@icsi.berkeley.edu>
On 12/14/2012 12:51 PM, yasser hifny wrote:
> Hello,
> I would like to ask how do the script compute-best-mix work? I mean
> what is the idea behind it. Can you refer to any paper explain how it
> works.
It's an EM algorithm, where the underlying mixture distributions (the
individual LMs) are held fixed.
You can find the general theory at
https://en.wikipedia.org/wiki/Mixture_model .
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Fri Dec 14 13:43:51 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 14 Dec 2012 13:43:51 -0800
Subject: [SRILM User List] gtmin and kndiscount
In-Reply-To: <20121214222159.c7gbsnki880oc8kk@webmail.ira.uni-karlsruhe.de>
References:
<20121214222159.c7gbsnki880oc8kk@webmail.ira.uni-karlsruhe.de>
Message-ID: <50CB9D97.4090000@icsi.berkeley.edu>
On 12/14/2012 1:21 PM, Mohammed Mediani wrote:
> Could anybody please tell me how the discounting parameters for
> modified kneser-ney smoothing (D1, D2, D3+) are computed in case we
> have gtmin parameter greater than 1.
> In such case, the corresponding ni would be zero, and we eventually
> have to divide by this ni to get one of the Di's.
> Many thanks,
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
The gtmin parameter is applied (i.e., the ngrams with frequency below
the threshold are omitted from the model) AFTER the discounting
constants are computed, so the gtmin options don't affect the D1,D2,D3
computation.
You have a problem when frequency cutoffs have been applied to the Ngram
data BEFORE SRILM gets to see it. This is the case, e.g., with the
Google N-gram data. In that case, if you use the make-big-lm wrapper
script, an attempt will be made to extrapolate the low count-of-counts
from the higher ones, according to an empirical law that is described in
Figure 1 / Equation 1 of this paper
.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From medmediani at gmail.com Sat Dec 15 06:48:21 2012
From: medmediani at gmail.com (Mohammed Mediani)
Date: Sat, 15 Dec 2012 15:48:21 +0100
Subject: [SRILM User List] Interpolation of Unigrams
Message-ID:
Hi,
Are the unigrams always interpolated with 0-gram (probability of any word
from the vocab)?
I got the same probabilities for unigrams with and without -interpolate
(both with -kndiscount). Is it meant to be this way?
Many thanks for your help.
Mohammed
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Sat Dec 15 23:34:37 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sat, 15 Dec 2012 23:34:37 -0800
Subject: [SRILM User List] Interpolation of Unigrams
In-Reply-To:
References:
Message-ID: <50CD798D.30109@icsi.berkeley.edu>
On 12/15/2012 6:48 AM, Mohammed Mediani wrote:
> Hi,
> Are the unigrams always interpolated with 0-gram (probability of any
> word from the vocab)?
> I got the same probabilities for unigrams with and without
> -interpolate (both with -kndiscount). Is it meant to be this way?
> Many thanks for your help.
> Mohammed
The KN discounting strategy for unigrams only interpolates with the
zero-gram (uniform) estimate if the -interpolate flag is given.
This is just a special case of the interpolation happening at all
N-vgram levels.
However, there is an independent step whereby unallocated unigram
probability mass is filled in by adding a uniform probability increment
to all words in the vocabulary. When this happens you see a message like
warning: distributing 0.0659302 left-over probability mass over all
26573 words
This happens for unigrams only, and regardless of what discounting
method is in effect, because otherwise that probability mass would be
"lost" and the model would be deficient.
It so happens that the effect of both strategies is the same when it
comes to unigrams, and that explains your observation.
Andreas
From medmediani at gmail.com Mon Dec 17 01:41:04 2012
From: medmediani at gmail.com (Mohammed Mediani)
Date: Mon, 17 Dec 2012 10:41:04 +0100
Subject: [SRILM User List] Cutoff, probabilities, and backoffs
Message-ID:
Could anybody please tell me how the probabilities and the backoff weights
are computed in case we use -gtmin (with -kndiscount). Following Chen's
paper and the ngram-count man pages, I was unable to reproduce the same
results as ngram-count.
Many thanks,
Mohammed
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Mon Dec 17 12:46:15 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 17 Dec 2012 12:46:15 -0800
Subject: [SRILM User List] Cutoff, probabilities, and backoffs
In-Reply-To:
References:
Message-ID: <50CF8497.6090600@icsi.berkeley.edu>
On 12/17/2012 1:41 AM, Mohammed Mediani wrote:
> Could anybody please tell me how the probabilities and the backoff
> weights are computed in case we use -gtmin (with -kndiscount).
> Following Chen's paper and the ngram-count man pages, I was unable to
> reproduce the same results as ngram-count.
As I explained in a previous email, the -gtmin parameter doesn't change
the way discounting is computed. It just eliminates ngrams from the
model AFTER you compute their probabilities. Of course this frees up
probability mass, which is then reallocated using the backoff mechanism
(that is, the backoff weights change as a result). You can think of the
process in three steps, plus the 0th step that is particular to KN methods:
0. Replace the lower-order counts based on the ngram type frequencies
(if you use the -write option you can save these modified counts to a
file to see what the effect is).
1. compute discounts for each ngram, and then their probabilities (use
ngram-count -debug 4 to get a detailed record of the quantities involved
in this step)
2. remove ngrams due to the -gtmin (or entropy pruning criterion, if
specified)
3. compute backoff weights (to normalize the model).
Andreas
From medmediani at gmail.com Mon Dec 17 13:52:18 2012
From: medmediani at gmail.com (Mohammed Mediani)
Date: Mon, 17 Dec 2012 22:52:18 +0100
Subject: [SRILM User List] Cutoff, probabilities, and backoffs
In-Reply-To: <50CF8497.6090600@icsi.berkeley.edu>
References:
<50CF8497.6090600@icsi.berkeley.edu>
Message-ID:
Thank you very much Andreas,
In fact, I have done all what you have just suggested.
- Modify the counts
- Compute smoothing parameters (discount constants)
- Compute the probabilities
- Remove the rare ngrams according to gtmin
- Compute the backoffs.
I get the exact numbers for both probabilities and backoffs if no gtmin
specified. But in the presence of cutoffs, I get a bit different numbers
(e.g if gt3min=2 I get slightly different backoffs for 2-grams). I thought
I did something wrong, since I still can't get the Backoffs correctly. If
there is no special attention to be paid to different cases, the I just
need to look more into it.
Once again, many many thanks for your kind help and great cooperation.
Mohammed
On Mon, Dec 17, 2012 at 9:46 PM, Andreas Stolcke
wrote:
> On 12/17/2012 1:41 AM, Mohammed Mediani wrote:
>
>> Could anybody please tell me how the probabilities and the backoff
>> weights are computed in case we use -gtmin (with -kndiscount). Following
>> Chen's paper and the ngram-count man pages, I was unable to reproduce the
>> same results as ngram-count.
>>
>
> As I explained in a previous email, the -gtmin parameter doesn't change
> the way discounting is computed. It just eliminates ngrams from the model
> AFTER you compute their probabilities. Of course this frees up probability
> mass, which is then reallocated using the backoff mechanism (that is, the
> backoff weights change as a result). You can think of the process in three
> steps, plus the 0th step that is particular to KN methods:
>
> 0. Replace the lower-order counts based on the ngram type frequencies (if
> you use the -write option you can save these modified counts to a file to
> see what the effect is).
> 1. compute discounts for each ngram, and then their probabilities (use
> ngram-count -debug 4 to get a detailed record of the quantities involved in
> this step)
> 2. remove ngrams due to the -gtmin (or entropy pruning criterion, if
> specified)
> 3. compute backoff weights (to normalize the model).
>
> Andreas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From stolcke at icsi.berkeley.edu Mon Dec 17 14:05:45 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 17 Dec 2012 14:05:45 -0800
Subject: [SRILM User List] Cutoff, probabilities, and backoffs
In-Reply-To:
References:
<50CF8497.6090600@icsi.berkeley.edu>
Message-ID: <50CF9739.2080900@icsi.berkeley.edu>
On 12/17/2012 1:52 PM, Mohammed Mediani wrote:
> Thank you very much Andreas,
> In fact, I have done all what you have just suggested.
> - Modify the counts
> - Compute smoothing parameters (discount constants)
> - Compute the probabilities
> - Remove the rare ngrams according to gtmin
> - Compute the backoffs.
>
> I get the exact numbers for both probabilities and backoffs if no
> gtmin specified. But in the presence of cutoffs, I get a bit different
> numbers (e.g if gt3min=2 I get slightly different backoffs for
> 2-grams). I thought I did something wrong, since I still can't get the
> Backoffs correctly. If there is no special attention to be paid to
> different cases, the I just need to look more into it.
The ngram probabilities should be the same. The backoff weights MUST be
different since you are backing of for more of the ngrams, when choosing
a higher gtmin threshold.
Andreas
From is13-noreply at inria.fr Sun Dec 23 14:23:55 2012
From: is13-noreply at inria.fr (Interspeech 2013 - First announcement)
Date: Sun, 23 Dec 2012 23:23:55 +0100
Subject: [SRILM User List] Interspeech 2013 - First Announcement
Message-ID: <50D7847B.8080101@inria.fr>
First Announcement
http://www.interspeech2013.org/calls/
INTERSPEECH is the world's largest and most comprehensive conference on
challenges surrounding the science and technology of Spoken Language
Processing (SLP) both in humans and machines. It is our great pleasure
to announce that Interspeech 2013 will be hosted by the Center of
Congress of Lyon (France), under the sponsorship of the International
Speech Communication Association (ISCA).
Interspeech 2013 will be the 14th conference in the annual series of
Interspeech events and will be held in Lyon, France, 25-29 August 2013.
The theme of Interspeech 2013 is "Speech in Life Sciences and Human
Societies". Life sciences cover a large set of disciplines, such as
biology, medicine, anthropology, or ecology, and deal with living
organisms and their organization, life processes, and relationships to
each other and their environment. In that respect, speech appears as a
key aspect of human activity that runs across these multiple sides of
life sciences. Besides the conventional acoustic and linguistic
viewpoints of speech communication, biology, psychology and sociology
provide additional angles to approach speech as the keystone means of
communication, cognition and interaction in the Human societies. Under
this theme the conference will emphasize an interdisciplinary approach
covering all aspects of speech science and technology spanning the basic
theories to applications.
Besides regular oral and poster sessions, plenary talks by
internationally renowned experts, tutorials, exhibits, and special
sessions are planned. For the complete Call for Papers and other Calls
please visit our website at: www.interspeech2013.org
We look forward to welcoming you to INTERSPEECH 2013 in Lyon!
Sincerely,
-----------------------------------------------------------------------------------------------------------
CALL FOR SATELLITE WORKSHOPS (Workshops at Interspeech2013.org)
The call for Satellite workshops is now closed
Notification of acceptance and ISCA approval / sponsorship will be
launched on October 30, 2012
CALL FOR TUTORIALS
Submission Deadline: December 15, 2012
Notification of acceptance: January 15, 2013
Tutorial proposals covering interdisciplinary topics and/or important
new emerging areas of interest related to the main conference topics are
encouraged. Visit the Tutorial Page of the conference website for more
information and to submit a tutorial proposal.
CALL FOR SHOW AND TELL AND OTHER SPECIAL EVENTS
Submission Deadline: January 4, 2013
Notification of acceptance: January 11, 2013
The Show & Tell sessions gather researchers, engineering groups, and
practitioners from academia, industry, and governmental institutes in a
unique opportunity to demonstrate their most advanced research systems
and interact with the conference attendees in an informal way.
Other special events: less formal events about "SPEECH SCIENCES AND
INNOVATIONS IN OUR FUTURE SOCIETY" are encouraged.
CALL FOR SPECIAL SESSIONS
Submission Deadline: January 15, 2013
Notification of pre-acceptance: February 15, 2013
Special sessions can be an opportunity to bring researchers in relevant
fields of interest outside the traditional speech and language fields,
together with the Interspeech community. Click here to learn more about
the updated 2013 special session submission process.
Special Session Topics will be defined after May 13, 2013
CALL FOR PAPERS
Submission Deadline: March 10, 2013
Notification of acceptance: May 13, 2013
We invite you to submit original papers in any related area, including -
but not limited to:
* Speech Perception and Production
* Phonology, Phonetics
* Para-/Non- linguistic Information
* Language Processing
* Analysis, Enhancement and Coding of Speech and Audio Signals
* Speaker and Language Identification
* Speech & Spoken Language Generation, Speech Synthesis
* Automatic Speech Recognition (ASR)
* Technologies and Systems for New Applications
* Spoken Dialogue System, Spoken Language Understanding, Speech Translation,
* Information Retrieval
* Application, Evaluation, Standardization, Spoken Language Resource
* Prostheses and aids for speech communication disorders
* Diagnosis and tools for speech therapy
* Emotional speech and affective computing
* Cognitive models of speech production and perception
* Natural and artifactual speech interaction
* Models for the origin and development of speech
* Speech communication diversity among people, languages and dialects
Paper Submission Procedure
The working language of the conference is English. Papers for the
INTERSPEECH 2013 proceedings should be up to 4 pages of content plus one
page of references in length and conform to the format given in the
paper preparation guidelines and author kits which will be available on
the INTERSPEECH 2013 website along with the Final Call for Papers.
Optionally, authors may submit additional files, such as multimedia
files, to be included on the Proceedings USB key. Authors shall also
declare that their contributions are original and not being submitted
for publication elsewhere (e.g. another conference, workshop, or
journal). Papers must be submitted via the on-line paper submission
system, which will open in February 24, 2013. The deadline for
submitting a paper is March 10, 2013. This date will not be extended.
2013 CONFERENCE VENUE
The Conference will be held at the Cit? Centre des Congr?s, Lyon,
France. Click here (http://www.ccc-lyon.com/) to learn more about the CCC
PAPER SUBMISSIONS AND REGISTRATION INFORMATION
Conference registration and the call for papers submission will open in
February 24, 2013. Visit the Conference website to keep abreast of the
program developments.
SPONSORSHIP AND EXHIBIT OPPORTUNITES
Want to increase your visibility and access a market of over 800
conference attendees? Apply online to become a sponsor or exhibitor of
the 2013 Conference. Benefits are level based and are on a first come,
first served basis.