From stolcke at speech.sri.com Sun Jul 11 09:03:49 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 11 Jul 2010 09:03:49 -0700 Subject: [SRILM User List] SRILM-User list working again Message-ID: <4C39EB65.7040006@speech.sri.com> I'm happy to report that the mailing list is working again. The problem was that a python upgrade on our systems crashed the mailman queue processing daemon back in May -- sorry for the long wait! Andreas From stolcke at speech.sri.com Sun Jul 11 09:18:33 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 11 Jul 2010 09:18:33 -0700 Subject: [SRILM User List] lattice-tool related issues In-Reply-To: References: <0BED8387-8953-419D-9436-E57FC5CC662C@jhu.edu> Message-ID: <4C39EED9.2050206@speech.sri.com> Anoop Deoras wrote: > Hello Andreas, > > My message got bounced from the mailing list. I am hence sending the > email directly to you. Probably due to the problems with the mailing list software that are now solved. > > Thanks and Regards > Anoop > > Begin forwarded message: > >> *From: *Anoop Deoras > >> *Date: *July 10, 2010 2:51:07 PM EDT >> *To: *srilm-user at speech.sri.com >> *Cc: *Anoop Deoras > >> *Subject: **lattice-tool related issues* >> >> Hello Andreas, >> >> I need to generate N best list from an HTK format lattice and >> unfortunately I am not able to >> suppress the default operation of deleting the duplicates. >> >> I give following command: >> >> lattice-tool -read-htk -in-lattice test.lat -nbest-decode 10 >> -nbest-duplicates 10 -out-nbest-dir my_nbest_dir >> >> I have the following test lattice: >> test.lat >> ************************************************** >> VERSION=1.0 >> UTTERANCE=test.mfc >> lmname=test.bg >> lmscale=16.00 wdpenalty=0.00 >> acscale=1.00 >> vocab=test >> N=439 L=1108 >> I=0 t=0.00 W=!NULL >> I=1 t=0.02 W=b v=1 >> I=2 t=0.08 W=b v=1 >> I=3 t=0.14 W=c v=1 >> J=0 S=0 E=1 a=0.00 l=0.000 >> J=1 S=1 E=3 a=-382.52 l=-3.730 >> J=2 S=0 E=2 a=-669.26 l=-3.730 >> J=3 S=2 E=3 a=0.00 l=0.00 >> >> ************************************************** >> >> I get following nbest hypotheses: >> $: less my_nbest_dir/test.mfc.gz >> >> -166.126 -1.61992 2 b c >> >> (the scores get divided by natural log of 10.) >> ****************** >> The nbest file contain just one hypothesis instead of two. The >> lattice-tool has deleted >> duplicate hypothesis. Inspite of specifying -nbest-duplicate option, >> I dont see the duplicates. >> >> To check if deletion of duplicates is the only issue, if we replace >> word at node I=2 by say 'd', then >> we do get 2 hypotheses i.e b c AND d c >> >> Am I missing any specific flag required to get duplicates in N Best >> lists ? The -nbest-duplicates option is no longer supported by the new nbest implementation that performs LM rescoring on the fly (as of SRILM 1.5.7). However, it still works if you enable use of the "old" decoding method. So use lattice-tools -old-decoding -nbest-duplicates 10 ... Andreas >> >> Thanks and Regards >> Anoop > From stolcke at speech.sri.com Sun Jul 11 10:37:20 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 11 Jul 2010 10:37:20 -0700 Subject: [SRILM User List] sausage edges merging In-Reply-To: <003e01caf692$dc50a050$94f1e0f0$@ac.uk> References: <003e01caf692$dc50a050$94f1e0f0$@ac.uk> Message-ID: <4C3A0150.5080204@speech.sri.com> Tim Kempton wrote: > > Hi > > > > I'm trying to prevent the merging of sausage (i.e. confusion network) > edges. I thought the option "-no-merge" might work e.g. I was hoping > to see two edges labelled "hello" in the following sausage but there > is just one: > > > > -bash-3.2$ echo -e "-5 -1 2 hello world\n-5 -2 2 hello mould" > |nbest-lattice -use-mesh -nbest - -write - -no-merge > > > > hello world > > name - > > numaligns 2 > > posterior 1 > > align 0 hello 1 > > align 1 world 0.909091 mould 0.0909091 > > > > I get the same result whether I use the "-no-merge" option or not. > Maybe I've got the wrong end of the stick and this option is for > something else. I am using SRILM version 1.5.8, but I don't believe > there's been any relevant changes to nbest-lattice since then. > > > > The reason I want to do this is to preserve timing information from an > NBestList2.0 list; when the edges get merged there is also a loss of > backtrace information (when using -nbest-backtrace). > Sorry, but -no-align has no effect with -use-mesh because word confusion networks as implemented only support unique word labels per alignment position (all the information is hashed on the word type). However, you can work around this by (1) making word labels unique -- hello-1, hello-2, etc. (2) use the -dictionary option to specify an alignment cost based on dictionary pronunciations. The pronunciations could be real ones (so all the hello-1, hello-2 etc. have the same pronunciation and hence align) or you could even use dummy pronunciations that just consist of the "real" word labels: hello-1 -> hello, hello-2 -> hello, etc. That was the alignment cost will exactly mimic the usual word identity. Note I haven't tried this, but it should work. Andreas > > > Thanks, > > Tim > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at speech.sri.com Sun Jul 11 11:35:33 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 11 Jul 2010 11:35:33 -0700 Subject: [SRILM User List] lattice-tool In-Reply-To: <642280.27633.qm@web28604.mail.ukl.yahoo.com> References: <642280.27633.qm@web28604.mail.ukl.yahoo.com> Message-ID: <4C3A0EF5.7070203@speech.sri.com> ali sadiqui wrote: > thank you for your answer, > indeed, I knew that ngram-count was the good order to create a model of language but my ambiguity comes from that: > > During the segmentation of an Arab word to follow the model Pr?fix-Stem-suffix > > A word ?B? can give several results. > > Supposing that the word B gives place to 3 results of segmentation. > > b1 = mot1 + sufi1 (mot1 can be noted stem1) > > b2 = pref1 + mot2 > > b3 = mot3 > > Starting from corpus ?A B C D E? I create a file (by programming): > > A mot1 suf1 C D E > > A pref1 mot2 C D E > > A mot3 C D E > > (to create all the possible ways) > > Then using SRLIM I will create a model of language of order 3 (for example) to use it to afterwards support a decomposition on other. > > My question is: > > - I supposed that I would need to create lattices, is what that is true or false? > > - If they is true how to proceed to use lattice-tool > > I am very grateful for your help. > > > Ali Sadiqui > --- En date de : Jeu 22.4.10, Andreas Stolcke a ?crit > Ali, sorry for not responding earlier. Your desire to use lattices now makes sense. You need to encode your morphologically analyzed training data as lattices in either the HTK or the PFSG format. PFSG is more limited but should be enough in your case. See the pfsg-format(5) man page for a description There are also some examples in $SRILM/lattice/test/tests/lattice-expansion/ . After each sentence is encoded as a lattice, you would use lattice-tool -in-lattice-list ... -write-ngrams NGRAMS to generate ngram counts from the corpus. Then you can train the LM using ngram-count -float-counts -read NGRAMS -lm ... Note that the counts will be fractional, so you can only use certain smoothing methods, like --wbdiscount. If you have trouble with the lattice generation you can also generate the ngram counts yourself. Note there are more sophisticated ways to model Arabic morphology, using factored LMs (FLMs). Google the work of Katrin Kirchhoff, she developed FLMs partly for this purpose, and this is now incorporated in SRILM (if you have question about this approach contact her directly). Andreas >> De: Andreas Stolcke >> Objet: Re: [SRILM User List] lattice-tool >> ?: "ali sadiqui" >> Cc: srilm-user at speech.sri.com >> Date: Jeudi 22 avril 2010, 6h42 >> ali sadiqui wrote: >> >>> hi, >>> I am a beginner SRILM, >>> I would like to create a lattice from corpora >>> "A B{b1, b2, b3) C" and then create a language model >>> I know I have to use the tool lattice-tool, but how do >>> >> I proceed, I was stuck there. I guess I should create >> a file-format pfsg but. >> >>> If so: >>> >>> >> How to define the nodes? >> >>> >>> >> Calculating the cost? >> >>> Is this is a manually or using a command? >>> In short, how to fill it? >>> >>> I am very grateful for your help. >>> thank you for your help >>> >>> >> I think you are confused about how to build language >> models. You typically create LMs directly from ngram >> counts extracted from a corpus, with no need to build >> lattices. >> Please consult the file $SRILM/doc/lm-intro for the most >> basic procedures, and the FAQ file and recommended text >> books for more details. >> >> Andreas >> >> >>> >>> _______________________________________________ >>> SRILM-User site list >>> SRILM-User at speech.sri.com >>> http://www.speech.sri.com/mailman/listinfo/srilm-user >>> >>> >> > > > > > ------------------------------------------------------------------------ > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/bmp Size: 18194 bytes Desc: not available URL: From jp4work at gmail.com Sat Jul 17 18:17:35 2010 From: jp4work at gmail.com (JIA Pei) Date: Sun, 18 Jul 2010 09:17:35 +0800 Subject: [SRILM User List] How to compile srilm-1.5.11 under Ubuntu 10.04? Message-ID: Hi, all: I'm trying to install srilm-1.5.11 on my laptop. My environments: Ubuntu 10.04 gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3 srilm-1.5.11 Followed the description in file "INSTALL", I just "make" srilm from its top-level folder, but obtained the following error messages. jiapei at jiapei-laptop:~/Tools/speechrecog/srilm-1.5.11$ make make: /sbin/machine-type: Command not found mkdir include lib bin mkdir: cannot create directory `lib': File exists make: [dirs] Error 1 (ignored) make init make[1]: /sbin/machine-type: Command not found make[1]: Entering directory `/home/jiapei/Tools/speechrecog/srilm-1.5.11' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM= MACHINE_TYPE= OPTION= MAKE_PIC= init) || exit 1; \ done make[2]: Entering directory `/home/jiapei/Tools/speechrecog/srilm-1.5.11/misc/src' Makefile:24: /common/Makefile.common.variables: No such file or directory Makefile:139: /common/Makefile.common.targets: No such file or directory make[2]: *** No rule to make target `/common/Makefile.common.targets'. Stop. make[2]: Leaving directory `/home/jiapei/Tools/speechrecog/srilm-1.5.11/misc/src' make[1]: *** [init] Error 1 make[1]: Leaving directory `/home/jiapei/Tools/speechrecog/srilm-1.5.11' make: *** [World] Error 2 Why are there so many error messages, and is there a standard way to "configure make install" this package? Best Regards JIA -- Welcome to Vision Open http://www.visionopen.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Jul 17 19:09:01 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 17 Jul 2010 19:09:01 -0700 Subject: [SRILM User List] How to compile srilm-1.5.11 under Ubuntu 10.04? In-Reply-To: Your message of Sun, 18 Jul 2010 09:17:35 +0800. Message-ID: <201007180209.o6I291j05186@huge> I think someone had this problem recently. Make sure you have csh/tcsh installed. The machine-type script uses it. to check run $SRILM/sbin/machine-type and make sure it outputs something sensible. --Andreas In message you wrote: > > Hi, all: > > I'm trying to install srilm-1.5.11 on my laptop. > My environments: > Ubuntu 10.04 > gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3 > srilm-1.5.11 > > Followed the description in file "INSTALL", > I just "make" srilm from its top-level folder, > but obtained the following error messages. > > > jiapei at jiapei-laptop:~/Tools/speechrecog/srilm-1.5.11$ make > make: /sbin/machine-type: Command not found > mkdir include lib bin > mkdir: cannot create directory `lib': File exists > make: [dirs] Error 1 (ignored) > make init > make[1]: /sbin/machine-type: Command not found > make[1]: Entering directory `/home/jiapei/Tools/speechrecog/srilm-1.5.11' > for subdir in misc dstruct lm flm lattice utils; do \ > (cd $subdir/src; make SRILM= MACHINE_TYPE= OPTION= MAKE_PIC= > init) || exit 1; \ > done > make[2]: Entering directory > `/home/jiapei/Tools/speechrecog/srilm-1.5.11/misc/src' > Makefile:24: /common/Makefile.common.variables: No such file or directory > Makefile:139: /common/Makefile.common.targets: No such file or directory > make[2]: *** No rule to make target `/common/Makefile.common.targets'. > Stop. > make[2]: Leaving directory > `/home/jiapei/Tools/speechrecog/srilm-1.5.11/misc/src' > make[1]: *** [init] Error 1 > make[1]: Leaving directory `/home/jiapei/Tools/speechrecog/srilm-1.5.11' > make: *** [World] Error 2 > > > Why are there so many error messages, and > is there a standard way to > "configure > make > install" > this package? > > > Best Regards > JIA > > > -- > Welcome to Vision Open > http://www.visionopen.com > > --e0cb4e88781196b423048b9f37d7 > Content-Type: text/html; charset=ISO-8859-1 > Content-Transfer-Encoding: quoted-printable > >

Hi, all:

I'm trying to ins= > tall srilm-1.5.11 on my laptop.
My environments:
Ubuntu= > 10.04
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
srilm-1.5.11 div> >

Followed the description in file "INSTALL",= > =A0
I just "make" srilm from its top-level folder, >
but obtained the following error messages.


ear=3D"all"> >
jiapei at jiapei-laptop:~/Tools/speechrecog/srilm-1.5.11$ make
= > make: /sbin/machine-type: Command not found
mkdir include lib bin= >
mkdir: cannot create directory `lib': File exists
> make: [dirs] Error 1 (ignored)
make init
make[1]: /sbin= > /machine-type: Command not found
make[1]: Entering directory `/ho= > me/jiapei/Tools/speechrecog/srilm-1.5.11'
for subdir in misc = > dstruct lm flm lattice utils; do \
>
=A0=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(cd $subdir/src; make SRILM=3D MACHI= > NE_TYPE=3D OPTION=3D MAKE_PIC=3D init) || exit 1; \
=A0=A0 =A0 = > =A0 =A0done
make[2]: Entering directory `/home/jiapei/Tools/speec= > hrecog/srilm-1.5.11/misc/src'
>
Makefile:24: /common/Makefile.common.variables: No such file or direct= > ory
Makefile:139: /common/Makefile.common.targets: No such file o= > r directory
make[2]: *** No rule to make target `/common/Makefile= > .common.targets'. =A0Stop.
>
make[2]: Leaving directory `/home/jiapei/Tools/speechrecog/srilm-1.5.1= > 1/misc/src'
make[1]: *** [init] Error 1
make[1]: Le= > aving directory `/home/jiapei/Tools/speechrecog/srilm-1.5.11'
>
make: *** [World] Error 2


Why = > are there so many error messages, and=A0
is there a standard way = > to
"configure
make
install"
>
this package?


Best Regards v>
JIA


--
Welcome to Vision Open= >
http://www.visionopen.com
> > > --e0cb4e88781196b423048b9f37d7-- > > --===============0077075051== > Content-Type: text/plain; charset="us-ascii" > MIME-Version: 1.0 > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > --===============0077075051==-- From wqfengnlpr at gmail.com Thu Jul 22 07:49:18 2010 From: wqfengnlpr at gmail.com (=?gb2312?B?zfXH77fm?=) Date: Thu, 22 Jul 2010 22:49:18 +0800 Subject: [SRILM User List] why some trigrams are lost? Message-ID: <201007222249153729808@gmail.com> Hi,all: I'm trying to get the trigram from the text with the command: "ngram-count -text char.txt -lm char.tri -order 3" and the content of char.txt is: a b a b a d a e and only one trigram in the result file char.tri, why other trigram are lost? like "b a b" , "b a d" ,... the content of char.tri is : \data\ ngram 1=6 ngram 2=7 ngram 3=1 \1-grams: -0.9208187 -99 -0.06445797 -0.3767507 a -0.4313637 -0.6575773 b -0.2405493 -0.9208187 d -0.06445797 -0.9208187 e -0.2455126 \2-grams: -0.30103 a -0.39794 a b 0 -0.69897 a d -0.69897 a e -0.1760913 b a -0.30103 d a -0.30103 e \3-grams: -0.1760913 a b a \end\ 2010-07-22 ??? -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Jul 22 09:11:51 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 22 Jul 2010 09:11:51 -0700 Subject: [SRILM User List] why some trigrams are lost? In-Reply-To: <201007222249153729808@gmail.com> References: <201007222249153729808@gmail.com> Message-ID: <4C486DC7.2020506@speech.sri.com> By default, trigrams, 4-grams, ..., must occur at least twice in the training data to be included in the model. To change that use ngram -gt3min 1 -gt4min 1 ... Andreas PS. This has become a FAQ ... ??? wrote: > Hi,all: > I'm trying to get the trigram from the text with the command: > "ngram-count -text char.txt -lm char.tri -order 3" > and the content of char.txt is: a b a b a d a e > and only one trigram in the result file char.tri, > why other trigram are lost? like "b a b" , "b a d" ,... > the content of char.tri is : > > \data\ > ngram 1=6 > ngram 2=7 > ngram 3=1 > \1-grams: > -0.9208187 > -99 -0.06445797 > -0.3767507 a -0.4313637 > -0.6575773 b -0.2405493 > -0.9208187 d -0.06445797 > -0.9208187 e -0.2455126 > \2-grams: > -0.30103 a > -0.39794 a b 0 > -0.69897 a d > -0.69897 a e > -0.1760913 b a > -0.30103 d a > -0.30103 e > \3-grams: > -0.1760913 a b a > \end\ > 2010-07-22 > ------------------------------------------------------------------------ > ??? > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From ravi.ranjan.coer at gmail.com Sat Jul 31 07:14:24 2010 From: ravi.ranjan.coer at gmail.com (Ravi Ranjan) Date: Sat, 31 Jul 2010 19:44:24 +0530 Subject: [SRILM User List] help on buildind srilm Message-ID: While building Srilm on my 32-bit Windows Vista Home premium I got the following error CODES : $ MAKE WORLD mkdir include lib bin mkdir : cannot create directory ?include? : File exits mkdir : cannot create directory ?lib? : File exits mkdir : cannot create directory ?bin? : File exits make : [dirs] Error 1 (ignored) make init make [1] : Entering directory ?cygdrive/c/cygwin/home/maa/srilm? for subdir in misc destruct lm flm lattice utils ; do\ (cd $subdir/src; make SRILM=MACHINE_TYPE =cygwin OPTION= MAKE PIC= init) || exit 1;\ Done Make[2]: Entering directory ?/cygdrive/c/cygwin/home/maa/srilm/misc/src? Makefile:24:/common/Makefile.common.variables: No such file or directory Makefile:139:/common/Makefile.common.targets: No such file or directory Make[2]: *** No rule to make target ?/common/Makefile.common.targets?. Stop Make[2]: Leaving directory?/cygwin/c/cygwin/home/maa/srilm/misc/src?. Make[1]: ***[init] Error 1 Make[1]: Leaving directory? cygwin/c/cygwin/home/maa/srilm? Make: ***[World] Error 2 Ravi -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Jul 31 08:44:24 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 31 Jul 2010 08:44:24 -0700 Subject: [SRILM User List] help on buildind srilm In-Reply-To: Your message of Sat, 31 Jul 2010 19:44:24 +0530. Message-ID: <201007311544.o6VFiOj19143@huge> This is the very first FAQ. Please check http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html --Andreas In message you wrote: > > While building Srilm on my 32-bit Windows Vista Home premium I got the > following error CODES : > > > > $ MAKE WORLD > > > > mkdir include lib bin > > mkdir : cannot create directory =91include=92 : File exits > > mkdir : cannot create directory =91lib=92 : File exits > > mkdir : cannot create directory =91bin=92 : File exits > > make : [dirs] Error 1 (ignored) > > make init > > make [1] : Entering directory =91cygdrive/c/cygwin/home/maa/srilm=92 > > for subdir in misc destruct lm flm lattice utils ; do\ > > (cd $subdir/src; make > SRILM=3DMACHINE_TYPE =3Dcygwin OPTION=3D MAKE PIC=3D init) || exit 1;\ > > Done > > Make[2]: Entering directory =91/cygdrive/c/cygwin/home/maa/srilm/misc/src= > =92 > > Makefile:24:/common/Makefile.common.variables: No such file or directory > > Makefile:139:/common/Makefile.common.targets: No such file or directory > > Make[2]: *** No rule to make target =91/common/Makefile.common.targets=92. = > Stop > > Make[2]: Leaving directory=92/cygwin/c/cygwin/home/maa/srilm/misc/src=92. > > Make[1]: ***[init] Error 1 > > Make[1]: Leaving directory=92 cygwin/c/cygwin/home/maa/srilm=92 > > Make: ***[World] Error 2 > > > > > > > > Ravi > > --000e0cd2579a9ed80d048caf95b9 > Content-Type: text/html; charset=windows-1252 > Content-Transfer-Encoding: quoted-printable > >

While building Srilm >=A0 > on my 32-bit Windows Vista Home premium I got the following error CO= > DES > :

> >

=A0

> >

$ MAKE WORLD

> >

=A0

> >

mkdir include lib bin

> >

mkdir : cannot create directory pacerun:yes">=A0=91include=92 : File exits

> >

mkdir : cannot create directory pacerun:yes">=A0=91lib=92 : File exits

> >

mkdir : cannot create directory pacerun:yes">=A0=91bin=92 : File exits

> >

make : [dirs] Error 1 (ignored)

> >

make init

> >

make [1] : Entering directory =91cygdrive/c/cygwin/h= > ome/maa/srilm=92

> >

for subdir in misc destruct lm flm lattice utils ; d= > o\

> >

=A0=A0=A0=A0=A0=A0= > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= > =A0=A0=A0=A0=A0=A0 (cd > $subdir/src; make SRILM=3DMACHINE_TYPE =3Dcygwin OPTION=3D MAKE PIC=3D init= > ) || exit > 1;\

> >

=A0=A0=A0 Do= > ne

> >

Make[2]: Entering directory =91/cygdrive/c/cygwin/ho= > me/maa/srilm/misc/src=92

> >

Makefile:24:/common/Makefile.common.variables: No su= > ch file or > directory

> >

Makefile:139:/common/Makefile.common.targets: No suc= > h file or > directory

> >

Make[2]: *** No rule to make target =91/common/Makef= > ile.common.targets=92. > Stop

> >

Make[2]: Leaving directory=92/cygwin/c/cygwin/home/m= > aa/srilm/misc/src=92.

> >

Make[1]: ***[init] Error 1

> >

Make[1]: Leaving directory=92 cygwin/c/cygwin/home/m= > aa/srilm=92

> >

Make: ***[World] Error 2

> >

=A0

> >

=A0

> >

=A0

> >

=A0Ravi

> > --000e0cd2579a9ed80d048caf95b9-- > > --===============0172873075== > Content-Type: text/plain; charset="us-ascii" > MIME-Version: 1.0 > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > --===============0172873075==-- From adeoras at jhu.edu Mon Aug 2 10:33:17 2010 From: adeoras at jhu.edu (Anoop Deoras) Date: Mon, 2 Aug 2010 13:33:17 -0400 Subject: [SRILM User List] lattice-tool related issues Message-ID: <5D1CA95A-F9E4-417E-B276-DE8056B3F254@jhu.edu> Hello, I am trying to rescore htk lattices using lattice-tool and am running into following issues: 1. I pass a 3gm language model and a vocabulary file to rescore the lattice (encoding bigram information) and then write back the updated and expanded lattice back in the htk format. However, when I specify -unk and -keep-unk flags, the OOV words gets mapped to unk without preserving the original label. I was under the impression that -keep-unk would preserve the label of the OOV word, but it does not do so. 2. Before I rescore the lattice, I want to split some words (multiword units). The multiwords are connected by an underscore character. I hence use the flags, -split-multiwords -multi- char _ All goes well, as long as I do not use -unk -keep-unk flag in conjunction with -split-multiwords . If I use -unk -keep-unk flag (for point 1 above) and also use -split-multiwords flags, then the multiword functionality does not work moreover the OOV words get mapped to . I should point out that the multi-word unit is NOT in my vocabulary but after the split, all the individual words are found in the vocabulary. Hence, I am suspecting that the functionality for the flag -unk takes place before the splitting and since no multiword unit is in the vocabulary, the -split- multiwords functionality does not have anything to split. I was wondering if there is anyway we can invoke split-multiword functionality before mapping unk words ? I apologize if I am not understanding the lattice-tool well enough and am passing wrong arguments in the first place. Thanks and Regards -Anoop From stolcke at speech.sri.com Tue Aug 3 16:29:55 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 03 Aug 2010 16:29:55 -0700 Subject: [SRILM User List] lattice-tool related issues In-Reply-To: Your message of Mon, 02 Aug 2010 13:33:17 -0400. <5D1CA95A-F9E4-417E-B276-DE8056B3F254@jhu.edu> Message-ID: <201008032329.o73NTtj15149@huge> In message <5D1CA95A-F9E4-417E-B276-DE8056B3F254 at jhu.edu>you wrote: > Hello, > > I am trying to rescore htk lattices using lattice-tool and am > running into following issues: > > 1. I pass a 3gm language model and a vocabulary file to rescore the > lattice (encoding bigram information) and > then write back the updated and expanded lattice back in the htk format. > > However, when I specify -unk and -keep-unk flags, the OOV words gets > mapped to unk without preserving the > original label. I was under the impression that -keep-unk would > preserve the label of the OOV word, but it does not do so. I just looked at the code, and it seems that -keep-unk is only implemented when reading HTK format lattices, not for PFSGs. Is that what you are using? If you are using HTK lattices then please prepare some small input data files that demonstrate the problem, and I can look into it when I get a chance. > > 2. Before I rescore the lattice, I want to split some words (multiword > units). The multiwords are connected by an > underscore character. I hence use the flags, -split-multiwords -multi- > char _ > > All goes well, as long as I do not use -unk -keep-unk flag in > conjunction with -split-multiwords . If I use -unk -keep-unk flag > (for point 1 above) and also use -split-multiwords flags, then the > multiword functionality does not work moreover the OOV > words get mapped to . > > I should point out that the multi-word unit is NOT in my vocabulary > but after the split, all the individual words are found > in the vocabulary. Hence, I am suspecting that the functionality for > the flag -unk takes place before the splitting > and since no multiword unit is in the vocabulary, the -split- > multiwords functionality does not have > anything to split. > > I was wondering if there is anyway we can invoke split-multiword > functionality before mapping > unk words ? The way it works is that upon reading the lattice (before any operation on them), word labels are converted to integers. Normally a new word generates a new integer autoamtically, but with -unk and -keep-unk unknown words are mapped to the integer code. So therefore, the splitting won't work if the multiwords themselves are not in the vocabulary. A workaround is to do the multiword splitting in a separate processing pass, where lattice-tool is invoked WITHOUT -unk. Andreas > > I apologize if I am not understanding the lattice-tool well enough and > am passing wrong arguments in the first place. > > Thanks and Regards > -Anoop > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From adeoras at jhu.edu Tue Aug 3 18:02:03 2010 From: adeoras at jhu.edu (Anoop Deoras) Date: Tue, 3 Aug 2010 21:02:03 -0400 Subject: [SRILM User List] lattice-tool related issues In-Reply-To: <201008032329.o73NTtj15149@huge> References: <201008032329.o73NTtj15149@huge> Message-ID: On Aug 3, 2010, at 7:29 PM, Andreas Stolcke wrote: > > In message <5D1CA95A-F9E4-417E-B276-DE8056B3F254 at jhu.edu>you wrote: >> Hello, >> >> I am trying to rescore htk lattices using lattice-tool and am >> running into following issues: >> >> 1. I pass a 3gm language model and a vocabulary file to rescore the >> lattice (encoding bigram information) and >> then write back the updated and expanded lattice back in the htk >> format. >> >> However, when I specify -unk and -keep-unk flags, the OOV words gets >> mapped to unk without preserving the >> original label. I was under the impression that -keep-unk would >> preserve the label of the OOV word, but it does not do so. > > I just looked at the code, and it seems that -keep-unk is only > implemented > when reading HTK format lattices, not for PFSGs. > Is that what you are using? > > If you are using HTK lattices then please prepare some small input > data > files that demonstrate the problem, and I can look into it when I > get a chance. > Hi Andreas, I am, infact, using HTK lattices. I was doing some debugging myself and noticed that when the rescoring LM is of the same order as that of the lattice (i.e. if the lattice expansion is not required), then -keep-unk works fine. When I use a higher order LM, it fails. I have uploaded the data at: Please run RescoreLattice.sh to process the HTK lattice file. I have kept the necessary vocabulary and trigram and bigram LM files too (Note: input lattices encodes bigram history and hence a trigram rescoring LM expands the lattice) The word 'slash' is out of vocabulary. A bigram rescoring keeps it intact while trigram rescoring maps it to >> >> 2. Before I rescore the lattice, I want to split some words >> (multiword >> units). The multiwords are connected by an >> underscore character. I hence use the flags, -split-multiwords - >> multi- >> char _ >> >> All goes well, as long as I do not use -unk -keep-unk flag in >> conjunction with -split-multiwords . If I use -unk -keep-unk flag >> (for point 1 above) and also use -split-multiwords flags, then the >> multiword functionality does not work moreover the OOV >> words get mapped to . >> >> I should point out that the multi-word unit is NOT in my vocabulary >> but after the split, all the individual words are found >> in the vocabulary. Hence, I am suspecting that the functionality for >> the flag -unk takes place before the splitting >> and since no multiword unit is in the vocabulary, the -split- >> multiwords functionality does not have >> anything to split. >> >> I was wondering if there is anyway we can invoke split-multiword >> functionality before mapping >> unk words ? > > The way it works is that upon reading the lattice (before any > operation > on them), word labels are converted to integers. Normally a new word > generates a new integer autoamtically, but with -unk and -keep-unk > unknown words are mapped to the integer code. > > So therefore, the splitting won't work if the multiwords themselves > are not in the vocabulary. > > A workaround is to do the multiword splitting in a separate processing > pass, where lattice-tool is invoked WITHOUT -unk. > > Andreas Yes, that makes sense. Thank you. -Anoop From cwsunshine at gmail.com Wed Aug 4 20:34:31 2010 From: cwsunshine at gmail.com (wei chen) Date: Thu, 5 Aug 2010 11:34:31 +0800 Subject: [SRILM User List] ngram -loglinear-mix problem Message-ID: Hi,all, I was trying to implement LM interpolation using -log-linear-mix: ngram -lm lm1 -lambda 0.5 -loglinear-mix -mix-lm lm2 -write-lm gen.lm however, some errors occured "write() method not implemented", which can not generate interpolated LM. I do not know why, Thanks a lot. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Wed Aug 4 22:42:48 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 04 Aug 2010 22:42:48 -0700 Subject: [SRILM User List] ngram -loglinear-mix problem In-Reply-To: References: Message-ID: <4C5A4F58.7070409@speech.sri.com> wei chen wrote: > Hi,all, > I was trying to implement LM interpolation using -log-linear-mix: > ngram -lm lm1 -lambda 0.5 -loglinear-mix -mix-lm > lm2 -write-lm gen.lm > however, some errors occured "write() method not implemented", which > can not generate interpolated LM. I do not know why, Unlike with linear mixtures, it is not supported to directly create a single merged LM that implements the mixture. You can perform the mixture dynamically, computing probabilities on the fly, e.g., with ngram -ppl . There is a roundabout way to achieve what you want. 1) do a linear mixture and dump out the merged LM (the purpose is to generate an LM that has the union of the ngrams of the input LMs). ngram -lm lm1 -mix-lm lm2 -write-lm MIXLM 2) "rescore" the LM by recomputing the probabilities according to the log-linear mixture ngram -rescore-ngram MIXLM -lm lm1 -lambda 0.5 -loglinear-mix -mix-lm lm2 -write-lm gen.lm This step can take a LONG time, since the normalization of the loglinear has to perform for every ngram context in the LM. Andreas > Thanks a lot. > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From jan.gebhardt at student.kit.edu Thu Aug 5 10:03:40 2010 From: jan.gebhardt at student.kit.edu (Gebhardt, Jan) Date: Thu, 5 Aug 2010 19:03:40 +0200 Subject: [SRILM User List] Factored Language Model - Backoff Weight Differs Message-ID: <32C565BC8990FF4E99A5B746707D7F70DE340534A3@KIT-MSX-15.kit.edu> Hello, I am working with Factored Language Models and want to start with a Factored Language Model which is equal to a standart 4-gram Language Model. Therefore I use the following factor language model file: 1 W : 3 W(-1) W(-2) W(-3) trainW.count trainW.flm.lm 4 W1,W2,W3 W3 ukndiscount gtmin 0 W1,W2 W2 ukndiscount gtmin 0 W1 W1 ukndiscount gtmin 0 0 0 ukndiscount gtmin 0 When I build the factored language model and write it using fngram-count -lm I realized the backoff weights in the language model differ significantly from the backoff weights in the standart n-gram. Both language model use ukndiscount and a cutoff of 0. For example while my normal 4-gram contains the following entries: -2.401827 A BEAUTIFUL 0.01767567 -2.401827 A BETTER 0.01767567 the factored language model has: -2.401827 W-A W-BEAUTIFUL -0.1628703 -2.401827 W-A W-BETTER So the both language models have the same probability but a different or even missing backoff weight. If I evalulate the language model written with fngram-count using ngram I get a lot of warnings like: trainWX.flm.lm: line 2678: warning: no bow for prefix of ngram "A BEAUTIFUL" . If I use the factored language model for decode I have a higher WER than with the standart 4-gram. I would like to know how to get the backoff weights for FLMs like for a standart n-gram. Also an explanation why the backoff weights are missing or different in the FLM would help. Thank you for your help. Jan From cwsunshine at gmail.com Tue Aug 17 19:50:31 2010 From: cwsunshine at gmail.com (wei chen) Date: Wed, 18 Aug 2010 10:50:31 +0800 Subject: [SRILM User List] Question about ngram-count Message-ID: Hi all, I trained a LM model using the default discount algorithm of ngram-count successfully, but in one experiement, I removed some training data, and then increased the number of other trainind data to keep the number of the total training set fixed, but the some message occured: warning: discount coeff 1 is out of range: -0 warning: discount coeff 1 is out of range: -2.09472 warning: discount coeff 3 is out of range: 0.966989 warning: discount coeff 5 is out of range: 0.990832 warning: discount coeff 7 is out of range: 0.998723 warning: discount coeff 1 is out of range: -4.55137 warning: discount coeff 3 is out of range: 0.988902 And the training process became very slow, I did not know why. Thanks in advance! wei chen -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Wed Aug 18 16:28:13 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 18 Aug 2010 16:28:13 -0700 Subject: [SRILM User List] Question about ngram-count In-Reply-To: References: Message-ID: <4C6C6C8D.5010701@speech.sri.com> wei chen wrote: > Hi all, > I trained a LM model using the default discount algorithm of > ngram-count successfully, but in one experiement, I removed some > training data, and then increased the number of other trainind data to > keep the number of the total training set fixed, but the some > message occured: > > warning: discount coeff 1 is out of range: -0 > warning: discount coeff 1 is out of range: -2.09472 > warning: discount coeff 3 is out of range: 0.966989 > warning: discount coeff 5 is out of range: 0.990832 > warning: discount coeff 7 is out of range: 0.998723 > warning: discount coeff 1 is out of range: -4.55137 > warning: discount coeff 3 is out of range: 0.988902 > And the training process became very slow, I did not know why. The FAQ says: > > *C3) Why am I getting errors or warnings from the smoothing method I'm > using? * > The Good-Turing and Kneser-Ney smoothing methods rely on > statistics called "count-of-counts", the number of words occurring > one, twice, three times, etc. The formulae for these methods > become undefined if the counts-of-counts are zero, or not strictly > decreasing. Some conditions are fatal (such as when the count of > singleton words is zero), others lead to less smoothing (and > warnings). To avoid these problems, check for the following > possibilities: > > a) > The data could be very sparse, i.e., the training corpus very > small. Try using the Witten-Bell discounting method. > b) > The vocabulary could be very small, such as when training an > LM based on characters or parts-of-speech. Smoothing is less > of an issue in those cases, and the Witten-Bell method should > work well. > c) > The data was manipulated in some way, or artificially > generated. For example, duplicating data eliminates the > odd-numbered counts-of-counts. > This is my guess as to what happened. Did you duplicate some of your data? Even if it is an artificial mix of several sources you can get count-of-count statistics that lead to errors in the GT discount estimator. > > > d) > The vocabulary is limited during counts collection using the > *ngram-count* * -vocab * option, with the effect that many > low-frequency N-grams are eliminated. The proper approach is > to compute smoothing parameters on the full vocabulary. This > happens automatically in the * make-big-lm * wrapper script, > which is preferable to direct use of *ngram-count* for other > reasons (see issue B3-a above). > e) > You are estimating an LM from N-gram counts that have been > truncated beforehand, e.g., by removing singleton events. If > you cannot go back to the original data and recompute the > counts there is a heuristic to extrapolate low > counts-of-counts from higher ones. The heuristic is invoked > automatically (and an informational message is output) when * > make-big-lm * is used to estimate LMs with Kneser-Ney > smoothing. For details see the paper by W. Wang et al. in > ASRU-2007, listed under "SEE ALSO". > If you cannot fix the problem, try using a different smoothing method, like Witten Bell. Andreas From alfonso at iet.ntnu.no Fri Aug 20 06:32:20 2010 From: alfonso at iet.ntnu.no (Alfonso MHC) Date: Fri, 20 Aug 2010 15:32:20 +0200 Subject: [SRILM User List] Nbest list to lattice Message-ID: <4C6E83E4.4000409@iet.ntnu.no> Hello, I would like to build a lattice in HTK format from a N-best list in HTK format. They told me I should use srilm for this so I have looked at the documentation. I think I should use nbest-lattice to construct the lattice from the N-best list. Then use lattice-tool to read the previous lattice and output it in HTK format. Is this correct? If I understood correctly, nbest-lattice supports three formats for nbest lists, but none of them is the HTK format. Is there any tool that can transform an HTK master label file with the N-best list into one of the formats that nbest-lattice supports? I havent really found any examples of how to use the tools in srilm, is there anything available? Thanks in advance! Alfonso MHC, Norwegian University of Science and Technology. From stolcke at speech.sri.com Fri Aug 20 13:49:59 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 20 Aug 2010 13:49:59 -0700 Subject: [SRILM User List] Nbest list to lattice In-Reply-To: <4C6E83E4.4000409@iet.ntnu.no> References: <4C6E83E4.4000409@iet.ntnu.no> Message-ID: <4C6EEA77.102@speech.sri.com> Alfonso MHC wrote: > Hello, > > I would like to build a lattice in HTK format from a N-best list in > HTK format. They told me I should use srilm for this so I have looked > at the documentation. I think I should use nbest-lattice to construct > the lattice from the N-best list. Then use lattice-tool to read the > previous lattice and output it in HTK format. Is this correct? Yes, this should be possible. Note that you will loose the distinction between acoustic scores, LM scores, etc., along the way because nbest-latttice has no way to keep those separate. (The resulting lattices will encode the posterior probabilities obtained from combining all the scores.) > > If I understood correctly, nbest-lattice supports three formats for > nbest lists, but none of them is the HTK format. Is there any tool > that can transform an HTK master label file with the N-best list into > one of the formats that nbest-lattice supports? Someone (not me) must have written a perl script of something to do this... > > I havent really found any examples of how to use the tools in srilm, > is there anything available? Look around the test directories $SRILM/*/test/tests Each subdirectory has a script that exercises a particular aspect of SRILM. Andreas > > Thanks in advance! > > Alfonso MHC, Norwegian University of Science and Technology. > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From Maria.Georgescul at unige.ch Tue Aug 24 06:37:02 2010 From: Maria.Georgescul at unige.ch (Maria.Georgescul at unige.ch) Date: Tue, 24 Aug 2010 15:37:02 +0200 Subject: [SRILM User List] compiling SRILM err: File to be installed (../bin/i686/maxalloc) does not exist Message-ID: <20100824153702.9chnnuvw0swkcog8@webmail.unige.ch> Dear SRILM users, When compiling SRILM on my machine (running OpenSuse) the following type of errors occur: ?ERROR: File to be installed (../bin/i686/maxalloc) does not exist.? ?ERROR: File to be installed (../bin/i686/ngram) does not exist.? Here are my platform and compiler specs: ---------------------------------- uname -m i686 -------------------------------- uname -a Linux linux 2.6.31.12-0.2-desktop #1 SMP PREEMPT 2010-03-16 21:25:39 +0100 i686 i686 i386 GNU/Linux ---------------------------------- gcc -v Using built-in specs. Target: i586-suse-linux Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib --libexecdir=/usr/lib --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.4 --enable-ssp --disable-libssp --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --disable-libgcj --disable-libmudflap --with-slibdir=/lib --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --program-suffix=-4.4 --enable-linux-futex --without-system-libunwind --with-arch-32=i586 --with-tune=generic --build=i586-suse-linux Thread model: posix gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) ---------------------------------- make --version GNU Make 3.81 Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. This program built for i686-pc-linux-gnu ---------------------------------- I have tcl8.5.8 installed. ----------------------------------- Here in the attachment is the output when running the make. --------------------------------- Thank you in advance for any suggestions. Best, Maria -------------- next part -------------- A non-text attachment was scrubbed... Name: make.output Type: application/octet-stream Size: 22919 bytes Desc: not available URL: From Maria.Georgescul at unige.ch Tue Aug 24 08:46:08 2010 From: Maria.Georgescul at unige.ch (Maria.Georgescul at unige.ch) Date: Tue, 24 Aug 2010 17:46:08 +0200 Subject: [SRILM User List] compiling SRILM err: File to be installed (../bin/i686/maxalloc) does not exist In-Reply-To: <20100824153702.9chnnuvw0swkcog8@webmail.unige.ch> References: <20100824153702.9chnnuvw0swkcog8@webmail.unige.ch> Message-ID: <20100824174608.ihzxrprsw000gwko@webmail.unige.ch> After a closer look, I realize that the problem was a typo in my Makefile.machine.i686. That is, I typed: TCL_INCLUDE='-I/usr/local/include' intead of : TCL_INCLUDE=-I/usr/local/include Building and testing works fine now. So please ignore my previous message. best, Maria Quoting Maria.Georgescul at unige.ch: > Dear SRILM users, > > When compiling SRILM on my machine (running OpenSuse) the following > type of errors occur: > ?ERROR: File to be installed (../bin/i686/maxalloc) does not exist.? > ?ERROR: File to be installed (../bin/i686/ngram) does not exist.? > > Here are my platform and compiler specs: > ---------------------------------- > uname -m > > i686 > > -------------------------------- > uname -a > > Linux linux 2.6.31.12-0.2-desktop #1 SMP PREEMPT 2010-03-16 21:25:39 > +0100 i686 i686 i386 GNU/Linux > > ---------------------------------- > gcc -v > Using built-in specs. > > Target: i586-suse-linux > > Configured with: ../configure --prefix=/usr --infodir=/usr/share/info > --mandir=/usr/share/man --libdir=/usr/lib --libexecdir=/usr/lib > --enable-languages=c,c++,objc,fortran,obj-c++,java,ada > --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.4 > --enable-ssp --disable-libssp --with-bugurl=http://bugs.opensuse.org/ > --with-pkgversion='SUSE Linux' --disable-libgcj --disable-libmudflap > --with-slibdir=/lib --with-system-zlib --enable-__cxa_atexit > --enable-libstdcxx-allocator=new --disable-libstdcxx-pch > --enable-version-specific-runtime-libs --program-suffix=-4.4 > --enable-linux-futex --without-system-libunwind --with-arch-32=i586 > --with-tune=generic --build=i586-suse-linux > > Thread model: posix > > gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) > ---------------------------------- > make --version > > GNU Make 3.81 > > Copyright (C) 2006 Free Software Foundation, Inc. > > This is free software; see the source for copying conditions. > > There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A > > PARTICULAR PURPOSE. > > > > This program built for i686-pc-linux-gnu > > ---------------------------------- > > I have tcl8.5.8 installed. > ----------------------------------- > Here in the attachment is the output when running the make. > --------------------------------- > > Thank you in advance for any suggestions. > Best, > Maria From stolcke at speech.sri.com Tue Aug 24 14:49:30 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 24 Aug 2010 14:49:30 -0700 Subject: [SRILM User List] Question Regarding SRILM N-gram tools In-Reply-To: References: Message-ID: <4C743E6A.9080004@speech.sri.com> Ryan, I suggested you use the -limit-vocab option with ngram, and write out your LM in binary. Reading a binary LM with -limit-vocab is very efficient in processing only the portions of the LM parameters that pertain to your test set vocabulary. You can generate the vocabulary used by your test data using ngram-count -text DATA -write-vocab VOCAB There is a tradeoff between processing small batches of data (hence small vocabularies, hence fast loading of the LM) with large batches (larger vocabularies, but the LM fewer times), so you might want to tune the batch size empirically for best overall throughput. If LM load time is still a limiting factor with this approach you should use an LM server (see ngram -use-server option), which effectively means you load the LM into memory only once. I suggest you join the srilm-user list and direct future questions there. Andreas Ryan Roth wrote: > Hello: > > My name is Ryan Roth and I work at Columbia University's Center for > Computational Learning Systems. My research focus currently is on > Arabic Natural Language Processing. > > I have question about the SRILM toolkit that I hope you'll be able to > help me with. > > My problem is the following. I have a large N-gram LM file > (non-binary) that I built from a collection of about 200 million > words. I want to able to read a given input text file (containing one > sentence per line), and for every N-gram that I find there, extract > the probability for that N-gram from the LM file. > > Currently, I am solving this problem by reading the entire LM file > into memory first, and then reading the N-grams from the input text > file and referencing the memory structure to get the probability for > that N-gram. This works fine, but is very slow and memory intensive. > I can reduce the memory issues by reading the input text file into > memory instead, and reading the LM file line-by-line, but this is > somewhat less convenient due to the other processing I need to perform > on the input file. > > I've looked through the SRILM toolkit, and another option would seem > to be to filter the large LM file first using the "make-lm-subset" > script and a counts file built from input text file. I would then use > the filtered output LM in place of the larger LM and proceed as > before. This method would seem to avoid the large memory > requirements. My initial tests, however, show that the filtering step > is still a bit slower than I'd like. > > I was wondering if there is another, more time-efficient way of > solving this particular problem (that is, extracting a specific subset > of N-gram probabilities from a large LM file) using the other tools in > the SRILM toolkit. Is there some option combination for "ngram", for > example, that would work? I don't currently see a direct solution. > > > Thank you very much, > > Ryan Roth > CCLS > Columbia University > From alfonso at iet.ntnu.no Wed Aug 25 08:12:04 2010 From: alfonso at iet.ntnu.no (Alfonso MHC) Date: Wed, 25 Aug 2010 17:12:04 +0200 Subject: [SRILM User List] Nbest list to lattice In-Reply-To: <4C6EEA77.102@speech.sri.com> References: <4C6E83E4.4000409@iet.ntnu.no> <4C6EEA77.102@speech.sri.com> Message-ID: <4C7532C4.6070605@iet.ntnu.no> Hello, I'm trying to build a lattice in HTK format from a N-best list also in HTK format.I have run a simple example where I start with this Nbest list in the file ex2.nbest: NBestList2.0 (0) c1 ( st: 0 et: 4 g: 0 a: 0 ) c2 ( st: 5 et: 10 g: 0 a: 0 ) (0) c3 ( st: 0 et: 1 g: 0 a: 0 ) c4 ( st: 1.01 et: 2 g: 0 a: 0 ) c5 ( st: 2.1 et: 3 g: 0 a: 0 ) c6 ( st: 3.1 et: 4 g: 0 a: 0 ) c7 ( st: 4.1 et: 10 g: 0 a: 0 ) Given the time information, c1 happens at the same time as c3 c4 c5 c6, and c2 happens at the same time as c7. Then I hope to obtain a lattice that allows only four combinations: c1 c2 c3 c4 c5 c6 c2 c1 c7 c3 c4 c5 c6 c7 However, the lattice I obtained allowed other paths and then I wonder if I have missunderstood something. This is what I did: To build the lattice I run: ./nbest-lattice -nbest ex2.nbest -use-mesh -write out2.lat And get this lattice in out2.lat: name ex2.nbest numaligns 5 posterior 1 align 0 *DELETE* 0.5 c3 0.5 align 1 *DELETE* 0.5 c4 0.5 align 2 *DELETE* 0.5 c5 0.5 align 3 c1 0.5 c6 0.5 align 4 c2 0.5 c7 0.5 Then I build the HTK lattice with: ./lattice-tool -read-mesh -in-lattice out2.lat -out-lattice out_htk_2.lat -write-htk And get (I copy only the Links part...): # Links J=0 S=0 E=1 W=!NULL x1=-0.30103 a=-0.30103 J=1 S=0 E=1 W=c3 x1=-0.30103 a=-0.30103 J=2 S=1 E=2 W=!NULL x1=-0.30103 a=-0.30103 J=3 S=1 E=2 W=c4 x1=-0.30103 a=-0.30103 J=4 S=2 E=3 W=!NULL x1=-0.30103 a=-0.30103 J=5 S=2 E=3 W=c5 x1=-0.30103 a=-0.30103 J=6 S=3 E=4 W=c1 x1=-0.30103 a=-0.30103 J=7 S=3 E=4 W=c6 x1=-0.30103 a=-0.30103 J=8 S=4 E=5 W=c2 x1=-0.30103 a=-0.30103 J=9 S=4 E=5 W=c7 x1=-0.30103 a=-0.30103 It seems that this lattice allows other paths than the ones I expected, e.g. c4 c1 c7. Then I think that the time information is not used how I was expecting. Maybe I have not used the tools correctly or I have missunderstod something. Could anyone let me know where the problem is? Thanks in advance, Alfonso From rmr4848 at gmail.com Wed Aug 25 11:16:14 2010 From: rmr4848 at gmail.com (Ryan Roth) Date: Wed, 25 Aug 2010 14:16:14 -0400 Subject: [SRILM User List] Question Regarding SRILM N-gram tools In-Reply-To: <4C743E6A.9080004@speech.sri.com> References: <4C743E6A.9080004@speech.sri.com> Message-ID: Thank you Andreas. This was very helpful. I will make use of the SRI mailing list from now on. Ryan Roth CCLS Columbia University On Tue, Aug 24, 2010 at 5:49 PM, Andreas Stolcke wrote: > Ryan, > > I suggested you use the -limit-vocab option with ngram, and write out your > LM in binary. > Reading a binary LM with -limit-vocab is very efficient in processing only > the portions of the LM parameters that pertain to your test set vocabulary. > You can generate the vocabulary used by your test data using > > ngram-count -text DATA -write-vocab VOCAB > > There is a tradeoff between processing small batches of data (hence small > vocabularies, hence fast loading of the LM) with large batches (larger > vocabularies, but the LM fewer times), so you might want to tune the batch > size empirically for best overall throughput. > > If LM load time is still a limiting factor with this approach you should > use an LM server (see ngram -use-server option), which effectively means you > load the LM into memory only once. > > I suggest you join the srilm-user list and direct future questions there. > > Andreas > > > Ryan Roth wrote: > >> Hello: >> >> My name is Ryan Roth and I work at Columbia University's Center for >> Computational Learning Systems. My research focus currently is on Arabic >> Natural Language Processing. >> >> I have question about the SRILM toolkit that I hope you'll be able to help >> me with. >> >> My problem is the following. I have a large N-gram LM file (non-binary) >> that I built from a collection of about 200 million words. I want to able >> to read a given input text file (containing one sentence per line), and for >> every N-gram that I find there, extract the probability for that N-gram from >> the LM file. >> >> Currently, I am solving this problem by reading the entire LM file into >> memory first, and then reading the N-grams from the input text file and >> referencing the memory structure to get the probability for that N-gram. >> This works fine, but is very slow and memory intensive. I can reduce the >> memory issues by reading the input text file into memory instead, and >> reading the LM file line-by-line, but this is somewhat less convenient due >> to the other processing I need to perform on the input file. >> >> I've looked through the SRILM toolkit, and another option would seem to be >> to filter the large LM file first using the "make-lm-subset" script and a >> counts file built from input text file. I would then use the filtered output >> LM in place of the larger LM and proceed as before. This method would seem >> to avoid the large memory requirements. My initial tests, however, show >> that the filtering step is still a bit slower than I'd like. >> >> I was wondering if there is another, more time-efficient way of solving >> this particular problem (that is, extracting a specific subset of N-gram >> probabilities from a large LM file) using the other tools in the SRILM >> toolkit. Is there some option combination for "ngram", for example, that >> would work? I don't currently see a direct solution. >> >> >> Thank you very much, >> >> Ryan Roth >> CCLS >> Columbia University >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at chrismandery.de Tue Sep 7 08:59:57 2010 From: mail at chrismandery.de (Christian A. Mandery) Date: Tue, 7 Sep 2010 17:59:57 +0200 Subject: [SRILM User List] make-big-lm produces different LM than ngram-count Message-ID: Hello, I am trying to use the make-big-lm script in order to get a way of building modified Kneser-Neys LMs that scale better with larger corpora. However, make-big-lm produces different LMs for me than ngram-count although I am using the same parameters. Not only probabilities and back-off values differ, also the LM build with ngram-count countains more {2,3,4}-grams than the LM build with make-big-lm. I invoke ngram-count with this parameters: ngram-count -order 4 -debug 4 -unk -map-unk "" -vocab vocab-lm -gt1min 1 -gt2min 2 -gt3min 2 -gt4min 2 -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -text corpus.gz -lm ngram-count.lm And make-big-lm: make-big-lm -read counts -name zzz-make-big-lm -order 4 -debug 4 -unk -map-unk "" -vocab vocab-lm -gt1min 1 -gt2min 2 -gt3min 2 -gt4min 2 -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -lm make-big-lm.lm Why are there differences in the generated LM using these two calls? Best regards Christian Mandery PS: counts-new.gz is built using "ngram-count -text corpus.gz -write counts -order 4 -sort", so nothing should go wrong there. From stolcke at speech.sri.com Tue Sep 7 10:07:37 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 07 Sep 2010 10:07:37 -0700 Subject: [SRILM User List] make-big-lm produces different LM than ngram-count In-Reply-To: References: Message-ID: <4C867159.4020001@speech.sri.com> Christian A. Mandery wrote: > Hello, > > I am trying to use the make-big-lm script in order to get a way of > building modified Kneser-Neys LMs that scale better with larger > corpora. > > However, make-big-lm produces different LMs for me than ngram-count > although I am using the same parameters. > > Not only probabilities and back-off values differ, also the LM build > with ngram-count countains more {2,3,4}-grams than the LM build with > make-big-lm. > > > I invoke ngram-count with this parameters: > ngram-count -order 4 -debug 4 -unk -map-unk "" -vocab vocab-lm > -gt1min 1 -gt2min 2 -gt3min 2 -gt4min 2 -kndiscount1 -kndiscount2 > -kndiscount3 -kndiscount4 -text corpus.gz -lm ngram-count.lm > > And make-big-lm: > make-big-lm -read counts -name zzz-make-big-lm -order 4 -debug 4 -unk > -map-unk "" -vocab vocab-lm -gt1min 1 -gt2min 2 -gt3min 2 -gt4min > 2 -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -lm > make-big-lm.lm > > > Why are there differences in the generated LM using these two calls? > As explained in the FAQ, make-big-lm will compute the discounting parameters from the training corpus's full vocabulary, whereas ngram-count invoked directly will perform the mapping of OOVs and THEN compute the discounting parameters. The first method is usually better. Andreas > > Best regards > Christian Mandery > > > PS: counts-new.gz is built using "ngram-count -text corpus.gz -write > counts -order 4 -sort", so nothing should go wrong there. > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > From stolcke at speech.sri.com Mon Sep 13 19:05:23 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 13 Sep 2010 19:05:23 -0700 Subject: [SRILM User List] Problem with srilm on 64bit machine In-Reply-To: References: Message-ID: <4C8ED863.5040301@speech.sri.com> Veton Kepuska wrote: > > Andrea, > > > > I have managed to build albeit not fully successfully. > > The problem that I am having is in using 64 bit architecture that is > not being recognized by your build process. > > 1. I am using virtual machine (Oracle VirtualBox) > > 2. The command line: > make SRILM=$PWD MACHINE_TYPE=i686 World > try make SRILM=$PWD MACHINE_TYPE=i686-m64 World Andreas > 3. The error log reports incompatibility of the libraries. This > occurs due to incorrect build parameters (see highlighted below). > > g++ *-m32 -mtune=pentium3* -Wall -Wno-unused-variable > -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_BITS=64 > -I. -I../../include -u matherr -L../../lib/i686 -g -O3 -o > ../bin/i686/ngram-count ../obj/i686/ngram-count.o > ../obj/i686/liboolm.a -lm -ldl ../../lib/i686/libflm.a > ../../lib/i686/libdstruct.a ../../lib/i686/libmisc.a -lm 2>&1 | c++filt > > /usr/bin/ld: skipping incompatible > /usr/lib/gcc/x86_64-linux-gnu/4.4.3/../../../libdl.so when searching > for -ldl > > /usr/bin/ld: skipping incompatible > /usr/lib/gcc/x86_64-linux-gnu/4.4.3/../../../libdl.a when searching > for -ldl > > /usr/bin/ld: skipping incompatible /usr/lib/libdl.so when searching > for -ldl > > /usr/bin/ld: skipping incompatible /usr/lib/libdl.a when searching for > -ldl > > /usr/bin/ld: cannot find -ldl > > collect2: ld returned 1 exit status > > Could you please help me sett the parameters right? > > > > Thanks > > > > --Veton > > -- *Dr. K?puska* > > */The learning and knowledge that we have, is, at the most, but little > compared with that of which we are ignorant. - Plato/**//* > > */"Those that would give up essential liberty to obtain a little > temporary safety deserve neither liberty nor safety." - Benjamin > Franklin, An Historical Review of Pennsilvanya, 1759/* > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > *Dr. Veton K?puska, Associate Professor > *ECE Department > Florida Institute of Technology > Olin Engineering Building > 150 West University Blvd. > Melbourne, FL 32901-6975 > Tel. (321) 674-7183 > Mob. (321) 759-3157 > E-mail: vkepuska at fit.edu > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~** > > The information transmitted (including attachments) is covered by the > Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is > intended only for the person(s) or entity/entities to which it is > addressed and may contain confidential and/or privileged material. Any > review, retransmission, dissemination or other use of, or taking of > any action in reliance upon, this information by persons or entities > other than the intended recipient(s) is prohibited. If you received > this in error, please contact the sender and delete the material from > any computer. > > > From kyawkyawzinn at gmail.com Wed Sep 22 02:00:04 2010 From: kyawkyawzinn at gmail.com (kyawkyaw zin) Date: Wed, 22 Sep 2010 15:30:04 +0630 Subject: [SRILM User List] SRILM ngram-count error Message-ID: Hello there, When I used ngram-count to generate language model as following: ../srilm/bin/i686/ngram-count -interpolate -kndiscount -unk -text en2my/corpus/clean.my -lm en2my/corpus/clean.my.lm where, clean.my is text file that contains Burmese sentences And I got the following error: *one of modified KneserNey discounts is negative error in discount estimator for order 1* Please let me know How to solve this problem ? Thanks you all in advance. Best Regards, Kyaw Kyaw Zin -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Wed Sep 22 10:58:02 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 22 Sep 2010 10:58:02 -0700 Subject: [SRILM User List] SRILM ngram-count error In-Reply-To: References: Message-ID: <4C9A43AA.9090905@speech.sri.com> kyawkyaw zin wrote: > Hello there, > > When I used ngram-count to generate language model as following: > > ../srilm/bin/i686/ngram-count -interpolate -kndiscount -unk -text > en2my/corpus/clean.my -lm en2my/corpus/clean.my.lm > where, > clean.my is text file that contains Burmese sentences > > And I got the following error: > > *one of modified KneserNey discounts is negative > error in discount estimator for order 1* > > Please let me know How to solve this problem ? This is in the FAQ, item C3. http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html Andreas From stolcke at speech.sri.com Tue Sep 28 14:38:36 2010 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 28 Sep 2010 14:38:36 -0700 Subject: [SRILM User List] Pruning of KN-smoothed models Message-ID: <201009282138.o8SLcaj21457@huge> Ciprian Chelba and colleagues have a nice paper at Interspeech showing how KN smoothing interacts badly with N-gram pruning, especially if the pruning is severe. The reason is that the ngram history marginal probabilities are poorly estimated by the lower-order distributions produced by KN smoothing. To remedy the problem pointed out in the paper, I added a way to specify a separate model for computing the history marginals, different from the model being pruned. For example, when pruning a KN-smoothed 4-gram model M you could specify a GT-smoothed 3-gram H using ngram -lm M -prune ... -prune-history-lm H (the history LM only needs to be of order one less than the model being pruned, so it can be much smaller). This also gives you the option of pruning an LM in a way that is targeted at a specific domain. For example, if you have a large LM M and want to create a smaller version that works well in some domain for which you have a specialized LM D, you would use ngram -lm M -prune ... -prune-history-lm D This makes sure you retain the N-grams that matter for your target domain. Of course D should not be KN-smoothed model! The new option is implemented in the beta version on the download server, and a test case is in $SRILM/lm/test/tests/ngram-prune-history-lm. Comments welcome. --Andreas