From dugast at systran.fr  Mon Jan  6 07:45:35 2014
From: dugast at systran.fr (DUGAST Loic)
Date: Mon, 6 Jan 2014 15:45:35 +0000
Subject: [SRILM User List] class based model
In-Reply-To: <52B1C962.3020004@icsi.berkeley.edu>
References: <1387277889.14584.YahooMailNeo@web173202.mail.ir2.yahoo.com>
	<52B0F8E5.9060301@icsi.berkeley.edu>
	<1387353916.63557.YahooMailNeo@web173202.mail.ir2.yahoo.com>,
	<52B1C962.3020004@icsi.berkeley.edu>
Message-ID: <2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan>

Hi

In the FAQ (http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html)

You advise to ...

c)
Lower the minimum counts for N-grams included in the LM, i.e., the values of the options -gt2min, -gt3min, -gt4min, etc. The higher order N-grams typically get higher minimum counts.


Do you not mean : *rise* the minimum counts (...) instead ?

Plus I am not sure to understand why gt2min should be set higher than gt1min etc ?
Higher-order ngrams  are naturally less frequent. Therefore the same cutoff value (gt2min equal to gt1min)will be harsher to bigrams than to unigrams... Can you explain ?

Thank you!

Loic


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140106/f201325e/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Jan  6 11:31:42 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 06 Jan 2014 11:31:42 -0800
Subject: [SRILM User List] class based model
In-Reply-To: <2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan>
References: <1387277889.14584.YahooMailNeo@web173202.mail.ir2.yahoo.com>	<52B0F8E5.9060301@icsi.berkeley.edu>	<1387353916.63557.YahooMailNeo@web173202.mail.ir2.yahoo.com>,
	<52B1C962.3020004@icsi.berkeley.edu>
	<2BFF85983CF61146A76CD006ABA3835BFE88F1@SSAMBX02.systranssa.lan>
Message-ID: <52CB049E.7070202@icsi.berkeley.edu>

On 1/6/2014 7:45 AM, DUGAST Loic wrote:
> Hi
>
> In the FAQ 
> (http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html)
>
> You advise to ...
>
> c)
>     Lower the minimum counts for N-grams included in the LM, i.e., the
>     values of the options *-gt2min*, *-gt3min*, *-gt4min*, etc. The
>     higher order N-grams typically get higher minimum counts. 
>
>
> Do you not mean : *rise* the minimum counts (...) instead ?
>

You are correct.   It should say raise the min counts.  We'll fix the 
documentation ASAP.

> Plus I am not sure to understand why gt2min should be set higher than 
> gt1min etc ?
> Higher-order ngrams  are naturally less frequent. Therefore the same 
> cutoff value (gt2min equal to gt1min)will be harsher to bigrams than 
> to unigrams... Can you explain ?

The minimum counts are a crude way to trade off performance for space, 
and since there are lot more long ngrams than short ngrams you get more 
space savings with higher order ngrams.  It is typically not worth it to 
eliminate unigrams and bigrams, but a decent tradeoff to remove 
singleton trigrams and fourgrams.  The default values were chose based 
on historical practice (I think they might have even been inherited from 
the CMU LM toolkit).

The better and more principled way to remove ngrams is entropy-based 
pruning (ngram/ngram-count -prune option).   So the best strategy given 
limited memory is to make the gtmin values as low are you can afford to 
fit into memory, then use -prune (you can do this in the same invocation 
of ngram-count or make-big-lm).

Andreas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140106/082e64e6/attachment.html>

From asad_12204 at yahoo.com  Tue Jan  7 12:14:06 2014
From: asad_12204 at yahoo.com (Asad A.Malik)
Date: Tue, 7 Jan 2014 12:14:06 -0800 (PST)
Subject: [SRILM User List] Cannot Execute SRILM
Message-ID: <1389125646.40817.YahooMailNeo@web122206.mail.ne1.yahoo.com>

Hi All,


I am trying to install MOSES and for that I have to install SRILM. I
 have downloaded SRILM version 1.7.0. I've extracted it, have used:
??? sudo apt-get install libc6-dev-i386 

also as suggested by MOSES installation guide that "if your x86_64 type, you should edit thesbin/machine-type file" as 

??????? else if (`uname -m` == x86_64) then
??????????????? set MACHINE_TYPE = i686
??? and change it to
??? ??? else if (`uname -m` == x86_64) then
??????????????? set MACHINE_TYPE = i686-m64

But my file was automatically having i686-m64. So therefore I don't have to chage it. 

And for execution I enter:
??? ??? make MAKE_PIC=1? SRILM=`pwd` NO_TCL=X? World

It give me
 following error:

??? ??? make: pwd/sbin/machine-type: Command not found
??? ??? Makefile:13: pwd/common/Makefile.common.variables: No such file or directory
??? ??? make: *** No rule to make target `pwd/common/Makefile.common.variables'.? Stop.

I've attached the screen shot also. Kindly tell me that what sould I do???


Regards?


Asad A.Malik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140107/006121bd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2014-01-07 23:53:16.png
Type: image/png
Size: 33847 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140107/006121bd/attachment.png>

From stolcke at icsi.berkeley.edu  Tue Jan  7 14:53:59 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 07 Jan 2014 14:53:59 -0800
Subject: [SRILM User List] Cannot Execute SRILM
In-Reply-To: <1389125646.40817.YahooMailNeo@web122206.mail.ne1.yahoo.com>
References: <1389125646.40817.YahooMailNeo@web122206.mail.ne1.yahoo.com>
Message-ID: <52CC8587.6000907@icsi.berkeley.edu>

On 1/7/2014 12:14 PM, Asad A.Malik wrote:
> Hi All,
>
> I am trying to install MOSES and for that I have to install SRILM. I 
> have downloaded SRILM version 1.7.0. I've extracted it, have used:
> sudo apt-get install libc6-dev-i386
>
> also as suggested by MOSES installation guidethat "if your x86_64 
> type, you should edit the/sbin/machine-type/ file" as
>
> /else if (`uname -m` == x86_64) then/
> /set MACHINE_TYPE = i686/
> and change it to
> /else if (`uname -m` == x86_64) then/
> /set MACHINE_TYPE = i686-m64/
>
> But my file was automatically having i686-m64. So therefore I don't 
> have to chage it.
>
> And for execution I enter:
> make MAKE_PIC=1 SRILM=`pwd` NO_TCL=X  World
>
> It give me following error:
>
> make: pwd/sbin/machine-type: Command not found
> Makefile:13: pwd/common/Makefile.common.variables: No such file or 
> directory
> make: *** No rule to make target 
> `pwd/common/Makefile.common.variables'. Stop.
>
> I've attached the screen shot also. Kindly tell me that what sould I do???
>
you must have used

         SRILM='pwd'  (forward quotes)

instead of

     SRILM=`pwd`

Alternatively, use

    SRILM=$PWD

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140107/5794661e/attachment.html>

From alexx.tudor at gmail.com  Sun Jan 26 22:01:31 2014
From: alexx.tudor at gmail.com (alex tudor)
Date: Mon, 27 Jan 2014 08:01:31 +0200
Subject: [SRILM User List] lattice-tool for FLM
Message-ID: <CAM6aWROw9cMOo0Qb+kZSXLhw5+YTH8wLVY2C02BtVKNzwSJv+g@mail.gmail.com>

Hi,

I wish to use FLM in speech recognition. So I've done a factored language
model for a small text:
*fngram-count -factor-file flm_config_kn.txt -text textLM.txt -write-counts
textFLM_kn.count -lm textFLM_kn.lm*

flm_config_kn.txt:
*W : 2 W(-1) P(-1) textFLM_kn.count textFLM_kn.lm 3*
*W1,P1  W1 kndiscount gtmin 1 interpolate*
*P1     P1 kndiscount gtmin 1 *
*0      0  kndiscount gtmin 1 *

Finally textFLM_kn.lm is:
*\data\*
*ngram 0x0=510*
*ngram 0x1=0*
*ngram 0x2=1378*
*ngram 0x3=1978*

*\0x0-grams:*
*-1.746398 </s>*
*-99 <s>*
*-2.252263 W-A*
*-3.047544 W-ABUNDENTE*
*....*

Afterwards I've tried to rescore a HTK word lattice for a bigram using the
'lattice-tool' and a HTK word lattice (bigram.lat):
*lattice-tool -read-htk -write-htk -in-lattice bigram.lat -htk-lmscale 10
-posterior-scale 10 -factored -lm textFLM_kn.lm -out-lattice
bigramFLM_kn.lat*

But I have an error in textFLM_kn.lm: line 2: error: couldn't form int for
number of factored LMs in when reading FLM spec file

What's wrong ? I would be appreciated for any advice.

Sincerely yours,
Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140127/1075cecd/attachment.html>

From omm.tayma at yahoo.fr  Mon Jan 27 10:18:07 2014
From: omm.tayma at yahoo.fr (omm tayma)
Date: Mon, 27 Jan 2014 18:18:07 +0000 (GMT)
Subject: [SRILM User List] install srilm
Message-ID: <1390846687.52966.YahooMailNeo@web133201.mail.ir2.yahoo.com>

Hi,
I try to install srilm , i did this steps 

1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64
2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace 

GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit
CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS)
CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS)
-DINSTANTIATE_TEMPLATES

by : 

# Use the GNU C compiler.
? GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit
CC = /usr/bin/gcc $(GCC_FLAGS)
CXX = /usr/bin/g++
 -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES 


and 

#TCL_INCLUDE =
#TCL_LIBRARY = -ltcl


by :? 

# Tcl support (standard in Linux)
?TCL_INCLUDE = -I/usr/include/tcl8.5
TCL_LIBRARY = -ltcl8.5

3- i execute make SRILM=/home/lenovo/Documents/srilm? World
?but i have this error : cc1plus: error: bad value (i686-m64) for -march= switch

can some one help me plz !! 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140127/6919b7f1/attachment.html>

From omm.tayma at yahoo.fr  Mon Jan 27 10:22:19 2014
From: omm.tayma at yahoo.fr (omm tayma)
Date: Mon, 27 Jan 2014 18:22:19 +0000 (GMT)
Subject: [SRILM User List] install srilm
Message-ID: <1390846939.38659.YahooMailNeo@web133204.mail.ir2.yahoo.com>

Hi,
I try to install srilm , i did this steps 

1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64
2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace 

GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit
CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS)
CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS)
-DINSTANTIATE_TEMPLATES

by : 

# Use the GNU C compiler.
? GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit
CC = /usr/bin/gcc $(GCC_FLAGS)
CXX = /usr/bin/g++
 -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES 


and 

#TCL_INCLUDE =
#TCL_LIBRARY = -ltcl


by :? 

# Tcl support (standard in Linux)
?TCL_INCLUDE = -I/usr/include/tcl8.5
TCL_LIBRARY = -ltcl8.5

3- i execute make SRILM=/home/lenovo/Documents/srilm? World
?but i have this error : cc1plus: error: bad value (i686-m64) for -march= switch

can some one help me plz !! 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140127/c1577f1e/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Jan 27 11:08:21 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 27 Jan 2014 11:08:21 -0800
Subject: [SRILM User List] lattice-tool for FLM
In-Reply-To: <CAM6aWROw9cMOo0Qb+kZSXLhw5+YTH8wLVY2C02BtVKNzwSJv+g@mail.gmail.com>
References: <CAM6aWROw9cMOo0Qb+kZSXLhw5+YTH8wLVY2C02BtVKNzwSJv+g@mail.gmail.com>
Message-ID: <52E6AEA5.9060304@icsi.berkeley.edu>

Alex,

I haven't looked at your example in detail, but two hints:

1) debug your LM evaluation using ngram -factored first (before moving 
on to lattice rescoring)
2) follow the steps in $SRILM/flm/test/tests/ngram-factored and make 
sure sure you are (a) inputting and producing files in similar format 
and (b) supplying the right options.

Once you have that working, move on to lattice-tool.   The word strings 
in your lattices should be the same as the input to ngram -factored -ppl ...

Andreas

On 1/26/2014 10:01 PM, alex tudor wrote:
> Hi,
>
> I wish to use FLM in speech recognition. So I've done a factored 
> language model for a small text:
> /fngram-count -factor-file flm_config_kn.txt -text textLM.txt 
> -write-counts textFLM_kn.count -lm textFLM_kn.lm/
>
> flm_config_kn.txt:
> /W : 2 W(-1) P(-1) textFLM_kn.count textFLM_kn.lm 3/
> /W1,P1  W1 kndiscount gtmin 1 interpolate/
> /P1     P1 kndiscount gtmin 1 /
> /0      0  kndiscount gtmin 1 /
>
> Finally textFLM_kn.lm is:
> /\data\/
> /ngram 0x0=510/
> /ngram 0x1=0/
> /ngram 0x2=1378/
> /ngram 0x3=1978/
> /
> /
> /\0x0-grams:/
> /-1.746398</s>/
> /-99<s>/
> /-2.252263W-A/
> /-3.047544W-ABUNDENTE/
> /..../
>
> Afterwards I've tried to rescore a HTK word lattice for a bigram using 
> the 'lattice-tool' and a HTK word lattice (bigram.lat):
> /lattice-tool -read-htk -write-htk -in-lattice bigram.lat -htk-lmscale 
> 10 -posterior-scale 10 -factored -lm textFLM_kn.lm -out-lattice 
> bigramFLM_kn.lat/
>
> But I have an error in textFLM_kn.lm: line 2: error: couldn't form int 
> for number of factored LMs in when reading FLM spec file
>
> What's wrong ? I would be appreciated for any advice.
>
> Sincerely yours,
> Alex
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140127/733fc62d/attachment.html>

From omm.tayma at yahoo.fr  Mon Jan 27 11:16:35 2014
From: omm.tayma at yahoo.fr (omm tayma)
Date: Mon, 27 Jan 2014 19:16:35 +0000 (GMT)
Subject: [SRILM User List] install srilm
Message-ID: <1390850195.12957.YahooMailNeo@web133205.mail.ir2.yahoo.com>

Hi,
I try to install srilm , i did this steps 

1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64
2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace 

GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit
CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS)
CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS)
-DINSTANTIATE_TEMPLATES

by : 

# Use the GNU C compiler.
? GCC_FLAGS = -arch=i686-m64 -Wreturn-type -Wimplicit
CC = /usr/bin/gcc $(GCC_FLAGS)
CXX = /usr/bin/g++
 -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES 


and 

#TCL_INCLUDE =
#TCL_LIBRARY = -ltcl


by :? 

# Tcl support (standard in Linux)
?TCL_INCLUDE = -I/usr/include/tcl8.5
TCL_LIBRARY = -ltcl8.5
3- i execute make SRILM=/home/lenovo/Documents/srilm? World
?but i have this error : make[2]: Leaving directory `/home/lenovo/Documents/srilm/lattice/src'
make[2]: Entering directory `/home/lenovo/Documents/srilm/utils/src'
rm -f Dependencies.i686-m64
/home/lenovo/Documents/srilm/sbin/generate-program-dependencies ../bin/i686-m64 ../obj/i686-m64 ""? | sed -e "s&\.o&.o&g" >> Dependencies.i686-m64
make[2]: Leaving directory `/home/lenovo/Documents/srilm/utils/src'
make[1]: Leaving directory `/home/lenovo/Documents/srilm'
make release-libraries
make[1]: Entering directory `/home/lenovo/Documents/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
??? ??? (cd $subdir/src; make SRILM=/home/lenovo/Documents/srilm MACHINE_TYPE=i686-m64 OPTION= MAKE_PIC= release-libraries) || exit 1; \
??? done
make[2]: Entering directory `/home/lenovo/Documents/srilm/misc/src'
/usr/bin/gcc -arch=i686-m64 -Wreturn-type -Wimplicit?? -I/usr/include/tcl8.5 -I. -I../../include?? -c -g -O3? -o ../obj/i686-m64/option.o option.c
gcc: error: unrecognized option ?-arch=i686-m64?
make[2]: *** [../obj/i686-m64/option.o] Error 1
make[2]: Leaving directory `/home/lenovo/Documents/srilm/misc/src'
make[1]: *** [release-libraries] Error 1
make[1]: Leaving directory `/home/lenovo/Documents/srilm'
make: *** [World] Error 2


can some one help me plz !! 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140127/33491005/attachment.html>

From omm.tayma at yahoo.fr  Mon Jan 27 11:21:56 2014
From: omm.tayma at yahoo.fr (omm tayma)
Date: Mon, 27 Jan 2014 19:21:56 +0000 (GMT)
Subject: [SRILM User List] install srilm
Message-ID: <1390850516.83578.YahooMailNeo@web133203.mail.ir2.yahoo.com>

Hi,
I try to install srilm , i did this steps 

1-? i execute sudo ./sbin/machine-type? and my machine type is :i686-m64
2- in the file /home/lenovo/Documents/srilm/common/ Makefile.machine.i686-m64 i replace 

GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit
CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS)
CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS)
-DINSTANTIATE_TEMPLATES

by : 

# Use the GNU C compiler.
? GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit
CC = /usr/bin/gcc $(GCC_FLAGS)
CXX = /usr/bin/g++
 -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES 


and 

#TCL_INCLUDE =
#TCL_LIBRARY = -ltcl


by :? 

# Tcl support (standard in Linux)
?TCL_INCLUDE = -I/usr/include/tcl8.5
TCL_LIBRARY = -ltcl8.5
3- i execute make SRILM=/home/lenovo/Documents/srilm? World
?but i have this error :
make[2]: Entering directory `/home/lenovo/Documents/srilm/utils/src'
rm -f Dependencies.i686-m64
/home/lenovo/Documents/srilm/sbin/generate-program-dependencies ../bin/i686-m64 ../obj/i686-m64 ""? | sed -e "s&\.o&.o&g" >> Dependencies.i686-m64
make[2]: Leaving directory `/home/lenovo/Documents/srilm/utils/src'
make[1]: Leaving directory `/home/lenovo/Documents/srilm'
make release-libraries
make[1]: Entering directory `/home/lenovo/Documents/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
??? ??? (cd $subdir/src; make SRILM=/home/lenovo/Documents/srilm MACHINE_TYPE=i686-m64 OPTION= MAKE_PIC= release-libraries) || exit 1; \
??? done
make[2]: Entering directory `/home/lenovo/Documents/srilm/misc/src'
/usr/bin/gcc -arch=i686-m64 -Wreturn-type -Wimplicit?? -I/usr/include/tcl8.5 -I. -I../../include?? -c -g -O3? -o ../obj/i686-m64/option.o option.c
gcc: error: unrecognized option ?-arch=i686-m64?
make[2]: *** [../obj/i686-m64/option.o] Error 1
make[2]: Leaving directory `/home/lenovo/Documents/srilm/misc/src'
make[1]: *** [release-libraries] Error 1
make[1]: Leaving directory `/home/lenovo/Documents/srilm'
make: *** [World] Error 2

can some one help me plz !! 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140127/75452547/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Jan 27 12:23:12 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 27 Jan 2014 12:23:12 -0800
Subject: [SRILM User List] install srilm
In-Reply-To: <1390850516.83578.YahooMailNeo@web133203.mail.ir2.yahoo.com>
References: <1390850516.83578.YahooMailNeo@web133203.mail.ir2.yahoo.com>
Message-ID: <52E6C030.4020109@icsi.berkeley.edu>

On 1/27/2014 11:21 AM, omm tayma wrote:
> Hi,
> I try to install srilm , i did this steps
> 1- i execute sudo ./sbin/machine-type and my machine type is :i686-m64
> 2- in the file /home/lenovo/Documents/srilm/common/ 
> Makefile.machine.i686-m64 i replace
> GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit
> CC = /usr/local/lang/gcc-3.4.3/bin/gcc $(GCC_FLAGS)
> CXX = /usr/local/lang/gcc-3.4.3/bin/g++ -Wno-deprecated $(GCC_FLAGS)
> -DINSTANTIATE_TEMPLATES
> by :
> # Use the GNU C compiler.
>   GCC_FLAGS = -march=i686-m64 -Wreturn-type -Wimplicit
> CC = /usr/bin/gcc $(GCC_FLAGS)
> CXX = /usr/bin/g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES

Just remove -march=i686-m64 from the GCC_FLAGS.

Andreas

PS.  Please don't send your posts multiple times.  I received 4 copies 
of it.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140127/452d08c0/attachment.html>

From omm.tayma at yahoo.fr  Wed Jan 29 02:32:45 2014
From: omm.tayma at yahoo.fr (omm tayma)
Date: Wed, 29 Jan 2014 10:32:45 +0000 (GMT)
Subject: [SRILM User List] error in Makefile.machine.i686-m64
Message-ID: <1390991565.72702.YahooMailNeo@web133201.mail.ir2.yahoo.com>

hi 

if the type of my machineis :? i686-m64
so I changed in the file Makefile.machine.i686-m64 as follows:?
?# Use the GNU C compiler.
?? GCC_FLAGS = -m64 -Wall -Wno-unused-variable -Wno-uninitialized
?? CC = $(GCC_PATH)gcc $(GCC_FLAGS) -Wimplicit-int
?? CXX = $(GCC_PATH)g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES

and 


? # Tcl support (standard in Linux)
? TCL_INCLUDE = -I/usr/include/tcl8.5
TCL_LIBRARY = -ltcl8.5
?? NO_TCL = 1

but I think this is wrong because I always get errors !!!
please can some one help me !!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140129/8bc99fb7/attachment.html>

From stolcke at icsi.berkeley.edu  Wed Jan 29 10:02:10 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 29 Jan 2014 10:02:10 -0800
Subject: [SRILM User List] Error in usage of make-batch-counts
In-Reply-To: <CAC4-+NyDta_MZi1u9h3Z4_TnJFiFa5aYCd5_O18vth1uhFgjQg@mail.gmail.com>
References: <CAC4-+NyDta_MZi1u9h3Z4_TnJFiFa5aYCd5_O18vth1uhFgjQg@mail.gmail.com>
Message-ID: <52E94222.7030002@icsi.berkeley.edu>

On 1/28/2014 12:40 AM, Rajen Chatterjee wrote:
> Hello,
>        I want to pass this options "-order 5 -nterpolate -kndiscount" 
> to make-batch-counts, how can I do this?
>
> When I am giving this command "./make-batch-counts 
> /home/rajen/file_name 1 /home/rajen/count-dir -order 5" I am getting 
> this error mkdir: invalid option -- 'o'Try `mkdir --help' for more 
> information.
>
> Can you help me out to fix this problem.
>
>
There are two problems:

1 - According to the training-scripts(1) man page, the usage is

*make-batch-counts*  /file-list/  [/batch-size/  [/filter/  [/count-dir/  [/options/  ... ] ] ] ]

so you need to pass 4 parameters before any options that are passed to ngram-count.   For the "filter"  parameter you can use the "cat"  program.

2. The options -interpolate and -kndiscount are inappropriate since make-batch-counts does not create LMs, it only collects the counts.


Andreas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140129/d8ee91ce/attachment.html>

From asooryampasya at gmail.com  Fri Feb  7 05:14:06 2014
From: asooryampasya at gmail.com (Asooryampasya)
Date: Fri, 7 Feb 2014 14:14:06 +0100
Subject: [SRILM User List] Errors while trying to use fngram-count,
	fngram with Estonian Tagged data
Message-ID: <CAEUGcKc1vj0JCUtaL4GyNvpoXE6Qzcrx2GYzh5YAGkY9mQg3NQ@mail.gmail.com>

Dear fellow users

I am trying to build a factored model for Estonian (which is
morphologically tagged, using tree tagger). The fngram-count program seems
to run without issues. However, when I use fngram program to estimate the
perplexity of a test sample, I get an error.

I found the same question asked here before (
http://www.speech.sri.com/pipermail/srilm-user/2011q3/001088.html), but I
could not find a response to this email. Hence, I am posting it to the list
again.

I am pasting below the error I am getting while running fngram program and
also the contents of my factor-file that I used with both fngram-count and
fngram programs. Please let me know if any more information is needed.

The error:

***
w_g4_w1w2m1m2.count.gz: line 14172: malformed N-gram count or more than 100
words per line
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
s_g4_w1w2m1m2.lm.gz: line 21: error, ngram line has invalid number (1) of
fields, expecting either 2 or 3
format error in lm file
*******

I am still new to using factored models, and I am as of now only using the
example settings given in the Kirchhoff, Blimes and Duh tutorial.

Here is how my factor-file looks like:

******
##word given word-1 word-2 morph-1 morph-2
1
W : 4 W(-1) W(-2) M(-1) M(-2) w_g4_w1w2m1m2.count.gz s_g4_w1w2m1m2.lm.gz 5
0b0111 0b0010 wbdiscount gtmin 4 interpolate
0b1101 0b1000 wbdiscount gtmin 3 interpolate
0b0101 0b0001 wbdiscount gtmin 2 interpolate
0b0100 0b0100 wbdiscount gtmin 1 interpolate
0b0000 0b0000 wbdiscount gtmin 1
******

My training data look like this:
<s>
W-Eksamit??:M-S.com.pl.nom
W-I.:M-Y.nominal.?
W-Pange:M-V.main.imper.pres
W-sulgudes:M-S.com.pl.in
W-olevad:M-A.pos.pl.nom
W-s?nad:M-S.com.pl.nom
W-?igesse:M-A.pos.sg.ill
W-vormi:M-S.com.sg.adit
W-!:M-Z.Exc
W-Piret:M-S.prop.sg.nom
W-Toomet:M-S.prop.sg.abl
W-on:M-V.main.indic.pres.ps3
W-ettev?tlik:M-A.pos.sg.nom
W-naine:M-S.com.sg.nom
W-.:M-Z.Fst
</s>
******

Thanks,
Pasya.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140207/a26f1b2b/attachment.html>

From junfei.guo at gmail.com  Tue Mar  4 02:31:14 2014
From: junfei.guo at gmail.com (Junfei Guo)
Date: Tue, 4 Mar 2014 11:31:14 +0100
Subject: [SRILM User List] Does the calculation of Back Off Weight use the
 special property of different Discount?
Message-ID: <CADFMWpvybRGpbNOvhd9XZbXMX6DsVG0nLzTWbpB8pbmT2rVXhQ@mail.gmail.com>

Hi All,

>From the source code I can see that to calculate the Lower Oder Weight for
interpolation, Srilm uses the special properties of different Discount
function. For example for Modified KN, it uses the fact that the discount
is a constant for all trigrams which shows up more than 3 times.

My question is weather the calculation of Back Off Weight also use this
sort of information or it only assume a general discount function.

Thanks - Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140304/cbf60e8b/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Mar  4 09:58:15 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 04 Mar 2014 09:58:15 -0800
Subject: [SRILM User List] Does the calculation of Back Off Weight use
 the special property of different Discount?
In-Reply-To: <CADFMWpvybRGpbNOvhd9XZbXMX6DsVG0nLzTWbpB8pbmT2rVXhQ@mail.gmail.com>
References: <CADFMWpvybRGpbNOvhd9XZbXMX6DsVG0nLzTWbpB8pbmT2rVXhQ@mail.gmail.com>
Message-ID: <53161437.2080507@icsi.berkeley.edu>

On 3/4/2014 2:31 AM, Junfei Guo wrote:
> Hi All,
>
> From the source code I can see that to calculate the Lower Oder Weight 
> for interpolation, Srilm uses the special properties of different 
> Discount function. For example for Modified KN, it uses the fact that 
> the discount is a constant for all trigrams which shows up more than 3 
> times.
>
> My question is weather the calculation of Back Off Weight also use 
> this sort of information or it only assume a general discount function.

The computation of backoff weights if independent of discounting 
method.  It is only determined by the requirement that the sum of all 
probabilities for a given history sum up to 1.

Andreas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140304/70b94c18/attachment.html>

From kruza at ufal.mff.cuni.cz  Tue Mar 18 08:20:05 2014
From: kruza at ufal.mff.cuni.cz (Oldrich Kruza)
Date: Tue, 18 Mar 2014 16:20:05 +0100
Subject: [SRILM User List] ngram crashing
In-Reply-To: <CAMC-V0ZFBm_Jb-mcLpv59HvNtHOSqViX0y-oNzuyXMGh7B-Zsg@mail.gmail.com>
References: <CAMC-V0anhHyxWaj6j_-DZfr_9rz8KvnewD89dWETVykB_Nsekg@mail.gmail.com>
	<CAMC-V0ZFBm_Jb-mcLpv59HvNtHOSqViX0y-oNzuyXMGh7B-Zsg@mail.gmail.com>
Message-ID: <CAMC-V0aoOYy1zEZdduNoEN4mSf3qfXvgYuV0JhDtG2ZmLdEPdQ@mail.gmail.com>

Hello everybody.

I'm trying to reduce the vocabulary of a huge (64GB) trigram language model.

I ran the script change-lm-vocab, and the ngram process died with this
error message:

include/LHash.cc:141: void LHash<KeyT, DataT>::alloc(unsigned int) [with
KeyT = unsigned int, DataT = Trie<unsigned int, BOnode>]: Assertion `body
!= 0' failed.

I'm positive this is not due to insufficient memory.

Thanks for any insights.
? Oldrich Kruza
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140318/fa11a590/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Mar 18 11:21:04 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 18 Mar 2014 11:21:04 -0700
Subject: [SRILM User List] ngram crashing
In-Reply-To: <CAMC-V0aoOYy1zEZdduNoEN4mSf3qfXvgYuV0JhDtG2ZmLdEPdQ@mail.gmail.com>
References: <CAMC-V0anhHyxWaj6j_-DZfr_9rz8KvnewD89dWETVykB_Nsekg@mail.gmail.com>	<CAMC-V0ZFBm_Jb-mcLpv59HvNtHOSqViX0y-oNzuyXMGh7B-Zsg@mail.gmail.com>
	<CAMC-V0aoOYy1zEZdduNoEN4mSf3qfXvgYuV0JhDtG2ZmLdEPdQ@mail.gmail.com>
Message-ID: <53288E90.3000801@icsi.berkeley.edu>

On 3/18/2014 8:20 AM, Oldrich Kruza wrote:
>
> Hello everybody.
>
> I'm trying to reduce the vocabulary of a huge (64GB) trigram language 
> model.
>
> I ran the script change-lm-vocab, and the ngram process died with this 
> error message:
>
> include/LHash.cc:141: void LHash<KeyT, DataT>::alloc(unsigned int) 
> [with KeyT = unsigned int, DataT = Trie<unsigned int, BOnode>]: 
> Assertion `body != 0' failed.
>
> I'm positive this is not due to insufficient memory.
>
This is the error message when SRILM fails to allocate more memory. The 
reasons could be

- you are using a 32bit binary and running up against the 4GB limit of 
the architecture
- you have a memory resource limit in force (set by you or your 
sysadmin) - check the ulimit or limit (csh) command
- your system is actually, really out of memory (which also depends on 
what other users are doing)

By running top or some similar tool concurrently you can see how big 
your ngram process actually grows before crashing, and this can give you 
additional clues.

Andreas


From tsuki_stefy at yahoo.com  Tue Mar 18 12:44:56 2014
From: tsuki_stefy at yahoo.com (Stefy D.)
Date: Tue, 18 Mar 2014 12:44:56 -0700 (PDT)
Subject: [SRILM User List] compute perplexity
Message-ID: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com>

Dear all,

I have some questions regarding perplexity...I am very thankful for your time/ answers.

Settings:
- one language model LM_A estimated using training corpus A?
- one language model LM_B estimated using training corpus B (B = corpus_A + corpus_X)

My intention is to prove that model B is better than model A so I though I should show that the perplexity decreased (which can be seen from the ppl files).

Commands used to estimate ppl:
$NGRAM_FILE -order 3 ?-lm $WORKING_DIR"lm_A/lmodel.lm" -ppl $WORKING_DIR"test.lowercased."$TARGET > ?$WORKING_DIR"ppl_A.ppl"


$NGRAM_FILE -order 3 ?-lm $WORKING_DIR"lm_B/lmodel.lm" -ppl $WORKING_DIR"test.lowercased."$TARGET > ?$WORKING_DIR"ppl_B.ppl"


This contents of the two ppl files is (A then B):
1000 sentences, 21450 words, 0 OOVs
0 zeroprobs, logprob= -57849.4 ppl= 377.407 ppl1= 497.67
-------------------------------------------------------------------------------------------
1000 sentences, 21450 words, 0 OOVs
0 zeroprobs, logprob= -55535.3 ppl= 297.67 ppl1= 388.204

Questions:
1. Why do I get 0 OOVs? I checked using the compute-oov-rate script how many OOV there are in the test data compared to the training and it gave me the result "OOV tokens: 393 / 21450 (1.83%) excluding fragments: 390 / 21442 (1.82%)".

2. I read on the srilm-faq that "Note that perplexity comparisons are only ever meaningful if the vocabularies of all LMs are the same."?Since I want to compare perplexities of two LM I am wondering if I did the right thing with my settings and commands used. The two LM were estimated on different training corpora so the vocabularies are not identical, right? Please tell me what am I doing wrong.

3. If those two perplexities were computed correctly, then could you please tell me if their difference means that the LM model has been really improved and if there is a measure that says if this improvement is significantly??

Thank you very much for your time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140318/91307a95/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Mar 18 22:02:22 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 18 Mar 2014 22:02:22 -0700
Subject: [SRILM User List] compute perplexity
In-Reply-To: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com>
References: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com>
Message-ID: <532924DE.7000301@icsi.berkeley.edu>

On 3/18/2014 12:44 PM, Stefy D. wrote:
> Dear all,
>
> I have some questions regarding perplexity...I am very thankful for 
> your time/ answers.
>
> Settings:
> - one language model LM_A estimated using training corpus A
> - one language model LM_B estimated using training corpus B (B = 
> corpus_A + corpus_X)
>
> My intention is to prove that model B is better than model A so I 
> though I should show that the perplexity decreased (which can be seen 
> from the ppl files).
>
> Commands used to estimate ppl:
> $NGRAM_FILE -order 3  -lm $WORKING_DIR"lm_A/lmodel.lm" -ppl 
> $WORKING_DIR"test.lowercased."$TARGET >  $WORKING_DIR"ppl_A.ppl"
>
> $NGRAM_FILE -order 3  -lm $WORKING_DIR"lm_B/lmodel.lm" -ppl 
> $WORKING_DIR"test.lowercased."$TARGET >  $WORKING_DIR"ppl_B.ppl"
>
> This contents of the two ppl files is (A then B):
> 1000 sentences, 21450 words, 0 OOVs
> 0 zeroprobs, logprob= -57849.4 ppl= 377.407 ppl1= 497.67
> -------------------------------------------------------------------------------------------
> 1000 sentences, 21450 words, 0 OOVs
> 0 zeroprobs, logprob= -55535.3 ppl= 297.67 ppl1= 388.204
>
> Questions:
> 1. Why do I get 0 OOVs? I checked using the compute-oov-rate script 
> how many OOV there are in the test data compared to the training and 
> it gave me the result "OOV tokens: 393 / 21450 (1.83%) excluding 
> fragments: 390 / 21442 (1.82%)".
You didn't say how you trained the LMs.  Did you include an unknown-word 
probability?   The exact option used for LM training matter here.
>
> 2. I read on the srilm-faq that "Note that perplexity comparisons are 
> only ever meaningful if the vocabularies of all LMs are the same." 
> Since I want to compare perplexities of two LM I am wondering if I did 
> the right thing with my settings and commands used. The two LM were 
> estimated on different training corpora so the vocabularies are not 
> identical, right? Please tell me what am I doing wrong.
Again, we don't know how you trained the LMs, hence we don't know the 
vocabularies.
The best way to make the perplexities comparable would be to extract the 
vocabulary from corpus A + corpus X, and then specify that for training 
LM_A (using -vocab).

>
> 3. If those two perplexities were computed correctly, then could you 
> please tell me if their difference means that the LM model has been 
> really improved and if there is a measure that says if this 
> improvement is significantly?
The perplexities looks quite different.  Differences of 10-20% are 
usually considered non-negligible.
For statistical significance there are a number of tests you can apply, 
although none are built into SRILM.

The most straightforward tests would be nonparametric ones that compare 
the probabilities output by the two LMs for corresponding word or sentences.
Generate a table of word-level probabilities for LM_A and then LM_B, on 
the same test set.  Then ask, how many words had lower/same/greater 
probability in LM_B?
 From those statistics you can apply either the Sign test 
<http://en.wikipedia.org/wiki/Sign_test> or the stronger Wilcoxon test 
<http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test> (for the latter 
you need the differences of the probabilities, not just their sign).

The Sign test is extremely simple and can be computed with a small 
helper script included in SRILM.  For example if LM_B gives higher 
probability for 1080 out of 2000 words (and there are no ties), then the 
significance levels are computed by

% $SRILM/bin/cumbin 2000 1080
One-tailed: P(k >= 1080 | n=2000, p=0.5) = 0.00018750253721029
Two-tailed: 2*P(k >= 1080 | n=2000, p=0.5) = 0.00037500507442058

Doing this at the word-level assumes that all the words in a sentence 
are assigned probabilities independently, which is plainly not true (the 
same word occurs in several ngrams).  So a more conservative approach 
would compare the sentence-level probabilities.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140318/5dbc9109/attachment.html>

From kruza at ufal.mff.cuni.cz  Wed Mar 19 04:52:15 2014
From: kruza at ufal.mff.cuni.cz (Oldrich Kruza)
Date: Wed, 19 Mar 2014 12:52:15 +0100
Subject: [SRILM User List] ngram crashing
In-Reply-To: <53288E90.3000801@icsi.berkeley.edu>
References: <CAMC-V0anhHyxWaj6j_-DZfr_9rz8KvnewD89dWETVykB_Nsekg@mail.gmail.com>
	<CAMC-V0ZFBm_Jb-mcLpv59HvNtHOSqViX0y-oNzuyXMGh7B-Zsg@mail.gmail.com>
	<CAMC-V0aoOYy1zEZdduNoEN4mSf3qfXvgYuV0JhDtG2ZmLdEPdQ@mail.gmail.com>
	<53288E90.3000801@icsi.berkeley.edu>
Message-ID: <CAMC-V0ZJL3QaZiu+xWMww8KVY6H2xy1MPNmKPhCNCydNjS6ckw@mail.gmail.com>

Eh, yes. It was a 32-bit executable.
Thank you.

On Tue, Mar 18, 2014 at 7:21 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> On 3/18/2014 8:20 AM, Oldrich Kruza wrote:
>>
>>
>> Hello everybody.
>>
>> I'm trying to reduce the vocabulary of a huge (64GB) trigram language
>> model.
>>
>> I ran the script change-lm-vocab, and the ngram process died with this
>> error message:
>>
>> include/LHash.cc:141: void LHash<KeyT, DataT>::alloc(unsigned int) [with
>> KeyT = unsigned int, DataT = Trie<unsigned int, BOnode>]: Assertion `body !=
>> 0' failed.
>>
>> I'm positive this is not due to insufficient memory.
>>
> This is the error message when SRILM fails to allocate more memory. The
> reasons could be
>
> - you are using a 32bit binary and running up against the 4GB limit of the
> architecture
> - you have a memory resource limit in force (set by you or your sysadmin) -
> check the ulimit or limit (csh) command
> - your system is actually, really out of memory (which also depends on what
> other users are doing)
>
> By running top or some similar tool concurrently you can see how big your
> ngram process actually grows before crashing, and this can give you
> additional clues.
>
> Andreas
>

From asosimi at unilag.edu.ng  Sat Mar 22 05:12:33 2014
From: asosimi at unilag.edu.ng (Adeyanju Sosimi)
Date: Sat, 22 Mar 2014 13:12:33 +0100 (WAT)
Subject: [SRILM User List] Using HTK LM Score and externally computed Tone
	n-gram score
Message-ID: <2010698914.3384481.1395490353157.JavaMail.root@unilag.edu.ng>

Am currently working on a tone based language. The language has a CV and V syllabic structure,
 
Have decided to adopt both tone n-gram and word n-gram as prior probability in developing the ASR system. 

That is using hybrid HTK LM score and Tone n-gram. 

To accomplish this with HTK, I don't know what to do. But I have develop routine for computing Tone N-gram in MATLAB.
 
How to make use computed tone n-gram with HTK LM score has remained a challenge. 
 
Also, I need your assistance with regards tutorial materials/manual on the SRILM toolkits or scripting files for easy usage of the script. 

From rimlaatar at yahoo.fr  Thu Mar 27 07:38:10 2014
From: rimlaatar at yahoo.fr (Laatar Rim)
Date: Thu, 27 Mar 2014 14:38:10 +0000 (GMT)
Subject: [SRILM User List] Calculate perplexity
Message-ID: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com>

Dear Andreas , 

to calculate perplexity i do this : 

lenovo at ubuntu:~/Documents/srilm$ ngram -lm class_based_model '/home/lenovo/Documents/srilm/ML_N_Class/IN_SRILM' -ppl '/home/lenovo/Documents/srilm/ML_N_Class/titi.txt'? 

titi.txt is my training data 

1- i should calculate perplexity elso in my test data ? 

2- how can i interpretate this result :?
file /home/lenovo/Documents/srilm/ML_N_Class/titi.txt: 18657 sentences, 66817 words, 5285 OOVs
0 zeroprobs, logprob= -259950 ppl= 1744.69 ppl1= 16773.8
?what is the difference between ppl and ppl1 ??

----
Cordialement


Rim LAATAR?
Ing?nieur? Informatique, de l'?cole Nationale d?Ing?nieurs de Sfax(ENIS)
?tudiante en mast?re de recherche, Syst?me d'Information & Nouvelles Technologies ? laFSEGS?--Option TALN
Site web:Rim LAATAR BEN SAID
Tel: (+216) 99 64 74 98?
----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140327/76ab3c50/attachment.html>

From stolcke at icsi.berkeley.edu  Thu Mar 27 10:25:35 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 27 Mar 2014 10:25:35 -0700
Subject: [SRILM User List] compute perplexity
In-Reply-To: <1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com>
References: <1395171896.77554.YahooMailNeo@web163004.mail.bf1.yahoo.com>
	<532924DE.7000301@icsi.berkeley.edu>
	<1395251879.43758.YahooMailNeo@web163003.mail.bf1.yahoo.com>
Message-ID: <53345F0F.201@icsi.berkeley.edu>

On 3/19/2014 10:57 AM, Stefy D. wrote:
> Dear Andreas,
>
> thank you very much for replying.
>
> I trained both LMs using the "-unk" option like this:
> $NGRAMCOUNT_FILE -order 3 -interpolate -kndiscount -unk -text 
> $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -lm 
> $WORKING_DIR"lm_a/lmodel.lm"

That explains who you are not getting OOVs reported in the ppl output.  
Unknown words are mapped to <unk> and thus the LM has a probability for 
<unk>.

>
> For the OOV rate I created a vocabulary list for the training data and 
> I used the unigram counts of the test set and the compute-oov-rate 
> script like this:
>
> $NGRAMCOUNT_FILE -order 1 -write-vocab "vocabularyTargetUnigram.txt" 
> -text $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -sort
>
> $NGRAMCOUNT_FILE -order 1  -text $WORKING_DIR"test.lowercased."$TARGET 
> -write "unigramCounts_testdatal.txt" -sort
>
> $OOVCOUNT_FILE vocabularyTargetUnigram.txt unigramCounts_testdata.txt
>
> This is how I got that OOV rate mentioned in the first mail. Could you 
> please let me know if I used the right commands to compute that?
You did it right.

>
> You said I should train LM_A using the vocabulary of corpus A + corpus 
> X so that the perplexities can be compared. So I should train LM_A 
> using only corpus A but the vocabulary of A + X? I am sorry to be 
> confused, but I thought that for estimating the LM the vocabulary 
> should be from the same corpus used for estimating. I am using these 
> LMs in SMT systems (a baseline and an adapted one). If I influence the 
> baseline LM with vocabulary from the adapted data, then the baseline 
> is not really a baseline. Please tell me if I am thinking incorrectly.
You are right.   What this illustrates is that perplexity alone is not a 
sufficient metric for comparing LMs.  In your scenario (LM adaptation) 
the expansion of the vocabulary is a key component of the adaptation 
process, but LMs with different vocabularies are no longer comparable by 
ppl.  My suggestion to unify the vocabularies was a workaround to allow 
you to still use perplexity comparison.

>
> Thank you for introducing me into statistical significance.
> To generate a table of word level probabilities on the same test set 
> should I use get-unigram-probs? But where do I specify the test set?
> $UNIGRAMPROBS_FILE linear=1 $WORKING_DIR"lm_a/lmodel.arpa."$TARGET > 
> table_A.out
No, you get the word probabilities from output of ngram -debug 2 -ppl 
(you need to write some perl or whatever script to extract the 
probabilities).
>
> To get how many words had lower/same/greater probability in LM_B is 
> using compare-ppls script ok? For example, I get this output when 
> applying it to my 2 LMs (ngram -debug 2 on the same test set as in 
> previous commands):
> $COMPARE_PPLS $WORKING_DIR"ppl_files/ppl_A_detail.ppl" 
> $WORKING_DIR"ppl_files/ppl_B_detail.ppl"
> output: total 22450, equal 0, different 22450, greater 11447
Yes, it seems compare-ppls extracts exactly the statistics I was talking 
about.  I had forgotten about it ...

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140327/adc5d350/attachment.html>

From stolcke at icsi.berkeley.edu  Thu Mar 27 10:44:54 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 27 Mar 2014 10:44:54 -0700
Subject: [SRILM User List] Calculate perplexity
In-Reply-To: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com>
References: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com>
Message-ID: <53346396.3000101@icsi.berkeley.edu>

On 3/27/2014 7:38 AM, Laatar Rim wrote:
> Dear Andreas ,
> to calculate perplexity i do this :
> lenovo at ubuntu:~/Documents/srilm$ ngram -lm class_based_model 
> '/home/lenovo/Documents/srilm/ML_N_Class/IN_SRILM' -ppl 
> '/home/lenovo/Documents/srilm/ML_N_Class/titi.txt'
> titi.txt is my training data
> 1- i should calculate perplexity elso in my test data ?
Yes, in fact, perplexity is usually reported on test data (data not used 
in training the model) since otherwise you get a very biased estimate.

> 2- how can i interpretate this result :
> file /home/lenovo/Documents/srilm/ML_N_Class/titi.txt: 18657 
> sentences, 66817 words, 5285 OOVs
> 0 zeroprobs, logprob= -259950 ppl= 1744.69 ppl1= 16773.8
>  what is the difference between ppl and ppl1 ??
OOVs is the count of  words that don't occur in the vocabulary 
(technically, that are mapped to <unk>) and have zero probability.
zeroprobs refers to any other words that have zero probability.
These counts are reported because they are not included in the 
perplexity computation.

ppl is the standard perplexity where end-of-sentence tokens (</s>) are 
counted in the denominator. ppl1 is the same thing but </s> tokens are 
not counted in the denominator.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140327/42289e10/attachment.html>

From stolcke at icsi.berkeley.edu  Fri Mar 28 10:18:11 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 28 Mar 2014 10:18:11 -0700
Subject: [SRILM User List] Calculate perplexity
In-Reply-To: <1395997546.8767.YahooMailNeo@web173205.mail.ir2.yahoo.com>
References: <1395931090.86195.YahooMailNeo@web173202.mail.ir2.yahoo.com>
	<53346396.3000101@icsi.berkeley.edu>
	<1395997546.8767.YahooMailNeo@web173205.mail.ir2.yahoo.com>
Message-ID: <5335AED3.8020105@icsi.berkeley.edu>

On 3/28/2014 2:05 AM, Laatar Rim wrote:
> thanks
> so my file  replace-word-with-class sould not contain the words from 
> data test ?

Knowing which words should be in class should be considered part of the 
training process, or comes from prior knowledge.
If you application gives you the class membership of the words in the 
test data then you can add it, otherwise it would be "training on test 
data".

Andreas

> ----
> Cordialement
>
> *Rim LAATAR *
> Ing?nieur  Informatique, de l'?cole Nationale d?Ing?nieurs de 
> Sfax(ENIS <http://www.enis.rnu.tn/>)
> ?tudiante en mast?re de recherche, Syst?me d'Information & Nouvelles 
> Technologies ? laFSEGS <http://www.fsegs.rnu.tn/> --Option TALN
> Site web:Rim LAATAR BEN SAID 
> <https://sites.google.com/site/rimlaatarbnsaid/>
> Tel: (+216) 99 64 74 98
> ----
>
>
> Le Jeudi 27 mars 2014 18h44, Andreas Stolcke 
> <stolcke at icsi.berkeley.edu> a ?crit :
> On 3/27/2014 7:38 AM, Laatar Rim wrote:
>> Dear Andreas ,
>> to calculate perplexity i do this :
>> lenovo at ubuntu:~/Documents/srilm$ ngram -lm class_based_model 
>> '/home/lenovo/Documents/srilm/ML_N_Class/IN_SRILM' -ppl 
>> '/home/lenovo/Documents/srilm/ML_N_Class/titi.txt'
>> titi.txt is my training data
>> 1- i should calculate perplexity elso in my test data ?
> Yes, in fact, perplexity is usually reported on test data (data not 
> used in training the model) since otherwise you get a very biased 
> estimate.
>
>> 2- how can i interpretate this result :
>> file /home/lenovo/Documents/srilm/ML_N_Class/titi.txt: 18657 
>> sentences, 66817 words, 5285 OOVs
>> 0 zeroprobs, logprob= -259950 ppl= 1744.69 ppl1= 16773.8
>>  what is the difference between ppl and ppl1 ??
> OOVs is the count of  words that don't occur in the vocabulary 
> (technically, that are mapped to <unk>) and have zero probability.
> zeroprobs refers to any other words that have zero probability.
> These counts are reported because they are not included in the 
> perplexity computation.
>
> ppl is the standard perplexity where end-of-sentence tokens (</s>) are 
> counted in the denominator. ppl1 is the same thing but </s> tokens are 
> not counted in the denominator.
>
>
> Andreas
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140328/cee45f91/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Mar 31 17:44:48 2014
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 31 Mar 2014 17:44:48 -0700
Subject: [SRILM User List] perplexity
In-Reply-To: <1396265179.10448.YahooMailNeo@web173202.mail.ir2.yahoo.com>
References: <1396265179.10448.YahooMailNeo@web173202.mail.ir2.yahoo.com>
Message-ID: <533A0C00.3020505@icsi.berkeley.edu>

On 3/31/2014 4:26 AM, Laatar Rim wrote:
> Dear Andreas,
>
> PLz i have a question :
> you say : Knowing which words should be in class should be considered 
> part of the training process, or comes from prior knowledge.
> If you application gives you the class membership of the words in the 
> test data then you can add it, otherwise it would be "training on test 
> data".
>
> you mean that my "IN_SRILM: my classes-format - File format for word 
> class definitions  ( /class/ [/p/] /word1/ /word2/ ... )" should also 
> contain both words that exist in my training data and test data or it 
> should contains only words from trainnig data .??
You should only use words in the training data, plus any other knowledge 
source or databases that are different from the test data.
In many application domains that involve semantic knowledge you have 
additional information about the task domain from which you can infer 
class membership.
For example, if you are doing air travel domain, you probably have a 
list of all airport cities, and you create a word class from that.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140331/58681df5/attachment.html>