From bassam_qatab at hotmail.com  Mon Jul  9 17:20:35 2012
From: bassam_qatab at hotmail.com (Bassam Al_Qatab)
Date: Tue, 10 Jul 2012 03:20:35 +0300
Subject: [SRILM User List] Confidence Measure
Message-ID: <DUB103-W2F847D92B90D923E594B9F8D20@phx.gbl>


Dear all   I want to develop the Pronunciation
Verification. Actually, I have been developed the Automatic Speech Recognition
(ASR) using HTK toolkit. I used the SRILM tool to convert the HTK lattice to
confusion network then, the word posterior probability has been calculated
using SRILM tool. Actually, my understanding is that, first I have to save the
word posterior probability for the words (based on the given sentences). Next,
obtain the word posterior probability for the given utterance that should be among
saved sentences. Finally, divided the obtained word posterior probability by
saved one, the output should be between 0 and 1. This value will be the threshold
for accepting or rejection of the word according to the output we observed. My
question,  is that all we need? Or should
I have to calculate the confidence measure. For the confidence measure I want
to know how to calculate it (if there is any tutorial). Any one can help! or send me a link or paper describe the procedure. Thank you in advance.

 
- - -- - -- - --- --- -- --- --- ---- -- --- --- ---- --- ---- --- -- -- --- ---- ------- --- --- --- -
- - -- - -- - --- --- -- --- --- ---- -- --- --- ---- --- ---- --- -- -- --- ---- ------- --- --- --- -
Bassam Ali Qasem Al-Qatab
Master Of Software Engineering
Faculty Of Computer Science and Information Technology
University of Malaya 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120710/aa96c009/attachment.html>

From pranavshriram at gmail.com  Mon Jul  9 20:59:34 2012
From: pranavshriram at gmail.com (Pranav Jawale)
Date: Tue, 10 Jul 2012 09:29:34 +0530
Subject: [SRILM User List] Confidence Measure
In-Reply-To: <DUB103-W2F847D92B90D923E594B9F8D20@phx.gbl>
References: <DUB103-W2F847D92B90D923E594B9F8D20@phx.gbl>
Message-ID: <CADRsXgebJp+AAGNgWBdHrBif4Yvx0KSUC11Tm8rPypwkuD8zWQ@mail.gmail.com>

>
> Or should I have to calculate the confidence measure?
>
Word posterior probability computed using confusion network itself IS a
confidence measure.   But it is not the only one, there are many others
too.
e.g. see
[] H Jiang, ?Confidence measures for speech recognition: A survey?, Speech
Communication, 2005
[] F. Wessel, R. Schl?ter, K. Macherey and H. Ney, ?Confidence measures for
large vocabulary continuous speech recognition?, IEEE Transactions on
Speech and Audio Processing , 2001

For pronunciation evaluation at phone-level, you may need to compute phone
posterior probability using a phone decoder.

-- 
*
*
The best way to get something done is to begin.  ~Author Unknown
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120710/8970c7e3/attachment.html>

From bassam_qatab at hotmail.com  Mon Jul  9 23:09:10 2012
From: bassam_qatab at hotmail.com (Bassam Al_Qatab)
Date: Tue, 10 Jul 2012 09:09:10 +0300
Subject: [SRILM User List] Confidence Measure
In-Reply-To: <CADRsXgebJp+AAGNgWBdHrBif4Yvx0KSUC11Tm8rPypwkuD8zWQ@mail.gmail.com>
References: <DUB103-W2F847D92B90D923E594B9F8D20@phx.gbl>,
	<CADRsXgebJp+AAGNgWBdHrBif4Yvx0KSUC11Tm8rPypwkuD8zWQ@mail.gmail.com>
Message-ID: <DUB103-W25F05FF5ACD17349D27188F8D20@phx.gbl>


Dear pranavshriram first thank you for replied. Actually, I have read the two papers( espcially the second one), but I will read them again to figure out the other things to include. for the phone decoder you mean that the output of the recognizer(decoder) will be in the phone level( not a word level).thank you. Bassam.

  
 From: pranavshriram at gmail.com
Date: Tue, 10 Jul 2012 09:29:34 +0530
Subject: Re: [SRILM User List] Confidence Measure
To: bassam_qatab at hotmail.com
CC: srilm-user at speech.sri.com

Or should
I have to calculate the confidence measure?Word posterior probability computed using confusion network itself IS a confidence measure.   But it is not the only one, there are many others too. 

e.g. see [] H Jiang, ?Confidence measures for speech recognition: A survey?, Speech Communication, 2005[] F. Wessel, R. Schl?ter, K. Macherey and H. Ney, ?Confidence measures for large vocabulary continuous speech recognition?, IEEE Transactions on Speech and Audio Processing , 2001


For pronunciation evaluation at phone-level, you may need to compute phone posterior probability using a phone decoder.
-- 
*
*
The best way to get something done is to begin.  ~Author Unknown


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120710/e8653595/attachment.html>

From pranavshriram at gmail.com  Mon Jul  9 23:54:52 2012
From: pranavshriram at gmail.com (Pranav Jawale)
Date: Tue, 10 Jul 2012 12:24:52 +0530
Subject: [SRILM User List] Confidence Measure
In-Reply-To: <DUB103-W25F05FF5ACD17349D27188F8D20@phx.gbl>
References: <DUB103-W2F847D92B90D923E594B9F8D20@phx.gbl>
	<CADRsXgebJp+AAGNgWBdHrBif4Yvx0KSUC11Tm8rPypwkuD8zWQ@mail.gmail.com>
	<DUB103-W25F05FF5ACD17349D27188F8D20@phx.gbl>
Message-ID: <CADRsXgeEGPOa5ABSCsWgmO1fOMqzhxtPjgh2GCd7UiGa3VcBPg@mail.gmail.com>

> for the phone decoder you mean that the output of the recognizer(decoder)
> will be in the phone level( not a word level).
>
Yes. For example, see

S.M Witt, S.J Young, Phone-level pronunciation scoring and assessment
for interactive language learning, Speech Communication, Volume 30,
Issues 2?3, February 2000
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120710/87564f94/attachment.html>

From lqin at cs.cmu.edu  Wed Jul 11 15:15:11 2012
From: lqin at cs.cmu.edu (Long Qin)
Date: Wed, 11 Jul 2012 18:15:11 -0400
Subject: [SRILM User List] lattice-tool error while loading mesh
Message-ID: <4FFDFAEF.8070208@cs.cmu.edu>

Hi,

I tried to load a confusion network in the mesh format using the 
lattice-tool. The mesh contains word level time mark and scores. The 
command I used is "lattice-tool -read-mesh -in-lattice mesh" And the 
lattice-tool outputs the following error message:

mesh: line 5: invalid word info
error reading mesh

The mesh file was generated using the "nbest-lattice" tool. So I guess 
the format should be correct. Then the question is can lattice-tool work 
with mesh file with word level time mark and scores?

Another question I want to ask is how can I convert nbest hypotheses 
into a word lattice with time marks, acoustic and LM scores? It seems 
the nbest-lattice tool can only produce a lattice without those word 
level information.

Thanks,
Long Qin

From stolcke at icsi.berkeley.edu  Wed Jul 11 17:25:37 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 11 Jul 2012 17:25:37 -0700
Subject: [SRILM User List] lattice-tool error while loading mesh
In-Reply-To: <4FFDFAEF.8070208@cs.cmu.edu>
References: <4FFDFAEF.8070208@cs.cmu.edu>
Message-ID: <4FFE1981.7010403@icsi.berkeley.edu>

On 7/11/2012 3:15 PM, Long Qin wrote:
> Hi,
>
> I tried to load a confusion network in the mesh format using the 
> lattice-tool. The mesh contains word level time mark and scores. The 
> command I used is "lattice-tool -read-mesh -in-lattice mesh" And the 
> lattice-tool outputs the following error message:
>
> mesh: line 5: invalid word info
> error reading mesh
>
> The mesh file was generated using the "nbest-lattice" tool. So I guess 
> the format should be correct. Then the question is can lattice-tool 
> work with mesh file with word level time mark and scores?
Did you generate the mesh using nbest-lattice -use-mesh  ?   Can you 
send a small sample file?
It should not happen.

>
> Another question I want to ask is how can I convert nbest hypotheses 
> into a word lattice with time marks, acoustic and LM scores? It seems 
> the nbest-lattice tool can only produce a lattice without those word 
> level information.
Actually, nbest-lattice will do this when (1) the -nbest-backtrace 
option is given and (2) the nbest lists are in the 'NBestList2.0' format.
See the nbest-format(5) man page.  The format is awkward if you don't 
happen to be using the SRI Decipher recognizer, but it should work if 
you carefully convert your data into this format.

Andreas


>
> Thanks,
> Long Qin
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From stolcke at icsi.berkeley.edu  Wed Jul 11 19:03:13 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 11 Jul 2012 19:03:13 -0700
Subject: [SRILM User List] lattice-tool error while loading mesh
In-Reply-To: <4FFE22AE.7080501@cs.cmu.edu>
References: <4FFDFAEF.8070208@cs.cmu.edu> <4FFE1981.7010403@icsi.berkeley.edu>
	<4FFE22AE.7080501@cs.cmu.edu>
Message-ID: <4FFE3061.9000305@icsi.berkeley.edu>

On 7/11/2012 6:04 PM, Long Qin wrote:
> Hi Andreas,
>
> Thanks for answering my question.
>
> Yes, I did use the nbest-lattice -use-mesh -nbest-backtrace. The 
> attachment files are the nbest file and the mesh file. Is there 
> anything wrong with it?
Your nbest file is fine, but there was a bug in the nbest list parser 
that would lead to incorrect mesh files when no pronunciation 
information was given (as in your case).   The attached patch should fix 
this.  You will have to rebuild nbest-lattice and then regenerate the 
mesh file.

Andreas

-------------- next part --------------
diff -c -r1.83 NBest.cc
*** lm/src/NBest.cc	6 Jul 2012 06:43:26 -0000	1.83
--- lm/src/NBest.cc	12 Jul 2012 01:57:40 -0000
***************
*** 620,626 ****
  		    /*
  		     * save pronunciation info for previous word
  		     */
! 		    if (prevWordInfo) {
  			prevWordInfo->phones = strdup(phones);
  			assert(prevWordInfo->phones != 0);
  
--- 620,626 ----
  		    /*
  		     * save pronunciation info for previous word
  		     */
! 		    if (prevWordInfo && phones[0] != '\0') {
  			prevWordInfo->phones = strdup(phones);
  			assert(prevWordInfo->phones != 0);
  
***************
*** 695,701 ****
  	    /*
  	     * save pronunciation info for last word
  	     */
! 	    if (prevWordInfo) {
  		prevWordInfo->phones = strdup(phones);
  		assert(prevWordInfo->phones != 0);
  
--- 695,701 ----
  	    /*
  	     * save pronunciation info for last word
  	     */
! 	    if (prevWordInfo && phones[0] != '\0') {
  		prevWordInfo->phones = strdup(phones);
  		assert(prevWordInfo->phones != 0);
  

From chenmengdx at gmail.com  Wed Jul 18 03:40:37 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Wed, 18 Jul 2012 18:40:37 +0800
Subject: [SRILM User List] How to train LM fast with large corpus
Message-ID: <CA+bc0mpSvZfn2GKt025zs8bR6R=0ECGmLy2myCNQ2v=7OQMnyw@mail.gmail.com>

Hi, I want to ask how to train N-gram language model with SRILM if the
corpus is very large (100GB). Should I still use the command of *ngram-count
*? Or use *make-big-lm* instead? I also want to know if there is any
limitation of training corpus in vocabulary and size with SRILM?
Thanks!

Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120718/2e669fa6/attachment.html>

From stolcke at icsi.berkeley.edu  Wed Jul 18 05:13:23 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 18 Jul 2012 05:13:23 -0700
Subject: [SRILM User List] How to train LM fast with large corpus
In-Reply-To: <CA+bc0mpSvZfn2GKt025zs8bR6R=0ECGmLy2myCNQ2v=7OQMnyw@mail.gmail.com>
References: <CA+bc0mpSvZfn2GKt025zs8bR6R=0ECGmLy2myCNQ2v=7OQMnyw@mail.gmail.com>
Message-ID: <5006A863.9030903@icsi.berkeley.edu>

On 7/18/2012 3:40 AM, Meng Chen wrote:
> Hi, I want to ask how to train N-gram language model with SRILM if the 
> corpus is very large (100GB). Should I still use the command of 
> *ngram-count*? Or use *make-big-lm* instead? I also want to know if 
> there is any limitation of training corpus in vocabulary and size with 
> SRILM?
> Thanks!
Definitely make-big-lm.   Read the FAQ on handling large data.  You are 
limited by computer memory but it is not possible to give a hard limit, 
it depends on the properties of your data.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120718/b6f66e62/attachment.html>

From shinichiro.hamada at gmail.com  Thu Jul 19 16:47:19 2012
From: shinichiro.hamada at gmail.com (shinichiro.hamada)
Date: Fri, 20 Jul 2012 08:47:19 +0900
Subject: [SRILM User List] counts in ngram-count output
Message-ID: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91>

Hi, I have a question if my outputs of ngram-count are correct or not.

I made a fractional word-count file by my own program and executed 
ngram-count command with wb discount. The header of outputs were 
bellow:

--------------------------
[4gram wb float-count]
ngram-count -read countfile_float -float-counts -order 4 -lm outfile \
 -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3

ngram 1=780387
ngram 2=20321
ngram 3=2692
ngram 4=2622
..
--------------------------

I thought higher order models have always more counts than lower 
order ones, but the above result wasn't so. Does this result 
designate that my word-count file has bug?


----------------------------------------------------------------------
For further investigation, I made a integer word-count file by 
scaling and truncating (I know it is inappropriate approximation) 
and executed ngram-count with other discount methods. But higher 
order models doesn't have always more counts than lower order ones 
in this result too.

--------------------------
[4gram none int-count]
ngram-count -read countfile_int -order 3 -lm outfile \
 -gt1min 0 -gt1max 0 -gt2min 0 -gt2max 0 -gt3min 0 -gt3max 0

ngram 1=780387
ngram 2=871835
ngram 3=1310979
ngram 4=1038980

--------------------------
[4gram gt int-count]
ngram-count -read countfile_int -order 3 -lm outfile \

ngram 1=780387
ngram 2=871835
ngram 3=1170462
ngram 4=1038980

--------------------------
[4gram natural int-count]
ngram-count -read countfile_int -order 3 -lm outfile \
 -ndiscount -ndiscount1 -ndiscount2 -ndiscount3

ngram 1=780387
ngram 2=871835
ngram 3=1170339
ngram 4=1038858


Any advices will help me very much. Thank you in advance.

--
Shincihiro Hamada


From nouf.alharbi at yahoo.com  Fri Jul 20 05:04:55 2012
From: nouf.alharbi at yahoo.com (Nouf Al-Harbi)
Date: Fri, 20 Jul 2012 13:04:55 +0100 (BST)
Subject: [SRILM User List] Predicting words
Message-ID: <1342785895.7010.YahooMailNeo@web171304.mail.ir2.yahoo.com>

Hello,

I am new to language modeling and was hoping that someone can help me with the following.

I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the 
highest probability in sentences such as ' A man is ...' (e.g. sitting).

I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...).


Many thanks in advance,

Nouf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120720/59783e2b/attachment.html>

From stolcke at icsi.berkeley.edu  Fri Jul 20 08:55:42 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 20 Jul 2012 10:55:42 -0500
Subject: [SRILM User List] counts in ngram-count output
In-Reply-To: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91>
References: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91>
Message-ID: <50097F7E.5090206@icsi.berkeley.edu>

On 7/19/2012 6:47 PM, shinichiro.hamada wrote:
> Hi, I have a question if my outputs of ngram-count are correct or not.
>
> I made a fractional word-count file by my own program and executed
> ngram-count command with wb discount. The header of outputs were
> bellow:
>
> --------------------------
> [4gram wb float-count]
> ngram-count -read countfile_float -float-counts -order 4 -lm outfile \
>   -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram 1=780387
> ngram 2=20321
> ngram 3=2692
> ngram 4=2622
> ..
> --------------------------
>
> I thought higher order models have always more counts than lower
> order ones, but the above result wasn't so. Does this result
> designate that my word-count file has bug?
This is probably because the defaults for minimum count frequency are 
higher for trigrams and 4grams than for bigrams.
For bigrams it is 1, whereas for 3grams and higher it is 2.  You should 
see the expected behavior if you add

-gt3min 1 -gt4min 1

to the options.  (As explained in the man page, -gtXmin options apply to 
all discounting methods, not just GT.)

Andreas


From saraelec at yahoo.com  Sun Jul 22 20:05:56 2012
From: saraelec at yahoo.com (sara)
Date: Sun, 22 Jul 2012 20:05:56 -0700 (PDT)
Subject: [SRILM User List] create LM for one topic
Message-ID: <1343012756.90490.YahooMailClassic@web162303.mail.bf1.yahoo.com>

Hi,

I am new to SRILM and I want to create language model for one topic. 
I
 have used the online tool. The results shows two different 
probabilities. could you please help me how I can build the language 
model for one topic?

Many thanks,
Sara

The results:

\1-grams:
-1.2884 </s> -0.3010
-1.2884 <s> -0.2781
-1.6564 A -0.2913
-2.1335 ABBREVIATIONS -0.2978
-2.1335 ACRONYMS -0.2978
-2.1335 AN -0.2946
-2.1335 AND -0.2978
-2.1335 ARE -0.2978
-1.8325 AS -0.2881
-2.1335 BE -0.2978
-2.1335 BEST -0.2978
-2.1335 BUT -0.2978
-2.1335 CAN -0.2978
-2.1335 CETERA -0.2781
-2.1335 EACH -0.2978
-2.1335 ENTERED -0.2946
-2.1335 ET -0.2978
-1.8325 EXAMPLE -0.2747
-2.1335 FEW -0.2978
-2.1335 FOR -0.2946
-2.1335 HUNDRED -0.2978
-1.6564 IS -0.2848
-2.1335 LETTERS -0.2978
-2.1335 LIMIT -0.2781
-2.1335 LINE -0.2913
-2.1335 NUMBERS -0.2978
-2.1335 OUGHT -0.2946
-2.1335 OUT -0.2978
-2.1335 PRONOUNCED -0.2946
-2.1335 RECOGNIZE -0.2781
-2.1335 SENTENCE -0.2781
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120722/12364d1e/attachment.html>

From shinichiro.hamada at gmail.com  Tue Jul 24 11:05:08 2012
From: shinichiro.hamada at gmail.com (shinichiro.hamada)
Date: Wed, 25 Jul 2012 03:05:08 +0900
Subject: [SRILM User List] counts in ngram-count output
In-Reply-To: <50097F7E.5090206@icsi.berkeley.edu>
References: <2EEDEC4377C14CA694D50D5DD5ACCCAC@f91>
	<50097F7E.5090206@icsi.berkeley.edu>
Message-ID: <1A01F11B0513446E84A5D179F713FF26@f91>

I haven't understood the specifications of the options.
Thank you very much for pointing it out. I'll try it.

Best regards,
Shinichiro

> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] 
> Sent: Saturday, July 21, 2012 12:56 AM
> To: shinichiro.hamada
> Cc: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] counts in ngram-count output
> 
> On 7/19/2012 6:47 PM, shinichiro.hamada wrote:
> > Hi, I have a question if my outputs of ngram-count are 
> correct or not.
> >
> > I made a fractional word-count file by my own program and executed 
> > ngram-count command with wb discount. The header of outputs were
> > bellow:
> >
> > --------------------------
> > [4gram wb float-count]
> > ngram-count -read countfile_float -float-counts -order 4 
> -lm outfile \
> >   -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3
> >
> > ngram 1=780387
> > ngram 2=20321
> > ngram 3=2692
> > ngram 4=2622
> > ..
> > --------------------------
> >
> > I thought higher order models have always more counts than 
> lower order 
> > ones, but the above result wasn't so. Does this result 
> designate that 
> > my word-count file has bug?
> This is probably because the defaults for minimum count 
> frequency are higher for trigrams and 4grams than for bigrams.
> For bigrams it is 1, whereas for 3grams and higher it is 2.  
> You should see the expected behavior if you add
> 
> -gt3min 1 -gt4min 1
> 
> to the options.  (As explained in the man page, -gtXmin 
> options apply to all discounting methods, not just GT.)
> 
> Andreas


From ma.farajian at gmail.com  Tue Jul 24 23:40:09 2012
From: ma.farajian at gmail.com (amin farajian)
Date: Wed, 25 Jul 2012 11:10:09 +0430
Subject: [SRILM User List] Error in compiling SRILM
Message-ID: <CAA+Df5VQVLZeOPRs4iWpEGEXxu3ZDGwGUjFb7neSnYu+PjcuDw@mail.gmail.com>

Hi all,

I recently changed my machine, and I'm now trying to install the latest
version of SRILM on it. I installed all the required tools and libraries
(at least I hope so). but I couldn't finish the installation correctly. I
checked everything that I thought could cause the problem, but I couldn't
find anything.
Some information about my new machine are:

Machine Type: i686 (according to output of this script:
srilm/sbin/machine-type)
OS: kubuntu 12.04 (output of uname: 36-Ubuntu SMP Tue Apr 10 20:39:51 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux)
compiler version (output of "gcc -v"): gcc version 4.6.3 (Ubuntu/Linaro
4.6.3-1ubuntu5)

I changed the contents of srilm/common/Makefile.machine.i686 as described
in installation instruction:
CC = /usr/bin/gcc $(GCC_FLAGS)
CXX = /usr/bin/g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES
and added this line to the file:
NO_TCL = X
but nothing changed in installation procedure.
I also attached the output of make command. As could be seen in the file,
the first error occurs in line 158:

ERROR:  File to be installed (../bin/i686/maxalloc) does not exist.
Usage:  decipher-install [-p] <mode> <file1> ... <fileN> <directory>
    mode:                 file permission mode, in octal
    file1 ... fileN:      files to be installed
    directory:            where the files should be installed

files =  ../bin/i686/maxalloc
directory =  ../../bin/i686
mode =  0555

May I ask you to help me in this problem?

Thank you in advance.
Regards,
M. Amin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120725/49fb6119/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make.output
Type: application/octet-stream
Size: 36535 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120725/49fb6119/attachment.obj>

From stolcke at icsi.berkeley.edu  Thu Jul 26 00:07:36 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 26 Jul 2012 00:07:36 -0700
Subject: [SRILM User List] Error in compiling SRILM
In-Reply-To: <CAA+Df5VQVLZeOPRs4iWpEGEXxu3ZDGwGUjFb7neSnYu+PjcuDw@mail.gmail.com>
References: <CAA+Df5VQVLZeOPRs4iWpEGEXxu3ZDGwGUjFb7neSnYu+PjcuDw@mail.gmail.com>
Message-ID: <5010ECB8.1010409@icsi.berkeley.edu>

On 7/24/2012 11:40 PM, amin farajian wrote:
> Hi all,
>
> I recently changed my machine, and I'm now trying to install the 
> latest version of SRILM on it. I installed all the required tools and 
> libraries (at least I hope so). but I couldn't finish the installation 
> correctly. I checked everything that I thought could cause the 
> problem, but I couldn't find anything.
> Some information about my new machine are:
>
> Machine Type: i686 (according to output of this script: 
> srilm/sbin/machine-type)
> OS: kubuntu 12.04 (output of uname: 36-Ubuntu SMP Tue Apr 10 20:39:51 
> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux)
> compiler version (output of "gcc -v"): gcc version 4.6.3 
> (Ubuntu/Linaro 4.6.3-1ubuntu5)
>
> I changed the contents of srilm/common/Makefile.machine.i686 as 
> described in installation instruction:
> CC = /usr/bin/gcc $(GCC_FLAGS)
> CXX = /usr/bin/g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES
> and added this line to the file:
> NO_TCL = X
> but nothing changed in installation procedure.
> I also attached the output of make command. As could be seen in the 
> file, the first error occurs in line 158:
>
> ERROR: File to be installed (../bin/i686/maxalloc) does not exist.
> Usage:  decipher-install [-p] <mode> <file1> ... <fileN> <directory>
>     mode: file permission mode, in octal
>     file1 ... fileN: files to be installed
>     directory: where the files should be installed
>
> files =  ../bin/i686/maxalloc
> directory =  ../../bin/i686
> mode =  0555
>
> May I ask you to help me in this problem?
Based on the error message from the linker

> /usr/bin/g++ -m32 -mtune=pentium3 -Wall -Wno-unused-variable 
> -Wno-uninitialized -DINSTANTIATE_TEMPLATES -D_FILE_OFFSET_B
> ITS=64    -I. -I../../include   -u matherr -L../../lib/i686  -g -O3 -o 
> ../bin/i686/lattice-tool ../obj/i686/lattice-tool
> .o ../obj/i686/liblattice.a -lm -ldl ../../lib/i686/libflm.a 
> ../../lib/i686/liboolm.a ../../lib/i686/libdstruct.a ../../
> lib/i686/libmisc.a  -lm 2>&1 | c++filt
> /usr/bin/ld: skipping incompatible 
> /usr/lib/gcc/x86_64-linux-gnu/4.6/libstdc++.so when searching for -lstdc++
> /usr/bin/ld: skipping incompatible 
> /usr/lib/gcc/x86_64-linux-gnu/4.6/libstdc++.a when searching for -lstdc++
> /usr/bin/ld: cannot find -lstdc++

you don't have the 32bit version of libstdc++ installed.   Try building 
64bit binaries:

     make MACHINE_TYPE=i686-m64 World

If that shows similar problem seek the advice of someone familiar with 
your Ubuntu installation.

Andreas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120726/d53465d3/attachment.html>

From ma.farajian at gmail.com  Thu Jul 26 01:29:51 2012
From: ma.farajian at gmail.com (Amin Farajian)
Date: Thu, 26 Jul 2012 12:59:51 +0430
Subject: [SRILM User List] Error in compiling SRILM
In-Reply-To: <5010ECB8.1010409@icsi.berkeley.edu>
References: <CAA+Df5VQVLZeOPRs4iWpEGEXxu3ZDGwGUjFb7neSnYu+PjcuDw@mail.gmail.com>
	<5010ECB8.1010409@icsi.berkeley.edu>
Message-ID: <5010FFFF.7080805@gmail.com>

An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120726/a9335358/attachment.html>

From ee07b282 at gmail.com  Thu Jul 26 14:00:59 2012
From: ee07b282 at gmail.com (xinrui yu)
Date: Thu, 26 Jul 2012 14:00:59 -0700
Subject: [SRILM User List] Question about lattice-tool -nbest-decode
Message-ID: <CAA7LVm_fT-mgtVt_49jnAKEotuKmS8JBQMOdXWxN06ohrymXNw@mail.gmail.com>

Hi All,
I'm new to srilm and I have some questions about finding nbest result by
using srilm.
I try Srilm by the command "./lattice-tool -read-htk -in-lattice test.lat
-nbest-decode 10   -out-nbest-dir my_nbest_dir". I indeed get 10 results.
But are the result placed in order?  I read from manual page said that they
are placed in order by default, I think they should placed according to the
score (combine the acoustic and lm score) in front of it. But from what i
have get, it's not. Am I misunderstanding something?

I also notice that there are another command callled "nbest-lattice". I try
to use it as well but it seems it does not accept HTK lattice. So could it
be used to find nbest result. And what's the different bettwen lattice-tool
and nbest-lattice? How to decide which one should be used?

Another question is that I read from manual page there is one option called
*-nbest-backtrace *which could preserve word-level timemarks and scores. Is
there similar option for lattice tool? What if I want to keep those
information while using lattice-tool?

Thanks!

Liz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120726/7558c13f/attachment.html>

From stolcke at icsi.berkeley.edu  Thu Jul 26 15:35:00 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 26 Jul 2012 15:35:00 -0700
Subject: [SRILM User List] Question about lattice-tool -nbest-decode
In-Reply-To: <CAA7LVm_fT-mgtVt_49jnAKEotuKmS8JBQMOdXWxN06ohrymXNw@mail.gmail.com>
References: <CAA7LVm_fT-mgtVt_49jnAKEotuKmS8JBQMOdXWxN06ohrymXNw@mail.gmail.com>
Message-ID: <5011C614.9090505@icsi.berkeley.edu>

On 7/26/2012 2:00 PM, xinrui yu wrote:
> Hi All,
> I'm new to srilm and I have some questions about finding nbest result 
> by using srilm.
> I try Srilm by the command "./lattice-tool -read-htk -in-lattice 
> test.lat -nbest-decode 10   -out-nbest-dir my_nbest_dir". I indeed get 
> 10 results. But are the result placed in order?  I read from manual 
> page said that they are placed in order by default, I think they 
> should placed according to the score (combine the acoustic and lm 
> score) in front of it. But from what i have get, it's not. Am I 
> misunderstanding something?
The output is sorted by score.   You are probably not considering the 
way that the combined score is computed.
You need to take the acoustic score, and added the weighted LM score and 
word insertion penalty. The LM weight and insertion penalties might have 
default values encoded in the lattices.  You can override them on the 
command line.
You might get the output you expect by using -htk-lm-scale 1 and 
-htk-wdpenalty 0, but that will probably not be the best result in terms 
of word error.

>
> I also notice that there are another command callled "nbest-lattice". 
> I try to use it as well but it seems it does not accept HTK lattice. 
> So could it be used to find nbest result. And what's the different 
> bettwen lattice-tool and nbest-lattice? How to decide which one should 
> be used?
nbest-lattice takes nbest lists as INPUT and  constructs a special type 
of lattice representing word posterior probabilities.
So it is not what you need.
>
> Another question is that I read from manual page there is one option 
> called *-nbest-backtrace *which could preserve word-level timemarks 
> and scores. Is there similar option for lattice tool? What if I want 
> to keep those information while using lattice-tool?
No, sorry.

Andreas

>
> Thanks!
>
> Liz
>
>
>
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120726/cb592589/attachment.html>

From alex.dan.tomescu at gmail.com  Sat Jul 28 03:09:18 2012
From: alex.dan.tomescu at gmail.com (Alex Tomescu)
Date: Sat, 28 Jul 2012 13:09:18 +0300
Subject: [SRILM User List] Fwd: Batch no-sos and no-eos
In-Reply-To: <CAEutucbMahjiDJZt+2hy9ONp2OqBnG-hA+zzJVf1ufmZpXFp5g@mail.gmail.com>
References: <CAEutucbMahjiDJZt+2hy9ONp2OqBnG-hA+zzJVf1ufmZpXFp5g@mail.gmail.com>
Message-ID: <CAEutucYP95iRd32zONH67n5WTA93EcLXeL47qtAeP_7zO9PAow@mail.gmail.com>

Hi

I need to make a language model from a set of 5000+ texts. The texts
are separated into one sentence per line so there are a lot of
sentence boundary tokens which I need to get rid of.

I used make-batch-counts and merge-batch counts to count the ngrams,
and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
still sentence boundaries we're included.

I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
the same results.

Removing '\n' from the text files resulted in "line 1: line too long".

I tried ngram-count with -no-eos -no-sos on one of the files and it
worked, but on a batch it didn't seem to work.

Any ideas on what I should try next ?

Thanks
--
Alexandru Tomescu, undergraduate Computer Science student at
Polytechnic University of Bucharest

From tonyr at cantabresearch.com  Sat Jul 28 04:16:20 2012
From: tonyr at cantabresearch.com (Tony Robinson)
Date: Sat, 28 Jul 2012 12:16:20 +0100
Subject: [SRILM User List] Fwd: Batch no-sos and no-eos
In-Reply-To: <CAEutucYP95iRd32zONH67n5WTA93EcLXeL47qtAeP_7zO9PAow@mail.gmail.com>
References: <CAEutucbMahjiDJZt+2hy9ONp2OqBnG-hA+zzJVf1ufmZpXFp5g@mail.gmail.com>
	<CAEutucYP95iRd32zONH67n5WTA93EcLXeL47qtAeP_7zO9PAow@mail.gmail.com>
Message-ID: <5013CA04.8000907@cantabResearch.com>

Hi Alex,

<s> and </s> are not really "sentence boundary" tokens, even though 
that's what everyone calls them and that's how they are used most of the 
time.   They are for the start and end of utterance contexts.    So for 
your problem pick a suitably large chunk - let's say we decode a chapter 
at a time and have a <s> at the start and a </s> at the end and replace 
the rest with <PERIOD>.

I'm back, so mail me if this doesn't make sense.


Tony

On 07/28/2012 11:09 AM, Alex Tomescu wrote:
> Hi
>
> I need to make a language model from a set of 5000+ texts. The texts
> are separated into one sentence per line so there are a lot of
> sentence boundary tokens which I need to get rid of.
>
> I used make-batch-counts and merge-batch counts to count the ngrams,
> and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
> still sentence boundaries we're included.
>
> I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
> the same results.
>
> Removing '\n' from the text files resulted in "line 1: line too long".
>
> I tried ngram-count with -no-eos -no-sos on one of the files and it
> worked, but on a batch it didn't seem to work.
>
> Any ideas on what I should try next ?
>
> Thanks
> --
> Alexandru Tomescu, undergraduate Computer Science student at
> Polytechnic University of Bucharest
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


-- 
Dr A J Robinson, Founder and Director of Cantab Research Limited.
St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.
Company reg no 05697423 (England and Wales), VAT reg no 925606030.

From stolcke at icsi.berkeley.edu  Sat Jul 28 09:46:05 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sat, 28 Jul 2012 09:46:05 -0700
Subject: [SRILM User List] Fwd: Batch no-sos and no-eos
In-Reply-To: <CAEutucYP95iRd32zONH67n5WTA93EcLXeL47qtAeP_7zO9PAow@mail.gmail.com>
References: <CAEutucbMahjiDJZt+2hy9ONp2OqBnG-hA+zzJVf1ufmZpXFp5g@mail.gmail.com>
	<CAEutucYP95iRd32zONH67n5WTA93EcLXeL47qtAeP_7zO9PAow@mail.gmail.com>
Message-ID: <5014174D.1080701@icsi.berkeley.edu>

On 7/28/2012 3:09 AM, Alex Tomescu wrote:
> Hi
>
> I need to make a language model from a set of 5000+ texts. The texts
> are separated into one sentence per line so there are a lot of
> sentence boundary tokens which I need to get rid of.
>
> I used make-batch-counts and merge-batch counts to count the ngrams,
> and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
> still sentence boundaries we're included.
I don't see this behavior.  With make-big-lm -no-sos -no-eos  it's true 
that <s> and </s> appear in the unigram section of the LM (they are 
still part of the vocabulary, similar to other words that might occur in 
your vocab file but don't occur in your training data), but there are 
not higher-order order N-gram involving <s> or </s> in the resulting LM.

The same is true if you run ngram-count -no-sos -no-eos, so the two ways 
of building the LM are consistent in this regard.

Presently,  -no-sos -no-eos just affect the way ngrams are generated 
from text.   After counts are extracted, they don't affect any part of 
the LM building process.   It might make sense for these options to also 
modify the default vocab membership or <s> and </s>.  Having the tags in 
the vocab without N-grams should be fine for most LM uses, but I can see 
an argument for removing them. Is that the behavior you are looking for?

Andreas


>
> I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
> the same results.
>
> Removing '\n' from the text files resulted in "line 1: line too long".
>
> I tried ngram-count with -no-eos -no-sos on one of the files and it
> worked, but on a batch it didn't seem to work.
>
> Any ideas on what I should try next ?
>
> Thanks
> --
> Alexandru Tomescu, undergraduate Computer Science student at
> Polytechnic University of Bucharest
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From alex.dan.tomescu at gmail.com  Sun Jul 29 03:46:21 2012
From: alex.dan.tomescu at gmail.com (Alex Tomescu)
Date: Sun, 29 Jul 2012 13:46:21 +0300
Subject: [SRILM User List] Fwd: Batch no-sos and no-eos
In-Reply-To: <5014174D.1080701@icsi.berkeley.edu>
References: <CAEutucbMahjiDJZt+2hy9ONp2OqBnG-hA+zzJVf1ufmZpXFp5g@mail.gmail.com>
	<CAEutucYP95iRd32zONH67n5WTA93EcLXeL47qtAeP_7zO9PAow@mail.gmail.com>
	<5014174D.1080701@icsi.berkeley.edu>
Message-ID: <CAEutucYAi9nbDJQH3A5RAN+FN0iOdFQ6EoHVn0Q6kypXDT1PsA@mail.gmail.com>

Hello,

> I don't see this behavior.  With make-big-lm -no-sos -no-eos  it's true that <s> and </s> appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving <s> or </s> in the resulting LM.


These are the exact parameters I passed to make-big-lm, and still I
looked through the LM and there are ngrams containing </s>
("-0.0009011862   <PERIOD> </s>")

make-big-lm -name biglm -read merge-iter9-1.ngrams.gz -lm gut.lm
-no-eos -no-sos -prune 1e-8 -vocab ../gut.vocab -limit-vocab
using existing gtcounts
warning: discount coeff 1 is out of range: 1.1758
warning: discount coeff 3 is out of range: 1.11643
warning: discount coeff 5 is out of range: 1.17202
warning: discount coeff 7 is out of range: 1.12503
+ ngram-count -read - -read-with-mincounts -order 3 -gt1 biglm.gt1
-gt2 biglm.gt2 -gt3 biglm.gt3 -lm gut.lm -no-eos -no-sos -prune 1e-8
-vocab ../gut.vocab -limit-vocab -meta-tag __meta__

It's really weird because when I tried ngram-count on a single file
(very similar to the one triggered by make-big-lm), eos and sos tokens
were only included in the unigrams.

> Presently,  -no-sos -no-eos just affect the way ngrams are generated from text.   After counts are extracted, they don't affect any part of the LM building process.   It might make sense for these options to also modify the default vocab membership or <s> and </s>.  Having the tags in the vocab without N-grams should be fine for most LM uses, but I can see an argument for removing them. Is that the behavior you are looking for?


It's ok if they are included as unigrams.

I am going to make some more tests and if I find the problem I will
post it. For the moment I can work around this by making bigger
paragraphs so that there are not so many eos and sos tags.

Thank you,

Alex

On Sat, Jul 28, 2012 at 7:46 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
>
> On 7/28/2012 3:09 AM, Alex Tomescu wrote:
>>
>> Hi
>>
>> I need to make a language model from a set of 5000+ texts. The texts
>> are separated into one sentence per line so there are a lot of
>> sentence boundary tokens which I need to get rid of.
>>
>> I used make-batch-counts and merge-batch counts to count the ngrams,
>> and make-big-lm with -vocab -limit-vocab -no-sos -no-eos -prune, but
>> still sentence boundaries we're included.
>
> I don't see this behavior.  With make-big-lm -no-sos -no-eos  it's true that <s> and </s> appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving <s> or </s> in the resulting LM.
>
> The same is true if you run ngram-count -no-sos -no-eos, so the two ways of building the LM are consistent in this regard.
>
> Presently,  -no-sos -no-eos just affect the way ngrams are generated from text.   After counts are extracted, they don't affect any part of the LM building process.   It might make sense for these options to also modify the default vocab membership or <s> and </s>.  Having the tags in the vocab without N-grams should be fine for most LM uses, but I can see an argument for removing them. Is that the behavior you are looking for?
>
> Andreas
>
>
>>
>> I also tried make-batch-counts file_list | xargs -no-eos -no-sos, with
>> the same results.
>>
>> Removing '\n' from the text files resulted in "line 1: line too long".
>>
>> I tried ngram-count with -no-eos -no-sos on one of the files and it
>> worked, but on a batch it didn't seem to work.
>>
>> Any ideas on what I should try next ?
>>
>> Thanks
>> --
>> Alexandru Tomescu, undergraduate Computer Science student at
>> Polytechnic University of Bucharest
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>


--
Alexandru Tomescu, undergraduate Computer Science student at
Polytechnic University of Bucharest


From stolcke at icsi.berkeley.edu  Sun Jul 29 08:55:34 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 29 Jul 2012 08:55:34 -0700
Subject: [SRILM User List] Fwd: Batch no-sos and no-eos
In-Reply-To: <CAEutucYAi9nbDJQH3A5RAN+FN0iOdFQ6EoHVn0Q6kypXDT1PsA@mail.gmail.com>
References: <CAEutucbMahjiDJZt+2hy9ONp2OqBnG-hA+zzJVf1ufmZpXFp5g@mail.gmail.com>
	<CAEutucYP95iRd32zONH67n5WTA93EcLXeL47qtAeP_7zO9PAow@mail.gmail.com>
	<5014174D.1080701@icsi.berkeley.edu>
	<CAEutucYAi9nbDJQH3A5RAN+FN0iOdFQ6EoHVn0Q6kypXDT1PsA@mail.gmail.com>
Message-ID: <50155CF6.4080901@icsi.berkeley.edu>

On 7/29/2012 3:46 AM, Alex Tomescu wrote:
> Hello,
>
>> I don't see this behavior.  With make-big-lm -no-sos -no-eos  it's true that <s> and </s> appear in the unigram section of the LM (they are still part of the vocabulary, similar to other words that might occur in your vocab file but don't occur in your training data), but there are not higher-order order N-gram involving <s> or </s> in the resulting LM.
>
> These are the exact parameters I passed to make-big-lm, and still I
> looked through the LM and there are ngrams containing </s>
> ("-0.0009011862   <PERIOD> </s>")
>
> make-big-lm -name biglm -read merge-iter9-1.ngrams.gz -lm gut.lm
> -no-eos -no-sos -prune 1e-8 -vocab ../gut.vocab -limit-vocab
That just means that those ngrams are in the input count file 
(merge-iter9-1.ngrams.gz).  You need to also include -no-eos -no-sos 
when generating the counts (e.g., with make-batch-counts or directly 
with ngram-count).

Andreas


From shahramk at gmail.com  Mon Jul 30 21:46:17 2012
From: shahramk at gmail.com (Shahram)
Date: Tue, 31 Jul 2012 14:46:17 +1000
Subject: [SRILM User List] installation problem
Message-ID: <CABjq+1-4OrLsuHHHvmSBLdUCpZPf_zB8-pjiE2hdUwHX0Zm0Ng@mail.gmail.com>

Hi,

I have a problem installing SRILM on my linux machine.
When I install it with "NO-TCL=X" it works fine, however it seems it does
not install ngram and ngram-count.
I have tclsh installed on my machine. SRILM installation seems to need
ltcl. I actually do not know much about tcl. Are tclsh and ltcl the same?
If so, how can I make the SRILM installation use tclsh instead of ltcl?

-- 
---
Regards

Shahram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120731/e444c134/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Jul 31 11:15:31 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 31 Jul 2012 11:15:31 -0700
Subject: [SRILM User List] installation problem
In-Reply-To: <CABjq+1-4OrLsuHHHvmSBLdUCpZPf_zB8-pjiE2hdUwHX0Zm0Ng@mail.gmail.com>
References: <CABjq+1-4OrLsuHHHvmSBLdUCpZPf_zB8-pjiE2hdUwHX0Zm0Ng@mail.gmail.com>
Message-ID: <501820C3.4040103@icsi.berkeley.edu>

On 7/30/2012 9:46 PM, Shahram wrote:
> Hi,
>
> I have a problem installing SRILM on my linux machine.
> When I install it with "NO-TCL=X" it works fine, however it seems it 
> does not install ngram and ngram-count.
> I have tclsh installed on my machine. SRILM installation seems to need 
> ltcl. I actually do not know much about tcl. Are tclsh and ltcl the same?
To remove the dependency on -ltcl you also need to set the variable 
TCL_LIBRARY= (to empty) .

> If so, how can I make the SRILM installation use tclsh instead of ltcl?
tclsh and -ltcl are for different purposes.    One is a command shell, 
the other a library you link your programs with .
However, if tclsh is installed on your system then chances are that 
somewhere in /usr/lib there is a version of -ltcl .  Try

ls /usr/lib/libtcl*.so

and if you see something like

/usr/lib/libtcl8.4.so

then set TCL_LIBRARY=-ltcl8.4 (and leave NO_TCL= empty).

Andreas


From chenmengdx at gmail.com  Thu Aug  2 02:30:56 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Thu, 2 Aug 2012 17:30:56 +0800
Subject: [SRILM User List] Why modified Kneser-Ney much slower than
	Good-Turing using make-big-lm?
Message-ID: <CA+bc0mq1Sa8_KsSuHQ9LMv+GP=8+SVVhu16hRDdareVcQbzJWw@mail.gmail.com>

Hi, I am training LM using *make-batch-counts*, *merge-batch-counts* and *
make-big-lm*. I compared the modified Kneser-Ney and Good-Turing smoothing
algorithm in *make-big-lm*, and found that the training speed is much
slower by modified Kneser-Ney. I checked the debug information, and found
that it run *make-kn-counts* and *merge-batch-counts*, which cost most of
the time. I wonder if the extra two steps could run in *make-batch-counts*,
so it could save much time.
Thanks!

Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120802/26ac966f/attachment.html>

From stolcke at icsi.berkeley.edu  Thu Aug  2 09:40:30 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 02 Aug 2012 09:40:30 -0700
Subject: [SRILM User List] Why modified Kneser-Ney much slower than
 Good-Turing using make-big-lm?
In-Reply-To: <CA+bc0mq1Sa8_KsSuHQ9LMv+GP=8+SVVhu16hRDdareVcQbzJWw@mail.gmail.com>
References: <CA+bc0mq1Sa8_KsSuHQ9LMv+GP=8+SVVhu16hRDdareVcQbzJWw@mail.gmail.com>
Message-ID: <501AAD7E.2070708@icsi.berkeley.edu>

On 8/2/2012 2:30 AM, Meng Chen wrote:
> Hi, I am training LM using *make-batch-counts*, *merge-batch-counts* 
> and *make-big-lm*. I compared the modified Kneser-Ney and Good-Turing 
> smoothing algorithm in *make-big-lm*, and found that the training 
> speed is much slower by modified Kneser-Ney. I checked the debug 
> information, and found that it run *make-kn-counts* and 
> *merge-batch-counts*, which cost most of the time. I wonder if the 
> extra two steps could run in *make-batch-counts*, so it could save 
> much time.
KN is slower because it has to first compute the regular ngram counts, 
then, in a second pass, make-kn-counts, which takes the merged ngram 
counts as input.  Because the counts have to be merged first (you are 
counting the ngram types, not the token frequencies) you need to do it 
in this order.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120802/dedc5b70/attachment.html>

From chenmengdx at gmail.com  Fri Aug  3 03:18:37 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Fri, 3 Aug 2012 18:18:37 +0800
Subject: [SRILM User List] What's the limitation to memory in
	make-batch-counts ?
Message-ID: <CA+bc0moHaOd-e0LPfoC=LqTUi7kLn_5_PrOMvXW42svVixVNXQ@mail.gmail.com>

Hi, in *make-batch-counts*, we need to set the batch-size in order to count
faster. it says "For maximum performance, batch-size should be as large as
possible without triggering paging". However, sometimes I found it would
crash if I set it too large (eg. 500). So I want to ask if there is any
limitation to batch-size. Suppose every text in file list is *a* MB, the
memory of server is *b* MB,the batch-size should not be larger than *b/a*,
is it right? Or some other limitations?
Thanks!

Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120803/d165fb04/attachment.html>

From stolcke at icsi.berkeley.edu  Fri Aug  3 16:39:50 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 03 Aug 2012 16:39:50 -0700
Subject: [SRILM User List] What's the limitation to memory in
 make-batch-counts ?
In-Reply-To: <CA+bc0moHaOd-e0LPfoC=LqTUi7kLn_5_PrOMvXW42svVixVNXQ@mail.gmail.com>
References: <CA+bc0moHaOd-e0LPfoC=LqTUi7kLn_5_PrOMvXW42svVixVNXQ@mail.gmail.com>
Message-ID: <501C6146.5040506@icsi.berkeley.edu>

On 8/3/2012 3:18 AM, Meng Chen wrote:
> Hi, in *make-batch-counts*, we need to set the batch-size in order to 
> count faster. it says "For maximum performance, batch-size should be 
> as large as possible without triggering paging". However, sometimes I 
> found it would crash if I set it too large (eg. 500). So I want to ask 
> if there is any limitation to batch-size. Suppose every text in file 
> list is *a* MB, the memory of server is *b* MB,the batch-size should 
> not be larger than *b/a*, is it right? Or some other limitations?

make-batch-counts actually works sequentially, so you can devote all of 
a machine's memory to computing counts, unless you have other things 
running.  If you want to parallelize the counting you have to devise 
your own method for that.

Of course in general there other things running on a machine, and some 
systems start randomly killing processes when you exhaust their memory.  
I suspect that's what is happening in your case.  There is no built-in 
limitation in make-batch-counts, other than the limits imposed by the 
system.   Another reason your job might have crashed is that you are 
using 32bit binaries and you were hitting against the 2 or 4 GB limit 
inherent in 32bit memory addresses.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120803/6262708b/attachment.html>

From saraelec at yahoo.com  Sun Aug  5 16:02:17 2012
From: saraelec at yahoo.com (sara)
Date: Sun, 5 Aug 2012 16:02:17 -0700 (PDT)
Subject: [SRILM User List] ngram command not found
Message-ID: <1344207737.29305.YahooMailClassic@web162301.mail.bf1.yahoo.com>

Hi, 

I complied SRILM in Linux and? got this message : "ngram command not found" . Please help me why i got this error and what I shoud do?

Thanks,
Sara
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120805/ed2adf07/attachment.html>

From stolcke at icsi.berkeley.edu  Sun Aug  5 17:47:39 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 05 Aug 2012 17:47:39 -0700
Subject: [SRILM User List] ngram command not found
In-Reply-To: <1344207737.29305.YahooMailClassic@web162301.mail.bf1.yahoo.com>
References: <1344207737.29305.YahooMailClassic@web162301.mail.bf1.yahoo.com>
Message-ID: <501F142B.2010401@icsi.berkeley.edu>

On 8/5/2012 4:02 PM, sara wrote:
> Hi,
>
> I complied SRILM in Linux and  got this message : "ngram command not 
> found" . Please help me why i got this error and what I shoud do?
>

Go through the first question in the  FAQ item (A1) and check each 
possible problem described there.
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120805/86fb1dc3/attachment.html>

From saraelec at yahoo.com  Mon Aug  6 12:13:51 2012
From: saraelec at yahoo.com (sara)
Date: Mon, 6 Aug 2012 12:13:51 -0700 (PDT)
Subject: [SRILM User List] complie on 32-bit system
Message-ID: <1344280431.50069.YahooMailClassic@web162301.mail.bf1.yahoo.com>

Hi,

How Can I compile SRILM on 32-bit system?
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120806/03974bbe/attachment.html>

From saraelec at yahoo.com  Mon Aug  6 17:35:23 2012
From: saraelec at yahoo.com (sara)
Date: Mon, 6 Aug 2012 17:35:23 -0700 (PDT)
Subject: [SRILM User List] clmain.cc:8:17: error
Message-ID: <1344299723.30666.YahooMailClassic@web162303.mail.bf1.yahoo.com>

Hi I compile SRILM in Linux but I got these errors:

tclmain.cc:8:17: error: tcl.h: No such file or directory
make[2]: *** [../obj/i686/tclmain.o] Error 1
make[2]: Leaving directory `/root/Desktop/srilm/misc/src'
make[1]: *** [release-libraries] Error 1
make[1]: Leaving directory `/root/Desktop/srilm'
make: *** [World] Error 2

Please help!


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120806/eacb0362/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Aug  6 21:10:19 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 06 Aug 2012 21:10:19 -0700
Subject: [SRILM User List] clmain.cc:8:17: error
In-Reply-To: <1344299723.30666.YahooMailClassic@web162303.mail.bf1.yahoo.com>
References: <1344299723.30666.YahooMailClassic@web162303.mail.bf1.yahoo.com>
Message-ID: <5020952B.4010508@icsi.berkeley.edu>

On 8/6/2012 5:35 PM, sara wrote:
> Hi I compile SRILM in Linux but I got these errors:
>
> tclmain.cc:8:17: error: tcl.h: No such file or directory
> make[2]: *** [../obj/i686/tclmain.o] Error 1
> make[2]: Leaving directory `/root/Desktop/srilm/misc/src'
> make[1]: *** [release-libraries] Error 1
> make[1]: Leaving directory `/root/Desktop/srilm'
> make: *** [World] Error 2
>

Look in the FAQ file 
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html and 
search for "tcl" to find your answer.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120806/64c33cf6/attachment.html>

From ayse.serbetci at hotmail.com  Mon Aug  6 23:53:26 2012
From: ayse.serbetci at hotmail.com (=?windows-1254?B?QXn+ZSDeZXJiZXTnaQ==?=)
Date: Tue, 7 Aug 2012 09:53:26 +0300
Subject: [SRILM User List] build problem, nothing under bin directory
In-Reply-To: <mailman.123.1344002103.4154.srilm-user@speech.sri.com>
References: <mailman.123.1344002103.4154.srilm-user@speech.sri.com>
Message-ID: <COL114-W402206A3F3ABB2D71250AFF2CE0@phx.gbl>


Hi,

 
I am trying to build SRILM on cygwin installed on a Windows 7
environment.

When I run the makefile I obtain .h and .cc files under
/include, .a files under /lib/cygwin but nothing under /bin.

My gcc version :  4.5.3

Machine type :  CYGWIN_NT-6.1

 
Make file output is as follows. Any help is really
appreciated.

 
Thanks in advance,

 
--

Ayse

 
mkdir -p include lib bin

make init

make[1]: Entering directory `/home/aserbetci/srilm'

for subdir in misc 
dstruct  lm  flm 
lattice  utils; do \

                (cd
$subdir/src; make SRILM=/home/aserbetci/srilm 
MACHINE_TYPE=cygwin  OPTION=  MAKE_PIC= init) || exit 1; \

done

make[2]: Entering directory `/home/aserbetci/srilm/misc/src'

cd ..; /home/aserbetci/srilm/sbin/make-standard-directories

make ../obj/cygwin/STAMP ../bin/cygwin/STAMP

make[3]: Entering directory `/home/aserbetci/srilm/misc/src'

make[3]: `../obj/cygwin/STAMP' is up to date.

make[3]: `../bin/cygwin/STAMP' is up to date.

make[3]: Leaving directory `/home/aserbetci/srilm/misc/src'

make[2]: Leaving directory `/home/aserbetci/srilm/misc/src'

make[2]: Entering directory
`/home/aserbetci/srilm/dstruct/src'

cd ..; /home/aserbetci/srilm/sbin/make-standard-directories

make ../obj/cygwin/STAMP ../bin/cygwin/STAMP

make[3]: Entering directory
`/home/aserbetci/srilm/dstruct/src'

make[3]: `../obj/cygwin/STAMP' is up to date.

make[3]: `../bin/cygwin/STAMP' is up to date.

make[3]: Leaving directory
`/home/aserbetci/srilm/dstruct/src'

make[2]: Leaving directory
`/home/aserbetci/srilm/dstruct/src'

make[2]: Entering directory `/home/aserbetci/srilm/lm/src'

cd ..; /home/aserbetci/srilm/sbin/make-standard-directories

make ../obj/cygwin/STAMP ../bin/cygwin/STAMP

make[3]: Entering directory `/home/aserbetci/srilm/lm/src'

make[3]: `../obj/cygwin/STAMP' is up to date.

make[3]: `../bin/cygwin/STAMP' is up to date.

make[3]: Leaving directory `/home/aserbetci/srilm/lm/src'

make[2]: Leaving directory `/home/aserbetci/srilm/lm/src'

make[2]: Entering directory `/home/aserbetci/srilm/flm/src'

cd ..; /home/aserbetci/srilm/sbin/make-standard-directories

make ../obj/cygwin/STAMP ../bin/cygwin/STAMP

make[3]: Entering directory `/home/aserbetci/srilm/flm/src'

make[3]: `../obj/cygwin/STAMP' is up to date.

make[3]: `../bin/cygwin/STAMP' is up to date.

make[3]: Leaving directory `/home/aserbetci/srilm/flm/src'

make[2]: Leaving directory `/home/aserbetci/srilm/flm/src'

make[2]: Entering directory
`/home/aserbetci/srilm/lattice/src'

cd ..; /home/aserbetci/srilm/sbin/make-standard-directories

make ../obj/cygwin/STAMP ../bin/cygwin/STAMP

make[3]: Entering directory
`/home/aserbetci/srilm/lattice/src'

make[3]: `../obj/cygwin/STAMP' is up to date.

make[3]: `../bin/cygwin/STAMP' is up to date.

make[3]: Leaving directory
`/home/aserbetci/srilm/lattice/src'

make[2]: Leaving directory
`/home/aserbetci/srilm/lattice/src'

make[2]: Entering directory
`/home/aserbetci/srilm/utils/src'

cd ..; /home/aserbetci/srilm/sbin/make-standard-directories

make ../obj/cygwin/STAMP ../bin/cygwin/STAMP

make[3]: Entering directory `/home/aserbetci/srilm/utils/src'

make[3]: `../obj/cygwin/STAMP' is up to date.

make[3]: `../bin/cygwin/STAMP' is up to date.

make[3]: Leaving directory `/home/aserbetci/srilm/utils/src'

make[2]: Leaving directory `/home/aserbetci/srilm/utils/src'

make[1]: Leaving directory `/home/aserbetci/srilm'

make release-headers

make[1]: Entering directory `/home/aserbetci/srilm'

for subdir in misc 
dstruct  lm  flm 
lattice  utils; do \

                (cd
$subdir/src; make SRILM=/home/aserbetci/srilm 
MACHINE_TYPE=cygwin  OPTION=  MAKE_PIC= release-headers) || exit 1; \

done

make[2]: Entering directory `/home/aserbetci/srilm/misc/src'

make[2]: Nothing to be done for `release-headers'.

make[2]: Leaving directory `/home/aserbetci/srilm/misc/src'

make[2]: Entering directory `/home/aserbetci/srilm/dstruct/src'

make[2]: Nothing to be done for `release-headers'.

make[2]: Leaving directory
`/home/aserbetci/srilm/dstruct/src'

make[2]: Entering directory `/home/aserbetci/srilm/lm/src'

make[2]: Nothing to be done for `release-headers'.

make[2]: Leaving directory `/home/aserbetci/srilm/lm/src'

make[2]: Entering directory `/home/aserbetci/srilm/flm/src'

make[2]: Nothing to be done for `release-headers'.

make[2]: Leaving directory `/home/aserbetci/srilm/flm/src'

make[2]: Entering directory `/home/aserbetci/srilm/lattice/src'

make[2]: Nothing to be done for `release-headers'.

make[2]: Leaving directory
`/home/aserbetci/srilm/lattice/src'

make[2]: Entering directory
`/home/aserbetci/srilm/utils/src'

make[2]: Nothing to be done for `release-headers'.

make[2]: Leaving directory `/home/aserbetci/srilm/utils/src'

make[1]: Leaving directory `/home/aserbetci/srilm'

make depend

make[1]: Entering directory `/home/aserbetci/srilm'

for subdir in misc 
dstruct  lm  flm 
lattice  utils; do \

                (cd
$subdir/src; make SRILM=/home/aserbetci/srilm 
MACHINE_TYPE=cygwin  OPTION=  MAKE_PIC= depend) || exit 1; \

done

make[2]: Entering directory `/home/aserbetci/srilm/misc/src'

rm -f Dependencies.cygwin

gcc -Wall -Wno-unused-variable -Wno-uninitialized
-Wimplicit-int    -I. -I../../include
-MM  ./option.c ./zio.c ./fcheck.c
./fake-rand48.c ./version.c ./ztest.c | sed -e "s&^\([^
]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e
"s&\.o&.o&g" >> Dependencies.cygwin

g++ -Wall -Wno-unused-variable -Wno-uninitialized
-DINSTANTIATE_TEMPLATES    -I.
-I../../include -MM  ./Debug.cc ./File.cc
./MStringTokUtil.cc ./tclmain.cc ./testFile.cc | sed -e "s&^\([^
]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e
"s&\.o&.o&g" >> Dependencies.cygwin

/home/aserbetci/srilm/sbin/generate-program-dependencies
../bin/cygwin ../obj/cygwin ".exe" ztest  testFile | sed -e
"s&\.o&.o&g" >> Dependencies.cygwin

make[2]: Leaving directory `/home/aserbetci/srilm/misc/src'

make[2]: Entering directory
`/home/aserbetci/srilm/dstruct/src'

rm -f Dependencies.cygwin

gcc -Wall -Wno-unused-variable -Wno-uninitialized
-Wimplicit-int    -I. -I../../include
-MM  ./qsort.c ./BlockMalloc.c
./maxalloc.c | sed -e "s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g"
-e "s&\.o&.o&g" >> Dependencies.cygwin

g++ -Wall -Wno-unused-variable -Wno-uninitialized
-DINSTANTIATE_TEMPLATES    -I.
-I../../include -MM  ./MemStats.cc
./LHashTrie.cc ./SArrayTrie.cc ./Array.cc ./IntervalHeap.cc ./Map.cc
./SArray.cc ./LHash.cc ./Map2.cc ./Trie.cc ./CachedMem.cc ./testArray.cc
./testMap.cc ./benchHash.cc ./testHash.cc ./testSizes.cc ./testCachedMem.cc
./testBlockMalloc.cc ./testMap2.cc ./testTrie.cc | sed -e "s&^\([^
]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e
"s&\.o&.o&g" >> Dependencies.cygwin

/home/aserbetci/srilm/sbin/generate-program-dependencies
../bin/cygwin ../obj/cygwin ".exe" maxalloc  testArray 
testMap  benchHash  testHash 
testSizes  testCachedMem  testBlockMalloc testMap2  testTrie | sed -e
"s&\.o&.o&g" >> Dependencies.cygwin

make[2]: Leaving directory `/home/aserbetci/srilm/dstruct/src'

make[2]: Entering directory `/home/aserbetci/srilm/lm/src'

rm -f Dependencies.cygwin

gcc -Wall -Wno-unused-variable -Wno-uninitialized
-Wimplicit-int    -I. -I../../include
-MM  ./matherr.c | sed -e
"s&^\([^ ]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g"
-e "s&\.o&.o&g" >> Dependencies.cygwin

g++ -Wall -Wno-unused-variable -Wno-uninitialized
-DINSTANTIATE_TEMPLATES    -I.
-I../../include -MM  ./Prob.cc
./Counts.cc ./XCount.cc ./Vocab.cc ./VocabMap.cc ./VocabMultiMap.cc
./VocabDistance.cc ./SubVocab.cc ./MultiwordVocab.cc ./TextStats.cc ./LM.cc
./LMClient.cc ./LMStats.cc ./RefList.cc ./Bleu.cc ./NBest.cc ./NBestSet.cc
./NgramLM.cc ./NgramStatsInt.cc ./NgramStatsShort.cc ./NgramStatsLong.cc
./NgramStatsLongLong.cc ./NgramStatsFloat.cc ./NgramStatsDouble.cc
./NgramStatsXCount.cc ./NgramCountLM.cc ./Discount.cc ./ClassNgram.cc
./SimpleClassNgram.cc ./DFNgram.cc ./SkipNgram.cc ./HiddenNgram.cc
./HiddenSNgram.cc ./VarNgram.cc ./DecipherNgram.cc ./TaggedVocab.cc
./TaggedNgram.cc ./TaggedNgramStats.cc ./StopNgram.cc ./StopNgramStats.cc
./MultiwordLM.cc ./NonzeroLM.cc ./BayesMix.cc ./LoglinearMix.cc
./AdaptiveMix.cc ./AdaptiveMarginals.cc ./CacheLM.cc ./DynamicLM.cc
./HMMofNgrams.cc ./WordAlign.cc ./WordLattice.cc ./WordMesh.cc
./simpleTrigram.cc ./NgramStats.cc ./Trellis.cc ./testBinaryCounts.cc
./testHash.cc ./testProb.cc ./testXCount.cc ./testParseFloat.cc
./testVocabDistance.cc ./testNgram.cc ./testNgramAlloc.cc ./testMultiReadLM.cc
./hoeffding.cc ./tolower.cc ./testLattice.cc ./testError.cc ./testNBest.cc
./testMix.cc ./testTaggedVocab.cc ./testVocab.cc ./ngram.cc ./ngram-count.cc
./ngram-merge.cc ./ngram-class.cc ./disambig.cc ./anti-ngram.cc
./nbest-lattice.cc ./nbest-mix.cc ./nbest-optimize.cc ./nbest-pron-score.cc
./segment.cc ./segment-nbest.cc ./hidden-ngram.cc ./multi-ngram.cc | sed -e
"s&^\([^
]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e
"s&\.o&.o&g" >> Dependencies.cygwin

/home/aserbetci/srilm/sbin/generate-program-dependencies
../bin/cygwin ../obj/cygwin ".exe" testBinaryCounts  testHash 
testProb  testXCount  testParseFloat  testVocabDistance  testNgram 
testNgramAlloc 
testMultiReadLM  hoeffding  tolower 
testLattice  testError  testNBest 
testMix testTaggedVocab 
testVocab   ngram  ngram-count 
ngram-merge  ngram-class  disambig 
anti-ngram  nbest-lattice  nbest-mix 
nbest-optimize 
nbest-pron-score  segment  segment-nbest 
hidden-ngram  multi-ngram | sed -e
"s&\.o&.o&g" >> Dependencies.cygwin

make[2]: Leaving directory `/home/aserbetci/srilm/lm/src'

make[2]: Entering directory `/home/aserbetci/srilm/flm/src'

rm -f Dependencies.cygwin

g++ -Wall -Wno-unused-variable -Wno-uninitialized
-DINSTANTIATE_TEMPLATES    -I.
-I../../include -MM  ./FDiscount.cc
./FNgramStats.cc ./FNgramStatsInt.cc ./FNgramSpecs.cc ./FNgramSpecsInt.cc
./FactoredVocab.cc ./FNgramLM.cc ./ProductVocab.cc ./ProductNgram.cc
./wmatrix.cc ./pngram.cc ./fngram-count.cc ./fngram.cc | sed -e
"s&^\([^
]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e
"s&\.o&.o&g" >> Dependencies.cygwin

/home/aserbetci/srilm/sbin/generate-program-dependencies
../bin/cygwin ../obj/cygwin ".exe" pngram  fngram-count 
fngram  | sed -e
"s&\.o&.o&g" >> Dependencies.cygwin

make[2]: Leaving directory `/home/aserbetci/srilm/flm/src'

make[2]: Entering directory
`/home/aserbetci/srilm/lattice/src'

rm -f Dependencies.cygwin

g++ -Wall -Wno-unused-variable -Wno-uninitialized
-DINSTANTIATE_TEMPLATES    -I.
-I../../include -MM  ./Lattice.cc
./LatticeAlign.cc ./LatticeExpand.cc ./LatticeIndex.cc ./LatticeNBest.cc
./LatticeNgrams.cc ./LatticeReduce.cc ./HTKLattice.cc ./LatticeLM.cc
./LatticeDecode.cc ./testLattice.cc ./lattice-tool.cc | sed -e
"s&^\([^
]\)&../obj/cygwin"'$(OBJ_OPTION)'"/\1&g" -e
"s&\.o&.o&g" >> Dependencies.cygwin

/home/aserbetci/srilm/sbin/generate-program-dependencies
../bin/cygwin ../obj/cygwin ".exe" testLattice   lattice-tool | sed -e
"s&\.o&.o&g" >> Dependencies.cygwin

make[2]: Leaving directory
`/home/aserbetci/srilm/lattice/src'

make[2]: Entering directory
`/home/aserbetci/srilm/utils/src'

rm -f Dependencies.cygwin

/home/aserbetci/srilm/sbin/generate-program-dependencies
../bin/cygwin ../obj/cygwin ".exe" 
| sed -e "s&\.o&.o&g" >> Dependencies.cygwin

make[2]: Leaving directory `/home/aserbetci/srilm/utils/src'

make[1]: Leaving directory `/home/aserbetci/srilm'

make release-libraries

make[1]: Entering directory `/home/aserbetci/srilm'

for subdir in misc 
dstruct  lm  flm 
lattice  utils; do \

                (cd
$subdir/src; make SRILM=/home/aserbetci/srilm 
MACHINE_TYPE=cygwin  OPTION=  MAKE_PIC= release-libraries) || exit 1; \

done

make[2]: Entering directory `/home/aserbetci/srilm/misc/src'

make[2]: Nothing to be done for `release-libraries'.

make[2]: Leaving directory `/home/aserbetci/srilm/misc/src'

make[2]: Entering directory
`/home/aserbetci/srilm/dstruct/src'

make[2]: Nothing to be done for `release-libraries'.

make[2]: Leaving directory
`/home/aserbetci/srilm/dstruct/src'

make[2]: Entering directory `/home/aserbetci/srilm/lm/src'

make[2]: Nothing to be done for `release-libraries'.

make[2]: Leaving directory `/home/aserbetci/srilm/lm/src'

make[2]: Entering directory `/home/aserbetci/srilm/flm/src'

make[2]: Nothing to be done for `release-libraries'.

make[2]: Leaving directory `/home/aserbetci/srilm/flm/src'

make[2]: Entering directory
`/home/aserbetci/srilm/lattice/src'

make[2]: Nothing to be done for `release-libraries'.

make[2]: Leaving directory
`/home/aserbetci/srilm/lattice/src'

make[2]: Entering directory
`/home/aserbetci/srilm/utils/src'

make[2]: Nothing to be done for `release-libraries'.

make[2]: Leaving directory `/home/aserbetci/srilm/utils/src'

make[1]: Leaving directory `/home/aserbetci/srilm'

make release-programs

make[1]: Entering directory `/home/aserbetci/srilm'

for subdir in misc 
dstruct  lm  flm 
lattice  utils; do \

                (cd
$subdir/src; make SRILM=/home/aserbetci/srilm 
MACHINE_TYPE=cygwin  OPTION=  MAKE_PIC= release-programs) || exit 1; \

done

make[2]: Entering directory `/home/aserbetci/srilm/misc/src'

make[2]: Nothing to be done for `release-programs'.

make[2]: Leaving directory `/home/aserbetci/srilm/misc/src'

make[2]: Entering directory
`/home/aserbetci/srilm/dstruct/src'

g++ -Wall -Wno-unused-variable -Wno-uninitialized
-DINSTANTIATE_TEMPLATES    -I.
-I../../include   -L../../lib/cygwin  -g -O2 -o ../bin/cygwin/maxalloc.exe
../obj/cygwin/maxalloc.o ../obj/cygwin/libdstruct.a  -lm 
../../lib/cygwin/libmisc.a 
-ltcl84  -lm 

/usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld:
cannot find -ltcl84

collect2: ld returned 1 exit status

/home/aserbetci/srilm/common/Makefile.common.targets:108:
recipe for target `../bin/cygwin/maxalloc.exe' failed

make[2]: *** [../bin/cygwin/maxalloc.exe] Error 1

make[2]: Leaving directory
`/home/aserbetci/srilm/dstruct/src'

Makefile:105: recipe for target `release-programs' failed

make[1]: *** [release-programs] Error 1

make[1]: Leaving directory `/home/aserbetci/srilm'

Makefile:54: recipe for target `World' failed

make: *** [World] Error 2 		 	   		   		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120807/2531480f/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Aug  7 09:31:44 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 07 Aug 2012 09:31:44 -0700
Subject: [SRILM User List] build problem, nothing under bin directory
In-Reply-To: <COL114-W402206A3F3ABB2D71250AFF2CE0@phx.gbl>
References: <mailman.123.1344002103.4154.srilm-user@speech.sri.com>
	<COL114-W402206A3F3ABB2D71250AFF2CE0@phx.gbl>
Message-ID: <502142F0.2070805@icsi.berkeley.edu>

On 8/6/2012 11:53 PM, Ay?e ?erbet?i wrote:
>
> Hi,
>
> I am trying to build SRILM on cygwin installed on a Windows 7 environment.
>
> When I run the makefile I obtain .h and .cc files under /include, .a 
> files under /lib/cygwin but nothing under /bin.
>
> My gcc version :  4.5.3
>
> Machine type :  CYGWIN_NT-6.1
>

Check the first question and list of remedies in the FAQ file!
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120807/95841f87/attachment.html>

From shinichiro.hamada at gmail.com  Tue Aug  7 10:32:43 2012
From: shinichiro.hamada at gmail.com (shinichiro.hamada)
Date: Wed, 8 Aug 2012 02:32:43 +0900
Subject: [SRILM User List] WBDiscount backoff weights
Message-ID: <545624FF362B4E4F9FEC494EFE119B27@f91>

Hi.

I did a small test described as below to understand SRILM behavior
of WBDiscount backoff weights (bow), and got a question.

The values of bows of "<s> context", "context word1", "context 
word2" (2grams) are set to zero. Why?

They are the prefix of "<s> context word1" (or "<s> context word2"), 
"context word1 </s>", "context word2 </s>" respetively, so I think 
they are qualified to have bow values.

I read the explanation of WBDiscount and "Warning5" in the 
ngram-discount manual (*1), but I couln't get it's answer.

Any advices will help me very much. Thank you.

(*1) ngram-discount manual
http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html


----------------------------------------------------------------------
$ cat > smp.txt << EOF
context word1
context word2
EOF
$ ngram-count -order 3 -wbdiscount -text smp.txt -gtmin 0 -gt1min0 -gt2min 0
-gt3min 0 -lm lm.arpa
$ cat lm.arpa

\data\
ngram 1=5
ngram 2=5
ngram 3=4

\1-grams:
-0.5228788	</s>
-99	<s>	-0.3222193
-0.5228788	context -0.07918124
-0.69897	word1	-0.146128
-0.69897	word2	-0.146128

\2-grams:
-0.1760913	<s> context	0
-0.60206	context word1	0
-0.60206	context word2	0
-0.30103	word1 </s>
-0.30103	word2 </s>

\3-grams:
-0.60206	<s> context word1
-0.60206	<s> context word2
-0.30103	context word1 </s>
-0.30103	context word2 </s>

\end\

--
Shinichiro Hamada


From stolcke at icsi.berkeley.edu  Tue Aug  7 11:14:55 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 07 Aug 2012 11:14:55 -0700
Subject: [SRILM User List] WBDiscount backoff weights
In-Reply-To: <545624FF362B4E4F9FEC494EFE119B27@f91>
References: <545624FF362B4E4F9FEC494EFE119B27@f91>
Message-ID: <50215B1F.4080403@icsi.berkeley.edu>

On 8/7/2012 10:32 AM, shinichiro.hamada wrote:
> Hi.
>
> I did a small test described as below to understand SRILM behavior
> of WBDiscount backoff weights (bow), and got a question.
>
> The values of bows of "<s> context", "context word1", "context
> word2" (2grams) are set to zero. Why?
>
> They are the prefix of "<s> context word1" (or "<s> context word2"),
> "context word1 </s>", "context word2 </s>" respetively, so I think
> they are qualified to have bow values.
>
> I read the explanation of WBDiscount and "Warning5" in the
> ngram-discount manual (*1), but I couln't get it's answer.
>
> Any advices will help me very much. Thank you.

Backoff log weight zero (= 1 in the probability domain) means that the 
bigram probs don't need to be modified when used for backoff purposes.   
This is because, in your example, the probability mass left over from 
the explicit trigrams is the same as the probability mass of the 
corresponding bigrams.   And this, in turn, is because your trigrams

-0.60206	<s> context word1
-0.60206	<s> context word2

have the same probabilities as the corresponding bigrams:

-0.60206    context word1    0
-0.60206    context word2    0

So there is nothing mysterious going on, it just happens to follow from 
the bigram and trigrams in your data.  You will not likely find this 
situation in realistic data sets.

Andreas


>
> (*1) ngram-discount manual
> http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
>
>
> ----------------------------------------------------------------------
> $ cat > smp.txt << EOF
> context word1
> context word2
> EOF
> $ ngram-count -order 3 -wbdiscount -text smp.txt -gtmin 0 -gt1min0 -gt2min 0
> -gt3min 0 -lm lm.arpa
> $ cat lm.arpa
>
> \data\
> ngram 1=5
> ngram 2=5
> ngram 3=4
>
> \1-grams:
> -0.5228788	</s>
> -99	<s>	-0.3222193
> -0.5228788	context -0.07918124
> -0.69897	word1	-0.146128
> -0.69897	word2	-0.146128
>
> \2-grams:
> -0.1760913	<s> context	0
> -0.60206	context word1	0
> -0.60206	context word2	0
> -0.30103	word1 </s>
> -0.30103	word2 </s>
>
> \3-grams:
> -0.60206	<s> context word1
> -0.60206	<s> context word2
> -0.30103	context word1 </s>
> -0.30103	context word2 </s>
>
> \end\
>
> --
> Shinichiro Hamada
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From chenmengdx at gmail.com  Wed Aug  8 03:31:03 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Wed, 8 Aug 2012 18:31:03 +0800
Subject: [SRILM User List] Question about -prune-lowprobs and
	-text-has-weights
Message-ID: <CA+bc0mogppHN-PRzz_bOSVsqv=GPwWR8AruUKxxrqx594kJ77g@mail.gmail.com>

Hi, the* -prune-lowprobs* option in* ngram* will  "prune N-gram
probabilities that are lower than the corresponding backed-off estimates".
This option would be useful especially when the back-off-weight (bow) value
is positive. However, I want to ask if I could simply replace the positive
bow value with 0 instead of using prune-lowprobs. Are there any
differences? Or replace simply is not correct?

Another question:
When training LM, we could use* -text-has-weights* option for the corpus
with sentence frequency. I want to ask what we should do with
the*duplicated sentences
* in large corpus. Should I delete the duplicated sentences? Or should I
calculate the sentence frequency first and use the -text-has-weights option
instead? Or do nothing, just throw all the corpus into training?

Thanks!

Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120808/849ea146/attachment.html>

From shinichiro.hamada at gmail.com  Wed Aug  8 08:00:42 2012
From: shinichiro.hamada at gmail.com (shinichiro.hamada)
Date: Thu, 9 Aug 2012 00:00:42 +0900
Subject: [SRILM User List] WBDiscount backoff weights
In-Reply-To: <50215B1F.4080403@icsi.berkeley.edu>
References: <545624FF362B4E4F9FEC494EFE119B27@f91>
	<50215B1F.4080403@icsi.berkeley.edu>
Message-ID: <42E86E835B3F497FA4E68890F183E435@f91>

Dear Mr. Stolcke,

I understood very well owning to your detail explanation with 
concrete examples. Thank you for always being so kind.

Best Regards,
Shinichiro

> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] 
> Sent: Wednesday, August 08, 2012 3:15 AM
> To: shinichiro.hamada
> Cc: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] WBDiscount backoff weights
> 
> On 8/7/2012 10:32 AM, shinichiro.hamada wrote:
> > Hi.
> >
> > I did a small test described as below to understand SRILM 
> > behavior of WBDiscount backoff weights (bow), and got a question.
> >
> > The values of bows of "<s> context", "context word1", "context 
> > word2" (2grams) are set to zero. Why?
> >
> > They are the prefix of "<s> context word1" (or "<s> context 
> > word2"), "context word1 </s>", "context word2 </s>" respetively, 
> > so I think they are qualified to have bow values.
> >
> > I read the explanation of WBDiscount and "Warning5" in the 
> > ngram-discount manual (*1), but I couln't get it's answer.
> >
> > Any advices will help me very much. Thank you.
> 
> Backoff log weight zero (= 1 in the probability domain) means that 
> the bigram probs don't need to be modified when used for backoff 
> purposes.
> This is because, in your example, the probability mass left over 
> from the explicit trigrams is the same as the probability mass of 
> the corresponding bigrams.   And this, in turn, is because your 
> trigrams
> 
> -0.60206	<s> context word1
> -0.60206	<s> context word2
> 
> have the same probabilities as the corresponding bigrams:
> 
> -0.60206    context word1    0
> -0.60206    context word2    0
> 
> So there is nothing mysterious going on, it just happens to follow 
> from the bigram and trigrams in your data.  You will not likely 
> find this situation in realistic data sets.
> 
> Andreas
> 
> 
> 
> 
> >
> > (*1) ngram-discount manual
> > http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.
> > 7.html
> >
> >
> > -----------------------------------------------------------------
> > $ cat > smp.txt << EOF
> > context word1
> > context word2
> > EOF
> > $ ngram-count -order 3 -wbdiscount -text smp.txt -gtmin 0
> >  -gt1min0 -gt2min 0 -gt3min 0 -lm lm.arpa $ cat lm.arpa
> >
> > \data\
> > ngram 1=5
> > ngram 2=5
> > ngram 3=4
> >
> > \1-grams:
> > -0.5228788	</s>
> > -99	<s>	-0.3222193
> > -0.5228788	context -0.07918124
> > -0.69897	word1	-0.146128
> > -0.69897	word2	-0.146128
> >
> > \2-grams:
> > -0.1760913	<s> context	0
> > -0.60206	context word1	0
> > -0.60206	context word2	0
> > -0.30103	word1 </s>
> > -0.30103	word2 </s>
> >
> > \3-grams:
> > -0.60206	<s> context word1
> > -0.60206	<s> context word2
> > -0.30103	context word1 </s>
> > -0.30103	context word2 </s>
> >
> > \end\
> >
> > --
> > Shinichiro Hamada


From stolcke at icsi.berkeley.edu  Wed Aug  8 11:57:27 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 08 Aug 2012 11:57:27 -0700
Subject: [SRILM User List] Question about -prune-lowprobs and
	-text-has-weights
In-Reply-To: <CA+bc0mogppHN-PRzz_bOSVsqv=GPwWR8AruUKxxrqx594kJ77g@mail.gmail.com>
References: <CA+bc0mogppHN-PRzz_bOSVsqv=GPwWR8AruUKxxrqx594kJ77g@mail.gmail.com>
Message-ID: <5022B697.1080908@icsi.berkeley.edu>

On 8/8/2012 3:31 AM, Meng Chen wrote:
> Hi, the*-prune-lowprobs* option in*ngram* will  "prune N-gram 
> probabilities that are lower than the corresponding backed-off 
> estimates". This option would be useful especially when the 
> back-off-weight (bow) value is positive. However, I want to ask if I 
> could simply replace the positive bow value with 0 instead of using 
> prune-lowprobs. Are there any differences? Or replace simply is not 
> correct?
It's not correct.  If you modify the backoff weight you end up with an 
LM that is no longer normalized (word probs for a given context don't 
sum to 1).
>
> Another question:
> When training LM, we could use*-text-has-weights* option for the 
> corpus with sentence frequency. I want to ask what we should do with 
> the*duplicated sentences* in large corpus. Should I delete the 
> duplicated sentences? Or should I calculate the sentence frequency 
> first and use the -text-has-weights option instead? Or do nothing, 
> just throw all the corpus into training?
You can do either.   Have a duplicated sentence

1.0 a b c
1.0 a b c

is equivalent to having the sentence once with added weights:

2.0 a b c


Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120808/433ccd1a/attachment.html>

From stolcke at icsi.berkeley.edu  Wed Aug  8 22:09:35 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 08 Aug 2012 22:09:35 -0700
Subject: [SRILM User List] Predicting words
In-Reply-To: <1342785895.7010.YahooMailNeo@web171304.mail.ir2.yahoo.com>
References: <1342785895.7010.YahooMailNeo@web171304.mail.ir2.yahoo.com>
Message-ID: <5023460F.5050301@icsi.berkeley.edu>

On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote:
> Hello,
>
> I am new to language modeling and was hoping that someone can help me 
> with the following.
>
> I try to predict a word given an input sentence. For example, I would 
> like to get a word replacing the ... that has the
> highest probability in sentences such as ' A man is ...' (e.g. sitting).
>
> I try to use disambig tool but I couldn't found any example illustrate 
> how to use it especially how how I can create the map file and what is 
> the type of this file ( e.g. txt, arpa, ...).

Indeed you can use disambig, at least in theory to solve this problem.

1. prepare a map file of the form:

     a       a
     man    man
     ...   [for all words occurring in your data]
     UNKNOWN_WORD  word1 word2  ....  [list all words in the vocabulary 
here]

2. train an LM of word sequences.

3. prepare disambig input of the form

                 a man is sitting UNKNOWN_WORD

    You can also add known words to the right of UKNOWN_WORD if you have 
that information (see the note about -fw-only below).

4. run disambig

             disambig -map MAPFILE -lm LMFILE -text INPUTFILE

If you want to use only the left context of the UNKNOWN_WORD use the 
-fw-only option.

This is in theory.  If your vocabulary is large it may be very slow and 
take too much memory.  I haven't tried it, so let me know if it works 
for you.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120808/4a346594/attachment.html>

From chenmengdx at gmail.com  Thu Aug 16 04:07:13 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Thu, 16 Aug 2012 19:07:13 +0800
Subject: [SRILM User List] How to interpolate big LMs?
Message-ID: <CA+bc0mpErjeWdkygpgcN4Rr-j5tN+cz=z8XQmJgSvj2Szid8XQ@mail.gmail.com>

Hi, suppose I have trained three big LMs: LM1 LM2 and LM3, each of which
has more than billions of ngrams. I wonder to know how to interpolate such
big LMs together. I found that the ngram command in SRILM would load all
the LMs in memory firstly, so it will reach the limitation of server's
memory. In such situation, how can I get the interpolation of big LMs?

Another question about training LM with large corpus. There are two methods:
1) I can pool all data to train a big LM0.
2) I can split the data into several parts, and train small LMs (eg. LM1
and LM2). Then interpolate them with average weight (eg. 0.5 X LM1 + 0.5 X
LM2 ) to get the final LM3.
All the cut-offs and smoothing algorithm are the same for both methods. So
does LM3 the same with LM0?

Thanks!

Meng CHRN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120816/0ee2fb8a/attachment.html>

From stolcke at icsi.berkeley.edu  Thu Aug 16 11:06:55 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 16 Aug 2012 11:06:55 -0700
Subject: [SRILM User List] How to interpolate big LMs?
In-Reply-To: <CA+bc0mpErjeWdkygpgcN4Rr-j5tN+cz=z8XQmJgSvj2Szid8XQ@mail.gmail.com>
References: <CA+bc0mpErjeWdkygpgcN4Rr-j5tN+cz=z8XQmJgSvj2Szid8XQ@mail.gmail.com>
Message-ID: <502D36BF.6080804@icsi.berkeley.edu>

On 8/16/2012 4:07 AM, Meng Chen wrote:
> Hi, suppose I have trained three big LMs: LM1 LM2 and LM3, each of 
> which has more than billions of ngrams. I wonder to know how to 
> interpolate such big LMs together. I found that the ngram command in 
> SRILM would load all the LMs in memory firstly, so it will reach the 
> limitation of server's memory. In such situation, how can I get the 
> interpolation of big LMs?
>
> Another question about training LM with large corpus. There are two 
> methods:
> 1) I can pool all data to train a big LM0.
> 2) I can split the data into several parts, and train small LMs (eg. 
> LM1 and LM2). Then interpolate them with average weight (eg. 0.5 X LM1 
> + 0.5 X LM2 ) to get the final LM3.
> All the cut-offs and smoothing algorithm are the same for both 
> methods. So does LM3 the same with LM0?
>
>
I'm assuming you are merging ngram LMs into one big LM (-mix-lm etc. 
WITHOUT the -bayes option).

In that case the LMs are merged destructively into the first LM, one by 
one.  This means at any given time only the partially merged LM and the 
next LM to be merged are kept in memory.  So when you're running

     ngram -lm LM1 -mix-lm LM2 -mix-lm2 LM3

it is NOT the case that LM1, LM2 and LM3 are in memory at the same 
time.  Instead, the result of merging LM1 and LM2, plus LM3 need to fit 
into memory.  Of course, depending on how much overlap in ngrams there 
is, that might be almost the same in terms of total memory.

Try building your binaries with OPTION=_c (compact memory).  Also, try 
using the latest beta version off the web site.  It contains an 
optimized memory allocator that leads to significant memory savings.  
Finally, if all else fails, prune your large component LMs prior to merging.

Andreas


From stolcke at icsi.berkeley.edu  Thu Aug 16 11:30:28 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 16 Aug 2012 11:30:28 -0700
Subject: [SRILM User List] How to interpolate big LMs?
In-Reply-To: <502D36BF.6080804@icsi.berkeley.edu>
References: <CA+bc0mpErjeWdkygpgcN4Rr-j5tN+cz=z8XQmJgSvj2Szid8XQ@mail.gmail.com>
	<502D36BF.6080804@icsi.berkeley.edu>
Message-ID: <502D3C44.3090903@icsi.berkeley.edu>

On 8/16/2012 11:06 AM, Andreas Stolcke wrote:
>
>
> Try building your binaries with OPTION=_c (compact memory).  Also, try 
> using the latest beta version off the web site.  It contains an 
> optimized memory allocator that leads to significant memory savings.  
> Finally, if all else fails, prune your large component LMs prior to 
> merging.

Correction: the improved memory allocator is already in the 1.6.0 
release, which is the current stable release.
But do make sure you have that version, and not some older one.

Andreas


From kcananda at gmail.com  Thu Aug 16 21:11:52 2012
From: kcananda at gmail.com (Ananda K.C.)
Date: Fri, 17 Aug 2012 09:56:52 +0545
Subject: [SRILM User List] (no subject)
Message-ID: <CA+VC5DaS21r_fsDT88tScOjhqQtULSacy0+n4bY-JAML=PRWCQ@mail.gmail.com>

Dear all,

I am doing dissertation of my Master's degree in computer science.I
want to calculate the bigram and trigram probability table as in
attachment.,from back off N-gram language models in ARPA format.


Also when i use this command "ngram-count -order 3 -read
/home/ananda/Desktop/work/countoutput.txt -vocab
/home/ananda/Desktop/work/corpusvocab.txt -lm
/home/ananda/Desktop/work/anandamodeling",which discounting is use for
backoff smothing.

I am new in the language modeling and thanks in advance.


Regards,
Ananda K.C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120817/5f7b64a5/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trigram probability table.jpg
Type: image/jpeg
Size: 19638 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120817/5f7b64a5/attachment.jpg>

From kcananda at gmail.com  Thu Aug 16 21:14:05 2012
From: kcananda at gmail.com (Ananda K.C.)
Date: Fri, 17 Aug 2012 09:59:05 +0545
Subject: [SRILM User List] bigram and trigram probability table
Message-ID: <CA+VC5DZCPxaQ6voo1H0KK6RdLD7E-B28Pus3+KrPOiLOrEc3PQ@mail.gmail.com>

Dear all,

I am doing dissertation of my Master's degree in computer science.I
want to calculate the bigram and trigram probability table as in
attachment,from back off N-gram language models in ARPA format.

Also when i use this command "ngram-count -order 3 -read
/home/ananda/Desktop/work/countoutput.txt -vocab
/home/ananda/Desktop/work/corpusvocab.txt -lm
/home/ananda/Desktop/work/anandamodeling",which discounting is use for
backoff smothing.

I am new in the language modeling and thanks in advance.


Regards,
Ananda K.C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120817/b753854b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: trigram probability table.jpg
Type: image/jpeg
Size: 19638 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120817/b753854b/attachment.jpg>

From shahramk at gmail.com  Thu Aug 16 21:24:48 2012
From: shahramk at gmail.com (Shahram)
Date: Fri, 17 Aug 2012 14:24:48 +1000
Subject: [SRILM User List] Topic Dependent Audio Date set
Message-ID: <CABjq+19Gn90t=9Q6aGrbX3Pt_3T9xjtK-+q0d_8SioMRKF5vTA@mail.gmail.com>

Hi all,

I am looking for an audio data set for my thesis in the area of topic
dependent spoken term detection and I need to create topic dependent
language models.
Does any one know any audio data set with its textual transcription which
is tagged by topics? Topics should preferably categorized as Politics,
Sports, ...

-- 
---
Regards

Shahram Kalantari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120817/bbd1efd3/attachment.html>

From stolcke at icsi.berkeley.edu  Fri Aug 17 11:09:11 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 17 Aug 2012 11:09:11 -0700
Subject: [SRILM User List] bigram and trigram probability table
In-Reply-To: <CA+VC5DZCPxaQ6voo1H0KK6RdLD7E-B28Pus3+KrPOiLOrEc3PQ@mail.gmail.com>
References: <CA+VC5DZCPxaQ6voo1H0KK6RdLD7E-B28Pus3+KrPOiLOrEc3PQ@mail.gmail.com>
Message-ID: <502E88C7.9010708@icsi.berkeley.edu>

Ananda,

the easiest way to have the toolkit compute your bigram and trigram 
probabilities once you have the model trained is:

ngram -lm /home/ananda/Desktop/work/anandamodeling -debug 2 -counts NGRAMS

where NGRAMS is a file you prepare that lists all the bigrams and 
trigrams you need, followed by a "1".
For example:

i i 1
i want 1
i to 1
want want 1
to to 1
etc.

Andreas


On 8/16/2012 9:14 PM, Ananda K.C. wrote:
> Dear all,
>
> I am doing dissertation of my Master's degree in computer science.I
> want to calculate the bigram and trigram probability table as in 
> attachment,from back off N-gram language models in ARPA format.
>
> Also when i use this command "ngram-count -order 3 -read 
> /home/ananda/Desktop/work/countoutput.txt -vocab
> /home/ananda/Desktop/work/corpusvocab.txt -lm 
> /home/ananda/Desktop/work/anandamodeling",which discounting is use for 
> backoff smothing.
>
> I am new in the language modeling and thanks in advance.
>
>
> Regards,
> Ananda K.C.
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120817/d54ff6a5/attachment.html>

From kcananda at gmail.com  Mon Aug 20 20:08:23 2012
From: kcananda at gmail.com (Ananda K.C.)
Date: Tue, 21 Aug 2012 08:53:23 +0545
Subject: [SRILM User List] bigram and trigram probability table
In-Reply-To: <502E88C7.9010708@icsi.berkeley.edu>
References: <CA+VC5DZCPxaQ6voo1H0KK6RdLD7E-B28Pus3+KrPOiLOrEc3PQ@mail.gmail.com>
	<502E88C7.9010708@icsi.berkeley.edu>
Message-ID: <CA+VC5DYqpP6PCyVv2cMpBnAVFp8=kbPPUoJrOGp1Q_Bhr9uocA@mail.gmail.com>

Dear Andreas,

Thanks for your help.


On Fri, Aug 17, 2012 at 11:54 PM, Andreas Stolcke <stolcke at icsi.berkeley.edu
> wrote:

>  Ananda,
>
> the easiest way to have the toolkit compute your bigram and trigram
> probabilities once you have the model trained is:
>
> ngram -lm /home/ananda/Desktop/work/anandamodeling -debug 2 -counts NGRAMS
>
> where NGRAMS is a file you prepare that lists all the bigrams and trigrams
> you need, followed by a "1".
> For example:
>
> i i 1
> i want 1
> i to 1
> want want 1
> to to 1
> etc.
>
> Andreas
>
>
>
> On 8/16/2012 9:14 PM, Ananda K.C. wrote:
>
> Dear all,
>
> I am doing dissertation of my Master's degree in computer science.I
> want to calculate the bigram and trigram probability table as in
> attachment,from back off N-gram language models in ARPA format.
>
> Also when i use this command "ngram-count -order 3 -read
> /home/ananda/Desktop/work/countoutput.txt -vocab
> /home/ananda/Desktop/work/corpusvocab.txt -lm
> /home/ananda/Desktop/work/anandamodeling",which discounting is use for
> backoff smothing.
>
> I am new in the language modeling and thanks in advance.
>
>
> Regards,
> Ananda K.C.
>
>
>
> _______________________________________________
> SRILM-User site listSRILM-User at speech.sri.comhttp://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120821/df498ec6/attachment.html>

From lluis.formiga at upc.edu  Tue Aug 21 14:43:05 2012
From: lluis.formiga at upc.edu (=?iso-8859-1?Q?Llu=EDs_Formiga_i_Fanals?=)
Date: Tue, 21 Aug 2012 23:43:05 +0200
Subject: [SRILM User List] Does keep-unk work with lattice-tool and htk
	format?
In-Reply-To: <4FBC29CD.3010601@icsi.berkeley.edu>
References: <4FB9B671.9080604@mit.edu> <4FBA8AE3.2070709@icsi.berkeley.edu>
	<9D363DB8-11DF-4534-AEB5-058E96E3A74C@upc.edu>
	<4FBC29CD.3010601@icsi.berkeley.edu>
Message-ID: <7507354C-E6F2-4F91-8B9D-D75B1118023D@upc.edu>

Hi Andreas,

	Sorry to bother you with this old issue.

	The two-step lattice-tool process worked perfectly. First the rescoring and second the conversion to CN.

	But, unfortunately I have seen a few unks while rescoring the lattice (not as many as writing the mesh).

	The command I use to rescore is:

	lattice-tool -lm ../../lm/interpolated-lm.en -in-lattice wordlattice0.slf -read-htk -out-lattice out.slf -write-htk -keep-unk -print-sent-tags -htk-logbase 2.71828

	And I find lines like these: (Whithin these lines the <unk> tag should be queit)

J=26    S=19    E=24    W=qu    a=0     l=-13.8261
J=27    S=19    E=25    W=que   a=0     l=-11.4986
J=28    S=19    E=26    W=<unk> a=0     l=-2.76367
J=29    S=19    E=27    W=quest a=0     l=-10.831
J=30    S=19    E=28    W=quiet a=0     l=-10.57
J=31    S=19    E=29    W=quit  a=0     l=-10.4455
J=32    S=20    E=21    W=row   a=0     l=-10.1076
J=33    S=21    E=24    W=qu    a=0     l=-14.9448
J=34    S=21    E=25    W=que   a=0     l=-12.6173
J=35    S=21    E=26    W=<unk> a=0     l=-3.88236
J=36    S=21    E=27    W=quest a=0     l=-11.9497
J=37    S=21    E=28    W=quiet a=0     l=-11.6887
J=38    S=21    E=29    W=quit  a=0     l=-11.0153
J=39    S=22    E=19    W=arrow a=0     l=-12.6258

	I have to say that I use the rescoring to give probabilities to the archs from misspelling corrections. So I do not have any acoustic scores. (I set all them equal).

	Regards,

Llu?s


El 23/05/2012, a les 2:05, Andreas Stolcke va escriure:

> On 5/22/2012 10:56 AM, Llu?s Formiga i Fanals wrote:
>> 
>> Hi,
>> 
>>  I was trying to execute the following command:
>> 
>> 
>>           lattice-tool -in-lattice-list lattice_lists.txt -read-htk -lm
>>           /veu4/usuaris24/lluisf/EMS/misspelling2012/lm/interpolated-lm.en
>>           -write-mesh-dir out -keep-unk
>> 
>>  but I find that unks ("<unk>") are still on the written CN (-write-mesh).
>> 
>>  Does -keep-unk option work only for lattices output? Am I doing something wrong?
> No, the code is working as intended.
> 
> The option is described as 
>        -keep-unk
>               Treat out-of-vocabulary words as <unk> but preserve their labels in lattice output.
> 
> What you are outputting is confusion networks, not lattices.  In the CN building process, lattice nodes that are mapped to <unk>  are treated as equivalent, and the word information is lost in the process.
> 
> I would suggest that you simple do your lattice rescoring with -keep-unk, output the rescored lattices, and then run lattice-tool a second time without -keep-unk and without the -vocab option, so all word labels are preserved (all words are implicitly added to the vocabulary).
> 
> Andreas 
> 
> 
>> 
>>  Thanks,
>> 
>> Llu?s
>> <Adjunt de Mail.jpeg>
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120821/ff0e14a6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 8771 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120821/ff0e14a6/attachment.jpg>

From stolcke at icsi.berkeley.edu  Fri Aug 24 00:07:40 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 24 Aug 2012 00:07:40 -0700
Subject: [SRILM User List] Does keep-unk work with lattice-tool and htk
 format?
In-Reply-To: <7507354C-E6F2-4F91-8B9D-D75B1118023D@upc.edu>
References: <4FB9B671.9080604@mit.edu> <4FBA8AE3.2070709@icsi.berkeley.edu>
	<9D363DB8-11DF-4534-AEB5-058E96E3A74C@upc.edu>
	<4FBC29CD.3010601@icsi.berkeley.edu>
	<7507354C-E6F2-4F91-8B9D-D75B1118023D@upc.edu>
Message-ID: <5037283C.9070005@icsi.berkeley.edu>

Congratulations, you found a bug! The patch attached to this message (to 
HTKLattice.cc) should fix this problem.

Andreas

On 8/21/2012 2:43 PM, Llu?s Formiga i Fanals wrote:
> Hi Andreas,
>
> Sorry to bother you with this old issue.
>
> The two-step lattice-tool process worked perfectly. First the 
> rescoring and second the conversion to CN.
>
> But, unfortunately I have seen a few unks while rescoring the lattice 
> (not as many as writing the mesh).
>
> The command I use to rescore is:
>
> lattice-tool -lm ../../lm/interpolated-lm.en -in-lattice 
> wordlattice0.slf -read-htk -out-lattice out.slf-write-htk -keep-unk 
> -print-sent-tags -htk-logbase 2.71828
>
> And I find lines like these: (Whithin these lines the <unk> tag should 
> be queit)
>
> J=26 S=19 E=24 W=qu a=0 l=-13.8261 J=27 S=19 E=25 W=que a=0 l=-11.4986 
> J=28 S=19 E=26 W=<unk> a=0 l=-2.76367 J=29 S=19 E=27 W=quest a=0 
> l=-10.831 J=30 S=19 E=28 W=quiet a=0 l=-10.57 J=31 S=19 E=29 W=quit 
> a=0 l=-10.4455 J=32 S=20 E=21 W=row a=0 l=-10.1076 J=33 S=21 E=24 W=qu 
> a=0 l=-14.9448 J=34 S=21 E=25 W=que a=0 l=-12.6173 J=35 S=21 E=26 
> W=<unk> a=0 l=-3.88236 J=36 S=21 E=27 W=quest a=0 l=-11.9497 J=37 S=21 
> E=28 W=quiet a=0 l=-11.6887 J=38 S=21 E=29 W=quit a=0 l=-11.0153 J=39 
> S=22 E=19 W=arrow a=0 l=-12.6258
>
> I have to say that I use the rescoring to give probabilities to the 
> archs from misspelling corrections. So I do not have any acoustic 
> scores. (I set all them equal).
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120824/8ac46647/attachment.html>
-------------- next part --------------
*** lattice/src/HTKLattice.cc	3 Aug 2012 01:11:34 -0000	1.60
--- lattice/src/HTKLattice.cc	24 Aug 2012 07:02:40 -0000
***************
*** 1769,1776 ****
  					toNode->word == vocab.seIndex()) ||
  				   toNode->word == Vocab_None) ?
  				   HTK_null_word :
! 				    (node->htkinfo && node->htkinfo->wordLabel ?
! 					node->htkinfo->wordLabel :
  					vocab.getWord(toNode->word)),
  			    htkheader.useQuotes);
  	    }
--- 1769,1776 ----
  					toNode->word == vocab.seIndex()) ||
  				   toNode->word == Vocab_None) ?
  				    HTK_null_word :
! 				    (toNode->htkinfo && toNode->htkinfo->wordLabel ?
! 					toNode->htkinfo->wordLabel :
  					vocab.getWord(toNode->word)),
  			    htkheader.useQuotes);
  	    }

From wrested at hotmail.de  Tue Aug 28 05:06:36 2012
From: wrested at hotmail.de (hic et nunc)
Date: Tue, 28 Aug 2012 12:06:36 +0000
Subject: [SRILM User List] (no subject)
Message-ID: <BLU123-W23D802249103E46344C22EC9A10@phx.gbl>


hello. i'm a newbie of srilm toolkit. 
when i used wb, kn, or  mkn smoothing methods for lm making, i realized that some of ngrams are not in lm file (albeit they exists in count file).i checked for ngrams (4-5-6 ordered) and saw that, 3 and 3+grams which have 1 count are not included in lm file. is it possible to ignore this feature in srilm? if yes, could you tell me which part of the code should be changed?
thanks.  		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120828/8c4482c8/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Aug 28 10:32:05 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 28 Aug 2012 10:32:05 -0700
Subject: [SRILM User List] (no subject)
In-Reply-To: <BLU123-W23D802249103E46344C22EC9A10@phx.gbl>
References: <BLU123-W23D802249103E46344C22EC9A10@phx.gbl>
Message-ID: <503D0095.9050201@icsi.berkeley.edu>

On 8/28/2012 5:06 AM, hic et nunc wrote:
> hello. i'm a newbie of srilm toolkit.
> when i used wb, kn, or  mkn smoothing methods for lm making, i 
> realized that some of ngrams are not in lm file (albeit they exists in 
> count file).
> i checked for ngrams (4-5-6 ordered) and saw that, 3 and 3+grams which 
> have 1 count are not included in lm file.
> is it possible to ignore this feature in srilm? if yes, could you tell 
> me which part of the code should be changed?

This should be a FAQ.   The answer to your question is at 
http://www.speech.sri.com/pipermail/srilm-user/2012q3/001276.html .

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120828/c20f8e03/attachment.html>

From federico.sangati at gmail.com  Sun Sep  2 03:59:16 2012
From: federico.sangati at gmail.com (Federico Sangati)
Date: Sun, 2 Sep 2012 11:59:16 +0100
Subject: [SRILM User List] Predicting words
Message-ID: <EB62D562-999D-4A85-9010-ECF317925BF0@inf.ed.ac.uk>

Hi,

Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below). 

MAPFILE:
shock	shock
1961	1961
? [same for all words occurring in vocabulary]
UNK_NEXT_WORD	maturing analyzing attended ? [list of all words occurring in vocabulary]

INPUTFILE:
No , UNK_NEXT_WORD
<s> No , UNK_NEXT_WORD
But while , UNK_NEXT_WORD
<s> But while , UNK_NEXT_WORD
The 49 stock specialist UNK_NEXT_WORD

OUTPUTFILE:
<s> No , talent </s>
<s> No , talent </s>
<s> But while , talent </s>
<s> But while , talent </s>
<s> The 49 stock specialist talent </s>

Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters). 
It would be nice to know if there is any solution for this.

Best,
Federico Sangati
University of Edinburgh


> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote:
> Indeed you can use disambig, at least in theory to solve this problem.
> 
> 1. prepare a map file of the form:
> 
>     a       a
>     man    man
>     ...   [for all words occurring in your data]
>     UNKNOWN_WORD  word1 word2  ....  [list all words in the vocabulary 
> here]
> 
> 2. train an LM of word sequences.
> 
> 3. prepare disambig input of the form
> 
>                 a man is sitting UNKNOWN_WORD
> 
>    You can also add known words to the right of UKNOWN_WORD if you have 
> that information (see the note about -fw-only below).
> 
> 4. run disambig
> 
>             disambig -map MAPFILE -lm LMFILE -text INPUTFILE
> 
> If you want to use only the left context of the UNKNOWN_WORD use the 
> -fw-only option.
> 
> This is in theory.  If your vocabulary is large it may be very slow and 
> take too much memory.  I haven't tried it, so let me know if it works 
> for you.
> 
> Andreas

>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote:
>> Hello,
>> I am new to language modeling and was hoping that someone can help me with the following. 
>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting).
>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...).


From wrested at hotmail.de  Sun Sep  2 22:10:39 2012
From: wrested at hotmail.de (hic et nunc)
Date: Mon, 3 Sep 2012 05:10:39 +0000
Subject: [SRILM User List] (no subject)
In-Reply-To: <BLU123-W23D802249103E46344C22EC9A10@phx.gbl>
References: <BLU123-W23D802249103E46344C22EC9A10@phx.gbl>
Message-ID: <BLU168-W2006515BC68A2F8601BB9FC9AB0@phx.gbl>


hello again. i have a new question about lm ngram probs. 
as you know well, in lm file, the log probs are calculated like this: log [(count[n-gram]*d/count[(n-1)-gram] - count[(n-1)-gram_<unk>]] 
sometimes 1 is added to denominator, but sometimes not. what is the reason of this? 
 		 	   		  thanks. 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120903/cc221523/attachment.html>

From njs at pobox.com  Mon Sep  3 05:37:16 2012
From: njs at pobox.com (Nathaniel Smith)
Date: Mon, 3 Sep 2012 13:37:16 +0100
Subject: [SRILM User List] Predicting words
In-Reply-To: <EB62D562-999D-4A85-9010-ECF317925BF0@inf.ed.ac.uk>
References: <EB62D562-999D-4A85-9010-ECF317925BF0@inf.ed.ac.uk>
Message-ID: <CAPJVwBntdZO7eWBaSb3nSs=F=3Z=h0qsSYfaw0OQ7NkyR=XLEg@mail.gmail.com>

FYI, for others on the list and the archives--

After talking to Federico offline, I think he ended up solving his
problem by using the Python bindings I wrote a while back to query the
ngram model directly. Since they might be useful to others I went
ahead and uploaded them to github as well:
  https://github.com/njsmith/pysrilm
  Download snapshot: https://github.com/njsmith/pysrilm/zipball/master

-- Nathaniel Smith
University of Edinburgh

On Sun, Sep 2, 2012 at 11:59 AM, Federico Sangati
<federico.sangati at gmail.com> wrote:
> Hi,
>
> Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below).
>
> MAPFILE:
> shock   shock
> 1961    1961
> ? [same for all words occurring in vocabulary]
> UNK_NEXT_WORD   maturing analyzing attended ? [list of all words occurring in vocabulary]
>
> INPUTFILE:
> No , UNK_NEXT_WORD
> <s> No , UNK_NEXT_WORD
> But while , UNK_NEXT_WORD
> <s> But while , UNK_NEXT_WORD
> The 49 stock specialist UNK_NEXT_WORD
>
> OUTPUTFILE:
> <s> No , talent </s>
> <s> No , talent </s>
> <s> But while , talent </s>
> <s> But while , talent </s>
> <s> The 49 stock specialist talent </s>
>
> Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters).
> It would be nice to know if there is any solution for this.
>
> Best,
> Federico Sangati
> University of Edinburgh
>
>
>> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote:
>> Indeed you can use disambig, at least in theory to solve this problem.
>>
>> 1. prepare a map file of the form:
>>
>>     a       a
>>     man    man
>>     ...   [for all words occurring in your data]
>>     UNKNOWN_WORD  word1 word2  ....  [list all words in the vocabulary
>> here]
>>
>> 2. train an LM of word sequences.
>>
>> 3. prepare disambig input of the form
>>
>>                 a man is sitting UNKNOWN_WORD
>>
>>    You can also add known words to the right of UKNOWN_WORD if you have
>> that information (see the note about -fw-only below).
>>
>> 4. run disambig
>>
>>             disambig -map MAPFILE -lm LMFILE -text INPUTFILE
>>
>> If you want to use only the left context of the UNKNOWN_WORD use the
>> -fw-only option.
>>
>> This is in theory.  If your vocabulary is large it may be very slow and
>> take too much memory.  I haven't tried it, so let me know if it works
>> for you.
>>
>> Andreas
>
>>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote:
>>> Hello,
>>> I am new to language modeling and was hoping that someone can help me with the following.
>>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting).
>>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...).
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From stolcke at icsi.berkeley.edu  Tue Sep  4 00:46:32 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 04 Sep 2012 00:46:32 -0700
Subject: [SRILM User List] (no subject)
In-Reply-To: <BLU168-W2006515BC68A2F8601BB9FC9AB0@phx.gbl>
References: <BLU123-W23D802249103E46344C22EC9A10@phx.gbl>
	<BLU168-W2006515BC68A2F8601BB9FC9AB0@phx.gbl>
Message-ID: <5045B1D8.50803@icsi.berkeley.edu>

On 9/2/2012 10:10 PM, hic et nunc wrote:
> hello again. i have a new question about lm ngram probs.
> as you know well, in lm file, the log probs are calculated like this: 
> log [(count[n-gram]*d/count[(n-1)-gram] - count[(n-1)-gram_<unk>]]
> sometimes 1 is added to denominator, but sometimes not. what is the 
> reason of this?
One is added to the denominator only a  last resort when the smoothing 
results in n-gram probabilities that sum to 1.
The following comment in NgramLM.cc explains why:

>             /*
>              * This is a hack credited to Doug Paul (by Roni Rosenfeld in
>              * his CMU tools).  It may happen that no probability mass
>              * is left after totalling all the explicit probs, typically
>              * because the discount coefficients were out of range and
>              * forced to 1.0.  Unless we have seen all vocabulary words in
>              * this context, to arrive at some non-zero backoff mass,
>              * we try incrementing the denominator in the estimator by 1.
>              * Another hack: If the discounting method uses interpolation
>              * we first try disabling that because interpolation removes
>              * probability mass.
>              */

This happens occasionally with GT smoothing due to degenerate 
count-of-counts statistics.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120904/f57ff09a/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Sep  4 17:10:13 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 04 Sep 2012 17:10:13 -0700
Subject: [SRILM User List] Predicting words
In-Reply-To: <EB62D562-999D-4A85-9010-ECF317925BF0@inf.ed.ac.uk>
References: <EB62D562-999D-4A85-9010-ECF317925BF0@inf.ed.ac.uk>
Message-ID: <50469865.2010309@icsi.berkeley.edu>


I suspect there were some problems with the construction of the map 
file.   For one thing, when you have a word that is also a valid numeric 
string (like the second line in your example) you cannot leave out the 
explicit mapping probability.
Also, it turns out that it is much more convenient to use the disambig 
-classes option instead of -map to supply the mapping information (this 
allows you to give the mapping one-word-at-a-time for the "unknown" token).

Anyway, here is a short example that demonstrates that my instructions 
worked in principle ;-).
It uses the trigram LM supplied with SRILM.

# construct the map file in classes format
ngram -order 1 -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz 
-write-vocab - | \
gawk '{ print $1, 1, $1; print $1, 1, "UNKNOWN-WORD" }' lm.vocab > 
test.mapfile

# fill in the blanks (uses both left and right word context). Note 
-order 2 is default so specify -order 3
disambig -order 3 -classes test.mapfile -lm 
$SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -text -

INPUT:    what a great UNKNOWN-WORD
OUTPUT: <s> what a great time </s>
INPUT:   that is the stupidest UNKNOWN-WORD i've heard
OUTPUT: <s> that is the stupidest thing i've heard </s>

Seems to work ;-)

Andreas

On 9/2/2012 3:59 AM, Federico Sangati wrote:
> Hi,
>
> Regarding next word prediction, I have tried the solution suggested by Andreas, but it doesn't seem to work: it predicts the same word in different contexts, and it always assumes that the prefix starts from the beginning of the sentence (see below).
>
> MAPFILE:
> shock	shock
> 1961	1961
> ? [same for all words occurring in vocabulary]
> UNK_NEXT_WORD	maturing analyzing attended ? [list of all words occurring in vocabulary]
>
> INPUTFILE:
> No , UNK_NEXT_WORD
> <s> No , UNK_NEXT_WORD
> But while , UNK_NEXT_WORD
> <s> But while , UNK_NEXT_WORD
>
> OUTPUTFILE:
> <s> No , talent </s>
> <s> No , talent </s>
> <s> But while , talent </s>
> <s> But while , talent </s>
>
>
> Btw, I'm wondering why there is no way to use 'ngram' for this: it has this nice '-gen-prefixes file' option which is almost what we need, except that instead of a random word sequence conditioned on the prefix we need the most probable one (or just the most probable following word given the prefix for what it matters).
> It would be nice to know if there is any solution for this.
>
> Best,
> Federico Sangati
> University of Edinburgh
>
>
>> On Wed Aug 8 22:09:35 PDT 2012 Andreas Stolcke wrote:
>> Indeed you can use disambig, at least in theory to solve this problem.
>>
>> 1. prepare a map file of the form:
>>
>>       a       a
>>       man    man
>>       ...   [for all words occurring in your data]
>>       UNKNOWN_WORD  word1 word2  ....  [list all words in the vocabulary
>> here]
>>
>> 2. train an LM of word sequences.
>>
>> 3. prepare disambig input of the form
>>
>>                   a man is sitting UNKNOWN_WORD
>>
>>      You can also add known words to the right of UKNOWN_WORD if you have
>> that information (see the note about -fw-only below).
>>
>> 4. run disambig
>>
>>               disambig -map MAPFILE -lm LMFILE -text INPUTFILE
>>
>> If you want to use only the left context of the UNKNOWN_WORD use the
>> -fw-only option.
>>
>> This is in theory.  If your vocabulary is large it may be very slow and
>> take too much memory.  I haven't tried it, so let me know if it works
>> for you.
>>
>> Andreas
>>> On 7/20/2012 5:04 AM, Nouf Al-Harbi wrote:
>>>   Hello,
>>> I am new to language modeling and was hoping that someone can help me with the following.
>>> I try to predict a word given an input sentence. For example, I would like to get a word replacing the ... that has the highest probability in sentences such as ' A man is ...' (e.g. sitting).
>>> I try to use disambig tool but I couldn't found any example illustrate how to use it especially how how I can create the map file and what is the type of this file ( e.g. txt, arpa, ...).


From stolcke at icsi.berkeley.edu  Tue Sep  4 18:55:36 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 04 Sep 2012 18:55:36 -0700
Subject: [SRILM User List] Predicting words
In-Reply-To: Your message of Tue, 04 Sep 2012 17:10:13 -0700.
	<50469865.2010309@icsi.berkeley.edu>
Message-ID: <201209050155.q851ta91013037@fruitcake.ICSI.Berkeley.EDU>

In message <50469865.2010309 at icsi.berkeley.edu>I wrote:
> 
> Anyway, here is a short example that demonstrates that my instructions 
> worked in principle ;-).
> It uses the trigram LM supplied with SRILM.
> 
> # construct the map file in classes format
> ngram -order 1 -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz 
> -write-vocab - | \
> gawk '{ print $1, 1, $1; print $1, 1, "UNKNOWN-WORD" }' lm.vocab > 
> test.mapfile

Copy-and-paste error.   The above command should be 

ngram -order 1 -lm $SRILM/lm/test/tests/ngram-count-gt/swbd.3bo.gz -write-vocab - | \
gawk '{ print $1, 1, $1; print $1, 1, "UNKNOWN-WORD" }'  > test.mapfile


Andreas


From chenmengdx at gmail.com  Wed Sep  5 03:06:51 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Wed, 5 Sep 2012 18:06:51 +0800
Subject: [SRILM User List] Question about select-vocab
Message-ID: <CA+bc0mq62ctWZAEEmpwbUzFK92XD_eBXCsyJNyeCJb+mMEMb=g@mail.gmail.com>

Hi, I am using the *select-vocab* command to choose vocabulary from corpus
A and B in a Chinese speech recognition task, the command is as follows:
*select-vocab -heldout dev A B > vocab_with_weight*
Then I saw the prompts below:
*Iter 0: lambdas = (0.5 0.5)*
*Iter 1: lambdas = (0.443075 0.556925) log P(held-out) = -374805.0047 PPL =
6937.8495*
*Iter 2: lambdas = (0.399799 0.600201) log P(held-out) = -374319.5890 PPL =
6858.8301*
*Iter 3: lambdas = (0.366822 0.633178) log P(held-out) = -374032.9165 PPL =
6812.5869*
*Iter 4: lambdas = (0.341533 0.658467) log P(held-out) = -373860.8231 PPL =
6784.9764*
I want to ask what's the meaning of PPL. Does the command train a LM with
corpus A and B first, then calculate the PPL of heldout data with the LM?
If corpus A and B are 10GB each, how much the heldout data should be at
least in order to choose a reasonable vocabulary?

Thanks!
Meng CHEN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120905/8a146b61/attachment.html>

From venkataraman.anand at gmail.com  Wed Sep  5 13:05:04 2012
From: venkataraman.anand at gmail.com (Anand Venkataraman)
Date: Wed, 5 Sep 2012 13:05:04 -0700
Subject: [SRILM User List] Question about select-vocab
Message-ID: <CAF6FMTUyWr8DPvYBWu92Urqg1V8oF=LBxMFCkV+EghM_1k+hiA@mail.gmail.com>

I realized I was off the list and just rejoined (thanks Andreas).

Meng - In response to your questions about select-vocab:

   1. Yes, you're right about the PPL. The program trains separate unigram
   LMs for the given corpora (A & B) and the diagnostic output prints the PPL
   of the held-out set according to the _best_ word-level mixture of A.1bo and
   B.1bo.
   2. Hard to say how big the held-out set ought to be for given A and B
   sizes. My only suggestion is to ensure that the held-out set contains a
   representative sample of words that you expect to see in the domain. If in
   doubt, you can always extract the domain vocabulary and ensure that the
   held-out set covers the top N% (by freq) of the domain words (for some
   suitable N)

Hope this helps.

&
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120905/af2908f4/attachment.html>

From stolcke at icsi.berkeley.edu  Wed Sep  5 13:36:29 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 05 Sep 2012 13:36:29 -0700
Subject: [SRILM User List] Question about select-vocab
In-Reply-To: <CAF6FMTUyWr8DPvYBWu92Urqg1V8oF=LBxMFCkV+EghM_1k+hiA@mail.gmail.com>
References: <CAF6FMTUyWr8DPvYBWu92Urqg1V8oF=LBxMFCkV+EghM_1k+hiA@mail.gmail.com>
Message-ID: <5047B7CD.1060406@icsi.berkeley.edu>

On 9/5/2012 1:05 PM, Anand Venkataraman wrote:
> I realized I was off the list and just rejoined (thanks Andreas).
>
> Meng - In response to your questions about select-vocab:
>
>  1. Yes, you're right about the PPL. The program trains separate
>     unigram LMs for the given corpora (A & B) and the diagnostic
>     output prints the PPL of the held-out set according to the _best_
>     word-level mixture of A.1bo and B.1bo.
>  2. Hard to say how big the held-out set ought to be for given A and B
>     sizes. My only suggestion is to ensure that the held-out set
>     contains a representative sample of words that you expect to see
>     in the domain. If in doubt, you can always extract the domain
>     vocabulary and ensure that the held-out set covers the top N% (by
>     freq) of the domain words (for some suitable N)
>
> Hope this helps.
>
> &
>
Thanks Anand.  Good to have you back on the list.

Meng:  in case this wasn't clear, "PPL" is short for "perplexity".

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120905/62b0bc09/attachment.html>

From kcananda at gmail.com  Thu Sep  6 06:58:56 2012
From: kcananda at gmail.com (Ananda K.C.)
Date: Thu, 6 Sep 2012 19:43:56 +0545
Subject: [SRILM User List] Regarding ngram
Message-ID: <CA+VC5DYLTiUuymTO9qs3BrQgDbL+Ct_doMk6y346O8e0eZOczA@mail.gmail.com>

hi,

how to print the output probability calculation of the  command "ngram -lm
/home/ananda/Desktop/reporting/probability -debug 2 -counts
/home/ananda/Desktop/reporting/countoutput.txt" in a file.

Anadna
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120906/4821470d/attachment.html>

From kcananda at gmail.com  Sat Sep 15 08:01:12 2012
From: kcananda at gmail.com (Ananda K.C.)
Date: Sat, 15 Sep 2012 20:46:12 +0545
Subject: [SRILM User List] Regarding backoff using
Message-ID: <CA+VC5DY3EPM+_gc6q1pn5cC2tZfi0oS4QCTmjPv=nQUAdnZ3rw@mail.gmail.com>

hi all of you,

I have send my test file containing corpus,vocab,and final output bigram
probability.Also i have send you all the command in command file.

My main problem is when we use Backoff with Good Turing discounting.Then

p( He | <s> )     = [2gram] 0.0348584 [ -1.45769 ]
 p( I | <s> )     = [2gram] 0.0348584 [ -1.45769 ]
 p( this | <s> )     = [2gram] 0.0348584 [ -1.45769 *2 ] is only find out.

But it should find the probabilty with all the words in the vocabulary,if
bigram count is zero then it should move towards unigram count to assign
some probabilty to bigram.

like p( am | <s> )
    p(going| <s> )
    p( kath | <s> )   and so on with all the word in the vocabulary,which
is not calculated.

Since we know that when the bigram count is zero ,we should get probability
from unigram count.May be i have done some mistake in commands.

Please help me to solve my problem.


regards,
Ananda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120915/8227cfe3/attachment.html>
-------------- next part --------------
this is ananda
this is bhawana
I am going to kath	
He is going to kath
-------------- next part --------------
A non-text attachment was scrubbed...
Name: command
Type: application/octet-stream
Size: 384 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120915/8227cfe3/attachment.obj>
-------------- next part --------------
<s>	4
<s> this	2
<s> I	1
<s> He	1
this	2
this is	2
is	3
is ananda	1
is bhawana	1
is going	1
ananda	1
ananda </s>	1
</s>	4
bhawana	1
bhawana </s>	1
I	1
I am	1
am	1
am going	1
going	2
going to	2
to	2
to kath	2
kath	2
kath </s>	2
He	1
He is	1
-------------- next part --------------
	p( </s> |  ) 	= [1gram] 0.2 [ -0.69897 *4 ]
	p( <s> |  ) 	= [1gram] 0 [ -inf *4 ]
	p( He |  ) 	= [1gram] 0.05 [ -1.30103 ]
	p( I |  ) 	= [1gram] 0.05 [ -1.30103 ]
	p( am |  ) 	= [1gram] 0.05 [ -1.30103 ]
	p( ananda |  ) 	= [1gram] 0.05 [ -1.30103 ]
	p( bhawana |  ) 	= [1gram] 0.05 [ -1.30103 ]
	p( going |  ) 	= [1gram] 0.1 [ -1 *2 ]
	p( is |  ) 	= [1gram] 0.15 [ -0.823909 *3 ]
	p( kath |  ) 	= [1gram] 0.1 [ -1 *2 ]
	p( this |  ) 	= [1gram] 0.1 [ -1 *2 ]
	p( to |  ) 	= [1gram] 0.1 [ -1 *2 ]
	p( He | <s> ) 	= [2gram] 0.2 [ -0.69897 ]
	p( I | <s> ) 	= [2gram] 0.2 [ -0.69897 ]
	p( this | <s> ) 	= [2gram] 0.4 [ -0.39794 *2 ]
	p( is | He ) 	= [2gram] 0.5 [ -0.30103 ]
	p( am | I ) 	= [2gram] 0.5 [ -0.30103 ]
	p( going | am ) 	= [2gram] 0.5 [ -0.30103 ]
	p( </s> | ananda ) 	= [2gram] 0.5 [ -0.30103 ]
	p( </s> | bhawana ) 	= [2gram] 0.5 [ -0.30103 ]
	p( to | going ) 	= [2gram] 0.666667 [ -0.176091 *2 ]
	p( ananda | is ) 	= [2gram] 0.25 [ -0.60206 ]
	p( bhawana | is ) 	= [2gram] 0.25 [ -0.60206 ]
	p( going | is ) 	= [2gram] 0.25 [ -0.60206 ]
	p( </s> | kath ) 	= [2gram] 0.666667 [ -0.176091 *2 ]
	p( is | this ) 	= [2gram] 0.666667 [ -0.176091 *2 ]
	p( kath | to ) 	= [2gram] 0.666667 [ -0.176091 *2 ]
8 sentences, 36 words, 0 OOVs
4 zeroprobs, logprob= -26.6866 ppl= 4.64693 ppl1= 6.82272

file /home/ananda/Desktop/countout.txt: 8 sentences, 36 words, 0 OOVs
4 zeroprobs, logprob= -26.6866 ppl= 4.64693 ppl1= 6.82272
-------------- next part --------------

\data\
ngram 1=12
ngram 2=15

\1-grams:
-0.69897	</s>
-99	<s>	-0.60206
-1.30103	He	-0.2304489
-1.30103	I	-0.2787536
-1.30103	am	-0.2552725
-1.30103	ananda	-0.20412
-1.30103	bhawana	-0.20412
-1	going	-0.4313638
-0.8239087	is	-0.5051499
-1	kath	-0.3802113
-1	this	-0.4065402
-1	to	-0.4313638

\2-grams:
-0.69897	<s> He
-0.69897	<s> I
-0.39794	<s> this
-0.30103	He is
-0.30103	I am
-0.30103	am going
-0.30103	ananda </s>
-0.30103	bhawana </s>
-0.1760913	going to
-0.60206	is ananda
-0.60206	is bhawana
-0.60206	is going
-0.1760913	kath </s>
-0.1760913	this is
-0.1760913	to kath

\end\
-------------- next part --------------
this
is
ananda
bhawana
I
am
going
to
kath
He 

From julia_hancke at yahoo.com  Sat Sep 22 17:50:44 2012
From: julia_hancke at yahoo.com (Julia Hancke)
Date: Sat, 22 Sep 2012 17:50:44 -0700 (PDT)
Subject: [SRILM User List] Hi!
Message-ID: <1348361444.18417.BPMail_high_noncarrier@web113519.mail.gq1.yahoo.com>


http://grange-aux-ormes.com/work.at.home.online.php?owmarket=9yov0