From Dmitriy.Dligach at colorado.edu  Tue Apr  7 08:30:53 2009
From: Dmitriy.Dligach at colorado.edu (Dmitriy Dligach)
Date: Tue, 07 Apr 2009 09:30:53 -0600
Subject: OOV words
Message-ID: <20090407093053.114155kecp61tt44@opsmail.colorado.edu>

Hello,

First of all I wanted to thank the creators of SRILM -- I find this  
tool extremely useful in my research.

Second, I have a question about out-of-vocabulary (OOV) words. I train  
a language model on a collection of english news wire text:

ngram-count -text all.txt -lm all.lm -order 5

and then compute probabilities:

ngram -lm all.lm -ppl test.txt -debug 1

There happen to be some sentences in foreign languages in my test.txt  
file. I'd expect them to receive very low probabilities because the  
model was trained on strictly english text. However, instead they  
receive very high probabilities.

Could this have something to do with the way SRILM handles OOV words?

Dima


From christophe.hauser at irisa.fr  Wed Apr  8 09:51:32 2009
From: christophe.hauser at irisa.fr (Christophe Hauser)
Date: Wed, 8 Apr 2009 18:51:32 +0200
Subject: [christophe.hauser@irisa.fr: Jelinek Mercer Smoothing]
In-Reply-To: <49D25248.7020209@speech.sri.com>
References: <20090331100957.GA18372@sovkipeu.irisa.fr> <49D25248.7020209@speech.sri.com>
Message-ID: <20090408165132.GE3589@sovkipeu.irisa.fr>

On Tue, Mar 31, 2009 at 10:26:32AM -0700, Andreas Stolcke wrote:

> An example of the count-lm training procedure is given by  
> $SRILM/test/tests/ngram-count-lm/run-test .
>
> Andreas

hello,

I am trying to reproduce some experiments using SRILM.
I would like to apply Jelinek Mercer smoothing, but the perplexity results
I get are very weird : ways more than the results with no smoothing at
all. 

Here is what I did :

ngram-count -text training -lm lm -order $order  -write-vocab vocab
-write cfile 

ngram -ppl test -lm lm -order $order -vocab vocab -unk

file test: 1 sentences, 964 words, 41 OOVs
0 zeroprobs, logprob= -1445.86 ppl= 36.7102 ppl1= 36.8538

Then, if I use Jelinek Mercer smoothing

cat >countlm <<EOF
countmodulus 1
mixweights 0
.5 .5 .5 
counts cfile
EOF

ngram -count-lm -lm lmsmooth -order $order -ppl test -unk -vocab vocab

file test: 1 sentences, 964 words, 41 OOVs
0 zeroprobs, logprob= -2948.76 ppl= 1553.45 ppl1= 1565.86

The smoothed model perplexity over the test set is very high. Is there
something I did wrong ? 
I expected to get something around 5 bits/symbol on this test.

Also, I am not sure how to interpret perplexity results using SRILM : as
OOvs / zeroprobs are discarded, model estimation without smoothing gives finite
perplexity results where it is actually infinite. It is confusing
me, especially in the case where smoothing techniques are used : how to
accurately mesure smoothing benefits ?

PS : does -gtmin 0 -gtmax 0 totally disable discounting ?

Many thanks.

Kind regards,
-- 
Christophe


From i_am_behrang at yahoo.com  Mon Apr 20 20:44:42 2009
From: i_am_behrang at yahoo.com (Behrang Mohit)
Date: Mon, 20 Apr 2009 20:44:42 -0700 (PDT)
Subject: training with different weights
Message-ID: <816341.11192.qm@web110304.mail.gq1.yahoo.com>


Hi,

Is there an option to give weights to certain training instances (sentences)?  For example if I have some sentences that are more relevant to my translation domain and I want them to influence the LM 4 times more than the rest of the data.  

Currently I'm doing that by just repeating those important sentences in the training corpus.  This way the training takes much longer.  Is there an alternative  way to do this?

Also I was wondering why there is such slowdown?
My guess is that the repetition changes the size of ngrams (mainly trigrams) dramatically. many of the infrequent bi or tri grams that are filtered in the baseline model, will be considered in the new model.  Is that right?

Thanks
Behrang

Thanks
Behrang


From sylvain.raybaud at crans.org  Wed Apr 22 02:44:23 2009
From: sylvain.raybaud at crans.org (Sylvain Raybaud)
Date: Wed, 22 Apr 2009 11:44:23 +0200
Subject: problems getting srilm
Message-ID: <200904221144.23250.sylvain.raybaud@crans.org>

Dear list,

  I need to use SRILM toolkit for my PhD, but I have been unsuccessful at 
downloading it... I have tried several times, after I fill in the form the 
download starts but is incredibly slow (around 4kBps), and stops after a 
while (firefox gives me an error like "the connection to the server was 
reset"). If I pass the download link to wget I only get an html code 
complaining about empty form fields (this was to be expected). Did I miss 
something? Thanks

regards,

-- 
Sylvain Raybaud


From stolcke at speech.sri.com  Sun May 10 11:09:29 2009
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 10 May 2009 11:09:29 PDT
Subject: SRILM 1.5.8 released
Message-ID: <200905101809.n4AI9T613648@ns2>


The latest version of SRILM is now avialable from 
http://www.speech.sri.com/projects/srilm/download.html .

A list of changes appears below.

Enjoy,

Andreas


1.5.8   10 May 2009

        Functionality:

        * merge-batch-counts -float-counts option for merging of fractional
        counts.

        * compare-sclite now includes statistical significance computation
        based on a matched-pair Sign test.

        * Added a Perl tool to compute the cumulative binomial distribution,
        contributed by Brett Kessler and David Gelbart.

        * Don't output LM server banner message for ngram -use-server -debug 0.

        * The LM::generateSentence() function now takes option argument to
        specify sentence prefix that is to be used to condition subsequent
        word generation (suggested by Alexy Khrabrov).  The default is to
        condition on <s> as before, or an empty context if no start-of-sentence
        tag is defined.

        * A new option ngram -gen-prefixes to read conditioning prefixes
        from a file, and generate random sentences based on them.

        * New options in nbest-optimize that modify -print-hyps output so that
        only unique hypotheses are included (-print-unique-hyps), and to print
        the original ranks of hypotheses (-print-old-ranks) (from Jing Zheng).

        * The -version option reports whether support for compressed files
        is available.

       * Added merge-batch-count -l option to control how many files to merge
        in each iteration.

        Bug fixes:

        * ngram-count, NgramLM: disable the Doug Paul smoothing hack (add one
        to denominator when smoothing results in 0 backoff mass) in contexts
        where the entire vocabulary has been observed.

        * nbest-optimize fixes to the -minimum-bleu-reference functionality
        (from Jing Zheng).

        * Fixed nbest-optimize bug that was causing incorrect log output with
        gcc 4.x.

        * Output vocabulary index map in binary ngram count and LM format
        in numerical index order.  This avoids a performance bug whereby
        reading the data structures back into _c binary version could take
        a long time due to inefficient insertion order.

        * Fix ngram -counts with -use-server (from Ergun Bicici).

        * Fixed memory allocation bug in FLM tag vocabulary handling that could
        lead to crash when interpolating several FLMs.

        * Rewrote make-batch-counts scripts to
          - avoid problems with limits on command line length
          - support systems that don't have compressed file I/O.

        * Modified merge-batch-counts script to
          - ensure that unmerged files are always merged in the next iteration,
          to avoid file size imbalance (suggested by Alex Marin)
          - support systems that don't have compressed file I/O.

        * Fixed a portability issue with Intel icc version 7.0.

        * compute-sclite fixed to invoke csrfilt.sh script with -t option.


From stolcke at speech.sri.com  Wed May 13 11:06:06 2009
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 13 May 2009 11:06:06 -0700
Subject: ngram-discount
In-Reply-To: <200905092008369537576@gmail.com>
References: <200905092008369537576@gmail.com>
Message-ID: <4A0B0C0E.8020504@speech.sri.com>

If the model is smoothed (the default), zeroprobs typically occur for
out-of-vocabulary words.
You need to train a model that assigns probability to the unknown word
(<unk>).
Use the ngram-count -unk option (you need to also specify a predefined
vocabulary to there are OOV words in your training data that you can get
a probability estimate from). Then use ngram -unk to test the LM.

Hope this helps,

Andreas


??? wrote:
> hi,
> when I used the srilm, I found the zeroprobs of n-gram. So why will
> zeroprobs turn up?
> I used the bigram. so when I calculated p(w2|w1), if C(w1w2)=0, the
> prob backoff to unigram:alpha*P(w2);
> and if C(w2)=0 (maybe it is out-of-vocabulary),we can backoff to
> zerogram,like uniform distribution; or we use good-turing discount,
> we have some discounts which can be used to this zero count word. so I
> think zeroprobs will not turn up.
> Do I understand it right?
> or the unigram is calculated by maximum likelihood directly,like
> p(w2)=C(w2)/(all counts)?
> so why not be calculated by good-turing discount,like
> p*(w2)=C*(w2)/(all counts). (C*(w2) is calculated by good-turing).
> Thank you very much.
> Sincerely yours,
> Wang
> 2009-05-09
> ------------------------------------------------------------------------
> ???


From christophe.hauser at irisa.fr  Wed May 20 09:34:10 2009
From: christophe.hauser at irisa.fr (Christophe Hauser)
Date: Wed, 20 May 2009 18:34:10 +0200
Subject: Odd jelinek mercer results
Message-ID: <20090520163410.GA24057@sovkipeu.irisa.fr>

Hello,

I get really odd results using Jelinek Mercer smoothing.

In the following simple example, I get the best results with no smoothing at all (1.12).
Using smoothing, setting all parameters to 1 gives better performance (1.24) than optimizing the parameters on the test set (2.4) According to Chen&Goodman, this means there is no smoothing at all.
This yields similar results with any other dataset I've tried.

Am I doing something wrong ?


training : A B C A B C A B C
test : A B C A B C A B C D

# write vocabulary 
cat $test $training > everything
ngram-count -text everything -no-eos -no-sos -write-vocab vocab -order
$order

# write count file
ngram-count -debug 2 -text $training -lm lm -order $order -write cfile
-vocab vocab -gt1max 0 -gt2max 0 -gt3max 0 -no-eos -no-sos

cat >countlm <<EOF
countmodulus 1
mixweights 1
1 1 1
1 1 1
counts cfile
EOF

# Optimize smoothing parameters > lmsmooth
ngram-count -debug 2 -text $test -count-lm -init-lm countlm -lm lmsmooth
-order $order -vocab vocab -no-eos -no-sos -gt1max 0 -gt2max 0 -gt3max 0

lsmooth :
order 3 
mixweights 1
1 1 1
0 0 0.674508
countmodulus 1
vocabsize 5
totalcount 9
counts cfile


# Evaluate perplexity using lmsmooth model 
ngram -debug  2 -count-lm -lm lmsmooth -order $order -ppl $test
-write-lm lm2 -vocab vocab -no-eos -no-sos

A B C A B C A B C D
    p( A |  )   = [9,0,3] 0.2 [ -0.69897 ]
    p( B | A ...)   = [9,0,3,0,3] 0.2 [ -0.69897 ]
    p( C | B ...)   = [9,0,3,0,3,0.674508,3] 0.739606 [ -0.130999 ]
    p( A | C ...)   = [9,0,3,0,2,0.674508,2] 0.51477 [ -0.288386 ]
    p( B | A ...)   = [9,0,3,0,3,0.674508,2] 0.739606 [ -0.130999 ]
    p( C | B ...)   = [9,0,3,0,3,0.674508,3] 0.739606 [ -0.130999 ]
    p( A | C ...)   = [9,0,3,0,2,0.674508,2] 0.51477 [ -0.288386 ]
    p( B | A ...)   = [9,0,3,0,3,0.674508,2] 0.739606 [ -0.130999 ]
    p( C | B ...)   = [9,0,3,0,3,0.674508,3] 0.739606 [ -0.130999 ]
    p( D | C ...)   = [9,0,0,0,0,0.674508,0] 0.0650984 [ -1.18643 ]
0 sentences, 10 words, 0 OOVs
0 zeroprobs, logprob= -3.81614 ppl= 2.40776 ppl1= 2.40776

# Evaluate perplexity using manual parameters
ngram -debug  2 -count-lm -lm countlm -order $order -ppl $test -write-lm
lm2 -vocab vocab -no-eos -no-sos

A B C A B C A B C D
    p( A |  )   = [9,1,3] 0.333333 [ -0.477121 ]
    p( B | A ...)   = [9,1,3,1,3] 1 [ 0 ]
    p( C | B ...)   = [9,1,3,1,3,1,3] 1 [ 0 ]
    p( A | C ...)   = [9,1,3,1,2,1,2] 0.666667 [ -0.176091 ]
    p( B | A ...)   = [9,1,3,1,3,1,2] 1 [ 0 ]
    p( C | B ...)   = [9,1,3,1,3,1,3] 1 [ 0 ]
    p( A | C ...)   = [9,1,3,1,2,1,2] 0.666667 [ -0.176091 ]
    p( B | A ...)   = [9,1,3,1,3,1,2] 1 [ 0 ]
    p( C | B ...)   = [9,1,3,1,3,1,3] 1 [ 0 ]
    p( D | C ...)   = [9,1,0,1,0,1,0] 0 [ -inf ]
0 sentences, 10 words, 0 OOVs
1 zeroprobs, logprob= -0.829304 ppl= 1.23636 ppl1= 1.23636


# Evaluate the perplexity with no smoothing at all
ngram -debug 2 -ppl $test -lm lm -order $order -vocab vocab -no-eos -no-sos

A B C A B C A B C D
    p( A |  )   = [1gram] 0.333333 [ -0.477121 ]
    p( B | A ...)   = [2gram] 1 [ 0 ]
    p( C | B ...)   = [3gram] 1 [ 0 ]
    p( A | C ...)   = [3gram] 1 [ 0 ]
    p( B | A ...)   = [3gram] 1 [ 0 ]
    p( C | B ...)   = [3gram] 1 [ 0 ]
    p( A | C ...)   = [3gram] 1 [ 0 ]
    p( B | A ...)   = [3gram] 1 [ 0 ]
    p( C | B ...)   = [3gram] 1 [ 0 ]
    p( D | C ...)   = [1gram] 0 [ -inf ]
0 sentences, 10 words, 0 OOVs
1 zeroprobs, logprob= -0.477121 ppl= 1.12983 ppl1= 1.12983


Kind regards,
-- 
Christophe


From stolcke at speech.sri.com  Sun Jun  7 09:49:18 2009
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 07 Jun 2009 09:49:18 -0700
Subject: SRILM 1.5.8 released
In-Reply-To: <444174749.03914@ustc.edu.cn>
References: <444174749.03914@ustc.edu.cn>
Message-ID: <4A2BEF8E.6010506@speech.sri.com>

yingyul at mail.ustc.edu.cn wrote:
> Dear Sir,
>   I have installed the SRILM 1.5.8 in ubuntu9.0.4 with i686 architecture. I will
> describe the installation process in detail. It may be useful for you. 
>   If you understand chinese , you can visit the chinese website which contain the
> Process Description of installing the SRILM 1.5.8 in ubuntu9.0.4 with i686
> architecture.
> http://www.52nlp.cn/ubuntu-64-bit-system-srilm-configuration
>
> First of all, I also installed the following freely available tools: 
> (1)A template-capable ANSI-C/C++ compiler: gcc 4.3, g++ 4.4.3.3
> (2)GNU make 3.81: It is used to control compilation and installation of the SRILM
> 1.5.8.
> (3)GNU gawk
> (4)GNU gzip
> (5)bzip2
> (6)p7zip
> (7)Tcl 
> (8)csh
>
> secondly, I will describe the installation process in detail:
> 1.	Creating the installation directory and extracting the compressed package to
> the directory.My installation directory is /home/user/srilm.
> 2.	Modifying Makefile file(in the directory:/home/user/srilm)
>   (1).searching the line: "# SRILM = /home/speech/stolcke/project/srilm/devel
> "??Input the actual installation path of srilm. The line was revised to:
> "SRILM=/home/user/srilm"
> ?? (2).searching the line:" MACHINE_TYPE := $(shell
> $(SRILM)/sbin/machine-type)"??Input the actual machine type. The line was revised
> to:"MACHINE_TYPE := i686-m64". The line tell "Makefile" to find the Settings of
> ubuntu9.0.4 with i686 architecture in the path:
> /home/user/srilm/common/Makefile.machine.i686-m64.
> 3.	Modifying Makefile.machine.i686-m64 file(in the
> directory:/home/user/srilm/common/Makefile.machine.i686-m64)
> searching the line:"GAWK = /usr/bin/awk"
> The line was revised to:"GAWK = /usr/bin/gawk"
>   
Interesting.  I assumed that Linux systems have both /usr/bin/awk and 
/usr/bin/gawk and they seem to be the same program. If there is a reason 
to use /usr/bin/gawk I can will change the default configuration for 
i686 and i686-m64.
> Thirdly,Modifying system environment variables.
> input the command??sudo gedit /etc/profile
> finding the lines:
> if [ "$PS1" ]; then
> ??if [ "$BASH" ]; then
> ????PS1=??u at h:w$ ??
> ????if [ -f /etc/bash.bashrc ]; then
> ??????. /etc/bash.bashrc
> ????fi
> ??else
> ????if [ "`id -u`" -eq 0 ]; then
> ??????PS1=??# ??
> ????else
> ??????PS1=??$ ??
> ????fi
> ??fi
> fi
> Below these lines,input the setting??"export
> PATH=??$PATH:/home/user/srilm/bin/i686-m64:/home/user/srilm/bin??"
>   
But this could also be done in the user's ~/.profile, right ?  Modifying 
/etc/profile affects all users and requires root authority.

> Finally.Installing and testing the SRILM 1.5.8
> 1.Input the following commands:
>   cd /home/user/srilm
>   make World
> 2.Input the following commands:
>   cd test
>   make all
>
> OK!!
>   
I'm glad it worked smoothly.  I get quite a few reports (of course 
without any useful details!) of problems installing SRILM on ubuntu,  so 
it's good to confirm that the installation instructions do in fact work.

Andreas

> Regards,
>  
>                Yulong Ying
>
>
>   
>> From: Sanjay Chatterji <sanjaychatter at gmail.com>
>> Reply-To: 
>> To: Andreas Stolcke <stolcke at speech.sri.com>
>> Subject: Re: SRILM 1.5.8 released
>> Date:Wed, 3 Jun 2009 15:44:59 +0530
>>
>> Dear Sir,
>> I tried to install the SRILM 1.5.8 in fedora 6 with i686 architecture. But
>> it is giving an error and
>> ngram
>> ngram-class
>> ngram-count
>> ngram-merge
>> etc. are not created.
>>
>> gcc version is gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)
>> tcl is installed
>> Please suggest,
>> Regards,
>> Sanjay
>>
>>
>>     


From stolcke at speech.sri.com  Sun Jun  7 10:07:24 2009
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 07 Jun 2009 10:07:24 PDT
Subject: Testing srilm-user
Message-ID: <200906071707.n57H7O804280@ns2>


Please ignore.

--Andreas


From i_am_behrang at yahoo.com  Mon Jun  8 08:57:30 2009
From: i_am_behrang at yahoo.com (i_am_behrang at yahoo.com)
Date: Mon, 8 Jun 2009 08:57:30 -0700 (PDT)
Subject: Building adapted language models
Message-ID: <297533.48294.qm@web110315.mail.gq1.yahoo.com>


Hi,

Is there an option to give weights to certain training instances (sentences)?? For example if I have some sentences that are more relevant to my translation domain and I want them to influence the LM 4 times more than the rest of the data.

I've done this by repeating the more relevant training instances, which makes the model training quite slow.? Is there an alternative way in SRILM?

Thanks
Behrang


From stolcke at speech.sri.com  Wed Jun 10 22:36:53 2009
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 10 Jun 2009 22:36:53 PDT
Subject: Building adapted language models 
In-Reply-To: Your message of Mon, 08 Jun 2009 08:57:30 -0700.
             <297533.48294.qm@web110315.mail.gq1.yahoo.com> 
Message-ID: <200906110536.n5B5arZ06507@ns2>


In message <297533.48294.qm at web110315.mail.gq1.yahoo.com>you wrote:
> 
> Hi,
> 
> Is there an option to give weights to certain training instances (sentences)??
>  For example if I have some sentences that are more relevant to my translatio
> n domain and I want them to influence the LM 4 times more than the rest of th
> e data.
> 
> I've done this by repeating the more relevant training instances, which makes
>  the model training quite slow.? Is there an alternative way in SRILM?
> 
You can weight the counts, pool them, and train a single LM.  The
internal methods that perform sentence-level count generation actually
have an argument to scale the couns by a number, but this functionality
was not accessible at the command line.   So I added an option ngram-count
-text-has-weights that tells ngram-count that the first field in each
line is a count scaling factor (the number has to be an integer, but can
be a floating point number if -float-counts is enabled).  This is
available in the 1.5.9-beta version that you can download now.

Or you can train separate LMs for different subsets of data (this only makes if the unit of weighting it larger than a sentence,
e.g., a data source or corpus), and then interpolate (mix) the probability estimates with weights.
LM interpolation is described with the "-mix-lm" option in ngram(1).

Andreas


From elias.majic at gmail.com  Sat Jun 13 11:42:00 2009
From: elias.majic at gmail.com (Elias Majic)
Date: Sat, 13 Jun 2009 14:42:00 -0400
Subject: Google Web N-gram
Message-ID: <5a87a1470906131142x146df1c0qba8d5a9f229c7aeb@mail.gmail.com>

Hello,

First off, to save you from having to read the below, suppose I used
make-google-ngrams to store a small corpus of text's N-gram counts on disk
in googles format.  How do I then convert this to ARPA format with SRILM?

I have read the Google Web N-gram section in the F.A.Q, I read all the
emails with the search   term google in it and I read all the relevant man
pages as well as looked at relevant run-tests without success.

My goal is to make an arpa format language model from the N-gram counts
inside the Google Web N-gram corpus.  I realize its too large to load into
memory as discussed in the documentation, so as per one of the emails in the
list suggested, I pruned out most of the junk or non dictionary words and
merged different cases and fixed the config files.  So now I reduced the
data quite significantly and am unable to figure out how to convert it to
arpa format.  Below is what I tried:

1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM

This did not work. It produced the same duplicate file of google.countlm

2. I noticed in the man pages that using the command -expand-classes forced
the output to be a single ngram model in ARPA format. So I tried:
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -write-lm
arpaLM
Nothing happened but the output:
HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden
N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram models are
mutually exclusive

3.I thought maybe using mix-lm would result in an arpa model as it also says
in the man pages this would occur with mix-lm. I realize this was unlikely
to work as I am combining the same lm's but tried regardless.
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm
google.countlm -write-lm arpaLM
Output was the same as google.countlm

I tried other things like using ngram-count and running the lm-scripts but
no dice.  One of the relevant posts in the forum I posted below:

http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html
The URL above mentions:
*
>> Could you give me an *example* about bulilding google 3-gram LM file
>> ,please?
>>
>Again, this will require using the  option with some tricks
>that are not documents
>as yet. Please be patient (or read all the manual pages carefully to
>figure it our yourself.)*
*
*Has any documentations been made regarding this? Did the trick infer using
mix-lm or expand-classes to force arpa format?

I figure worst case I do it manually but am sure there is something in SRILM
that I am missing.

Thanks
Elias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20090613/f1bddd36/attachment.html>

From stolcke at speech.sri.com  Mon Jun 15 11:34:03 2009
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 15 Jun 2009 11:34:03 -0700
Subject: Google Web N-gram
In-Reply-To: <5a87a1470906131142x146df1c0qba8d5a9f229c7aeb@mail.gmail.com>
References: <5a87a1470906131142x146df1c0qba8d5a9f229c7aeb@mail.gmail.com>
Message-ID: <4A36941B.30901@speech.sri.com>

Elias Majic wrote:
> Hello,
>
> First off, to save you from having to read the below, suppose I used 
> make-google-ngrams to store a small corpus of text's N-gram counts on 
> disk in googles format.  How do I then convert this to ARPA format 
> with SRILM?
You don't.  There is no reason to convert a standard ngram count file 
into google format for building an ARPA LM.
Converting the counts into a different format won't help you deal with 
any memory issues.
SRILM currently is just not set up to estimate ARPA LMs of the size 
implied by the google corpus.
That's why we created the count-LM approach, that can make use of the 
google ngram files directly.
The estimation process is described in the FAQ, as you know.

If you want to build a very large backoff LMs there are a few other LM 
tools out there that are explicitly targeted at  large data sets.   Try 
googling "MSRLM"  and "IRSTLM".    I doubt that even if you were able to 
build a traditional ARPA LM from all the google ngrams it would do you 
much good -- it would take way too long to load into memory, even if 
only a subset were used.  That's why MSRLM, for example, uses a 
server-based approach.

Andreas

>
> I have read the Google Web N-gram section in the F.A.Q, I read all the 
> emails with the search   term google in it and I read all the relevant 
> man pages as well as looked at relevant run-tests without success.
>
> My goal is to make an arpa format language model from the N-gram 
> counts inside the Google Web N-gram corpus.  I realize its too large 
> to load into memory as discussed in the documentation, so as per one 
> of the emails in the list suggested, I pruned out most of the junk or 
> non dictionary words and merged different cases and fixed the config 
> files.  So now I reduced the data quite significantly and am unable to 
> figure out how to convert it to arpa format.  Below is what I tried:
>
> 1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM
>
> This did not work. It produced the same duplicate file of google.countlm
>
> 2. I noticed in the man pages that using the command -expand-classes 
> forced the output to be a single ngram model in ARPA format. So I tried:
> ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 
> -write-lm arpaLM
> Nothing happened but the output:
> HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden 
> N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram 
> models are mutually exclusive
>
> 3.I thought maybe using mix-lm would result in an arpa model as it 
> also says in the man pages this would occur with mix-lm. I realize 
> this was unlikely to work as I am combining the same lm's but tried 
> regardless.
> ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm 
> google.countlm -write-lm arpaLM
> Output was the same as google.countlm
>
> I tried other things like using ngram-count and running the lm-scripts 
> but no dice.  One of the relevant posts in the forum I posted below:
>
> http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html
> The URL above mentions:
> *
> />> Could you give me an *example* about bulilding google 3-gram LM file
> >> ,please?
> >>  
> >Again, this will require using the  option with some tricks
> >that are not documents
> >as yet. Please be patient (or read all the manual pages carefully to
> >figure it our yourself.)/*
> *
> *Has any documentations been made regarding this? Did the trick infer 
> using mix-lm or expand-classes to force arpa format? 
>
> I figure worst case I do it manually but am sure there is something in 
> SRILM that I am missing.
>
> Thanks
> Elias


From aria.rastrou at gmail.com  Mon Jun 22 22:50:29 2009
From: aria.rastrou at gmail.com (Ariya Rastrow)
Date: Tue, 23 Jun 2009 01:50:29 -0400
Subject: problem with linking?
Message-ID: <4205a1540906222250y6b0aaf4cmaef43c4c5de3fbb6@mail.gmail.com>

Hi there,
  I have a question regarding compiling/linking a C++ code which uses SRILM
classes. So the point is that my code compiles but when I try to link it I
get whole bunch of errors. I have read somehing about this issue about
linking problem and the fact that I have to use exact same flags used during
SRILM building and linking. But even after that I still get same following
errors. Any help would be appreciated.

fsNgram.o: In function `DfsNgram::computeBOWs(int)':
DfsNgram.cpp:(.text+0x15): undefined reference to
`Ngram::computeBOWs(unsigned int)'
DfsNgram.o: In function `DfsNgram::~DfsNgram()':
DfsNgram.cpp:(.text+0x9c): undefined reference to `Ngram::~Ngram()'
DfsNgram.o: In function `DfsNgram::DfsNgram(Vocab&, unsigned int)':
DfsNgram.cpp:(.text+0xd5): undefined reference to `Ngram::Ngram(Vocab&,
unsigned int)'
DfsNgram.o: In function `DfsNgram::DfsNgram(Vocab&, unsigned int)':
DfsNgram.cpp:(.text+0xf5): undefined reference to `Ngram::Ngram(Vocab&,
unsigned int)'
DfsNgram.o: In function `DfsNgram::findBOnode(unsigned int*)':
DfsNgram.cpp:(.text+0x17d): undefined reference to `_Map::foundP'
DfsNgram.cpp:(.text+0x186): undefined reference to `_Map::foundP'
DfsNgram.cpp:(.text+0x195): undefined reference to `_Map::foundP'
DfsNgram.cpp:(.text+0x22e): undefined reference to `_Map::foundP'
DfsNgram.cpp:(.text+0x241): undefined reference to `_Map::foundP'
DfsNgram.o: In function `DfsNgram::~DfsNgram()':
DfsNgram.cpp:(.text+0xb8): undefined reference to `Ngram::~Ngram()'
DfsNgram.o: In function `DfsNgram::~DfsNgram()':
DfsNgram.cpp:(.text+0xc8): undefined reference to `Ngram::~Ngram()'
DfsNgram.o: In function `LM::followIter(unsigned int const*)':
DfsNgram.cpp:(.text._ZN2LM10followIterEPKj[LM::followIter(unsigned int
const*)]+0x30): undefined reference to `_LM_FollowIter::_LM_FollowIter(LM&,
unsigned int const*)'
DfsNgram.o: In function `Trie<unsigned int, BOnode>::findTrie(unsigned int
const*, bool&) const':
DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie<unsigned int,
BOnode>::findTrie(unsigned int const*, bool&) const]+0xb2): undefined
reference to `_Map::foundP'
DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie<unsigned int,
BOnode>::findTrie(unsigned int const*, bool&) const]+0x108): undefined
reference to `_Map::foundP'
DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie<unsigned int,
BOnode>::findTrie(unsigned int const*, bool&) const]+0x207): undefined
reference to `_Map::foundP'
DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie<unsigned int,
BOnode>::findTrie(unsigned int const*, bool&) const]+0x28c): undefined
reference to `_Map::foundP'
DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie<unsigned int,
BOnode>::findTrie(unsigned int const*, bool&) const]+0x39d): undefined
reference to `_Map::foundP'
DfsNgram.o:DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie<unsigned
int, BOnode>::findTrie(unsigned int const*, bool&) const]+0x4b1): more
undefined references to `_Map::foundP' follow
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x38): undefined
reference to `Ngram::wordProb(unsigned int, unsigned int const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x40): undefined
reference to `LM::wordProb(char const*, char const* const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x48): undefined
reference to `LM::wordProbRecompute(unsigned int, unsigned int const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x50): undefined
reference to `LM::sentenceProb(unsigned int const*, TextStats&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x58): undefined
reference to `LM::sentenceProb(char const* const*, TextStats&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x60): undefined
reference to `LM::contextProb(unsigned int const*, unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x68): undefined
reference to `LM::countsProb(NgramStats&, TextStats&, unsigned int, bool)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x70): undefined
reference to `LM::pplCountsFile(File&, unsigned int, TextStats&, char
const*, bool)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x78): undefined
reference to `LM::pplFile(File&, TextStats&, char const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x80): undefined
reference to `LM::rescoreFile(File&, double, double, LM&, double, double,
char const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x88): undefined
reference to `LM::probServer(unsigned int, unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x90): undefined
reference to `LM::setState(char const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x98): undefined
reference to `LM::wordProbSum(unsigned int const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xa0): undefined
reference to `LM::generateWord(unsigned int const*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xa8): undefined
reference to `LM::generateSentence(unsigned int, unsigned int*, unsigned
int*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xb0): undefined
reference to `LM::generateSentence(unsigned int, char const**, char
const**)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xc8): undefined
reference to `Ngram::contextID(unsigned int, unsigned int const*, unsigned
int&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xd0): undefined
reference to `Ngram::contextBOW(unsigned int const*, unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xd8): undefined
reference to `LM::addUnkWords()'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xe0): undefined
reference to `LM::isNonWord(unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xe8): undefined
reference to `Ngram::read(File&, bool)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xf0): undefined
reference to `Ngram::write(File&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xf8): undefined
reference to `LM::writeBinary(File&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x118): undefined
reference to `Ngram::memStats(MemStats&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x120): undefined
reference to `LM::removeNoise(unsigned int*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x128): undefined
reference to `Ngram::writeWithOrder(File&, unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x140): undefined
reference to `Ngram::estimate(NgramStats&, unsigned long*, unsigned long*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x148): undefined
reference to `Ngram::estimate(NgramStats&, Discount**)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x150): undefined
reference to `Ngram::estimate(NgramCounts<double>&, Discount**)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x158): undefined
reference to `Ngram::mixProbs(Ngram&, double)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x160): undefined
reference to `Ngram::mixProbs(Ngram&, Ngram&, double)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x168): undefined
reference to `Ngram::recomputeBOWs()'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x170): undefined
reference to `Ngram::pruneProbs(double, unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x178): undefined
reference to `Ngram::pruneLowProbs(unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x180): undefined
reference to `Ngram::rescoreProbs(LM&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x188): undefined
reference to `Ngram::numNgrams(unsigned int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x190): undefined
reference to `Ngram::wordProbBO(unsigned int, unsigned int const*, unsigned
int)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x198): undefined
reference to `Ngram::vocabSize()'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1a0): undefined
reference to `Ngram::fixupProbs()'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1a8): undefined
reference to `Ngram::distributeProb(double, unsigned int*)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1b0): undefined
reference to `Ngram::computeBOW(BOnode*, unsigned int const*, unsigned int,
double&, double&)'
DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1b8): undefined
reference to `Ngram::computeBOWs(unsigned int)'
DfsNgram.o:(.rodata._ZTI8DfsNgram[typeinfo for DfsNgram]+0x10): undefined
reference to `typeinfo for Ngram'
DiscLMUpdate.o: In function `DiscLMUpdate::_REchange(unsigned int&, unsigned
int*, float&, DfsNgram*, DfsNgram*)':
DiscLMUpdate.cpp:(.text+0x9f): undefined reference to `LogP_Zero'
DiscLMUpdate.o: In function `DiscLMUpdate::_ApplyUpdate(unsigned int,
bool)':
DiscLMUpdate.cpp:(.text+0x433): undefined reference to `_Map::foundP'
DiscLMUpdate.cpp:(.text+0x47a): undefined reference to `Vocab::copy(unsigned
int*, unsigned int const*)'
DiscLMUpdate.cpp:(.text+0x482): undefined reference to
`Vocab::reverse(unsigned int*)'
DiscLMUpdate.o: In function
`DiscLMUpdate::DiscLMUpdate(std::basic_string<char, std::char_traits<char>,
std::allocator<char> >, int)':
DiscLMUpdate.cpp:(.text+0x608): undefined reference to
`Vocab::Vocab(unsigned int, unsigned int)'
DiscLMUpdate.cpp:(.text+0x642): undefined reference to `File::File(char
const*, char const*, int)'
DiscLMUpdate.cpp:(.text+0x695): undefined reference to `File::close()'
DiscLMUpdate.cpp:(.text+0x69d): undefined reference to `File::~File()'
DiscLMUpdate.cpp:(.text+0x6d9): undefined reference to `File::~File()'
DiscLMUpdate.o: In function
`DiscLMUpdate::DiscLMUpdate(std::basic_string<char, std::char_traits<char>,
std::allocator<char> >, int)':
DiscLMUpdate.cpp:(.text+0x748): undefined reference to
`Vocab::Vocab(unsigned int, unsigned int)'
DiscLMUpdate.cpp:(.text+0x782): undefined reference to `File::File(char
const*, char const*, int)'
DiscLMUpdate.cpp:(.text+0x7d5): undefined reference to `File::close()'
DiscLMUpdate.cpp:(.text+0x7dd): undefined reference to `File::~File()'
DiscLMUpdate.cpp:(.text+0x819): undefined reference to `File::~File()'
DiscLMUpdate.o: In function
`DiscLMUpdate::ReadUpdates(std::tr1::unordered_map<std::basic_string<char,
std::char_traits<char>, std::allocator<char> >, double,
std::tr1::hash<std::basic_string<char, std::char_traits<char>,
std::allocator<char> > >, std::equal_to<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > >,
std::allocator<std::pair<std::basic_string<char, std::char_traits<char>,
std::allocator<char> > const, double> >, false>*, bool, bool)':
DiscLMUpdate.cpp:(.text+0xcea): undefined reference to
`Vocab::parseWords(char*, char const**, unsigned int)'
DiscLMUpdate.cpp:(.text+0xd34): undefined reference to `Vocab::copy(unsigned
int*, unsigned int const*)'
DiscLMUpdate.cpp:(.text+0xd3c): undefined reference to
`Vocab::reverse(unsigned int*)'
DiscLMUpdate.cpp:(.text+0xd6d): undefined reference to
`Ngram::findProb(unsigned int, unsigned int const*)'
DiscLMUpdate.cpp:(.text+0xd7a): undefined reference to
`Vocab::reverse(unsigned int*)'
DiscLMUpdate.cpp:(.text+0xd9b): undefined reference to `Vocab::copy(unsigned
int*, unsigned int const*)'
DiscLMUpdate.cpp:(.text+0xe15): undefined reference to `_Map::foundP'
DiscLMUpdate.cpp:(.text+0xe1c): undefined reference to `_Map::foundP'
DiscLMUpdate.cpp:(.text+0xf79): undefined reference to `_Map::foundP'
DiscLMUpdate.cpp:(.text+0x10de): undefined reference to `_Map::foundP'
DiscLMUpdate.cpp:(.text+0x10fd): undefined reference to `_Map::foundP'
DiscLMUpdate.o:DiscLMUpdate.cpp:(.text._ZNK4TrieIjfE8findTrieEPKjRb[Trie<unsigned
int, float>::findTrie(unsigned int const*, bool&) const]+0xaf): more
undefined references to `_Map::foundP' follow
DiscTrain.o: In function `__static_initialization_and_destruction_0(int,
int)':
DiscTrain.cpp:(.text+0xb2): undefined reference to `Vocab::Vocab(unsigned
int, unsigned int)'
DiscTrain.cpp:(.text+0xda): undefined reference to `Ngram::Ngram(Vocab&,
unsigned int)'
collect2: ld returned 1 exit status
make: *** [all] Error 1

---
Ariya Rastrow
PhD Candidate,
Center for Language and Speech Processing(CLSP)
Johns Hopkins University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20090623/6a05a2c7/attachment.html>