From Dmitriy.Dligach at colorado.edu Tue Apr 7 08:30:53 2009 From: Dmitriy.Dligach at colorado.edu (Dmitriy Dligach) Date: Tue, 07 Apr 2009 09:30:53 -0600 Subject: OOV words Message-ID: <20090407093053.114155kecp61tt44@opsmail.colorado.edu> Hello, First of all I wanted to thank the creators of SRILM -- I find this tool extremely useful in my research. Second, I have a question about out-of-vocabulary (OOV) words. I train a language model on a collection of english news wire text: ngram-count -text all.txt -lm all.lm -order 5 and then compute probabilities: ngram -lm all.lm -ppl test.txt -debug 1 There happen to be some sentences in foreign languages in my test.txt file. I'd expect them to receive very low probabilities because the model was trained on strictly english text. However, instead they receive very high probabilities. Could this have something to do with the way SRILM handles OOV words? Dima From christophe.hauser at irisa.fr Wed Apr 8 09:51:32 2009 From: christophe.hauser at irisa.fr (Christophe Hauser) Date: Wed, 8 Apr 2009 18:51:32 +0200 Subject: [christophe.hauser@irisa.fr: Jelinek Mercer Smoothing] In-Reply-To: <49D25248.7020209@speech.sri.com> References: <20090331100957.GA18372@sovkipeu.irisa.fr> <49D25248.7020209@speech.sri.com> Message-ID: <20090408165132.GE3589@sovkipeu.irisa.fr> On Tue, Mar 31, 2009 at 10:26:32AM -0700, Andreas Stolcke wrote: > An example of the count-lm training procedure is given by > $SRILM/test/tests/ngram-count-lm/run-test . > > Andreas hello, I am trying to reproduce some experiments using SRILM. I would like to apply Jelinek Mercer smoothing, but the perplexity results I get are very weird : ways more than the results with no smoothing at all. Here is what I did : ngram-count -text training -lm lm -order $order -write-vocab vocab -write cfile ngram -ppl test -lm lm -order $order -vocab vocab -unk file test: 1 sentences, 964 words, 41 OOVs 0 zeroprobs, logprob= -1445.86 ppl= 36.7102 ppl1= 36.8538 Then, if I use Jelinek Mercer smoothing cat >countlm < Hi, Is there an option to give weights to certain training instances (sentences)? For example if I have some sentences that are more relevant to my translation domain and I want them to influence the LM 4 times more than the rest of the data. Currently I'm doing that by just repeating those important sentences in the training corpus. This way the training takes much longer. Is there an alternative way to do this? Also I was wondering why there is such slowdown? My guess is that the repetition changes the size of ngrams (mainly trigrams) dramatically. many of the infrequent bi or tri grams that are filtered in the baseline model, will be considered in the new model. Is that right? Thanks Behrang Thanks Behrang From sylvain.raybaud at crans.org Wed Apr 22 02:44:23 2009 From: sylvain.raybaud at crans.org (Sylvain Raybaud) Date: Wed, 22 Apr 2009 11:44:23 +0200 Subject: problems getting srilm Message-ID: <200904221144.23250.sylvain.raybaud@crans.org> Dear list, I need to use SRILM toolkit for my PhD, but I have been unsuccessful at downloading it... I have tried several times, after I fill in the form the download starts but is incredibly slow (around 4kBps), and stops after a while (firefox gives me an error like "the connection to the server was reset"). If I pass the download link to wget I only get an html code complaining about empty form fields (this was to be expected). Did I miss something? Thanks regards, -- Sylvain Raybaud From stolcke at speech.sri.com Sun May 10 11:09:29 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 10 May 2009 11:09:29 PDT Subject: SRILM 1.5.8 released Message-ID: <200905101809.n4AI9T613648@ns2> The latest version of SRILM is now avialable from http://www.speech.sri.com/projects/srilm/download.html . A list of changes appears below. Enjoy, Andreas 1.5.8 10 May 2009 Functionality: * merge-batch-counts -float-counts option for merging of fractional counts. * compare-sclite now includes statistical significance computation based on a matched-pair Sign test. * Added a Perl tool to compute the cumulative binomial distribution, contributed by Brett Kessler and David Gelbart. * Don't output LM server banner message for ngram -use-server -debug 0. * The LM::generateSentence() function now takes option argument to specify sentence prefix that is to be used to condition subsequent word generation (suggested by Alexy Khrabrov). The default is to condition on as before, or an empty context if no start-of-sentence tag is defined. * A new option ngram -gen-prefixes to read conditioning prefixes from a file, and generate random sentences based on them. * New options in nbest-optimize that modify -print-hyps output so that only unique hypotheses are included (-print-unique-hyps), and to print the original ranks of hypotheses (-print-old-ranks) (from Jing Zheng). * The -version option reports whether support for compressed files is available. * Added merge-batch-count -l option to control how many files to merge in each iteration. Bug fixes: * ngram-count, NgramLM: disable the Doug Paul smoothing hack (add one to denominator when smoothing results in 0 backoff mass) in contexts where the entire vocabulary has been observed. * nbest-optimize fixes to the -minimum-bleu-reference functionality (from Jing Zheng). * Fixed nbest-optimize bug that was causing incorrect log output with gcc 4.x. * Output vocabulary index map in binary ngram count and LM format in numerical index order. This avoids a performance bug whereby reading the data structures back into _c binary version could take a long time due to inefficient insertion order. * Fix ngram -counts with -use-server (from Ergun Bicici). * Fixed memory allocation bug in FLM tag vocabulary handling that could lead to crash when interpolating several FLMs. * Rewrote make-batch-counts scripts to - avoid problems with limits on command line length - support systems that don't have compressed file I/O. * Modified merge-batch-counts script to - ensure that unmerged files are always merged in the next iteration, to avoid file size imbalance (suggested by Alex Marin) - support systems that don't have compressed file I/O. * Fixed a portability issue with Intel icc version 7.0. * compute-sclite fixed to invoke csrfilt.sh script with -t option. From stolcke at speech.sri.com Wed May 13 11:06:06 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 13 May 2009 11:06:06 -0700 Subject: ngram-discount In-Reply-To: <200905092008369537576@gmail.com> References: <200905092008369537576@gmail.com> Message-ID: <4A0B0C0E.8020504@speech.sri.com> If the model is smoothed (the default), zeroprobs typically occur for out-of-vocabulary words. You need to train a model that assigns probability to the unknown word (). Use the ngram-count -unk option (you need to also specify a predefined vocabulary to there are OOV words in your training data that you can get a probability estimate from). Then use ngram -unk to test the LM. Hope this helps, Andreas ??? wrote: > hi, > when I used the srilm, I found the zeroprobs of n-gram. So why will > zeroprobs turn up? > I used the bigram. so when I calculated p(w2|w1), if C(w1w2)=0, the > prob backoff to unigram:alpha*P(w2); > and if C(w2)=0 (maybe it is out-of-vocabulary),we can backoff to > zerogram,like uniform distribution; or we use good-turing discount, > we have some discounts which can be used to this zero count word. so I > think zeroprobs will not turn up. > Do I understand it right? > or the unigram is calculated by maximum likelihood directly,like > p(w2)=C(w2)/(all counts)? > so why not be calculated by good-turing discount,like > p*(w2)=C*(w2)/(all counts). (C*(w2) is calculated by good-turing). > Thank you very much. > Sincerely yours, > Wang > 2009-05-09 > ------------------------------------------------------------------------ > ??? From christophe.hauser at irisa.fr Wed May 20 09:34:10 2009 From: christophe.hauser at irisa.fr (Christophe Hauser) Date: Wed, 20 May 2009 18:34:10 +0200 Subject: Odd jelinek mercer results Message-ID: <20090520163410.GA24057@sovkipeu.irisa.fr> Hello, I get really odd results using Jelinek Mercer smoothing. In the following simple example, I get the best results with no smoothing at all (1.12). Using smoothing, setting all parameters to 1 gives better performance (1.24) than optimizing the parameters on the test set (2.4) According to Chen&Goodman, this means there is no smoothing at all. This yields similar results with any other dataset I've tried. Am I doing something wrong ? training : A B C A B C A B C test : A B C A B C A B C D # write vocabulary cat $test $training > everything ngram-count -text everything -no-eos -no-sos -write-vocab vocab -order $order # write count file ngram-count -debug 2 -text $training -lm lm -order $order -write cfile -vocab vocab -gt1max 0 -gt2max 0 -gt3max 0 -no-eos -no-sos cat >countlm < lmsmooth ngram-count -debug 2 -text $test -count-lm -init-lm countlm -lm lmsmooth -order $order -vocab vocab -no-eos -no-sos -gt1max 0 -gt2max 0 -gt3max 0 lsmooth : order 3 mixweights 1 1 1 1 0 0 0.674508 countmodulus 1 vocabsize 5 totalcount 9 counts cfile # Evaluate perplexity using lmsmooth model ngram -debug 2 -count-lm -lm lmsmooth -order $order -ppl $test -write-lm lm2 -vocab vocab -no-eos -no-sos A B C A B C A B C D p( A | ) = [9,0,3] 0.2 [ -0.69897 ] p( B | A ...) = [9,0,3,0,3] 0.2 [ -0.69897 ] p( C | B ...) = [9,0,3,0,3,0.674508,3] 0.739606 [ -0.130999 ] p( A | C ...) = [9,0,3,0,2,0.674508,2] 0.51477 [ -0.288386 ] p( B | A ...) = [9,0,3,0,3,0.674508,2] 0.739606 [ -0.130999 ] p( C | B ...) = [9,0,3,0,3,0.674508,3] 0.739606 [ -0.130999 ] p( A | C ...) = [9,0,3,0,2,0.674508,2] 0.51477 [ -0.288386 ] p( B | A ...) = [9,0,3,0,3,0.674508,2] 0.739606 [ -0.130999 ] p( C | B ...) = [9,0,3,0,3,0.674508,3] 0.739606 [ -0.130999 ] p( D | C ...) = [9,0,0,0,0,0.674508,0] 0.0650984 [ -1.18643 ] 0 sentences, 10 words, 0 OOVs 0 zeroprobs, logprob= -3.81614 ppl= 2.40776 ppl1= 2.40776 # Evaluate perplexity using manual parameters ngram -debug 2 -count-lm -lm countlm -order $order -ppl $test -write-lm lm2 -vocab vocab -no-eos -no-sos A B C A B C A B C D p( A | ) = [9,1,3] 0.333333 [ -0.477121 ] p( B | A ...) = [9,1,3,1,3] 1 [ 0 ] p( C | B ...) = [9,1,3,1,3,1,3] 1 [ 0 ] p( A | C ...) = [9,1,3,1,2,1,2] 0.666667 [ -0.176091 ] p( B | A ...) = [9,1,3,1,3,1,2] 1 [ 0 ] p( C | B ...) = [9,1,3,1,3,1,3] 1 [ 0 ] p( A | C ...) = [9,1,3,1,2,1,2] 0.666667 [ -0.176091 ] p( B | A ...) = [9,1,3,1,3,1,2] 1 [ 0 ] p( C | B ...) = [9,1,3,1,3,1,3] 1 [ 0 ] p( D | C ...) = [9,1,0,1,0,1,0] 0 [ -inf ] 0 sentences, 10 words, 0 OOVs 1 zeroprobs, logprob= -0.829304 ppl= 1.23636 ppl1= 1.23636 # Evaluate the perplexity with no smoothing at all ngram -debug 2 -ppl $test -lm lm -order $order -vocab vocab -no-eos -no-sos A B C A B C A B C D p( A | ) = [1gram] 0.333333 [ -0.477121 ] p( B | A ...) = [2gram] 1 [ 0 ] p( C | B ...) = [3gram] 1 [ 0 ] p( A | C ...) = [3gram] 1 [ 0 ] p( B | A ...) = [3gram] 1 [ 0 ] p( C | B ...) = [3gram] 1 [ 0 ] p( A | C ...) = [3gram] 1 [ 0 ] p( B | A ...) = [3gram] 1 [ 0 ] p( C | B ...) = [3gram] 1 [ 0 ] p( D | C ...) = [1gram] 0 [ -inf ] 0 sentences, 10 words, 0 OOVs 1 zeroprobs, logprob= -0.477121 ppl= 1.12983 ppl1= 1.12983 Kind regards, -- Christophe From stolcke at speech.sri.com Sun Jun 7 09:49:18 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 07 Jun 2009 09:49:18 -0700 Subject: SRILM 1.5.8 released In-Reply-To: <444174749.03914@ustc.edu.cn> References: <444174749.03914@ustc.edu.cn> Message-ID: <4A2BEF8E.6010506@speech.sri.com> yingyul at mail.ustc.edu.cn wrote: > Dear Sir, > I have installed the SRILM 1.5.8 in ubuntu9.0.4 with i686 architecture. I will > describe the installation process in detail. It may be useful for you. > If you understand chinese , you can visit the chinese website which contain the > Process Description of installing the SRILM 1.5.8 in ubuntu9.0.4 with i686 > architecture. > http://www.52nlp.cn/ubuntu-64-bit-system-srilm-configuration > > First of all, I also installed the following freely available tools: > (1)A template-capable ANSI-C/C++ compiler: gcc 4.3, g++ 4.4.3.3 > (2)GNU make 3.81: It is used to control compilation and installation of the SRILM > 1.5.8. > (3)GNU gawk > (4)GNU gzip > (5)bzip2 > (6)p7zip > (7)Tcl > (8)csh > > secondly, I will describe the installation process in detail: > 1. Creating the installation directory and extracting the compressed package to > the directory.My installation directory is /home/user/srilm. > 2. Modifying Makefile file(in the directory:/home/user/srilm) > (1).searching the line: "# SRILM = /home/speech/stolcke/project/srilm/devel > "??Input the actual installation path of srilm. The line was revised to: > "SRILM=/home/user/srilm" > ?? (2).searching the line:" MACHINE_TYPE := $(shell > $(SRILM)/sbin/machine-type)"??Input the actual machine type. The line was revised > to:"MACHINE_TYPE := i686-m64". The line tell "Makefile" to find the Settings of > ubuntu9.0.4 with i686 architecture in the path: > /home/user/srilm/common/Makefile.machine.i686-m64. > 3. Modifying Makefile.machine.i686-m64 file(in the > directory:/home/user/srilm/common/Makefile.machine.i686-m64) > searching the line:"GAWK = /usr/bin/awk" > The line was revised to:"GAWK = /usr/bin/gawk" > Interesting. I assumed that Linux systems have both /usr/bin/awk and /usr/bin/gawk and they seem to be the same program. If there is a reason to use /usr/bin/gawk I can will change the default configuration for i686 and i686-m64. > Thirdly,Modifying system environment variables. > input the command??sudo gedit /etc/profile > finding the lines: > if [ "$PS1" ]; then > ??if [ "$BASH" ]; then > ????PS1=??u at h:w$ ?? > ????if [ -f /etc/bash.bashrc ]; then > ??????. /etc/bash.bashrc > ????fi > ??else > ????if [ "`id -u`" -eq 0 ]; then > ??????PS1=??# ?? > ????else > ??????PS1=??$ ?? > ????fi > ??fi > fi > Below these lines,input the setting??"export > PATH=??$PATH:/home/user/srilm/bin/i686-m64:/home/user/srilm/bin??" > But this could also be done in the user's ~/.profile, right ? Modifying /etc/profile affects all users and requires root authority. > Finally.Installing and testing the SRILM 1.5.8 > 1.Input the following commands: > cd /home/user/srilm > make World > 2.Input the following commands: > cd test > make all > > OK!! > I'm glad it worked smoothly. I get quite a few reports (of course without any useful details!) of problems installing SRILM on ubuntu, so it's good to confirm that the installation instructions do in fact work. Andreas > Regards, > > Yulong Ying > > > >> From: Sanjay Chatterji >> Reply-To: >> To: Andreas Stolcke >> Subject: Re: SRILM 1.5.8 released >> Date:Wed, 3 Jun 2009 15:44:59 +0530 >> >> Dear Sir, >> I tried to install the SRILM 1.5.8 in fedora 6 with i686 architecture. But >> it is giving an error and >> ngram >> ngram-class >> ngram-count >> ngram-merge >> etc. are not created. >> >> gcc version is gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13) >> tcl is installed >> Please suggest, >> Regards, >> Sanjay >> >> >> From stolcke at speech.sri.com Sun Jun 7 10:07:24 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 07 Jun 2009 10:07:24 PDT Subject: Testing srilm-user Message-ID: <200906071707.n57H7O804280@ns2> Please ignore. --Andreas From i_am_behrang at yahoo.com Mon Jun 8 08:57:30 2009 From: i_am_behrang at yahoo.com (i_am_behrang at yahoo.com) Date: Mon, 8 Jun 2009 08:57:30 -0700 (PDT) Subject: Building adapted language models Message-ID: <297533.48294.qm@web110315.mail.gq1.yahoo.com> Hi, Is there an option to give weights to certain training instances (sentences)?? For example if I have some sentences that are more relevant to my translation domain and I want them to influence the LM 4 times more than the rest of the data. I've done this by repeating the more relevant training instances, which makes the model training quite slow.? Is there an alternative way in SRILM? Thanks Behrang From stolcke at speech.sri.com Wed Jun 10 22:36:53 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 10 Jun 2009 22:36:53 PDT Subject: Building adapted language models In-Reply-To: Your message of Mon, 08 Jun 2009 08:57:30 -0700. <297533.48294.qm@web110315.mail.gq1.yahoo.com> Message-ID: <200906110536.n5B5arZ06507@ns2> In message <297533.48294.qm at web110315.mail.gq1.yahoo.com>you wrote: > > Hi, > > Is there an option to give weights to certain training instances (sentences)?? > For example if I have some sentences that are more relevant to my translatio > n domain and I want them to influence the LM 4 times more than the rest of th > e data. > > I've done this by repeating the more relevant training instances, which makes > the model training quite slow.? Is there an alternative way in SRILM? > You can weight the counts, pool them, and train a single LM. The internal methods that perform sentence-level count generation actually have an argument to scale the couns by a number, but this functionality was not accessible at the command line. So I added an option ngram-count -text-has-weights that tells ngram-count that the first field in each line is a count scaling factor (the number has to be an integer, but can be a floating point number if -float-counts is enabled). This is available in the 1.5.9-beta version that you can download now. Or you can train separate LMs for different subsets of data (this only makes if the unit of weighting it larger than a sentence, e.g., a data source or corpus), and then interpolate (mix) the probability estimates with weights. LM interpolation is described with the "-mix-lm" option in ngram(1). Andreas From elias.majic at gmail.com Sat Jun 13 11:42:00 2009 From: elias.majic at gmail.com (Elias Majic) Date: Sat, 13 Jun 2009 14:42:00 -0400 Subject: Google Web N-gram Message-ID: <5a87a1470906131142x146df1c0qba8d5a9f229c7aeb@mail.gmail.com> Hello, First off, to save you from having to read the below, suppose I used make-google-ngrams to store a small corpus of text's N-gram counts on disk in googles format. How do I then convert this to ARPA format with SRILM? I have read the Google Web N-gram section in the F.A.Q, I read all the emails with the search term google in it and I read all the relevant man pages as well as looked at relevant run-tests without success. My goal is to make an arpa format language model from the N-gram counts inside the Google Web N-gram corpus. I realize its too large to load into memory as discussed in the documentation, so as per one of the emails in the list suggested, I pruned out most of the junk or non dictionary words and merged different cases and fixed the config files. So now I reduced the data quite significantly and am unable to figure out how to convert it to arpa format. Below is what I tried: 1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM This did not work. It produced the same duplicate file of google.countlm 2. I noticed in the man pages that using the command -expand-classes forced the output to be a single ngram model in ARPA format. So I tried: ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -write-lm arpaLM Nothing happened but the output: HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram models are mutually exclusive 3.I thought maybe using mix-lm would result in an arpa model as it also says in the man pages this would occur with mix-lm. I realize this was unlikely to work as I am combining the same lm's but tried regardless. ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm google.countlm -write-lm arpaLM Output was the same as google.countlm I tried other things like using ngram-count and running the lm-scripts but no dice. One of the relevant posts in the forum I posted below: http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html The URL above mentions: * >> Could you give me an *example* about bulilding google 3-gram LM file >> ,please? >> >Again, this will require using the option with some tricks >that are not documents >as yet. Please be patient (or read all the manual pages carefully to >figure it our yourself.)* * *Has any documentations been made regarding this? Did the trick infer using mix-lm or expand-classes to force arpa format? I figure worst case I do it manually but am sure there is something in SRILM that I am missing. Thanks Elias -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Mon Jun 15 11:34:03 2009 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 15 Jun 2009 11:34:03 -0700 Subject: Google Web N-gram In-Reply-To: <5a87a1470906131142x146df1c0qba8d5a9f229c7aeb@mail.gmail.com> References: <5a87a1470906131142x146df1c0qba8d5a9f229c7aeb@mail.gmail.com> Message-ID: <4A36941B.30901@speech.sri.com> Elias Majic wrote: > Hello, > > First off, to save you from having to read the below, suppose I used > make-google-ngrams to store a small corpus of text's N-gram counts on > disk in googles format. How do I then convert this to ARPA format > with SRILM? You don't. There is no reason to convert a standard ngram count file into google format for building an ARPA LM. Converting the counts into a different format won't help you deal with any memory issues. SRILM currently is just not set up to estimate ARPA LMs of the size implied by the google corpus. That's why we created the count-LM approach, that can make use of the google ngram files directly. The estimation process is described in the FAQ, as you know. If you want to build a very large backoff LMs there are a few other LM tools out there that are explicitly targeted at large data sets. Try googling "MSRLM" and "IRSTLM". I doubt that even if you were able to build a traditional ARPA LM from all the google ngrams it would do you much good -- it would take way too long to load into memory, even if only a subset were used. That's why MSRLM, for example, uses a server-based approach. Andreas > > I have read the Google Web N-gram section in the F.A.Q, I read all the > emails with the search term google in it and I read all the relevant > man pages as well as looked at relevant run-tests without success. > > My goal is to make an arpa format language model from the N-gram > counts inside the Google Web N-gram corpus. I realize its too large > to load into memory as discussed in the documentation, so as per one > of the emails in the list suggested, I pruned out most of the junk or > non dictionary words and merged different cases and fixed the config > files. So now I reduced the data quite significantly and am unable to > figure out how to convert it to arpa format. Below is what I tried: > > 1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM > > This did not work. It produced the same duplicate file of google.countlm > > 2. I noticed in the man pages that using the command -expand-classes > forced the output to be a single ngram model in ARPA format. So I tried: > ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 > -write-lm arpaLM > Nothing happened but the output: > HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden > N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram > models are mutually exclusive > > 3.I thought maybe using mix-lm would result in an arpa model as it > also says in the man pages this would occur with mix-lm. I realize > this was unlikely to work as I am combining the same lm's but tried > regardless. > ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm > google.countlm -write-lm arpaLM > Output was the same as google.countlm > > I tried other things like using ngram-count and running the lm-scripts > but no dice. One of the relevant posts in the forum I posted below: > > http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html > The URL above mentions: > * > />> Could you give me an *example* about bulilding google 3-gram LM file > >> ,please? > >> > >Again, this will require using the option with some tricks > >that are not documents > >as yet. Please be patient (or read all the manual pages carefully to > >figure it our yourself.)/* > * > *Has any documentations been made regarding this? Did the trick infer > using mix-lm or expand-classes to force arpa format? > > I figure worst case I do it manually but am sure there is something in > SRILM that I am missing. > > Thanks > Elias From aria.rastrou at gmail.com Mon Jun 22 22:50:29 2009 From: aria.rastrou at gmail.com (Ariya Rastrow) Date: Tue, 23 Jun 2009 01:50:29 -0400 Subject: problem with linking? Message-ID: <4205a1540906222250y6b0aaf4cmaef43c4c5de3fbb6@mail.gmail.com> Hi there, I have a question regarding compiling/linking a C++ code which uses SRILM classes. So the point is that my code compiles but when I try to link it I get whole bunch of errors. I have read somehing about this issue about linking problem and the fact that I have to use exact same flags used during SRILM building and linking. But even after that I still get same following errors. Any help would be appreciated. fsNgram.o: In function `DfsNgram::computeBOWs(int)': DfsNgram.cpp:(.text+0x15): undefined reference to `Ngram::computeBOWs(unsigned int)' DfsNgram.o: In function `DfsNgram::~DfsNgram()': DfsNgram.cpp:(.text+0x9c): undefined reference to `Ngram::~Ngram()' DfsNgram.o: In function `DfsNgram::DfsNgram(Vocab&, unsigned int)': DfsNgram.cpp:(.text+0xd5): undefined reference to `Ngram::Ngram(Vocab&, unsigned int)' DfsNgram.o: In function `DfsNgram::DfsNgram(Vocab&, unsigned int)': DfsNgram.cpp:(.text+0xf5): undefined reference to `Ngram::Ngram(Vocab&, unsigned int)' DfsNgram.o: In function `DfsNgram::findBOnode(unsigned int*)': DfsNgram.cpp:(.text+0x17d): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text+0x186): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text+0x195): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text+0x22e): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text+0x241): undefined reference to `_Map::foundP' DfsNgram.o: In function `DfsNgram::~DfsNgram()': DfsNgram.cpp:(.text+0xb8): undefined reference to `Ngram::~Ngram()' DfsNgram.o: In function `DfsNgram::~DfsNgram()': DfsNgram.cpp:(.text+0xc8): undefined reference to `Ngram::~Ngram()' DfsNgram.o: In function `LM::followIter(unsigned int const*)': DfsNgram.cpp:(.text._ZN2LM10followIterEPKj[LM::followIter(unsigned int const*)]+0x30): undefined reference to `_LM_FollowIter::_LM_FollowIter(LM&, unsigned int const*)' DfsNgram.o: In function `Trie::findTrie(unsigned int const*, bool&) const': DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie::findTrie(unsigned int const*, bool&) const]+0xb2): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie::findTrie(unsigned int const*, bool&) const]+0x108): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie::findTrie(unsigned int const*, bool&) const]+0x207): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie::findTrie(unsigned int const*, bool&) const]+0x28c): undefined reference to `_Map::foundP' DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie::findTrie(unsigned int const*, bool&) const]+0x39d): undefined reference to `_Map::foundP' DfsNgram.o:DfsNgram.cpp:(.text._ZNK4TrieIj6BOnodeE8findTrieEPKjRb[Trie::findTrie(unsigned int const*, bool&) const]+0x4b1): more undefined references to `_Map::foundP' follow DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x38): undefined reference to `Ngram::wordProb(unsigned int, unsigned int const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x40): undefined reference to `LM::wordProb(char const*, char const* const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x48): undefined reference to `LM::wordProbRecompute(unsigned int, unsigned int const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x50): undefined reference to `LM::sentenceProb(unsigned int const*, TextStats&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x58): undefined reference to `LM::sentenceProb(char const* const*, TextStats&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x60): undefined reference to `LM::contextProb(unsigned int const*, unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x68): undefined reference to `LM::countsProb(NgramStats&, TextStats&, unsigned int, bool)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x70): undefined reference to `LM::pplCountsFile(File&, unsigned int, TextStats&, char const*, bool)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x78): undefined reference to `LM::pplFile(File&, TextStats&, char const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x80): undefined reference to `LM::rescoreFile(File&, double, double, LM&, double, double, char const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x88): undefined reference to `LM::probServer(unsigned int, unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x90): undefined reference to `LM::setState(char const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x98): undefined reference to `LM::wordProbSum(unsigned int const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xa0): undefined reference to `LM::generateWord(unsigned int const*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xa8): undefined reference to `LM::generateSentence(unsigned int, unsigned int*, unsigned int*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xb0): undefined reference to `LM::generateSentence(unsigned int, char const**, char const**)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xc8): undefined reference to `Ngram::contextID(unsigned int, unsigned int const*, unsigned int&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xd0): undefined reference to `Ngram::contextBOW(unsigned int const*, unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xd8): undefined reference to `LM::addUnkWords()' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xe0): undefined reference to `LM::isNonWord(unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xe8): undefined reference to `Ngram::read(File&, bool)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xf0): undefined reference to `Ngram::write(File&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0xf8): undefined reference to `LM::writeBinary(File&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x118): undefined reference to `Ngram::memStats(MemStats&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x120): undefined reference to `LM::removeNoise(unsigned int*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x128): undefined reference to `Ngram::writeWithOrder(File&, unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x140): undefined reference to `Ngram::estimate(NgramStats&, unsigned long*, unsigned long*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x148): undefined reference to `Ngram::estimate(NgramStats&, Discount**)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x150): undefined reference to `Ngram::estimate(NgramCounts&, Discount**)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x158): undefined reference to `Ngram::mixProbs(Ngram&, double)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x160): undefined reference to `Ngram::mixProbs(Ngram&, Ngram&, double)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x168): undefined reference to `Ngram::recomputeBOWs()' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x170): undefined reference to `Ngram::pruneProbs(double, unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x178): undefined reference to `Ngram::pruneLowProbs(unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x180): undefined reference to `Ngram::rescoreProbs(LM&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x188): undefined reference to `Ngram::numNgrams(unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x190): undefined reference to `Ngram::wordProbBO(unsigned int, unsigned int const*, unsigned int)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x198): undefined reference to `Ngram::vocabSize()' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1a0): undefined reference to `Ngram::fixupProbs()' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1a8): undefined reference to `Ngram::distributeProb(double, unsigned int*)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1b0): undefined reference to `Ngram::computeBOW(BOnode*, unsigned int const*, unsigned int, double&, double&)' DfsNgram.o:(.rodata._ZTV8DfsNgram[vtable for DfsNgram]+0x1b8): undefined reference to `Ngram::computeBOWs(unsigned int)' DfsNgram.o:(.rodata._ZTI8DfsNgram[typeinfo for DfsNgram]+0x10): undefined reference to `typeinfo for Ngram' DiscLMUpdate.o: In function `DiscLMUpdate::_REchange(unsigned int&, unsigned int*, float&, DfsNgram*, DfsNgram*)': DiscLMUpdate.cpp:(.text+0x9f): undefined reference to `LogP_Zero' DiscLMUpdate.o: In function `DiscLMUpdate::_ApplyUpdate(unsigned int, bool)': DiscLMUpdate.cpp:(.text+0x433): undefined reference to `_Map::foundP' DiscLMUpdate.cpp:(.text+0x47a): undefined reference to `Vocab::copy(unsigned int*, unsigned int const*)' DiscLMUpdate.cpp:(.text+0x482): undefined reference to `Vocab::reverse(unsigned int*)' DiscLMUpdate.o: In function `DiscLMUpdate::DiscLMUpdate(std::basic_string, std::allocator >, int)': DiscLMUpdate.cpp:(.text+0x608): undefined reference to `Vocab::Vocab(unsigned int, unsigned int)' DiscLMUpdate.cpp:(.text+0x642): undefined reference to `File::File(char const*, char const*, int)' DiscLMUpdate.cpp:(.text+0x695): undefined reference to `File::close()' DiscLMUpdate.cpp:(.text+0x69d): undefined reference to `File::~File()' DiscLMUpdate.cpp:(.text+0x6d9): undefined reference to `File::~File()' DiscLMUpdate.o: In function `DiscLMUpdate::DiscLMUpdate(std::basic_string, std::allocator >, int)': DiscLMUpdate.cpp:(.text+0x748): undefined reference to `Vocab::Vocab(unsigned int, unsigned int)' DiscLMUpdate.cpp:(.text+0x782): undefined reference to `File::File(char const*, char const*, int)' DiscLMUpdate.cpp:(.text+0x7d5): undefined reference to `File::close()' DiscLMUpdate.cpp:(.text+0x7dd): undefined reference to `File::~File()' DiscLMUpdate.cpp:(.text+0x819): undefined reference to `File::~File()' DiscLMUpdate.o: In function `DiscLMUpdate::ReadUpdates(std::tr1::unordered_map, std::allocator >, double, std::tr1::hash, std::allocator > >, std::equal_to, std::allocator > >, std::allocator, std::allocator > const, double> >, false>*, bool, bool)': DiscLMUpdate.cpp:(.text+0xcea): undefined reference to `Vocab::parseWords(char*, char const**, unsigned int)' DiscLMUpdate.cpp:(.text+0xd34): undefined reference to `Vocab::copy(unsigned int*, unsigned int const*)' DiscLMUpdate.cpp:(.text+0xd3c): undefined reference to `Vocab::reverse(unsigned int*)' DiscLMUpdate.cpp:(.text+0xd6d): undefined reference to `Ngram::findProb(unsigned int, unsigned int const*)' DiscLMUpdate.cpp:(.text+0xd7a): undefined reference to `Vocab::reverse(unsigned int*)' DiscLMUpdate.cpp:(.text+0xd9b): undefined reference to `Vocab::copy(unsigned int*, unsigned int const*)' DiscLMUpdate.cpp:(.text+0xe15): undefined reference to `_Map::foundP' DiscLMUpdate.cpp:(.text+0xe1c): undefined reference to `_Map::foundP' DiscLMUpdate.cpp:(.text+0xf79): undefined reference to `_Map::foundP' DiscLMUpdate.cpp:(.text+0x10de): undefined reference to `_Map::foundP' DiscLMUpdate.cpp:(.text+0x10fd): undefined reference to `_Map::foundP' DiscLMUpdate.o:DiscLMUpdate.cpp:(.text._ZNK4TrieIjfE8findTrieEPKjRb[Trie::findTrie(unsigned int const*, bool&) const]+0xaf): more undefined references to `_Map::foundP' follow DiscTrain.o: In function `__static_initialization_and_destruction_0(int, int)': DiscTrain.cpp:(.text+0xb2): undefined reference to `Vocab::Vocab(unsigned int, unsigned int)' DiscTrain.cpp:(.text+0xda): undefined reference to `Ngram::Ngram(Vocab&, unsigned int)' collect2: ld returned 1 exit status make: *** [all] Error 1 --- Ariya Rastrow PhD Candidate, Center for Language and Speech Processing(CLSP) Johns Hopkins University -------------- next part -------------- An HTML attachment was scrubbed... URL: