From ghaffar1 at cs.sfu.ca Tue Apr 4 15:10:49 2006 From: ghaffar1 at cs.sfu.ca (GholamReza Haffari) Date: Tue, 04 Apr 2006 15:10:49 -0700 Subject: pls help (urgent) Message-ID: <200604042210.k34MAnO2008779@rm-rstar.sfu.ca> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: lm.cc Type: application/octet-stream Size: 1095 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: makefile Type: application/octet-stream Size: 193 bytes Desc: not available URL: From patryale at iro.umontreal.ca Tue Apr 4 20:15:21 2006 From: patryale at iro.umontreal.ca (Alexandre Patry) Date: Tue, 04 Apr 2006 23:15:21 -0400 Subject: pls help (urgent) In-Reply-To: <200604042210.k34MAnO2008779@rm-rstar.sfu.ca> References: <200604042210.k34MAnO2008779@rm-rstar.sfu.ca> Message-ID: <1144206921.10174.7.camel@localhost.localdomain> Hi, you have specified the path to the libraries, but you did not specify that the compiler should link against them. In your makefile, if you change the line: libs=-L$(libdir) for: libs=-L$(libdir) -loolm -lmisc -ldstruct it should work. Good luck, Alexandre Le mardi 04 avril 2006 ? 15:10 -0700, GholamReza Haffari a ?crit : > Hi there, > > Currently I am trying to get the srilm working but I have a problem: when I > use "make" to compile and build the attached sample file, it gives me some > error messages. It seems to me that the libs of the sri toolkit may not be > installed properly in my machine. currently in the lib directory the > following files and directory exist: > > i386-solaris_m > i686\libdstruct.a > i686\liblattice.a > i686\libmisc.a > i686\liboolm.a > > Is everything fine? Would you please help me to find out where the problem > is? > thanks, > -Reza > > PS. for more reference I have copied the error messages here: > > /tmp/cc7U0zaO.o(.text+0xc3): In function `main': > : undefined reference to `File::File(char const*, char const*, int)' > /tmp/cc7U0zaO.o(.text+0xdf): In function `main': > : undefined reference to `File::File(char const*, char const*, int)' > /tmp/cc7U0zaO.o(.text+0xf5): In function `main': > : undefined reference to `File::getline()' > /tmp/cc7U0zaO.o(.text+0x11d): In function `main': > : undefined reference to `Vocab::parseWords(char*, char const**, unsigned From ghaffar1 at cs.sfu.ca Wed Apr 5 10:58:07 2006 From: ghaffar1 at cs.sfu.ca (GholamReza Haffari) Date: Wed, 05 Apr 2006 10:58:07 -0700 Subject: sample program Message-ID: <200604051758.k35Hw71E028663@rm-rstar.sfu.ca> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Antoine.Ghaoui at jinny.ie Mon Apr 24 03:04:16 2006 From: Antoine.Ghaoui at jinny.ie (Antoine Ghaoui) Date: Mon, 24 Apr 2006 13:04:16 +0300 Subject: Info on FLM format Message-ID: <063d01c66786$72f0a550$16c864c1@Italy1> Hello, can you please tell me where I can find the formats of the files for the FLM and how to use SRILM to implement FLM? Thanks Antoine -------------- next part -------------- An HTML attachment was scrubbed... URL: From amittai at mit.edu Tue Apr 25 16:57:58 2006 From: amittai at mit.edu (amittai e axelrod) Date: Wed, 26 Apr 2006 00:57:58 +0100 Subject: Info on FLM format In-Reply-To: <063d01c66786$72f0a550$16c864c1@Italy1> References: <063d01c66786$72f0a550$16c864c1@Italy1> Message-ID: <5734eadd0604251657g16b60a58pee9b391637ffd0d6@mail.gmail.com> On 4/24/06, Antoine Ghaoui wrote: > can you please tell me where I can find the formats of the files for the FLM > and how to use SRILM to implement FLM? Hi-- A good place to start is Chapter 5 of the report from the 2002 Johns Hopkins summer workshop where the FLM tools were implemented. The report is here: www.clsp.jhu.edu/ws2002/groups/arabic/arabic-final.pdf All versions of SRILM after v1.4 have these FLM tools included, just look in the flm/ directory. You may want to start with the "fngram" and "fngram-count" functions. ~amittai From ioparin at yahoo.co.uk Thu May 11 08:12:33 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Thu, 11 May 2006 16:12:33 +0100 (BST) Subject: GT coeffs in -make-big-lm Message-ID: <20060511151233.11799.qmail@web86908.mail.ukl.yahoo.com> Hi! When I trained a very large model (corpus size approx. 600 mln tokens), I found out a feature that look a bit odd. Since the LM is going to be huge, I'm using -make-big-lm script to calculate in a distributed way 4 partial LMs and then merge those into the resulting one. After I put to calculation 4 -make-big-lm tasks, GT coefficients for the first one are output in the home directory (and then it takes some time to get that something is possibly wrong, since this output is not reported in manual), and the other running tasks are just using those, presuming GT pre-computation was done in advance. It should not seriously damage a large model, but it's good to be as precise as possible. So I had to delete GT files manually after each consequent (not simultaneous then) -make-big-lm execution, presuming n-gram merge would correctly renormalize the probabilities. Is it correct or I'd rather calculate GT coefficients from the whole .ngram file, save in the home directory and use for each partial -make-big-lm calculation? best regards, Ilya --------------------------------- To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu May 11 19:55:52 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 11 May 2006 19:55:52 PDT Subject: GT coeffs in -make-big-lm In-Reply-To: Your message of Thu, 11 May 2006 16:12:33 +0100. <20060511151233.11799.qmail@web86908.mail.ukl.yahoo.com> Message-ID: <200605120255.k4C2tqW12176@huge> > Hi! > > When I trained a very large model (corpus size approx. 600 mln tokens), I f > ound out a feature that look a bit odd. Since the LM is going to be huge, I'm > using -make-big-lm script to calculate in a distributed way 4 partial LMs an > d then merge those into the resulting one. > After I put to calculation 4 -make-big-lm tasks, GT coefficients for the fi > rst one are output in the home directory (and then it takes some time to get > that something is possibly wrong, since this output is not reported in manual > ), and the other running tasks are just using those, presuming GT pre-computa > tion was done in advance. It should not seriously damage a large model, but i > t's good to be as precise as possible. So I had to delete GT files manually a > fter each consequent (not simultaneous then) -make-big-lm execution, presumin > g n-gram merge would correctly renormalize the probabilities. Is it correct o > r I'd rather calculate GT coefficients from the whole .ngram file, save in th > e home directory and use for each partial -make-big-lm calculation? It is true make-big-lm saves the statistics needed for count smoothing in files, so that if you rerun the script they are not recomputed (since this step is potentially expensive). I'm sorry this is not documented well. However, the filenames are keyed to the values of the "-name" option. so if you want to do several runs in the same directory just specify a separate -name parameter in each case. --Andreas From ioparin at yahoo.co.uk Sun May 21 13:34:22 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Sun, 21 May 2006 21:34:22 +0100 (BST) Subject: [SRILM]: FLM Message-ID: <20060521203422.79256.qmail@web86914.mail.ukl.yahoo.com> Hello, I've been recently playing with the factored language models for the Czech language. The FLM module works perfectly with small subcorpora. However, when I try to train the model even on my heldout data (60 mln tokens), it takes huge amount of time to get the model trained (by now it's been two days I have it running). Memory problems can expected as well. So, there is almost no sense in trying to train LM on my training data (550 mln). The question is: does anybody have experience in training FLMs on huge corpora: parallelizing tasks and so on? There is no direct way as with normal models (ngram-merge and make-big-lm features) - but are there some indirect ones? thanks in advance, ilya Send instant messages to your online friends http://uk.messenger.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertoldi at itc.it Mon May 22 06:44:24 2006 From: bertoldi at itc.it (Nicola Bertoldi) Date: Mon, 22 May 2006 15:44:24 +0200 Subject: [SRILM]: lattice-tool: problems while reading word-mesh Message-ID: <4471C038.8000504@itc.it> Hello, I've been recently started to use lattice-tool and I got first problems in reading word-meshes. In particular, if I run this 2 commands (i.e. first create a word-mesh and read it) lattice-tool -read-htk -in-lattice input.slf -write-mesh output.cn lattice-tool -read-mesh -in-lattice output.cn I got this error message: lattice-tool: /hardmnt/voxgate/ssi/HermesTools/srilm/include/LHash.cc:251: Boolean LHash::locate(KeyT, unsigned int&) const [with KeyT = NodeIndex, DataT = LatticeNode]: Assertion `!Map_noKeyP(key)' failed. Abort Who can help me? best regards and thanks in advance, Nicola From stolcke at speech.sri.com Mon May 22 11:02:01 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 22 May 2006 11:02:01 -0700 Subject: [SRILM]: lattice-tool: problems while reading word-mesh In-Reply-To: <4471C038.8000504@itc.it> References: <4471C038.8000504@itc.it> Message-ID: <4471FC99.5040805@speech.sri.com> Nicola Bertoldi wrote: > Hello, > I've been recently started to use lattice-tool > and I got first problems in reading word-meshes. > > In particular, if I run this 2 commands > (i.e. first create a word-mesh and read it) > lattice-tool -read-htk -in-lattice input.slf -write-mesh output.cn > lattice-tool -read-mesh -in-lattice output.cn > > I got this error message: > lattice-tool: > /hardmnt/voxgate/ssi/HermesTools/srilm/include/LHash.cc:251: Boolean > LHash::locate(KeyT, unsigned int&) const [with KeyT = > NodeIndex, DataT = LatticeNode]: Assertion `!Map_noKeyP(key)' failed. > Abort > > > Who can help me? > This looks like a know bug in SRILM 1.4.6. Please try getting the 1.5.0 beta version, that should fix it. --Andreas From bertoldi at itc.it Tue May 30 08:24:25 2006 From: bertoldi at itc.it (Nicola Bertoldi) Date: Tue, 30 May 2006 17:24:25 +0200 Subject: Lattice-Tool: problems with pruning Message-ID: <447C63A9.4010602@itc.it> While pruning a lattice wrt posterior probs with this command: lattice-tool -in-lattice lattice -read-htk -out-lattice - -write-htk -posterior-prune 1.0e-1 I got this error Lattice::computeForwardBackward: warning: called with unreachable nodes If I decrease pruning threshold this error disappears. Who can help me? best regards Nicola From ioparin at yahoo.co.uk Wed May 31 03:41:51 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Wed, 31 May 2006 11:41:51 +0100 (BST) Subject: [SRILM]: -debug 2 info Message-ID: <20060531104152.63656.qmail@web86903.mail.ukl.yahoo.com> Hi! When I calculate perplexity of my POS-based class model (word can belong to many classes, class-definition file I create myself on the base of a POS-tagged data), with "-debug 2" I get the output I can not fully understand. For testing puropses I measure ppl on the same data I trained the class model (i.e. there should not be ay OOVs). However, in the debug output, for every N-gram there is a string of the format P(w| w...) = [OOV][n-gram][n-gram]...[OOV][n-gram][n-gram]... As far as I get it, [n-gram]s refer to different combinations of assigning words to classes. But why fo those [OOV] may appear (and they appear in equal intervals between strings of [n-gram]s for each word)? I have only one guess: since [OOVs] are only missing for the last (| ...) n-gram, those [OOV] may correspond to a check if a word is present in the implicit stop-word vocabulary or something... It would be great if anybody could comment on that. best regards, Ilya --------------------------------- All New Yahoo! Mail ? Tired of Vi at gr@! come-ons? Let our SpamGuard protect you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ioparin at yahoo.co.uk Sun Jun 11 06:05:10 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Sun, 11 Jun 2006 14:05:10 +0100 (BST) Subject: GT coefficients Message-ID: <20060611130510.81544.qmail@web25401.mail.ukl.yahoo.com> Hello! If I count GT coefficients in advance and then feed GT-files (generated by make-gt-discounts) to ngram-count or make-big-lm, I get warnings of the kind file.gt1: line 9: warning: discount coefficient 1 = 0.0 file.gt1: line 9: warning: discount coefficient 2 = 0.0 ... and so on for all the gt parameters. Files themselves are alright and do not contain any zeroes. Number next to line corresponds to the last line in a gt-file. The model I get with this differs from that I get when just use ngram-count without loading GT coefficients (it appears much smaller in bigrams and trigrams) with the same gtmin and gtmax values. Could anybody tell me why it happens like this? best regards, Ilya Send instant messages to your online friends http://uk.messenger.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Jun 29 18:27:32 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 29 Jun 2006 18:27:32 PDT Subject: SRILM bug-fix Message-ID: <200606300127.k5U1RWg0005724@choro.speech.sri.com> Recent versions of SRILM have a bug in the option handling of ngram, hidden-ngram, and lattice-tool concerning interpolated LMs with more than 6 components. The bug is triggered by the use of -mix-lm[789] in conjunction with the -bayes option. This will be fixed in the next release, but that might take a while, so I'm including a patch below. This bug was found by Richard Zens of RWTH Aachen. --Andreas *** /tmp/T005lmct Thu Jun 29 06:02:25 2006 --- lm/src/ngram.cc Thu Jun 29 05:59:48 2006 *************** *** 738,744 **** mixLambda6); } if (mixFile7) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda7, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + --- 738,744 ---- mixLambda6); } if (mixFile7) { ! useLM = makeMixLM(mixFile7, *vocab, classVocab, order, useLM, mixLambda7, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + *************** *** 745,751 **** mixLambda6 + mixLambda7); } if (mixFile8) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda8, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + --- 745,751 ---- mixLambda6 + mixLambda7); } if (mixFile8) { ! useLM = makeMixLM(mixFile8, *vocab, classVocab, order, useLM, mixLambda8, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + *************** *** 752,758 **** mixLambda6 + mixLambda7 + mixLambda8); } if (mixFile9) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda9, 1.0); } } --- 752,758 ---- mixLambda6 + mixLambda7 + mixLambda8); } if (mixFile9) { ! useLM = makeMixLM(mixFile9, *vocab, classVocab, order, useLM, mixLambda9, 1.0); } } *** /tmp/T005lmct Thu Jun 29 06:02:25 2006 --- lm/src/hidden-ngram.cc Thu Jun 29 06:01:12 2006 *************** *** 1178,1184 **** mixLambda6); } if (mixFile7) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda7, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + --- 1178,1184 ---- mixLambda6); } if (mixFile7) { ! useLM = makeMixLM(mixFile7, *vocab, classVocab, order, useLM, mixLambda7, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + *************** *** 1185,1191 **** mixLambda6 + mixLambda7); } if (mixFile8) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda8, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + --- 1185,1191 ---- mixLambda6 + mixLambda7); } if (mixFile8) { ! useLM = makeMixLM(mixFile8, *vocab, classVocab, order, useLM, mixLambda8, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + *************** *** 1192,1198 **** mixLambda6 + mixLambda7 + mixLambda8); } if (mixFile9) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda9, 1.0); } } --- 1192,1198 ---- mixLambda6 + mixLambda7 + mixLambda8); } if (mixFile9) { ! useLM = makeMixLM(mixFile9, *vocab, classVocab, order, useLM, mixLambda9, 1.0); } } *** /tmp/T005lmct Thu Jun 29 06:02:25 2006 --- lattice/src/lattice-tool.cc Thu Jun 29 06:01:53 2006 *************** *** 1128,1134 **** mixLambda6); } if (mixFile7) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda7, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + --- 1128,1134 ---- mixLambda6); } if (mixFile7) { ! useLM = makeMixLM(mixFile7, *vocab, classVocab, order, useLM, mixLambda7, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + *************** *** 1135,1141 **** mixLambda6 + mixLambda7); } if (mixFile8) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda8, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + --- 1135,1141 ---- mixLambda6 + mixLambda7); } if (mixFile8) { ! useLM = makeMixLM(mixFile8, *vocab, classVocab, order, useLM, mixLambda8, mixLambda + mixLambda1 + mixLambda2 + mixLambda3 + mixLambda4 + mixLambda5 + *************** *** 1142,1148 **** mixLambda6 + mixLambda7 + mixLambda8); } if (mixFile9) { ! useLM = makeMixLM(mixFile6, *vocab, classVocab, order, useLM, mixLambda9, 1.0); } } --- 1142,1148 ---- mixLambda6 + mixLambda7 + mixLambda8); } if (mixFile9) { ! useLM = makeMixLM(mixFile9, *vocab, classVocab, order, useLM, mixLambda9, 1.0); } }