From souto at weenie.inesc.pt Tue May 22 05:26:20 2001 From: souto at weenie.inesc.pt (Nuno Souto) Date: Tue, 22 May 2001 13:26:20 +0100 Subject: Memory problems Message-ID: <3B0A5AEC.8CD4D20B@weenie.inesc.pt> Hi! I'm trying to use SRLIM toolkit to create a language model. As i'm using large amounts of text and i'm creating 4-grams language models the counts files (created using make-batch-counts/merge-batch-counts scripts) gets to big - 1.3 Gb gziped. When i try to create the language model using ngram-count or make-big-lm the programs abort because there isn't enough memory (500 Mb). How can i solve this problem. Regards Souto From stolcke at speech.sri.com Tue May 22 11:19:21 2001 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 22 May 2001 11:19:21 PDT Subject: Memory problems In-Reply-To: Your message of Tue, 22 May 2001 13:26:20 +0100. <3B0A5AEC.8CD4D20B@weenie.inesc.pt> Message-ID: <200105221819.LAA01941@huge> Nuno, you will have to raise the count thresholds on your bigrams, trigrams and fourgrams to the point were you can fit things into memory, or at least where you can tolerate the paging. (The LM estimation traverses the count and LM data structures in a fairly localized fashion, so some amount of paging is certainly tolerable). Use make-big-lm and play with the -gt2min, -gt3min and -gt4min parameters until the memory requirements become managable. Try eliminating the high-order ngrams first, they have a smaller effect on LM performance. I have successfully built 5-gram models in 512 MB of memory from about 1.3 GB of gzipped counts using -gt2min 1 -gt3min 2 -gt4min 4 -gt5min 4 As an independent measure, you could recompile the LM library and tools with -DUSE_SARRAY_TRIE -DUSE_SARRAY, which switches to a slower, but less memory-wasting version of the data structures (the default setting is to optimize for speed). Once you have managed to build the LM you probably also want to apply entropic pruning to it (ngram -prune) to further reduce memory use and loading time without sacrificing much performance. The better approach would be to integrate the pruning with the estimation so that irrelevant counts are excluded up front, but that will have to wait on my (or someone's) to-do list. Hope this helps. --Andreas In message <3B0A5AEC.8CD4D20B at weenie.inesc.pt>you wrote: > Hi! > I'm trying to use SRLIM toolkit to create a language model. As i'm using > large amounts of text and i'm creating 4-grams language models the > counts files (created using make-batch-counts/merge-batch-counts > scripts) gets to big - 1.3 Gb gziped. When i try to create the language > model using ngram-count or make-big-lm the programs abort because there > isn't enough memory (500 Mb). How can i solve this problem. > Regards > > Souto > From sarahs at cs.washington.edu Wed May 23 13:44:09 2001 From: sarahs at cs.washington.edu (Sarah E. Schwarm) Date: Wed, 23 May 2001 13:44:09 -0700 (PDT) Subject: question about warning message Message-ID: hi all, I am running SRILM 1.0.1 on two different platforms (linux and solaris) and got different results using the same data with exactly the same commands. I'm hoping that someone else might have some insight... I'm not doing anything fancy - in this case, I just used ngram-count to build a trigram lm using the default settings for GT discounting, etc. Still, I get noticably different results ( ppl= 18.0975 ppl1= 40.7525 in linux and ppl= 17.2411 ppl1= 38.3 in solaris ) The solaris version gives the following warning, but the linux version does not: warning: discount coeff 1 is out of range: 0.900585 I turned on the -debug 3 flag to get more information, and the output of the two versions are nearly identical. The differences are the warning above, also, one verision discards 1 1-gram prob prdeicting a pseudo-event while the other discards 2, and in the end, they have very different left-over probability masses ( 0.00388768 vs. 4.55956e-06, where the second number corresponds with the warning I quoted above ) although they distribute these over the same number of unseen events and write the same number of n-grams. The GT-count numbers are also all the same in both versions. I found the warning message in the code (in lm/src/Discount.cc) but I don't really understand what's causing it, and I certainly don't understand why I get it on one installation and not the other. If anyone has any insight to offer, I'd greatly appreciate it. thanks much, Sarah ________________________ Sarah Schwarm sarahs at cs.washington.edu From stolcke at speech.sri.com Wed May 23 14:47:45 2001 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 23 May 2001 14:47:45 PDT Subject: question about warning message In-Reply-To: Your message of Wed, 23 May 2001 13:44:09 -0700. Message-ID: <200105232147.OAA11771@huge> Sarah, there are differences in floating point arithmetic between Sparc and Intel CPUs, but it looks like something else is going on. Please send me the input data (just the counts would be enough) and how exactly you invoke ngram-count and I'll try to figure it out. I might be a good-old-fashioned bug ... --Andreas In message you wrote: > hi all, > > I am running SRILM 1.0.1 on two different platforms (linux and > solaris) and got different results using the same data with exactly the > same commands. I'm hoping that someone else might have some insight... > > I'm not doing anything fancy - in this case, I just used ngram-count to > build a trigram lm using the default settings for GT discounting, etc. > Still, I get noticably different results ( ppl= 18.0975 ppl1= 40.7525 in > linux and ppl= 17.2411 ppl1= 38.3 in solaris ) > > The solaris version gives the following warning, but the linux version > does not: > warning: discount coeff 1 is out of range: 0.900585 > > I turned on the -debug 3 flag to get more information, and the output of > the two versions are nearly identical. The differences are the warning > above, also, one verision discards 1 1-gram prob prdeicting a pseudo-event > while the other discards 2, and in the end, they have very different > left-over probability masses ( 0.00388768 vs. 4.55956e-06, where the > second number corresponds with the warning I quoted above ) > although they distribute these over the same number of > unseen events and write the same number of n-grams. The GT-count numbers > are also all the same in both versions. > > I found the warning message in the code (in lm/src/Discount.cc) but I > don't really understand what's causing it, and I certainly don't > understand why I get it on one installation and not the other. If anyone > has any insight to offer, I'd greatly appreciate it. > > thanks much, > Sarah > > ________________________ > Sarah Schwarm > sarahs at cs.washington.edu > From stolcke at speech.sri.com Wed May 23 20:28:29 2001 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 23 May 2001 20:28:29 PDT Subject: question about warning message In-Reply-To: Your message of Wed, 23 May 2001 13:44:09 -0700. Message-ID: <200105240328.UAA29344@huge> Sarah, this discrepancy was indeed caused by the different floating point precision on x86 machines. To check for an anomaly of the counts-of-counts in Good-Turing discounting the code was checking whether two numbers were the same. This test turned out true on the Sparc machine, but false on Intel-based CPUs (they were ever-so-slightly off due to the extra bits in x86 floating point registers). The patch below fixes this problem and makes the behavior consistent (apply it to Discount.cc and rebuild the Linux version). It is really annoying that Intel couldn't just implement standard-precision IEEE arithmetic... Beyond that however, you should use a higher threshold for unigram discounting to avoid the problem of anomalous (non-smooth) counts-of-counts in the first place. Try "-gt1min 5". *** /tmp/T00vP_Q5 Wed May 23 20:18:39 2001 --- Discount.cc Wed May 23 20:02:53 2001 *************** *** 185,197 **** } else { double coeff0 = (i + 1) * (double)countOfCounts[i+1] / (i * (double)countOfCounts[i]); ! if (coeff0 <= commonTerm || coeff0 > 1.0) { cerr << "warning: discount coeff " << i ! << " is out of range: " << coeff0 << "\n"; coeff = 1.0; - } else { - coeff = (coeff0 - commonTerm) / (1.0 - commonTerm); - } } discountCoeffs[i] = coeff; --- 185,195 ---- } else { double coeff0 = (i + 1) * (double)countOfCounts[i+1] / (i * (double)countOfCounts[i]); ! coeff = (coeff0 - commonTerm) / (1.0 - commonTerm); ! if (coeff <= Prob_Epsilon || coeff0 > 1.0) { cerr << "warning: discount coeff " << i ! << " is out of range: " << coeff << "\n"; coeff = 1.0; } } discountCoeffs[i] = coeff; --Andreas In message you wrote: > hi all, > > I am running SRILM 1.0.1 on two different platforms (linux and > solaris) and got different results using the same data with exactly the > same commands. I'm hoping that someone else might have some insight... > > I'm not doing anything fancy - in this case, I just used ngram-count to > build a trigram lm using the default settings for GT discounting, etc. > Still, I get noticably different results ( ppl= 18.0975 ppl1= 40.7525 in > linux and ppl= 17.2411 ppl1= 38.3 in solaris ) > > The solaris version gives the following warning, but the linux version > does not: > warning: discount coeff 1 is out of range: 0.900585 > > I turned on the -debug 3 flag to get more information, and the output of > the two versions are nearly identical. The differences are the warning > above, also, one verision discards 1 1-gram prob prdeicting a pseudo-event > while the other discards 2, and in the end, they have very different > left-over probability masses ( 0.00388768 vs. 4.55956e-06, where the > second number corresponds with the warning I quoted above ) > although they distribute these over the same number of > unseen events and write the same number of n-grams. The GT-count numbers > are also all the same in both versions. > > I found the warning message in the code (in lm/src/Discount.cc) but I > don't really understand what's causing it, and I certainly don't > understand why I get it on one installation and not the other. If anyone > has any insight to offer, I'd greatly appreciate it. > > thanks much, > Sarah > > ________________________ > Sarah Schwarm > sarahs at cs.washington.edu > From stolcke at speech.sri.com Sat Jun 9 11:47:21 2001 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 09 Jun 2001 11:47:21 PDT Subject: SRILM In-Reply-To: Your message of Sat, 09 Jun 2001 07:01:30 +0300. <3B219F9A.FD1D392E@cs.bilkent.edu.tr> Message-ID: <200106091847.LAA18679@huge> In message <3B219F9A.FD1D392E at cs.bilkent.edu.tr>you wrote: > Hello Andreas, > > My name is Umut Topkara, and I am an MS student in Bilkent University, > in Turkey. I have been using SRILM for my MS thesis. I would like to > thank you for providing the code publicly. I've really benefited from it > a lot. I have made a few additions to the code to use it for deriving > and applying different language models for prefixes and suffixes of > Turkish words. I preferred wrapping my code around SRILM code rather > than changing parts of it. At the time I started writing my code, > multi-ngram was not available. As far as I see from the source code, it > could have been a good starting point to add code for a language model > that eploits morphology. > > I have a comment on the toolkit that I want to share with you. For my > particular case I can say that, if the toolkit has supported a mapping > from input words to words looked up in the language models through a > user defined function, it would have been invaluable. That way a > morphological processing of the words can be done on the run and can be > easily integrated into language modeling. Although this might be of > limited benefit for English, it will have a good impact on modeling of > languages with more productive and rich morphology. > > Thank you very much again for the toolkit. Umut, I'm glad the toolkit was useful to you, and thanks much for your input. If you just want a one-to-one mapping of "surface" words to an "internal" vocabulary you can do that with classes. Just prepare a class definition file that looks like INTERNAL_WORD 1.0 surface_word etc. and use it with the ngram -classes option. The LM then needs to be in terms of internal words (i.e., word classes). For training you need to prepare the data to contain internal words yourself, but that shouldn't be a problem. Also, an internal word (i.e., class) can actually expand to a sequence of surface words (but not the other way round). Hope this helps --Andreas From yangl at ecn.purdue.edu Wed Jun 20 12:27:08 2001 From: yangl at ecn.purdue.edu (Yang Liu) Date: Wed, 20 Jun 2001 14:27:08 -0500 (EST) Subject: compile problem Message-ID: <200106201927.f5KJR8j21168@sohmm.ecn.purdue.edu> Hello, all: I have a "silly" question. I tried to install SRILM, and I just met a problem. After I changed the variable SRILM in Makefile, and run make, I got an error msg: make: Fatal error in reader: Makefile, line 9: Unexpected end of line seen. I could not see anything wrong with the Makefile and have no clue about this. BTW, I am using SUN OS . Does anybody know what caused such an error? I am wondering if the compiler is not working well:) Thanks. Yang From ge204 at eng.cam.ac.uk Wed Jun 20 14:08:49 2001 From: ge204 at eng.cam.ac.uk (Gunnar Evermann) Date: 20 Jun 2001 22:08:49 +0100 Subject: compile problem In-Reply-To: <200106201927.f5KJR8j21168@sohmm.ecn.purdue.edu> References: <200106201927.f5KJR8j21168@sohmm.ecn.purdue.edu> Message-ID: Yang Liu writes: > After I changed the variable SRILM in Makefile, and run make, > I got an error msg: > make: Fatal error in reader: Makefile, line 9: Unexpected end of line seen. > > I could not see anything wrong with the Makefile and have no clue about this. > BTW, I am using SUN OS . I can offer a wild guess: I have seen such errors when using Sun's make with makefiles that relied on GNU extensions. Try using GNU make. Andreas might know whether any GNU specific features are used in the makefiles. Gunnar