From souto at weenie.inesc.pt  Tue May 22 05:26:20 2001
From: souto at weenie.inesc.pt (Nuno Souto)
Date: Tue, 22 May 2001 13:26:20 +0100
Subject: Memory problems
Message-ID: <3B0A5AEC.8CD4D20B@weenie.inesc.pt>

Hi!
I'm trying to use SRLIM toolkit to create a language model. As i'm using
large amounts of text and i'm creating 4-grams language models the
counts files (created using make-batch-counts/merge-batch-counts
scripts)  gets to big - 1.3 Gb gziped. When i try to create the language
model using ngram-count  or make-big-lm the programs abort because there
isn't enough memory (500 Mb). How can i solve this problem.
Regards

Souto


From stolcke at speech.sri.com  Tue May 22 11:19:21 2001
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 22 May 2001 11:19:21 PDT
Subject: Memory problems 
In-Reply-To: Your message of Tue, 22 May 2001 13:26:20 +0100.
             <3B0A5AEC.8CD4D20B@weenie.inesc.pt> 
Message-ID: <200105221819.LAA01941@huge>


Nuno,

you will have to raise the count thresholds on your bigrams, trigrams
and fourgrams to the point were you can fit things into memory, or 
at least where you can tolerate the paging.  (The LM estimation traverses
the count and LM data structures in a fairly localized fashion, so some
amount of paging is certainly tolerable).  

Use make-big-lm and play with the -gt2min, -gt3min and -gt4min
parameters until the memory requirements become managable.
Try eliminating the high-order ngrams first, they have a smaller effect 
on LM performance.  I have successfully built 5-gram models in 512 MB 
of memory from about 1.3 GB of gzipped counts using 

	-gt2min 1 -gt3min 2 -gt4min 4 -gt5min 4

As an independent measure, you could recompile the LM library and tools
with -DUSE_SARRAY_TRIE -DUSE_SARRAY, which switches to a slower, but 
less memory-wasting version of the data structures (the default setting 
is to optimize for speed).

Once you have managed to build the LM you probably also want to
apply entropic pruning to it (ngram -prune) to further reduce memory use
and loading time without sacrificing much performance.
The better approach would be to integrate the pruning with the estimation
so that irrelevant counts are excluded up front, but that will have to
wait on my (or someone's) to-do list.

Hope this helps.

--Andreas

In message <3B0A5AEC.8CD4D20B at weenie.inesc.pt>you wrote:
> Hi!
> I'm trying to use SRLIM toolkit to create a language model. As i'm using
> large amounts of text and i'm creating 4-grams language models the
> counts files (created using make-batch-counts/merge-batch-counts
> scripts)  gets to big - 1.3 Gb gziped. When i try to create the language
> model using ngram-count  or make-big-lm the programs abort because there
> isn't enough memory (500 Mb). How can i solve this problem.
> Regards
> 
> Souto
> 


From sarahs at cs.washington.edu  Wed May 23 13:44:09 2001
From: sarahs at cs.washington.edu (Sarah E. Schwarm)
Date: Wed, 23 May 2001 13:44:09 -0700 (PDT)
Subject: question about warning message
Message-ID: <Pine.LNX.4.21.0105231331160.3828-100000@titanium.cs.washington.edu>

hi all,

I am running SRILM 1.0.1 on two different platforms (linux and
solaris) and got different results using the same data with exactly the
same commands.  I'm hoping that someone else might have some insight...  

I'm not doing anything fancy - in this case, I just used ngram-count to
build a trigram lm using the default settings for GT discounting, etc.  
Still, I get noticably different results ( ppl= 18.0975 ppl1= 40.7525 in
linux and  ppl= 17.2411 ppl1= 38.3 in solaris )

The solaris version gives the following warning, but the linux version
does not:
 warning: discount coeff 1 is out of range: 0.900585

I turned on the -debug 3 flag to get more information, and the output of
the two versions are nearly identical.  The differences are the warning
above, also, one verision discards 1 1-gram prob prdeicting a pseudo-event
while the other discards 2, and in the end, they have very different
left-over probability masses ( 0.00388768 vs.  4.55956e-06, where the
second number corresponds with the warning I quoted above )
although they distribute these over the same number of
unseen events and write the same number of n-grams.  The GT-count numbers
are also all the same in both versions.

I found the warning message in the code (in lm/src/Discount.cc) but I
don't really understand what's causing it, and I certainly don't
understand why I get it on one installation and not the other.  If anyone
has any insight to offer, I'd greatly appreciate it. 

thanks much,
Sarah

________________________
Sarah Schwarm
sarahs at cs.washington.edu


From stolcke at speech.sri.com  Wed May 23 14:47:45 2001
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 23 May 2001 14:47:45 PDT
Subject: question about warning message 
In-Reply-To: Your message of Wed, 23 May 2001 13:44:09 -0700.
             <Pine.LNX.4.21.0105231331160.3828-100000@titanium.cs.washington.edu> 
Message-ID: <200105232147.OAA11771@huge>


Sarah,

there are differences in floating point arithmetic between Sparc and Intel
CPUs, but it looks like something else is going on.  Please send me the 
input data (just the counts would be enough) and how exactly you invoke
ngram-count and I'll try to figure it out.  I might be a good-old-fashioned
bug ...

--Andreas

In message <Pine.LNX.4.21.0105231331160.3828-100000 at titanium.cs.washington.edu>
you wrote:
> hi all,
> 
> I am running SRILM 1.0.1 on two different platforms (linux and
> solaris) and got different results using the same data with exactly the
> same commands.  I'm hoping that someone else might have some insight...  
> 
> I'm not doing anything fancy - in this case, I just used ngram-count to
> build a trigram lm using the default settings for GT discounting, etc.  
> Still, I get noticably different results ( ppl= 18.0975 ppl1= 40.7525 in
> linux and  ppl= 17.2411 ppl1= 38.3 in solaris )
> 
> The solaris version gives the following warning, but the linux version
> does not:
>  warning: discount coeff 1 is out of range: 0.900585
> 
> I turned on the -debug 3 flag to get more information, and the output of
> the two versions are nearly identical.  The differences are the warning
> above, also, one verision discards 1 1-gram prob prdeicting a pseudo-event
> while the other discards 2, and in the end, they have very different
> left-over probability masses ( 0.00388768 vs.  4.55956e-06, where the
> second number corresponds with the warning I quoted above )
> although they distribute these over the same number of
> unseen events and write the same number of n-grams.  The GT-count numbers
> are also all the same in both versions.
> 
> I found the warning message in the code (in lm/src/Discount.cc) but I
> don't really understand what's causing it, and I certainly don't
> understand why I get it on one installation and not the other.  If anyone
> has any insight to offer, I'd greatly appreciate it. 
> 
> thanks much,
> Sarah
> 
> ________________________
> Sarah Schwarm
> sarahs at cs.washington.edu
> 


From stolcke at speech.sri.com  Wed May 23 20:28:29 2001
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 23 May 2001 20:28:29 PDT
Subject: question about warning message 
In-Reply-To: Your message of Wed, 23 May 2001 13:44:09 -0700.
             <Pine.LNX.4.21.0105231331160.3828-100000@titanium.cs.washington.edu> 
Message-ID: <200105240328.UAA29344@huge>


Sarah,

this discrepancy was indeed caused by the different floating point precision on 
x86 machines.  To check for an anomaly of the counts-of-counts in Good-Turing
discounting the code was checking whether two numbers were the same.  This test
turned out true on the Sparc machine, but false on Intel-based CPUs (they were
ever-so-slightly off due to the extra bits in x86 floating point registers).
The patch below fixes this problem and makes the behavior consistent (apply it to
Discount.cc and rebuild the Linux version).  It is really annoying that Intel
couldn't just implement standard-precision IEEE arithmetic...

Beyond that however, you should use a higher threshold for unigram discounting 
to avoid the problem of anomalous (non-smooth) counts-of-counts in the first place.
Try "-gt1min 5".

*** /tmp/T00vP_Q5	Wed May 23 20:18:39 2001
--- Discount.cc	Wed May 23 20:02:53 2001
***************
*** 185,197 ****
  	    } else {
  		double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
  					    (i * (double)countOfCounts[i]);
! 		if (coeff0 <= commonTerm || coeff0 > 1.0) {
  		    cerr << "warning: discount coeff " << i
! 			 << " is out of range: " << coeff0 << "\n";
  		    coeff = 1.0;
- 		} else {
- 		    coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
- 
  		}
  	    }
  	    discountCoeffs[i] = coeff;
--- 185,195 ----
  	    } else {
  		double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
  					    (i * (double)countOfCounts[i]);
! 		coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
! 		if (coeff <= Prob_Epsilon || coeff0 > 1.0) {
  		    cerr << "warning: discount coeff " << i
! 			 << " is out of range: " << coeff << "\n";
  		    coeff = 1.0;
  		}
  	    }
  	    discountCoeffs[i] = coeff;

--Andreas

In message <Pine.LNX.4.21.0105231331160.3828-100000 at titanium.cs.washington.edu>you wrote:
> hi all,
> 
> I am running SRILM 1.0.1 on two different platforms (linux and
> solaris) and got different results using the same data with exactly the
> same commands.  I'm hoping that someone else might have some insight...  
> 
> I'm not doing anything fancy - in this case, I just used ngram-count to
> build a trigram lm using the default settings for GT discounting, etc.  
> Still, I get noticably different results ( ppl= 18.0975 ppl1= 40.7525 in
> linux and  ppl= 17.2411 ppl1= 38.3 in solaris )
> 
> The solaris version gives the following warning, but the linux version
> does not:
>  warning: discount coeff 1 is out of range: 0.900585
> 
> I turned on the -debug 3 flag to get more information, and the output of
> the two versions are nearly identical.  The differences are the warning
> above, also, one verision discards 1 1-gram prob prdeicting a pseudo-event
> while the other discards 2, and in the end, they have very different
> left-over probability masses ( 0.00388768 vs.  4.55956e-06, where the
> second number corresponds with the warning I quoted above )
> although they distribute these over the same number of
> unseen events and write the same number of n-grams.  The GT-count numbers
> are also all the same in both versions.
> 
> I found the warning message in the code (in lm/src/Discount.cc) but I
> don't really understand what's causing it, and I certainly don't
> understand why I get it on one installation and not the other.  If anyone
> has any insight to offer, I'd greatly appreciate it. 
> 
> thanks much,
> Sarah
> 
> ________________________
> Sarah Schwarm
> sarahs at cs.washington.edu
> 


From stolcke at speech.sri.com  Sat Jun  9 11:47:21 2001
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sat, 09 Jun 2001 11:47:21 PDT
Subject: SRILM 
In-Reply-To: Your message of Sat, 09 Jun 2001 07:01:30 +0300.
             <3B219F9A.FD1D392E@cs.bilkent.edu.tr> 
Message-ID: <200106091847.LAA18679@huge>


In message <3B219F9A.FD1D392E at cs.bilkent.edu.tr>you wrote:
> Hello Andreas,
> 
> My name is Umut Topkara, and I am an MS student in Bilkent University,
> in Turkey. I have been using SRILM for my MS thesis. I would like to
> thank you for providing the code publicly. I've really benefited from it
> a lot. I have made a few additions to the code to use it for deriving
> and applying different language models for prefixes and suffixes of
> Turkish words. I preferred wrapping my code around SRILM code rather
> than changing parts of it. At the time I started writing my code,
> multi-ngram was not available. As far as I see from the source code, it
> could have been a good starting point to add code for a language model
> that eploits morphology.
> 
> I have a comment on the toolkit that I want to share with you. For my
> particular case I can say that, if the toolkit has supported a mapping
> from input words to words looked up in the language models through a
> user defined function, it would have been invaluable. That way a
> morphological processing of the words can be done on the run and can be
> easily integrated into language modeling. Although this might be of
> limited benefit for English, it will have a good impact on modeling of
> languages with more productive and rich morphology.
> 
> Thank you very much again for the toolkit.

Umut,

I'm glad the toolkit was useful to you, and thanks much for your input.   

If you just want a one-to-one mapping of "surface" words to an "internal"
vocabulary you can do that with classes.  Just prepare a class definition
file that looks like

	INTERNAL_WORD 1.0 surface_word
	etc.

and use it with the ngram -classes option.
The LM then needs to be in terms of internal words (i.e., word classes).
For training you need to prepare the data to contain internal words yourself,
but that shouldn't be a problem.

Also, an internal word (i.e., class) can actually expand to a sequence of 
surface words (but not the other way round).

Hope this helps

--Andreas


From yangl at ecn.purdue.edu  Wed Jun 20 12:27:08 2001
From: yangl at ecn.purdue.edu (Yang Liu)
Date: Wed, 20 Jun 2001 14:27:08 -0500 (EST)
Subject: compile problem 
Message-ID: <200106201927.f5KJR8j21168@sohmm.ecn.purdue.edu>

Hello, all:
I have a "silly" question. 
I tried to install SRILM, and I just met a problem.

After I changed the variable SRILM in Makefile, and run make, 
I got an error msg: 
make: Fatal error in reader: Makefile, line 9: Unexpected end of line seen.

I could not see anything wrong with the Makefile and have no clue about this.
BTW, I am using SUN OS . 

Does anybody know what caused such an error?  I am wondering if the compiler is 
not working well:) 

Thanks.
Yang


From ge204 at eng.cam.ac.uk  Wed Jun 20 14:08:49 2001
From: ge204 at eng.cam.ac.uk (Gunnar Evermann)
Date: 20 Jun 2001 22:08:49 +0100
Subject: compile problem
In-Reply-To: <200106201927.f5KJR8j21168@sohmm.ecn.purdue.edu>
References: <200106201927.f5KJR8j21168@sohmm.ecn.purdue.edu>
Message-ID: <mqqu21an9jy.fsf@eng.cam.ac.uk>

Yang Liu <yangl at ecn.purdue.edu> writes:

> After I changed the variable SRILM in Makefile, and run make, 
> I got an error msg: 
> make: Fatal error in reader: Makefile, line 9: Unexpected end of line seen.
>
> I could not see anything wrong with the Makefile and have no clue about this.
> BTW, I am using SUN OS . 

I can offer a wild guess:

I have seen such errors when using Sun's make with makefiles that
relied on GNU extensions. Try using GNU make.

Andreas might know whether any GNU specific features are used in the
makefiles.

  Gunnar