From nlp at pobox.sk  Tue Jul 20 08:26:23 2004
From: nlp at pobox.sk (Robert Wagner)
Date: Tue, 20 Jul 2004 17:26:23 +0200
Subject: Interpolation of word-based and POS-nased ngrams
Message-ID: <200407201526.i6KFQNr4006030@www3.pobox.sk>

Hello SRILM users!
 Does anybody know if there is an implementation of interpolation
weights in SRILM? I have an ordinary word-based ngram and
part-of-speech-based ngram and want to interpolate them to create HMM
model for disfluency detection (using hidden-ngram tool). Is it
possible to do it directly in SRILM?
 Regards
       Robert Wagner


____________________________________
http://www.logofun.pobox.sk - urobte radost svojmu telefonu


From stolcke at speech.sri.com  Tue Jul 20 08:54:16 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 20 Jul 2004 08:54:16 PDT
Subject: Interpolation of word-based and POS-nased ngrams 
In-Reply-To: Your message of Tue, 20 Jul 2004 17:26:23 +0200.
             <200407201526.i6KFQNr4006030@www3.pobox.sk> 
Message-ID: <200407201554.i6KFsGj09165@conga.speech.sri.com>


In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote:
> Hello SRILM users!
>  Does anybody know if there is an implementation of interpolation
> weights in SRILM? I have an ordinary word-based ngram and
> part-of-speech-based ngram and want to interpolate them to create HMM
> model for disfluency detection (using hidden-ngram tool). Is it
> possible to do it directly in SRILM?

By using the options

	-lm
	-classes
	-simple-classes
	-lambda
	-mix-lm

with hidden-ngram you can tell it to use an interpolated LM where
(one or both of) the component models are class-based.

For details see the man page.

--Andreas 


From nlp at pobox.sk  Tue Jul 20 10:01:10 2004
From: nlp at pobox.sk (Robert Wagner)
Date: Tue, 20 Jul 2004 19:01:10 +0200
Subject: Interpolation of word-based and POS-nased ngrams 
Message-ID: <200407201701.i6KH1A9R016054@www6.pobox.sk>

Hi Andreas,
 my problem is that I use different data for both models. The
word-based model uses a text consisting of recognized words, POS-based
class model uses a text consistig of recognized words' POS. I have
estimated this model simply by using the ngram-count tool from the
text where words were replaced by their POS tags. 
 POS-based classes are also not typical "simple classes"...
  
Robert

P.S.
 It would be ideal to gain the interpolation weights by SRILM as well;-)

> 
> In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote:
> > Hello SRILM users!
> >  Does anybody know if there is an implementation of interpolation
> > weights in SRILM? I have an ordinary word-based ngram and
> > part-of-speech-based ngram and want to interpolate them to create HMM
> > model for disfluency detection (using hidden-ngram tool). Is it
> > possible to do it directly in SRILM?
> 
> By using the options
> 
> 	-lm
> 	-classes
> 	-simple-classes
> 	-lambda
> 	-mix-lm
> 
> with hidden-ngram you can tell it to use an interpolated LM where
> (one or both of) the component models are class-based.
> 
> For details see the man page.
> 
> --Andreas 
> 

____________________________________
http://www.pobox.sk/ - spolahliva a bezpecna prevadzka


From katrin at ssli-mail.ee.washington.edu  Tue Jul 20 11:25:54 2004
From: katrin at ssli-mail.ee.washington.edu (Katrin Kirchhoff)
Date: Tue, 20 Jul 2004 11:25:54 -0700
Subject: Interpolation of word-based and POS-nased ngrams
In-Reply-To: <200407201701.i6KH1A9R016054@www6.pobox.sk>; from nlp@pobox.sk on Tue, Jul 20, 2004 at 07:01:10PM +0200
References: <200407201701.i6KH1A9R016054@www6.pobox.sk>
Message-ID: <20040720112554.B16976@duck.ee.washington.edu>


As far as I known you need to write your own script to compute
P(word|POS) and create the classes file. There's 
a compute-best-mix.gawk script in SRILM for estimating
interpolation weights. 

KK

On Tue, Jul 20, 2004 at 07:01:10PM +0200, Robert Wagner wrote:
> Hi Andreas,
>  my problem is that I use different data for both models. The
> word-based model uses a text consisting of recognized words, POS-based
> class model uses a text consistig of recognized words' POS. I have
> estimated this model simply by using the ngram-count tool from the
> text where words were replaced by their POS tags. 
>  POS-based classes are also not typical "simple classes"...
>   
> Robert
> 
> P.S.
>  It would be ideal to gain the interpolation weights by SRILM as well;-)
> 
> > 
> > In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote:
> > > Hello SRILM users!
> > >  Does anybody know if there is an implementation of interpolation
> > > weights in SRILM? I have an ordinary word-based ngram and
> > > part-of-speech-based ngram and want to interpolate them to create HMM
> > > model for disfluency detection (using hidden-ngram tool). Is it
> > > possible to do it directly in SRILM?
> > 
> > By using the options
> > 
> > 	-lm
> > 	-classes
> > 	-simple-classes
> > 	-lambda
> > 	-mix-lm
> > 
> > with hidden-ngram you can tell it to use an interpolated LM where
> > (one or both of) the component models are class-based.
> > 
> > For details see the man page.
> > 
> > --Andreas 
> > 
> 
> ____________________________________
> http://www.pobox.sk/ - spolahliva a bezpecna prevadzka
> 
> 
> 

-- 
-----------------------------------------------------------------
Katrin Kirchhoff
Dept of Electrical Engineering, University of Washington
M422 EE/CS Building, Box 352500, Seattle, WA, 98195
Phone: (206) 616 5494
katrin at ee.washington.edu
-----------------------------------------------------------------


From stolcke at speech.sri.com  Tue Jul 20 11:50:22 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 20 Jul 2004 11:50:22 PDT
Subject: Interpolation of word-based and POS-nased ngrams 
In-Reply-To: Your message of Tue, 20 Jul 2004 19:01:10 +0200.
             <200407201701.i6KH1A9R016054@www6.pobox.sk> 
Message-ID: <200407201850.LAA27699@huge>


In message <200407201701.i6KH1A9R016054 at www6.pobox.sk>you wrote:
> Hi Andreas,
>  my problem is that I use different data for both models. The
> word-based model uses a text consisting of recognized words, POS-based
> class model uses a text consistig of recognized words' POS. I have
> estimated this model simply by using the ngram-count tool from the
> text where words were replaced by their POS tags. 
>  POS-based classes are also not typical "simple classes"...

You are right of course.  While hidden-ngram can theoretically 
handle general class ngrams, the implementation is currently 
not able to handle anything but toy examples.  the reason is that 
general class-based models are no longer Markovian: they require the 
complete word history.  This means hidden-ngram has to 
keep complete distinct histories for every hypothesis, which quickly 
becomes infeasible.

With some small changes to the code one could approximate the full 
class N-gram by truncating the ngram context used to a fixed length 
(say 4).  this might not hurt you much in practice, and would enable
use of class-based N-grams in hidden-ngram decoding.
Let me know if you're insterest in that.

The other possbility (also an approximation) for now is to expand 
the class ngram into a word ngram, but this might also fail due to 
resource limitations, depending on your vocabulary size.

--Andreas

>   
> Robert
> 
> P.S.
>  It would be ideal to gain the interpolation weights by SRILM as well;-)
> 
> > 
> > In message <200407201526.i6KFQNr4006030 at www3.pobox.sk>you wrote:
> > > Hello SRILM users!
> > >  Does anybody know if there is an implementation of interpolation
> > > weights in SRILM? I have an ordinary word-based ngram and
> > > part-of-speech-based ngram and want to interpolate them to create HMM
> > > model for disfluency detection (using hidden-ngram tool). Is it
> > > possible to do it directly in SRILM?
> > 
> > By using the options
> > 
> > 	-lm
> > 	-classes
> > 	-simple-classes
> > 	-lambda
> > 	-mix-lm
> > 
> > with hidden-ngram you can tell it to use an interpolated LM where
> > (one or both of) the component models are class-based.
> > 
> > For details see the man page.
> > 
> > --Andreas 
> > 
> 
> ____________________________________
> http://www.pobox.sk/ - spolahliva a bezpecna prevadzka
> 
> 
> 
> 


From duh at ee.washington.edu  Fri Aug 13 14:52:33 2004
From: duh at ee.washington.edu (Kevin Duh)
Date: Fri, 13 Aug 2004 14:52:33 -0700
Subject: Memory problem with ngram-count
Message-ID: <411D3821.8090308@ee.washington.edu>

Hi,

I'm running into some memory limitations with ngram-count and am 
wondering if anyone has any suggestions.

I have a very large text file (more than 1GB) as input to ngram-count. I 
divided this text into smaller files and used the 'make-batch-counts' 
and 'merge-batch-counts' commands to create a large count-file. Then, I 
tried to use 'ngram-count -read myfile.counts -lm ...' to estimate a 
language model. I receive the following error:

ngram-count: /SRILM/include/LHash.cc:127: void LHash<KeyT, 
DataT>::alloc(unsigned int) [with KeyT = VocabIndex, DataT = 
Trie<VocabIndex, unsigned int>]: Assertion `body != 0' failed.

Does anyone have any suggestions for solving this problem?

Thanks in advance,
Kevin

-----------------------------
Kevin Duh
Graduate Research Assistant
Dept. of Electrical Engineering
University of Washington
http://ssli.ee.washington.edu/people/duh


From stolcke at speech.sri.com  Fri Aug 13 20:07:15 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 13 Aug 2004 20:07:15 PDT
Subject: Memory problem with ngram-count 
In-Reply-To: Your message of Fri, 13 Aug 2004 14:52:33 -0700.
             <411D3821.8090308@ee.washington.edu> 
Message-ID: <200408140307.UAA13114@huge>


In message <411D3821.8090308 at ee.washington.edu>you wrote:
> Hi,
> 
> I'm running into some memory limitations with ngram-count and am 
> wondering if anyone has any suggestions.
> 
> I have a very large text file (more than 1GB) as input to ngram-count. I 
> divided this text into smaller files and used the 'make-batch-counts' 
> and 'merge-batch-counts' commands to create a large count-file. Then, I 
> tried to use 'ngram-count -read myfile.counts -lm ...' to estimate a 
> language model. I receive the following error:
> 
> ngram-count: /SRILM/include/LHash.cc:127: void LHash<KeyT, 
> DataT>::alloc(unsigned int) [with KeyT = VocabIndex, DataT = 
> Trie<VocabIndex, unsigned int>]: Assertion `body != 0' failed.
> 
> Does anyone have any suggestions for solving this problem?

1. Use a binary compiled for "compact" memory use.
   If you are lucky (the person who installed SRILM did a thorough job)
   you should find these installed in 

	$SRILM/bin/${MACHINE_TYPE}_c/ ...

2. Use the make-big-lm script.  See the training-scripts(1) man page
   for details.

3. Find a machine with more memory or swap space.

4. Some combination of the above.

--Andreas 


From dw229 at hermes.cam.ac.uk  Mon Aug 23 06:21:37 2004
From: dw229 at hermes.cam.ac.uk (Daniel Walker)
Date: Mon, 23 Aug 2004 14:21:37 +0100 (BST)
Subject: ngram-count and float counts
Message-ID: <Pine.LNX.4.60.0408231421100.20065@hermes-1.csi.cam.ac.uk>

I'm trying to create a back-off file using ngram-count and a count file that 
has float "counts" derived from another model but get the error message:

error in discount estimator for order 1

The -float-counts option says that it only supports certain types of 
discounting but doesn't say which. Could this be the problem? Otherwise, what 
sort of data problems could cause this message, does anyone know?

Thanks,

Dan


From stolcke at speech.sri.com  Mon Aug 23 07:47:28 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 23 Aug 2004 07:47:28 PDT
Subject: ngram-count and float counts 
In-Reply-To: Your message of Mon, 23 Aug 2004 14:21:37 +0100.
             <Pine.LNX.4.60.0408231421100.20065@hermes-1.csi.cam.ac.uk> 
Message-ID: <200408231447.HAA21348@huge>


The only discounting methods that support fractional counts are 

-cdiscount*
-wbdiscount*

(You can look in include/Discount.h and see which classes do NOT have 
a member function

    virtual Boolean estimate(NgramCounts<FloatCount> &counts, unsigned order)
        { return false; };
)


--Andreas

In message <Pine.LNX.4.60.0408231421100.20065 at hermes-1.csi.cam.ac.uk>you wrote:
> I'm trying to create a back-off file using ngram-count and a count file that 
> has float "counts" derived from another model but get the error message:
> 
> error in discount estimator for order 1
> 
> The -float-counts option says that it only supports certain types of 
> discounting but doesn't say which. Could this be the problem? Otherwise, what
>  
> sort of data problems could cause this message, does anyone know?
> 
> Thanks,
> 
> Dan
> 


From wangc at csail.mit.edu  Fri Sep  3 10:27:44 2004
From: wangc at csail.mit.edu (Chao Wang)
Date: Fri, 03 Sep 2004 13:27:44 -0400
Subject: -unk flag
Message-ID: <4138A990.5060500@csail.mit.edu>

Could someone please tell me what the -unk flag will do to the probability
model? It seems that, with the -unk flag, the language model will give a very
good probability to unknown words, even when the training sentences don't
contain any unknown words. In fact, I found that the probability for a sentence
in the training data is inferior to that of a sentence composed entirely of
unknown words (the number of words are the same in the two sentences). This is 
quite expected.

Thanks a lot!

Chao
-- 
Chao Wang, PhD
Spoken Language Systems Group
MIT CSAIL
http://www.sls.csail.mit.edu/wangc


From stolcke at speech.sri.com  Mon Sep  6 13:54:16 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 06 Sep 2004 13:54:16 PDT
Subject: -unk flag 
In-Reply-To: Your message of Fri, 03 Sep 2004 13:27:44 -0400.
             <4138A990.5060500@csail.mit.edu> 
Message-ID: <200409062054.NAA11957@huge>


In message <4138A990.5060500 at csail.mit.edu>you wrote:
> Could someone please tell me what the -unk flag will do to the probability
> model? It seems that, with the -unk flag, the language model will give a very
> good probability to unknown words, even when the training sentences don't
> contain any unknown words. In fact, I found that the probability for a senten
> ce
> in the training data is inferior to that of a sentence composed entirely of
> unknown words (the number of words are the same in the two sentences). This i
> s 
> quite expected.

ngram-count -unk  builds an LM that has <unk> as a word type and assigns
non-zero probability to it (the default is not to include <unk> in the LM).
All words not in the training data not listed in the -vocab file are mapped
to <unk>.  This is what is commonly known as an "open-vocabulary" LM.

ngram -unk should be used to evaluate an LM that contains <unk>.
(A warning will be issued if -unk is no specified and the LM contains
a non-zero probability for <unk>).

The behavior you describe is certainly not expected.  But to figure out 
why it happens one would have to look at the data and exact command
invocations you are using.

--Andreas