From thomae at ei.tum.de  Wed Oct  2 06:09:46 2002
From: thomae at ei.tum.de (Matthias Thomae)
Date: Wed, 02 Oct 2002 15:09:46 +0200
Subject: N-Gram without backoff?
Message-ID: <3D9AF01A.4010505@ei.tum.de>

Hello SRILM users,

does anyone know if and how it is possible to construct n-gram language
models without backoff, and to convert them into pfsg format? I could
not find any corresponding option for ngram or ngram-count. I tried
manually deleting the lower-order n-grams from the ARPA format file, but
I am not sure if the weights are still correct then.

Regards.
Matthias


From stolcke at speech.sri.com  Wed Oct  2 10:04:56 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 02 Oct 2002 10:04:56 PDT
Subject: N-Gram without backoff? 
In-Reply-To: Your message of Wed, 02 Oct 2002 15:09:46 +0200.
             <3D9AF01A.4010505@ei.tum.de> 
Message-ID: <200210021704.KAA04741@huge>


You can disable probability smoothing with ngram-count -gt1max 0 -gt2max 0 ...
This will still include lower-order N-grams in the models, but 
they are effectively never used because no probability mass is left
for backing off.  you could then remove the lower-order ngrams to save
space (but leave the unigrams in).

the conversion to pfsg should be unaffected by any of this.

--Andreas

In message <3D9AF01A.4010505 at ei.tum.de>you wrote:
> Hello SRILM users,
> 
> does anyone know if and how it is possible to construct n-gram language
> models without backoff, and to convert them into pfsg format? I could
> not find any corresponding option for ngram or ngram-count. I tried
> manually deleting the lower-order n-grams from the ARPA format file, but
> I am not sure if the weights are still correct then.
> 
> Regards.
> Matthias
> 


From woosung at clsp.jhu.edu  Wed Oct  2 21:15:18 2002
From: woosung at clsp.jhu.edu (Woosung Kim)
Date: Thu, 3 Oct 2002 00:15:18 -0400
Subject: [Q] on mix-lm?
Message-ID: <20021003001518.430ac8d8.woosung@clsp.jhu.edu>

Dear Dr. Stolcke,

I am doing some experiments using interpolated LMs, and
I've noticed that mixed LMs give slightly different PPLs
from PPLs that should be. I mean, PPLs calculated by getting
weighted sums after getting respective models' word probs.
Do you have any documentations or explanations how that 'mix-lm' 
works in your toolkit or how it is different from the correct way?
Of course, the best ways would be to look at the source code,
but I am looking for an easier way.
According to my experiments, mix-lm gives better results when
the baseline model (before mixing) is good (PPL less than 300), 
but it gives worse results when it is not good (PPL above 500).

Thanks in advance,
-- 
Woosung Kim


From anand at speech.sri.com  Wed Oct  2 22:49:49 2002
From: anand at speech.sri.com (Anand Venkataraman)
Date: Wed, 2 Oct 2002 22:49:49 -0700 (PDT)
Subject: [Q] on mix-lm?
Message-ID: <200210030549.WAA01531@huge>

Dear Woosung,

>I am doing some experiments using interpolated LMs, and
>I've noticed that mixed LMs give slightly different
>PPLs from PPLs that should be. I mean, PPLs calculated
>by getting weighted sums after getting respective
>models' word probs.  Do you have any documentations or
>explanations how that 'mix-lm' works in your toolkit or
>how it is different from the correct way?

There is no one "correct way".  But I presume you mean
by that the unmixed estimation procedure.

mix-lm simply does \sum_i \lambda_i P(w_i|h_i) where
the probability is the backed-off ngram word level
probability.  You can in fact calculate this value by
hand quite easily from the individual ngram -ppl
outputs using the above expression.

However, there is a slight nuance involved.  One should
generally use lambdas that were estimated to maximize
the likelihood of some held out data in the domain.
The awk script compute-best-mix will do this for you.

You can also calculate a sentence level mixture
similarly interpolated with tuned weights (see
compute-best-sentence-mix).  This uses sentence level
probabilities (as for instance obtained from ngram
-debug 1 -ppl).

>experiments, mix-lm gives better results when the
>baseline model (before mixing) is good (PPL less than
>300), but it gives worse results when it is not good
>(PPL above 500).
>

Regardless of the quality of the lms, the mixed
likelihood on the held out set should alwasy be at
least as much as the likelihood of most likely
component likelihood becaues the EM procedure to
compute the best weights maximises this quantity.  Of
course the test set likelihood (and conseqnetly -PPL)
may not necessarily higher, but usually is.

hope this helps.

&


From stolcke at speech.sri.com  Thu Oct  3 08:25:58 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 03 Oct 2002 08:25:58 PDT
Subject: [Q] on mix-lm? 
In-Reply-To: Your message of Thu, 03 Oct 2002 00:15:18 -0400.
             <20021003001518.430ac8d8.woosung@clsp.jhu.edu> 
Message-ID: <200210031525.IAA20309@tonga>


Woosung,

I suspect that you are noticing the difference between "static" and
"dynamic" interpolation.  The former is sometimes called N-gram "merging",
while the latter is the commonly used mixture of probabilities.
ngram -bayes 0 -mix-lm performs dynamic interpolation. 
Without the -bayes option you get static interpolation.
This is also explained in the man page:

       -mix-lm file
              Read a second N-gram model for  interpolation  pur-
              poses.   The second and any additional interpolated
              models can also be class N-grams  (using  the  same
              -classes   definitions),  but  are  otherwise  con-
              strained to be standard N-grams, i.e., the  options
              -df, -tagged, -skip, and -hidden-vocab do not apply
              to then.
              NOTE: Unless -bayes (see below) is specified, -mix-
              lm triggers a static interpolation of the models in
              memory.  In most cases a  more  efficient,  dynamic
              interpolation is sufficient, requested by -bayes 0.

There is some discussion of the two methods in the paper that just 
appeared in ICSLP
(http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp2002-srilm.ps.gz,
last paragraph of section 3.2).

--Andreas

In message <20021003001518.430ac8d8.woosung at clsp.jhu.edu>you wrote:
> Dear Dr. Stolcke,
> 
> I am doing some experiments using interpolated LMs, and
> I've noticed that mixed LMs give slightly different PPLs
> from PPLs that should be. I mean, PPLs calculated by getting
> weighted sums after getting respective models' word probs.
> Do you have any documentations or explanations how that 'mix-lm' 
> works in your toolkit or how it is different from the correct way?
> Of course, the best ways would be to look at the source code,
> but I am looking for an easier way.
> According to my experiments, mix-lm gives better results when
> the baseline model (before mixing) is good (PPL less than 300), 
> but it gives worse results when it is not good (PPL above 500).
> 
> Thanks in advance,
> -- 
> Woosung Kim


From mirjam.sepesy at uni-mb.si  Tue Oct  8 05:34:37 2002
From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec)
Date: Tue, 08 Oct 2002 14:34:37 +0200
Subject: class LM
Message-ID: <3DA2D0DD.AE6387DB@uni-mb.si>

Andreas!

Thank you for your answers.

Few more questions:

1.)
I understand the transitions like:

[2gram]POSITION = 2 FROM: <504,NULL> TO: <756 504,NULL> WORD = primeri
PROB = -1.76748 EXPANDPROB = 0.0106105

(504, 756 are classs),

but not the transitions like:

[OOV]POSITION = 2 FROM: <504,NULL> TO: <,NULL> WORD = primeri PROB =
-inf

What does [OOV] mean? These transitions are not present in  the test
example of the toolkit.

2.) In which case  is the history string cleaned (FROM: <504,NULL> TO:
<,NULL>) ?

3.) Is the vocabulary size in SRI-LM limited?

Thanks a lot,

Mirjam


From stolcke at speech.sri.com  Tue Oct  8 08:52:48 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 08 Oct 2002 08:52:48 PDT
Subject: class LM 
In-Reply-To: Your message of Tue, 08 Oct 2002 14:34:37 +0200.
             <3DA2D0DD.AE6387DB@uni-mb.si> 
Message-ID: <200210081552.IAA10950@huge>


In message <3DA2D0DD.AE6387DB at uni-mb.si>you wrote:
> Andreas!
> 
> Thank you for your answers.
> 
> Few more questions:
> 
> 1.)
> I understand the transitions like:
> 
> [2gram]POSITION = 2 FROM: <504,NULL> TO: <756 504,NULL> WORD = primeri
> PROB = -1.76748 EXPANDPROB = 0.0106105
> 
> (504, 756 are classs),
> 
> but not the transitions like:
> 
> [OOV]POSITION = 2 FROM: <504,NULL> TO: <,NULL> WORD = primeri PROB =
> -inf
> 
> What does [OOV] mean? These transitions are not present in  the test
> example of the toolkit.

[OOV] means a word was not found even in the unigrams of your model.
The ClassNgram code handles LMs that contains both word and class ngrams.
It therefore always tries to also find an N-gram probabilty for each 
word (without class lookup), and if you don't include all class member words
in your vocabulary when building the LM you will get this "OOV" condition.
But is is harmless since presumably all your words get some probability 
by virtue of being members in some class.


> 2.) In which case  is the history string cleaned (FROM: <504,NULL> TO:
> <,NULL>) ?

When there a are no histories in the LM that start with the given class
(504).  The history is kept only a long as it needs to be to compute
subsequent N-gram probabilities (so as to minimize the state space).

> 
> 3.) Is the vocabulary size in SRI-LM limited?

To the range of unsigned integers (2^32).

--Andreas


From jachym at kky.zcu.cz  Fri Oct 11 04:47:17 2002
From: jachym at kky.zcu.cz (=?iso-8859-2?B?SuFjaHltIEtvbOH4?=)
Date: Fri, 11 Oct 2002 13:47:17 +0200
Subject: Problem with language-specific characters in segment
Message-ID: <000c01c2711b$f461a1d0$3f2fe493@ui.kky.fav.zcu.cz>

Hi to all!
I have a following problem with segment tool. In the output of segment appears <unk> token instead of words including language-specific characters - although in language model file they are saved correctly and input text file has the same coding (ISO-Latin 2) as the training text. 
 Does anybody know what's the problem?

Language model was buil using:
ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm lmfile2

Segment tool was used with option:
segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous

Disabling -unk option  I got right words in the output but posteriors are probably not correct.

Jachym Kolar
Department of Cybernetics
University of West-Bohemia
Pilsen, Czech Republic

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20021011/1182e4ae/attachment.html>

From stolcke at speech.sri.com  Sun Oct 13 08:20:53 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 13 Oct 2002 08:20:53 -0700
Subject: Problem with language-specific characters in segment
References: <000c01c2711b$f461a1d0$3f2fe493@ui.kky.fav.zcu.cz>
Message-ID: <3DA98F55.3000604@speech.sri.com>


Hi,

sorry to hear about the problems.  I think it has to do with the fact 
that the locale is
never set in segment.cc.   try putting

    setlocale(LC_CTYPE, "");
    setlocale(LC_COLLATE, "");

right at the beginning of main() in segment.cc.  (This applies to 
several other programs as
well, and will be fixed in the next release.)

BTW, the -unk option only makes sense if your LM was trained with 
instances of <unk>
(or the ngram-count -unk option).  Otherwise unknown words will get zero 
probability either
way.

--Andreas

J?chym Kol?? wrote:

> Hi to all!
> I have a following problem with segment tool. In the output of segment 
> appears <unk> token instead of words including 
> language-specific characters - although in language model file they 
> are saved correctly and input text file has the same coding (ISO-Latin 
> 2) as the training text. 
>  Does anybody know what's the problem?
>  
> Language model was buil using:
> ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm 
> lmfile2
>  
> Segment tool was used with option:
> segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous
>  
> Disabling -unk option  I got right words in the output but posteriors 
> are probably not correct.
>  
> Jachym Kolar
> Department of Cybernetics
> University of West-Bohemia
> Pilsen, Czech Republic
>  


From iris_jing_2000 at yahoo.com  Fri Oct 25 10:47:07 2002
From: iris_jing_2000 at yahoo.com (Bing Jing)
Date: Fri, 25 Oct 2002 10:47:07 -0700 (PDT)
Subject: Q: probabilities calculation
In-Reply-To: <3DA98F55.3000604@speech.sri.com>
Message-ID: <20021025174707.57738.qmail@web12501.mail.yahoo.com>


Hello there,

Does anyone know how the SRI tool generate
unigram probabilities for the words that NOT
occur in the training transcript but covered
by the training dictionary? As I read
the NgramLM.cc, I think all those words are
assigned a probability as LogP_Zero, but it 
seems to me that this value is various regarding
different LMs. 

I used two sets of quite small transcription to
train LMs, and use the same training dictionary (
46K). The number of unique words in trans1 and trans2
are 620 and 700, respectively. And for those words
that covered by the lexicon but now in the training
trans, the unigram probabilities are -5.337341 and 
-5.383736, respectively. I still can't figure out how
these two numbers are generated. 

Thanks in advance!

Bing


__________________________________________________
Do you Yahoo!?
Y! Web Hosting - Let the expert host your web site
http://webhosting.yahoo.com/


From stolcke at speech.sri.com  Sun Oct 27 10:15:16 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 27 Oct 2002 10:15:16 -0800
Subject: Q: probabilities calculation
References: <20021025174707.57738.qmail@web12501.mail.yahoo.com>
Message-ID: <3DBC2D34.4050306@speech.sri.com>

Bing,

words with zero unigram counts can still get a non-zero probability as a 
result of
probability smoothing.   The discounting method applied to unigrams will 
cause the
total probability mass of the oberserved unigrams to be less than zero. 
 SRILM
then effectively implements a backing off to a "zero-gram" (uniform) 
distribution.
Since the DARPA format has no provision for such a backoff this is done 
implicitly:
If there is at least one word with zero counts (sometimes called a 
"zeroton") then the left-over
unigram probability mass is distributed evenly over all zeroton words. 
 If  all words in
the vocabuary had non-zero counts (i.e., no zerotons) then the left-over 
probability
is split evenly among all words and added to the previously estimated 
unigram probabilities.

This is all implemented in Ngram::distributeProb(), which in turn is 
invoked as part of
the backoff weight normalization step.

So the short answer is that depending on the discounting method chosen 
for unigrams,
zerotons get some non-zero probabiility via backoff to a uniform 
distribution.
If you want to disable that you just  need to disable unigram 
discounting (-gt1max 0).

I hope this answers your question.

--Andreas

Bing Jing wrote:

>Hello there,
>
>Does anyone know how the SRI tool generate
>unigram probabilities for the words that NOT
>occur in the training transcript but covered
>by the training dictionary? As I read
>the NgramLM.cc, I think all those words are
>assigned a probability as LogP_Zero, but it 
>seems to me that this value is various regarding
>different LMs. 
>
>I used two sets of quite small transcription to
>train LMs, and use the same training dictionary (
>46K). The number of unique words in trans1 and trans2
>are 620 and 700, respectively. And for those words
>that covered by the lexicon but now in the training
>trans, the unigram probabilities are -5.337341 and 
>-5.383736, respectively. I still can't figure out how
>these two numbers are generated. 
>
>Thanks in advance!
>
>Bing
>
>
>
>
>__________________________________________________
>Do you Yahoo!?
>Y! Web Hosting - Let the expert host your web site
>http://webhosting.yahoo.com/
>  
>


From stolcke at speech.sri.com  Tue Nov  5 16:30:04 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 05 Nov 2002 16:30:04 PST
Subject: SRILM 1.3.2 
In-Reply-To: Your message of Tue, 05 Nov 2002 14:00:13 -0500.
             <3DC8153D.E75FFED2@crim.ca> 
Message-ID: <200211060030.QAA10980@huge>


In message <3DC8153D.E75FFED2 at crim.ca>you wrote:
> 
> Hi,
> 
> I did many tests to find the best suited language model for a given text
> with the "ngram" program with the -prune option and I maybe have
> discovered a bug with the OOV displayed in ngram.
> 
> With a command like:
> jarjar jfbeaumo/mlf> ngram -order 3 -vocab vocab20k.txt -unk -lm
> transtalk10.arpa -ppl test.txt
> file test.txt: 635 sentences, 9448 words, 0 OOVs
> 0 zeroprobs, logprob=3D -17926.9 ppl=3D 59.9706 ppl1=3D 78.9647
> 
> I am always ending with 0 OOV. The language model does contain the <unk>
> token. I supposed with a sufficient large value for -prune I will begin
> to get OOV word but it is fixed on 0. If I specified an empty vocabulary
> file, again, there is 0 OOV and I suppose this isn't correct. Maybe
> ngram is taking its vocabulary from the LM but then, there will be no
> use for the switch -vocab.
> 
> Can you help me? Did I miss something?
> 
> Best regards,
> 
> JF
> --
> Jean-Fran=E7ois Beaumont - Agent de recherche (jfbeaumont at crim.ca)
> CRIM - 550, rue Sherbrooke Ouest Bureau 100 (www.crim.ca)
> Montr=E9al (Qu=E9bec) H3A 1B9  T=E9l.: 514.840-1235 #3625

Dear JF,

it is actually a feature (not a bug) that ngram -unk counts OOVs as regular
words.   They would only be counted as OOVs in the ppl output if the
LM did not contain the <unk> token, or if it had probability 0.
Of course whether this is what you expect is debatable. 
You can get the OOV count you want by grepping the ngram -ppl 2 output
for "p( <unk> | ".

--Andreas


From geetu at clsp.jhu.edu  Tue Nov 12 09:29:06 2002
From: geetu at clsp.jhu.edu (Geetu Ambwani)
Date: Tue, 12 Nov 2002 12:29:06 -0500 (EST)
Subject: Class Language Modelling
Message-ID: <Pine.GSO.4.05.10211121223360.4504-100000@dc02.clsp.jhu.edu>


Hi, 
I am trying to use the SRILM toolkit to calculate perplexity results for
the following language model - a regular trigram model
interpolated with the class model P(w0/CW0,CW1,CW2) * P(Cw0/CW1,CW2) where
CW0,CW1 & CW2 are the equivalence classes for the predicted word and 2 the
preceding words respectively. I generated the equivalence classifications
for the words by myself and i want to know if it is possible to use the
toolkit to do the perplexity measurements if i input the class files as
data files. Can this be done at all? If any of you know how to do this, 
please reply pointing out the relevant sections of the manual i should
look up for this. 
Thanks a ton,
Geetu


From yangl at ecn.purdue.edu  Tue Nov 12 10:57:37 2002
From: yangl at ecn.purdue.edu (Yang Liu)
Date: Tue, 12 Nov 2002 13:57:37 -0500 (EST)
Subject: Class Language Modelling
In-Reply-To: <Pine.GSO.4.05.10211121223360.4504-100000@dc02.clsp.jhu.edu>
Message-ID: <Pine.GSO.4.33.0211121346470.1909-100000@min.ecn.purdue.edu>


Hi Geetu,
If your own class definition is alrady in the format of SRILM's
classes-format, then you can easily get the PP using the mixed LMs
(word based and class based) from 'ngram'.
Check the mannual of ngram for details.

I'm not sure if I understand your question correctly.
If this does not help, then please wait for the answers from Andreas.

Regards.
Yang


On Tue, 12 Nov 2002, Geetu Ambwani wrote:

>
> Hi,
> I am trying to use the SRILM toolkit to calculate perplexity results for
> the following language model - a regular trigram model
> interpolated with the class model P(w0/CW0,CW1,CW2) * P(Cw0/CW1,CW2) where
> CW0,CW1 & CW2 are the equivalence classes for the predicted word and 2 the
> preceding words respectively. I generated the equivalence classifications
> for the words by myself and i want to know if it is possible to use the
> toolkit to do the perplexity measurements if i input the class files as
> data files. Can this be done at all? If any of you know how to do this,
> please reply pointing out the relevant sections of the manual i should
> look up for this.
> Thanks a ton,
> Geetu
>
>
>


From geetu at clsp.jhu.edu  Tue Nov 19 08:18:45 2002
From: geetu at clsp.jhu.edu (Geetu Ambwani)
Date: Tue, 19 Nov 2002 11:18:45 -0500 (EST)
Subject: Class Language Modelling
Message-ID: <Pine.GSO.4.21.0211191112300.13692-100000@c06.clsp.jhu.edu>


Suppose i wish to build a language model P(w0/CW0,CW1,CW2) where CW0, CW1
& CW2 are the equivalence classes for the predicted word and the 2
preceding words respectively amd i wish to use absolute discounting with a
fixed D. The input files i have available are (1) a trigram count file
(format - w0 w1 w2 count) (2) a vocab file (3) 3 class files in format
classno word1 word2 ....) for w0, w1 & w2 positions . 
Can someone please tell me the syntax of the ngram-count command needed to
build a LM using this sort of a class LM as i am not very sure i
understand it clearly.
Thanks,
Geetu
  

From stolcke at speech.sri.com  Tue Nov 19 20:20:01 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 19 Nov 2002 20:20:01 PST
Subject: Class Language Modelling 
In-Reply-To: Your message of Tue, 19 Nov 2002 11:18:45 -0500.
             <Pine.GSO.4.21.0211191112300.13692-100000@c06.clsp.jhu.edu> 
Message-ID: <200211200420.UAA24301@huge>


In message <Pine.GSO.4.21.0211191112300.13692-100000 at c06.clsp.jhu.edu>you wrote
:
> 
> Suppose i wish to build a language model P(w0/CW0,CW1,CW2) where CW0, CW1
> & CW2 are the equivalence classes for the predicted word and the 2
> preceding words respectively amd i wish to use absolute discounting with a
> fixed D. The input files i have available are (1) a trigram count file
> (format - w0 w1 w2 count) (2) a vocab file (3) 3 class files in format
> classno word1 word2 ....) for w0, w1 & w2 positions . 
> Can someone please tell me the syntax of the ngram-count command needed to
> build a LM using this sort of a class LM as i am not very sure i
> understand it clearly.
> Thanks,
> Geetu

Geetu,

SRILM does not currently support class LMs with separate class membership
functions for the different positions in an N-gram.  All word positions
must share the same class definitions.

Under these constraints, we typically train class LM as follows:

1. prepare class definition file in the format described in the 
   classes-format(5) manual page.  this can be done by hand or from other
   knowledge sources, or automatically using word clustering algorithms
   (see ngram-class(1)).

   it is a bad idea to use plain numbers as class names.  when in doubt 
   use names like CLASS1, CLASS2, etc.  this avoids confusion in places where
   a file can be either a class name, word, or integer count.

2. condition the training data or counts to replace words with class labels,
   using the "replace-words-with-classes" filter (see training-scripts(1) 
   man page).

3. run ngram-count on the result of step 2.

Although multiple class definitions for different word positions are not 
supported by the above training procedure, or the LM evaluation code,
there is a fairly straightforward way to fake it.
I'm assuming now that classes expand to exactly one word at a time,
and that a word has a unique class in a given ngram position.

You need to write a filter that maps word ngram counts to
class ngram counts (w1 w2 w3 N -> c1 c2 c3 N, and similarly for unigrams and
bigrams). then you can train and evaluate your class LM by operating on
counts rather than text.
to train:

	ngram-counts -text DATA -write - | word-to-class-filter | \
	ngram-counts -read - -lm LM [smoothing-options]

Similary, you can map the test data to counts, filter them, and use the 
ngram -counts option to compute perplexities and log probabilities from
counts.

there is one detail in LM estimation: you need to prevent class labels that
can only occur in the history portion of an ngram from receiving backoff 
probability mass as a result of smoothing .  you can accomplish that 
by listing those not-to-be-predicted classes in a file, and specifying 
them with the ngram-count -nonevents option.  see the man page for 
details.  you need to also keep track of the probabilities incurred 
for replacing a word by its class for each word in the test set.
(the filter script could do that as a side effect), and add the log 
probability for class expansions to the log probability for 
class ngrams.

hope this helps,

--Andreas


From David.Mas at limsi.fr  Wed Nov 27 06:45:17 2002
From: David.Mas at limsi.fr (David Mas)
Date: Wed, 27 Nov 2002 15:45:17 +0100
Subject: Memory issues
Message-ID: <3DE4DA7D.9662ED5E@limsi.fr>

Hi,

I'm a french PhD Student, using the toolkit to compute ngram and
class-ngram models on Hub4 and Hub5 data.

I recently tried to mix several models with ngram -mix-lm, which works
fine except for big models (learned on Hub4).

It seems to be matter of memory. So I used the -memuse option to have an
idea of the memory load.

But this option doesn't reflect the actual load of the memory. It says
900M when a top running of the same machine gives a amount a 2,5G used.

So my 2 questions are :
- is it normal that the -memuse option gives a wrong result ?
- is it normal that the toolkit use so much memory, or have I done
something wrong in the installation ?

Any help is welcome.

David Mas

-- 
David Mas
LIMSI/CNRS, groupe TLP
Tel : 01 69 85 80 05
http://www.limsi.fr/Individu/mas/


From stolcke at speech.sri.com  Wed Nov 27 10:25:49 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 27 Nov 2002 10:25:49 PST
Subject: Memory issues 
In-Reply-To: Your message of Wed, 27 Nov 2002 15:45:17 +0100.
             <3DE4DA7D.9662ED5E@limsi.fr> 
Message-ID: <200211271825.KAA29413@huge>


In message <3DE4DA7D.9662ED5E at limsi.fr>you wrote:
> Hi,
> 
> I'm a french PhD Student, using the toolkit to compute ngram and
> class-ngram models on Hub4 and Hub5 data.
> 
> I recently tried to mix several models with ngram -mix-lm, which works
> fine except for big models (learned on Hub4).
> 
> It seems to be matter of memory. So I used the -memuse option to have an
> idea of the memory load.
> 
> But this option doesn't reflect the actual load of the memory. It says
> 900M when a top running of the same machine gives a amount a 2,5G used.

That's because -memuse only calculates the memory used by the final model.
For static interpolation with -mix-lm the program needs to temporarily 
allocate both the input models and the resulting mixture model, so 2.5 GB
doesn't sound too outlandish.

(I know one could implement this operation without requiring all models
to be fully in memory, but i preferred to keep the code simple.)

> So my 2 questions are :
> - is it normal that the -memuse option gives a wrong result ?

see above.

> - is it normal that the toolkit use so much memory, or have I done
> something wrong in the installation ?

The default build optimizes data structures for speed, not space.
that's why you see a significant portion of memory "wasted" (according to
-memuse output).  That's the extra space needed to keep hash tables sparse.

As of SRILM version 1.3.2, you can build a separate version of the binaries
optimized for space, and that's usually worth it once you start dealing with
Hub4 ;-)  Follow the instructions under item 9 in the INSTALL file.

--Andreas


From valsan at sony.de  Tue Dec  3 06:13:11 2002
From: valsan at sony.de (Valsan, Zica)
Date: Tue, 3 Dec 2002 15:13:11 +0100 
Subject: perplexity evaluation
Message-ID: <B0793DB946E52942A49C1E8152A1358C8E3781@leo.wins.fb.sony.de>

Hi all, 

I'm a new user of the toolkit and I need a little bit support in order to
understand how the perplexity is computed and why it is different from the
expected value.

For instance, I have the training data in the file train.text that contain
only a line:
<s> a b c </s>
and the vocabulary (train.vocab) that contains all these words, and I want
to generate a LM based on unigram only and to evaluate it on the same
training data. I don't want any discounting strategy to be applied. 
Here are the commands I used:

ngram-count -order 1 -vocab train.vocab -text train.text -lm lm.arpa gt1max
0
ngram -lm out.arpa -debug 2 -vocab train.vocab -ppl train.text > out.ppl


So, according to the theory, the expected value for perplexity is PP=3 if
the context cues are not taken into account. This is also what one can get
using CMU toolkit. 
Using this toolkit and the above commands what I've got actually, is PP=4.
Looking inside of the created arpa model , I could see that </s> has the
same probability as any of the real word (a, b,c). 
Does anybody could explain me why is like this? Did I make a mistake or is
something that miss me?

Thank you in advance for your support, 
Zica


From stolcke at speech.sri.com  Tue Dec  3 08:48:13 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 03 Dec 2002 08:48:13 PST
Subject: perplexity evaluation 
In-Reply-To: Your message of Tue, 03 Dec 2002 15:13:11 +0100.
             <B0793DB946E52942A49C1E8152A1358C8E3781@leo.wins.fb.sony.de> 
Message-ID: <200212031648.IAA22088@huge>


In message <B0793DB946E52942A49C1E8152A1358C8E3781 at leo.wins.fb.sony.de>you wrot
e:
> Hi all, 
> 
> I'm a new user of the toolkit and I need a little bit support in order to
> understand how the perplexity is computed and why it is different from the
> expected value.
> 
> For instance, I have the training data in the file train.text that contain
> only a line:
> <s> a b c </s>
> and the vocabulary (train.vocab) that contains all these words, and I want
> to generate a LM based on unigram only and to evaluate it on the same
> training data. I don't want any discounting strategy to be applied. 
> Here are the commands I used:
> 
> ngram-count -order 1 -vocab train.vocab -text train.text -lm lm.arpa gt1max
> 0
> ngram -lm out.arpa -debug 2 -vocab train.vocab -ppl train.text > out.ppl
> 
> 
> So, according to the theory, the expected value for perplexity is PP=3 if
> the context cues are not taken into account. This is also what one can get
> using CMU toolkit. 
> Using this toolkit and the above commands what I've got actually, is PP=4.
> Looking inside of the created arpa model , I could see that </s> has the
> same probability as any of the real word (a, b,c). 
> Does anybody could explain me why is like this? Did I make a mistake or is
> something that miss me?

You didn't make a mistake and this is the right answer as far as I can tell.
</s> needs to get a probability in order to be able to compute 
a probability for the whole "sentence".

Are you saying that the CMU software doesn't give any probabiliy to </s> ?
that would be quite odd.

Maybe someone on this list who is more familiar with the CMU toolkit can
contribute an explanation.

--Andreas


From valsan at sony.de  Wed Dec  4 00:21:31 2002
From: valsan at sony.de (Valsan, Zica)
Date: Wed, 4 Dec 2002 09:21:31 +0100 
Subject: perplexity evaluation 
Message-ID: <B0793DB946E52942A49C1E8152A1358C8E3782@leo.wins.fb.sony.de>


Thank you for your prompt answer.
I have understood that </s> is taken into account but the question is way
only it and not the other one, too? I read papers where people resort to
this strategy (choosing only one) but is not clear for me the reason for
which they do like this.

Regarding the CMU toolkit I did not say it doesn't output any probabilities
for these context cues, but it outputs the same small values for each of
them (-98.999 very close to the values outputted by SRILM toolkit). This is
somehow "equivalent" with saying there are not taken into account for
perplexity computation, I think. 

Regards, 
Zica


-----Original Message-----
From: Andreas Stolcke [mailto:stolcke at speech.sri.com]
Sent: Dienstag, 3. Dezember 2002 17:48
To: Valsan, Zica
Cc: 'srilm-user at speech.sri.com'
Subject: Re: perplexity evaluation 


In message <B0793DB946E52942A49C1E8152A1358C8E3781 at leo.wins.fb.sony.de>you
wrot
e:
> Hi all, 
> 
> I'm a new user of the toolkit and I need a little bit support in order to
> understand how the perplexity is computed and why it is different from the
> expected value.
> 
> For instance, I have the training data in the file train.text that contain
> only a line:
> <s> a b c </s>
> and the vocabulary (train.vocab) that contains all these words, and I want
> to generate a LM based on unigram only and to evaluate it on the same
> training data. I don't want any discounting strategy to be applied. 
> Here are the commands I used:
> 
> ngram-count -order 1 -vocab train.vocab -text train.text -lm lm.arpa
gt1max
> 0
> ngram -lm out.arpa -debug 2 -vocab train.vocab -ppl train.text > out.ppl
> 
> 
> So, according to the theory, the expected value for perplexity is PP=3 if
> the context cues are not taken into account. This is also what one can get
> using CMU toolkit. 
> Using this toolkit and the above commands what I've got actually, is PP=4.
> Looking inside of the created arpa model , I could see that </s> has the
> same probability as any of the real word (a, b,c). 
> Does anybody could explain me why is like this? Did I make a mistake or is
> something that miss me?

You didn't make a mistake and this is the right answer as far as I can tell.
</s> needs to get a probability in order to be able to compute 
a probability for the whole "sentence".

Are you saying that the CMU software doesn't give any probabiliy to </s> ?
that would be quite odd.

Maybe someone on this list who is more familiar with the CMU toolkit can
contribute an explanation.

--Andreas


From melis at cs.utwente.nl  Tue Dec 17 05:38:52 2002
From: melis at cs.utwente.nl (Paul Melis)
Date: Tue, 17 Dec 2002 14:38:52 +0100
Subject: Unexpected "ngram-count -recompute" result
Message-ID: <20021217143852.A7495@luistervink.cs.utwente.nl>

Hello,

We just noticed the following when using the -recompute flag of ngram-count. We're just try to generate uni- and bigram counts from trigram counts but some are missing:

[1 - directly summing uni-, bi- and trigram counts of a simple text file]

melis at luistervink:/local/export/melis/lm> cat t
<s> this is a test </s>

melis at luistervink:/local/export/melis/lm> ngram-count -text t -sort
</s>    1
<s>     1
<s> this        1
<s> this is     1
a       1
a test  1
a test </s>     1
is      1
is a    1
is a test       1
test    1
test </s>       1
this    1
this is 1
this is a       1

[2 - only summing trigram counts]

melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort
<s> this is     1
a test </s>     1
is a test       1
this is a       1

[3 - using the previous trigram counts to generate uni- and bigram counts]

melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort | ngram-count -recompute -sort -read -
<s>     1
<s> this        1
<s> this is     1
a       1
a test  1
a test </s>     1
is      1
is a    1
is a test       1
this    1
this is 1
this is a       1

We expected the output of 1 and 3 to be the same, but notice the missing unigrams "</s>" and "test". Also, the bigram "test </s>" is missing. 
Is this a bug, or is there something we're missing here? It seems to be related to the end of sentence symbol. 
This is with SRILM 1.3.2, BTW.

Regards,
Paul

-- 
melis at cs.utwente.nl


From mirjam.sepesy at uni-mb.si  Wed Dec 18 04:58:41 2002
From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec)
Date: Wed, 18 Dec 2002 13:58:41 +0100
Subject: missing counts
Message-ID: <3E007101.7A413D18@uni-mb.si>

Hi,

I have the following problem.

The n-gram counts are computed from raw text corpus by using
'ngram-count' and  'ngram-merge'.
I experiment with different vocabularies and bigram and trigram models.
In each experiment I run again 'ngram-count -vocab -order' and make the
language model with ' make-big-lm -trust-totals'.
I test language models on my test set and noticed some mistakes. Some
bigrams, which are present in the bigram model get lost in the trigram
model. When I omit the -trust-totals option, the results are OK.
Why should I not trust the totals in my case?  Are the counts of
different orders made by 'ngram-count' and  'ngram-merge' not in line?

Regards,

Mirjam.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mirjam.sepesy.vcf
Type: text/x-vcard
Size: 302 bytes
Desc: Card for Mirjam Sepesy Maucec
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20021218/e0a2f443/attachment.vcf>

From stolcke at speech.sri.com  Wed Dec 18 22:21:20 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 18 Dec 2002 22:21:20 PST
Subject: missing counts 
In-Reply-To: Your message of Wed, 18 Dec 2002 13:58:41 +0100.
             <3E007101.7A413D18@uni-mb.si> 
Message-ID: <200212190621.WAA01439@huge>


In message <3E007101.7A413D18 at uni-mb.si>you wrote:
> 
> Hi,
> 
> I have the following problem.
> 
> The n-gram counts are computed from raw text corpus by using
> 'ngram-count' and  'ngram-merge'.
> I experiment with different vocabularies and bigram and trigram models.
> In each experiment I run again 'ngram-count -vocab -order' and make the
> language model with ' make-big-lm -trust-totals'.
> I test language models on my test set and noticed some mistakes. Some
> bigrams, which are present in the bigram model get lost in the trigram
> model. When I omit the -trust-totals option, the results are OK.
> Why should I not trust the totals in my case?  Are the counts of
> different orders made by 'ngram-count' and  'ngram-merge' not in line?
> 
> Regards,
> 
> Mirjam.

This is indeed a little strange. However, the -trust-totals option
is obsolete, as it does not interact well with some discounting 
methods (e.g., KN).  It was always a hack, and the latest version of
make-big-lm uses a different strategy for saving memory on ngrams discarded by
cutoffs (the ngram-count -meta-tag and -read-with-mincounts options,
see the man page).

Still, if you can reduce your problem to a small test case I could look
at it to understand exactly what's going on.

--Andreas


From mirjam.sepesy at uni-mb.si  Fri Dec 20 00:12:33 2002
From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec)
Date: Fri, 20 Dec 2002 09:12:33 +0100
Subject: missing counts
References: <200212190621.WAA01439@huge>
Message-ID: <3E02D0F1.FF6EBC23@uni-mb.si>

Andreas Stolcke wrote:

> In message <3E007101.7A413D18 at uni-mb.si>you wrote:
> >
> > Hi,
> >
> > I have the following problem.
> >
> > The n-gram counts are computed from raw text corpus by using
> > 'ngram-count' and  'ngram-merge'.
> > I experiment with different vocabularies and bigram and trigram models.
> > In each experiment I run again 'ngram-count -vocab -order' and make the
> > language model with ' make-big-lm -trust-totals'.
> > I test language models on my test set and noticed some mistakes. Some
> > bigrams, which are present in the bigram model get lost in the trigram
> > model. When I omit the -trust-totals option, the results are OK.
> > Why should I not trust the totals in my case?  Are the counts of
> > different orders made by 'ngram-count' and  'ngram-merge' not in line?
> >
> > Regards,
> >
> > Mirjam.
>
> This is indeed a little strange. However, the -trust-totals option
> is obsolete, as it does not interact well with some discounting
> methods (e.g., KN).  It was always a hack, and the latest version of
> make-big-lm uses a different strategy for saving memory on ngrams discarded by
> cutoffs (the ngram-count -meta-tag and -read-with-mincounts options,
> see the man page).
>
> Still, if you can reduce your problem to a small test case I could look
> at it to understand exactly what's going on.
>
> --Andreas

Thank you for answering so quick.
You are right. I used KN discounting.  I see, it's time to switch from the
version 1.3.1 to 1.3.2.
I will report the results.

Have nice holidays!

Mirjam


-------------- next part --------------
A non-text attachment was scrubbed...
Name: mirjam.sepesy.vcf
Type: text/x-vcard
Size: 302 bytes
Desc: Card for Mirjam Sepesy Maucec
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20021220/968a68e3/attachment.vcf>

From stolcke at speech.sri.com  Fri Dec 20 00:54:21 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 20 Dec 2002 00:54:21 PST
Subject: missing counts 
In-Reply-To: Your message of Fri, 20 Dec 2002 09:12:33 +0100.
             <3E02D0F1.FF6EBC23@uni-mb.si> 
Message-ID: <200212200854.AAA21640@huge>


--Andreas

In message <3E02D0F1.FF6EBC23 at uni-mb.si>you wrote:
> This is a multi-part message in MIME format.
> 
> --Boundary_(ID_pd4a/8W91VuCtRvCI8wYoA)
> Content-type: text/plain; charset=us-ascii
> Content-transfer-encoding: 7BIT
> 
> Andreas Stolcke wrote:
> 
> > In message <3E007101.7A413D18 at uni-mb.si>you wrote:
> > >
> > > Hi,
> > >
> > > I have the following problem.
> > >
> > > The n-gram counts are computed from raw text corpus by using
> > > 'ngram-count' and  'ngram-merge'.
> > > I experiment with different vocabularies and bigram and trigram models.
> > > In each experiment I run again 'ngram-count -vocab -order' and make the
> > > language model with ' make-big-lm -trust-totals'.
> > > I test language models on my test set and noticed some mistakes. Some
> > > bigrams, which are present in the bigram model get lost in the trigram
> > > model. When I omit the -trust-totals option, the results are OK.
> > > Why should I not trust the totals in my case?  Are the counts of
> > > different orders made by 'ngram-count' and  'ngram-merge' not in line?
> > >
> > > Regards,
> > >
> > > Mirjam.
> >
> > This is indeed a little strange. However, the -trust-totals option
> > is obsolete, as it does not interact well with some discounting
> > methods (e.g., KN).  It was always a hack, and the latest version of
> > make-big-lm uses a different strategy for saving memory on ngrams discarded
>  by
> > cutoffs (the ngram-count -meta-tag and -read-with-mincounts options,
> > see the man page).
> >
> > Still, if you can reduce your problem to a small test case I could look
> > at it to understand exactly what's going on.
> >
> > --Andreas
> 
> Thank you for answering so quick.
> You are right. I used KN discounting.  I see, it's time to switch from the
> version 1.3.1 to 1.3.2.
> I will report the results.

And of course KN discounting modifies the lower-order counts, so at a given
cutoff > 1 you might lose ngrams because after the KN method is applied 
the counts below the cutoff.  this is consistent with your observation
that a bigram is not in the trigram model while it is in the bigram model.

--Andreas


From stolcke at speech.sri.com  Fri Dec 20 02:03:18 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 20 Dec 2002 02:03:18 PST
Subject: Unexpected "ngram-count -recompute" result 
In-Reply-To: Your message of Tue, 17 Dec 2002 14:38:52 +0100.
             <20021217143852.A7495@luistervink.cs.utwente.nl> 
Message-ID: <200212201003.CAA23504@huge>


In message <20021217143852.A7495 at luistervink.cs.utwente.nl>you wrote:
> Hello,
> 
> We just noticed the following when using the -recompute flag of ngram-count. 
> We're just try to generate uni- and bigram counts from trigram counts but som
> e are missing:
> 
> [1 - directly summing uni-, bi- and trigram counts of a simple text file]
> 
> melis at luistervink:/local/export/melis/lm> cat t
> <s> this is a test </s>
> 
> melis at luistervink:/local/export/melis/lm> ngram-count -text t -sort
> </s>    1
> <s>     1
> <s> this        1
> <s> this is     1
> a       1
> a test  1
> a test </s>     1
> is      1
> is a    1
> is a test       1
> test    1
> test </s>       1
> this    1
> this is 1
> this is a       1
> 
> [2 - only summing trigram counts]
> 
> melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 
> -sort
> <s> this is     1
> a test </s>     1
> is a test       1
> this is a       1
> 
> [3 - using the previous trigram counts to generate uni- and bigram counts]
> 
> melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 
> -sort | ngram-count -recompute -sort -read -
> <s>     1
> <s> this        1
> <s> this is     1
> a       1
> a test  1
> a test </s>     1
> is      1
> is a    1
> is a test       1
> this    1
> this is 1
> this is a       1
> 
> We expected the output of 1 and 3 to be the same, but notice the missing unig
> rams "</s>" and "test". Also, the bigram "test </s>" is missing. 
> Is this a bug, or is there something we're missing here? It seems to be relat
> ed to the end of sentence symbol. 
> This is with SRILM 1.3.2, BTW.
> 
> Regards,
> Paul
> 

It's a bug of sorts, or a feature depending on your point of view.

Because </s> is not followed by anything, discarding unigrams and bigrams
ending in </s> will in fact discard information that is not contained
in the trigrams.  I'm not sure why you are doing what you describe,
but a quick solution would be to introduce "dummy" N-grams that 
complete the ngrams ending in </s> to the full length of the counts 
you want to keep.  The little scripts below does that.
If you call it "complete-eos-ngrams" then

ngram-count -text t -write - | \
complete-eos-ngrams | \
ngram-count -read - -write-order 3 | \
ngram-count -recompute -sort -read - 

will produce the output you expect.
Alternatively you could tack dummy words onto the end of your input 
sentences.  in either case you have to delete the dummy ngrams from the 
final output.

--Andreas

#!/usr/local/bin/gawk -f

BEGIN {
	order = 3;
}

{
	print;
}

$(NF - 1) == "</s>" { 
	count = $NF;

	for (i = NF; i <= order; i ++) {
		$i = "DUMMY"; 
		print $0, count;
	}
}