This is a multi-part message in MIME format.
--Boundary_(ID_pd4a/8W91VuCtRvCI8wYoA)
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7BIT
Andreas Stolcke wrote:
> In message <3E007101.7A413D18 at ADDRESS HIDDEN>you wrote:
> >
> > Hi,
> >
> > I have the following problem.
> >
> > The n-gram counts are computed from raw text corpus by using
> > 'ngram-count' and 'ngram-merge'.
> > I experiment with different vocabularies and bigram and trigram models.
> > In each experiment I run again 'ngram-count -vocab -order' and make the
> > language model with ' make-big-lm -trust-totals'.
> > I test language models on my test set and noticed some mistakes. Some
> > bigrams, which are present in the bigram model get lost in the trigram
> > model. When I omit the -trust-totals option, the results are OK.
> > Why should I not trust the totals in my case? Are the counts of
> > different orders made by 'ngram-count' and 'ngram-merge' not in line?
> >
> > Regards,
> >
> > Mirjam.
>
> This is indeed a little strange. However, the -trust-totals option
> is obsolete, as it does not interact well with some discounting
> methods (e.g., KN). It was always a hack, and the latest version of
> make-big-lm uses a different strategy for saving memory on ngrams discarded by
> cutoffs (the ngram-count -meta-tag and -read-with-mincounts options,
> see the man page).
>
> Still, if you can reduce your problem to a small test case I could look
> at it to understand exactly what's going on.
>
> --Andreas
Thank you for answering so quick.
You are right. I used KN discounting. I see, it's time to switch from the
version 1.3.1 to 1.3.2.
I will report the results.
Have nice holidays!
Mirjam
--Boundary_(ID_pd4a/8W91VuCtRvCI8wYoA)
Content-type: text/x-vcard; name=mirjam.sepesy.vcf; charset=us-ascii
Content-transfer-encoding: 7BIT
Content-disposition: attachment; filename=mirjam.sepesy.vcf
Content-description: Card for Mirjam Sepesy Maucec
begin:vcard
n:Sepesy Maucec;Mirjam
x-mozilla-html:FALSE
org:Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor
adr:;;;;;;
version:2.1
email;internet:mirjam.sepesy at ADDRESS HIDDEN
title:PhD
note:Phone: ++386 (0)2 220-7225
x-mozilla-cpt:;7072
fn:Mirjam Sepesy Maucec
end:vcard
--Boundary_(ID_pd4a/8W91VuCtRvCI8wYoA)--
Click here to go to the SRILM home page.