Search SRILM-USER Archives

missing counts

From: Mirjam Sepesy Maucec <mirjam.sepesy at ADDRESS HIDDEN>
Date: Wed, 18 Dec 2002 13:58:41 +0100

This is a multi-part message in MIME format.

--Boundary_(ID_fEV0I1hR9hYElZh2ehtMGQ)
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7BIT

Hi,

I have the following problem.

The n-gram counts are computed from raw text corpus by using
'ngram-count' and  'ngram-merge'.
I experiment with different vocabularies and bigram and trigram models.
In each experiment I run again 'ngram-count -vocab -order' and make the
language model with ' make-big-lm -trust-totals'.
I test language models on my test set and noticed some mistakes. Some
bigrams, which are present in the bigram model get lost in the trigram
model. When I omit the -trust-totals option, the results are OK.
Why should I not trust the totals in my case?  Are the counts of
different orders made by 'ngram-count' and  'ngram-merge' not in line?

Regards,

Mirjam.

--Boundary_(ID_fEV0I1hR9hYElZh2ehtMGQ)
Content-type: text/x-vcard; name=mirjam.sepesy.vcf; charset=us-ascii
Content-transfer-encoding: 7BIT
Content-disposition: attachment; filename=mirjam.sepesy.vcf
Content-description: Card for Mirjam Sepesy Maucec

begin:vcard
n:Sepesy Maucec;Mirjam
x-mozilla-html:FALSE
org:Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor
adr:;;;;;;
version:2.1
email;internet:mirjam.sepesy at ADDRESS HIDDEN
title:PhD
note:Phone: ++386 (0)2 220-7225
x-mozilla-cpt:;7072
fn:Mirjam Sepesy Maucec
end:vcard

--Boundary_(ID_fEV0I1hR9hYElZh2ehtMGQ)--

Click here to go to the SRILM home page.