Search SRILM-USER Archives

[Fwd: Ask about the practical usage of SRILM for Machine Translation]

From: Cuong Huy To <cuong at ADDRESS HIDDEN>
Date: Thu, 03 Aug 2006 14:09:49 +0200

This is a multi-part message in MIME format.
--------------090508060806040104030105
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi all
I forgot to mention the result for my best options ever:

So far, given this training text of 512,000 sentences, my test set is of
2000 sentences, 57951 words, and among the LM with order=7 here is the
best combination I have
-order 7 -kndiscount 1 -kndiscount 2 -kndiscount 3 -kndiscount 4
-kndiscount 5 -kndiscount 6 -kndiscount 7 -interpolate

logprob=-107526, ppl = 63.4007, ppl1=73.214

Thanks
Cuong

--------------090508060806040104030105
Content-Type: message/rfc822;
name*0="Ask about the practical usage of SRILM for Machine Translation"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename*0="Ask about the practical usage of SRILM for Machine Translati";
filename*1="on"

Message-ID: <44D1E6CD.8070606 at ADDRESS HIDDEN>
Date: Thu, 03 Aug 2006 14:06:37 +0200
From: Cuong Huy To <cuong at ADDRESS HIDDEN>
User-Agent: Thunderbird 1.5.0.5 (Windows/20060719)
MIME-Version: 1.0
To: srilm-user at ADDRESS HIDDEN
Subject: Ask about the practical usage of SRILM for Machine Translation
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi every one

This question is for SRILM - 1.4.1

I am working on Statistical Machine Translation, basically the problem
is to find the best sentence e (english) given the input sentence f
(foreign)
e = argmax p(e|f) = argmax p(f|e).p(e).
In which, the p(f|e) is about the translation model (including the
lexicon and alignment models)

What I am concerning about is p(e), the language model.

My corpus is EuroParl (European Parliament Sessions), now I'm working
with 512,000 sentences, 10,228,002 words, which is made by 54182
monograms, 1044600 bigrams, 765141 trigrams .....
My questions are:

1. Which combination of several options currently available with
ngram-count I should use.
2. How many words per parameter should I use . (Joshua Goodman on his
tutorial research.microsoft.com/~joshuago/lm-tutorial-v7-handouts.ps
recommend the ratio between Number of words/Number of parameters to be
greater than 100 or 1000) .
3. Normally, an option -X is to represent all the options for each order
of n-gram (e.g. -interpolate is like -interpolate1 -interpolate2 .....
-interpolateN), but why it doens't work for -kndiscount ?

So far, given this training text of 512,000 sentences, my test set is of
2000 sentences, 57951 words, and among the LM with order=7 here is the
best combination I have
-order 7 -kndiscount 1 -kndiscount 2 -kndiscount 3 -kndiscount 4
-kndiscount 5 -kndiscount 6 -kndiscount 7 -interpolate

(also the question with -kndiscount, if I use -kndiscount only, then I
will get the message: "warning: discount coeff 1 is out of range:
5.96382e-17")

Thanks for reading this long email, and thanks to all who might want to
answer this.
Bests
Cuong,

--------------090508060806040104030105--

Click here to go to the SRILM home page.