Search SRILM-USER Archives

Interpolating Lang Models for Indonesian

From: <tl.martin at ADDRESS HIDDEN>
Date: Wed, 7 Sep 2005 10:55:09 +1000 (EST)

Hi All,

At present I am trying to use the SRI tools to improve the
LM for an Indonesian ASR system we are building. We have
just over ten hours of Australian Broadcast Commision
training data and at present the system gets just over 80%
on a heldout test set, with a bigram LM trained on the
training data. However, we also have approximately 12
million words of Text from Indonesian papers Kompass and
Tempo and were hoping that we could interpolate these with
the existing ABC LM to improve the ngram estimates and
subsequent perplexity.

Evaluating the ABC data PPL on a separate dev transcript
provides a ppl=297

Noting the advice given in the package notes when using
limited vocabs(vocab is 11000 words) I computed discount
coefficients first on unlimited vocab.and then subsequently
used these as input to a second pass of n-gram-count to get
the LM. I used good-turing.

I then ran

ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab
$DESIRED_VOCAB -limit-vocab -ppl output/sr\
i_trans/$PPL_CORPUS.dev.sri.trans

to get perplexity score

Using a similar technique using the much larger set of
Kompass text produces a ppl score of 808 when evaluated on
the ABC dev set.

All is well and good, until I try and interpolate the 2. I
have trialled two approaches. The first uses the dynamic
interpolation capabality incorporated in ngram.Using

ngram -bayes 0  -lm ./ABC/lm/arpa.bo.lm -mix-
lm ./Kompass/lm/arpa.bo.lm -debug 2 -
ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342  ie
much worse than the original 297.

I then tried using the "compute-best-mix" utility which
starts of as expected at lambda values 0.5 and 0.5 and
iterates to 0.66 and 0.33.Plugging these vals into

ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix-
lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/
$PPL_CORPUS.dev.sri.tran|tail yields
s

         ppl= 331.8 ppl1= 608.504

still worse. I would expect it to perhaps stay the same, and
iterate to lambda values which excluded the Kompass data,
but these seem to be at odds with the ppl score. I then
trialled the same technique using Switchboard and Gigaword
and got the normal expected behaviour ie improvement

Unsure of whether this was because the Kompass data was
unsuitable or I was just making a foolish error somewhere I
trialled CMU LM toolkit. Agian using gt discounting to build
a lm and evaluate on ABC devset gives a ppl of 268 which was
a little surprising. More surprising was when I used their
interpolation tools. To cut the story short it produces:

weights: 0.547  0.453  (7843 items) -
-> PP=152.624029

=============>  TOTAL PP = 152.624

No doubt the devil is in the detail, but has anyone got some
suggestions.

Cheers

Terry Martin
QUT Speech Lab
Australia

Click here to go to the SRILM home page.