Search SRILM-USER Archives

Re: Class-based LM using the SRILM toolkit?

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 24 Apr 2007 10:29:37 PDT

In message <d4929ad00704181101t30f6d973s986f692b0010e2ca at ADDRESS HIDDEN>you wro
te:
> Dear Dr. Stolcke,
>
> Thank you for your attention.
>
> Is there no way to construct a class-based LM by pre-defining the
> classes to be used (vis-a-vis inducing them)? The class-format man
> page does mention how classes may be defined by hand, but this format
> requires the specification of the class expansion probabilities as
> well. Can these probabilities be calculated by a program in the
> toolkit? Correct me if I'm wrong, but these probabilities are given by
> (for a certain word wi, and class ci) : Number of times wi occurs in
> class ci/Number of times words in class ci occur.

You

(1) define your classes by hand, using dummy probabilities.
(2) use the replace-words-with-classes with options
outfile=FILE normalize=1
    on some training data. This is documented in the training-scripts(5)
    man page.

> Also, is the file that is generated by the ngram-class -class-counts
> option in the same format as class-format? Can a file in the
> class-format format be used directly by the ngram-count program to
> learn a class-based LM?

The -class-counts output is in the right format to be used as a count
input file for ngram-count to estimate a bigram LM for the class labels.
However, this will only work for bigram LMs since ngram-class doesn't
use higher-order statistics.  The recommended procedure is to
again use the replace-words-with-classes command to insert class
labels in your LM training data, and then use ngram-count on
the transformed data to estimate the class ngram probabilities.

Andreas

Click here to go to the SRILM home page.