Language Model output problem using FLM

amittai e axelrod amittai at mit.edu
Thu Feb 15 07:49:46 PST 2007


On 2/15/07, Antoine Ghaoui <Antoine.Ghaoui at jinny.ie> wrote:
> ## word trigram
> 1
> W : 2 W(-1) W(-2) ntextfile_99.flm.cnt ntextfile_99.flm.lm 3
> W1W2    W2      kndiscount gtmin 1 interpolate
> W1      W1      kndiscount gtmin 1 interpolate
> 0       0       kndiscount gtmin 1
> Can you please help on this? Is it normal to have ngram 0x2=0?

Yes (for a regular trigram LM in FLM format). The short answer is that
this indicates that you have no histories that consist simply of W2.

> How can I get the old format?

You can't. This is the standard FLM file format-- but it's really
equivalent to the LM format, it's just labelled a bit differently.

Because a FLM allows you to select arbitrary combinations of factors
to use as the ngram history, the header of the FLM file will contain a
list of how many of each possible combination of factors you're using
for your history. However, as your FLM specification narrows down
which factor combinations are valid histories, some (or many) of the
entries in the FLM header will have a count of zero.

For example, a FLM header corresponding to an FLM over a trigram with
3 factors per word, might look something like this:
<<<
\data\
ngram 0x0=61628
ngram 0x1=1267167
ngram 0x2=278079
ngram 0x4=1136820
ngram 0x8=2021099
ngram 0x10=0
ngram 0x3=1352676
ngram 0x5=1267167
ngram 0x6=1339994
ngram 0x9=0
ngram 0xA=2824147
ngram 0xC=4578754
ngram 0x11=0
ngram 0x12=0
ngram 0x14=0
ngram 0x18=0
ngram 0x7=1352676
ngram 0xB=0
ngram 0xD=0
ngram 0xE=4702913
ngram 0x13=0
ngram 0x15=4497090
ngram 0x16=4534847
ngram 0x19=0
ngram 0x1A=2824147
ngram 0x1C=4578754
ngram 0xF=0
ngram 0x17=4542579
ngram 0x1B=0
ngram 0x1D=0
ngram 0x1E=425916
ngram 0x1F=325041
>>>

...and this is also normal. While in a normal trigram LM you'd see
"1-gram", "2-gram", etc, a FLM will just number all the nodes in the
possible backoff graph and use each node's label in the header rather
than write out which particular factor combination it represents. If
you want to figure out which particular factor combination each hex
label means, I think the counting mechanism is commented in the FLM
code.

In the case of a trigram model, though, there's only one combination
of factors that's not used as a history and thus has zero entries
(namely that of W2 alone), and therefore that's the one labelled 0x2
:)

~amittai



More information about the SRILM-User mailing list