-
a)
-
Make sure you have built SRI binaries either for a 64-bit machine
(e.g., MACHINE_TYPE=i686-m64 OPTION=_c) or using 64-bit counts (OPTION=_l).
This is necessary because the data contains N-gram counts exceeding
the range of 32-bit integers.
Be sure to invoke all commands below using the path to the appropriate
binary executable directory.
-
b)
-
Prepare mapping file for some vocabulary mismatches and call this
google.aliases:
<S> <s>
</S> </s>
<UNK> <unk>
-
c)
-
Prepare an initial count-LM parameter file
google.countlm.0:
order 5
vocabsize 13588391
totalcount 1024908267229
countmodulus 40
mixweights 15
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
google-counts PATH
where
PATH
points to the location of the Google N-grams, i.e., the directory containing
subdirectories "1gms", "2gms", etc.
Note that the
vocabsize
and
totalcount
were obtained from the 1gms/vocab.gz and 1gms/total files, respectively.
(Check that they match and modify as needed.)
For an explanation of the parameters see the
ngram(1)
-count-lm
option.
-
d)
-
Prepare a text file
tune.text
containing data for estimating the mixture weights.
This data should be representative of, but different from your test data.
Compute the vocabulary of this data using
ngram-count -text tune.text -write-vocab tune.vocab
The vocabulary size should not exceed a few thousand to keep memory
requirements in the following steps manageable.
-
e)
-
Estimate the mixture weights:
ngram-count -debug 1 -order 5 -count-lm \
-text tune.text -vocab tune.vocab \
-vocab-aliases google.aliases \
-limit-vocab \
-init-lm google.countlm.0 \
-em-iters 100 \
-lm google.countlm
This will write the estimated LM to
google.countlm.
The output will be identical to the initial LM file, except for the
updated interpolation weights.
-
f)
-
Prepare a test data file
test.text,
and its vocabulary
test.vocab
as in Step d) above.
Then apply the LM to the test data:
ngram -debug 2 -order 5 -count-lm \
-lm google.countlm \
-vocab test.vocab \
-vocab-aliases google.aliases \
-limit-vocab \
-ppl test.text > test.ppl
The perplexity output will appear in
test.ppl.
-
g)
-
Note that the Google data uses mixed case spellings.
To apply the LM to lowercase data one needs to prepare a much more
extensive vocabulary mapping table for the
-vocab-aliases
option, namely, one that maps all
upper- and mixed-case spellings to lowercase strings.
This mapping file should be restricted to the words appearing in
tune.text
and
test.text,
respectively, to avoid defeating the effect of
-limit-vocab .