Question about hidden-ngram

Jachym Kolar jachym at kky.zcu.cz
Fri Nov 21 04:15:19 PST 2003


Hi,
 I've just tried the hidden-ngram tool to punctuate automatically an
unpunctuated text. But I got some unexpected results - every word was tagged
with the *noevent*.

I've used a training text in a following form:

...
for more than a century <COM> the fingerprint has been the quintessential piece
of crime scene evidence <PER>
but now the palm is getting its due <PER>
...

Then I trained a 3-gram model with:

ngram-count -write-vocab vocabulary -tolower -text trainingtext -write output
-lm lmfile

... and then I used hidden-ngram tool with following option:

hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab tags -continuous
-posteriors

... and received something like that:

6        *noevent* 0.998811 <com> 0.00117427 <per> 1.46659e-05 <qm> 7.92597e-10
měsíců   *noevent* 0.999898 <com> 9.326e-05 <per> 9.07804e-06 <qm> 4.61643e-10
do       *noevent* 1 <com> 4.19776e-09 <per> 5.76912e-09 <qm> 6.25918e-12
jednoho  *noevent* 0.999998 <com> 4.18691e-07 <per> 1.24419e-06 <qm> 8.63805e-11
roku     *noevent* 0.197671 <com> 0.801881 <per> 0.000340206 <qm> 0.000107651
jak      *noevent* 0.99997 <com> 2.44243e-05 <per> 1.32587e-06 <qm> 4.09674e-06
je       *noevent* 0.999857 <com> 0.000142836 <per> 2.47722e-07 <qm> 2.47757e-07
to       *noevent* 0.972235 <com> 0.0266202 <per> 0.000937748 <qm> 0.000206936
<unk>    *noevent* 0.979455 <com> 0.0205446 <per> 2.70218e-07 <qm> 1.33261e-07
uvedeno  *noevent* 0.933133 <com> 0.0538742 <per> 0.0129924 <qm> 6.16205e-08
na       *noevent* 0.999965 <com> 4.71218e-07 <per> 3.39777e-05 <qm> 1.57228e-07
výrobku  *noevent* 0.736376 <com> 0.168451 <per> 0.0947272 <qm> 0.00044499

Please, can somebody tell me what I did wrong? And is there in SRILM a tool to
obtain a text-map from the training text?

Thanks Jachym





More information about the SRILM-User mailing list