From thomae at ei.tum.de Thu Jul 4 06:54:49 2002
From: thomae at ei.tum.de (Matthias Thomae)
Date: Thu, 04 Jul 2002 15:54:49 +0200
Subject: make-ngram-pfsg cannot handle 1-grams?
Message-ID: <3D2453A9.7050802@ei.tum.de>
Hello,
I am using version 1.3.1 and would like to generate a unigram lm in pfsg
format. I encounter problems when calling make-ngram-pfsg, see example
below. Any ideas?
Regards.
Matthias
-------------------------------------------------------------------------
> cat test.txt
rote kugel
gruene kugel
> ngram-count -text test.txt -lm test.lm -order 1
warning: discount coeff 1 is out of range: -0
tho at odin: ~/worktho/nadia/lm > cat test.lm
\data\
ngram 1=5
\1-grams:
-0.4929155
-99
-0.748188 gruene
-0.4929155 kugel
-0.748188 rote
\end\
> make-ngram-pfsg test.lm
output_for_node: got empty name
tag undefined in LM
From stolcke at speech.sri.com Thu Jul 4 10:15:39 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 04 Jul 2002 10:15:39 PDT
Subject: make-ngram-pfsg cannot handle 1-grams?
In-Reply-To: Your message of Thu, 04 Jul 2002 15:54:49 +0200.
<3D2453A9.7050802@ei.tum.de>
Message-ID: <200207041715.KAA08739@huge>
A clear bug, and my fault. When I changed make-ngram-pfsg to handle
arbitrary N-gram lengths the unigram case was broken (it requires some
special handling).
A fixed version is included below. Put is in $SRILM/utils/src and
"make release" in that directory.
--Andreas
In message <3D2453A9.7050802 at ei.tum.de>you wrote:
> Hello,
>
> I am using version 1.3.1 and would like to generate a unigram lm in pfsg
> format. I encounter problems when calling make-ngram-pfsg, see example
> below. Any ideas?
>
> Regards.
> Matthias
>
#!/usr/local/bin/gawk -f
#
# make-ngram-pfsg --
# Create a Decipher PFSG from an N-gram language model
#
# usage: make-ngram-pfsg [debug=1] [check_bows=1] [maxorder=N] backoff-lm > pfsg
#
# $Header: /home/srilm/devel/utils/src/RCS/make-ngram-pfsg.gawk,v 1.23 2002/07/04 16:59:59 stolcke Exp $
#
#########################################
#
# Output format specific code
#
BEGIN {
logscale = 2.30258509299404568402 * 10000.5;
round = 0.5;
start_tag = "";
end_tag = "";
null = "NULL";
if ("pid" in PROCINFO) {
pid = PROCINFO["pid"];
} else {
getline pid < "/dev/pid";
}
tmpfile = "tmp.pfsg.trans." pid;
debug = 0;
}
function rint(x) {
if (x < 0) {
return int(x - round);
} else {
return int(x + round);
}
}
function scale_log(x) {
return rint(x * logscale);
}
function output_for_node(name) {
num_words = split(name, words);
if (num_words == 0) {
print "output_for_node: got empty name" > "/dev/stderr";
exit(1);
} else if (words[1] == bo_name) {
return null;
} else if (words[num_words] == end_tag || \
words[num_words] == start_tag)
{
return null;
} else {
return words[num_words];
}
}
function node_exists(name) {
return (name in node_num);
}
function node_index(name) {
i = node_num[name];
if (i == "") {
i = num_nodes ++;
node_num[name] = i;
node_string[i] = output_for_node(name);
if (debug) {
print "node " i " = " name ", output = " node_string[i] \
> "/dev/stderr";
}
}
return i;
}
function start_grammar(name) {
num_trans = 0;
num_nodes = 0;
return;
}
function end_grammar(name) {
if (!node_exists(start_tag)) {
print start_tag " tag undefined in LM" > "/dev/stderr";
exit(1);
} else if (!node_exists(end_tag)) {
print end_tag " tag undefined in LM" > "/dev/stderr";
exit(1);
}
printf "%d pfsg nodes\n", num_nodes > "/dev/stderr";
printf "%d pfsg transitions\n", num_trans > "/dev/stderr";
print "name " name;
printf "nodes %s", num_nodes;
for (i = 0; i < num_nodes; i ++) {
printf " %s", node_string[i];
}
printf "\n";
print "initial " node_index(start_tag);
print "final " node_index(end_tag);
print "transitions " num_trans;
fflush();
if (close(tmpfile) < 0) {
print "error closing tmp file" > "/dev/stderr";
exit(1);
}
system("/bin/cat " tmpfile "; /bin/rm -f " tmpfile);
}
function add_trans(from, to, prob) {
#print "add_trans " from " -> " to " " prob > "/dev/stderr";
num_trans ++;
print node_index(from), node_index(to), scale_log(prob) > tmpfile;
}
#########################################
#
# Generic code for parsing backoff file
#
BEGIN {
maxorder = 0;
grammar_name = "PFSG";
bo_name = "BO";
check_bows = 0;
epsilon = 1e-5; # tolerance for lowprob detection
}
NR == 1 {
start_grammar(grammar_name);
}
NF == 0 {
next;
}
/^ngram *[0-9][0-9]*=/ {
num_grams = substr($2,index($2,"=")+1);
if (num_grams > 0) {
order = substr($2,1,index($2,"=")-1);
# limit maximal N-gram order if desired
if (maxorder > 0 && order > maxorder) {
order = maxorder;
}
if (order == 1) {
grammar_name = "UNIGRAM_PFSG";
} else if (order == 2) {
grammar_name = "BIGRAM_PFSG";
} else if (order == 3) {
grammar_name = "TRIGRAM_PFSG";
} else {
grammar_name = "NGRAM_PFSG";
}
}
next;
}
/^\\[0-9]-grams:/ {
currorder = substr($0,2,1);
next;
}
/^\\/ {
next;
}
#
# unigram parsing
#
currorder == 1 {
first_word = last_word = ngram = $2;
ngram_prefix = ngram_suffix = "";
# we need all unigram backoffs (except for ),
# so fill in missing bow where needed
if (NF == 2 && last_word != end_tag) {
$3 = 0;
}
}
#
# bigram parsing
#
currorder == 2 {
ngram_prefix = first_word = $2;
ngram_suffix = last_word = $3;
ngram = $2 " " $3;
}
#
# trigram parsing
#
currorder == 3 {
first_word = $2;
last_word = $4;
ngram_prefix = $2 " " $3;
ngram_suffix = $3 " " $4;
ngram = ngram_prefix " " last_word;
}
#
# higher-order N-gram parsing
#
currorder >= 4 && currorder <= order {
first_word = $2;
last_word = $(currorder + 1);
ngram_infix = $3;
for (i = 4; i <= currorder; i ++ ) {
ngram_infix = ngram_infix " " $i;
}
ngram_prefix = first_word " " ngram_infix;
ngram_suffix = ngram_infix " " last_word;
ngram = ngram_prefix " " last_word;
}
#
# shared code for N-grams of all orders
#
currorder <= order {
prob = $1;
bow = $(currorder + 2);
# skip backoffs that exceed maximal order,
# but always include unigram backoffs
if (bow != "" && (currorder == 1 || currorder < order)) {
# remember all LM contexts for creation of N-gram transitions
bows[ngram] = bow;
# insert backoff transitions
if (currorder < order - 1) {
add_trans(bo_name " " ngram, bo_name " " ngram_suffix, bow);
add_trans(ngram, bo_name " " ngram, 0);
} else {
add_trans(ngram, bo_name " " ngram_suffix, bow);
}
}
if (last_word == start_tag) {
if (currorder > 1) {
printf "warning: ignoring ngram into start tag %s -> %s\n", \
ngram_prefix, last_word > "/dev/stderr";
}
} else {
# insert N-gram transition to maximal suffix of target context
if (last_word == end_tag) {
target = end_tag;
} else if (ngram in bows || currorder == 1) {
# the minimal context is unigram
target = ngram;
} else if (ngram_suffix in bows) {
target = ngram_suffix;
} else {
target = ngram_suffix;
for (i = 3; i <= currorder; i ++) {
target = substr(target, length($i) + 2);
if (target in bows) break;
}
}
if (currorder == 1 || currorder < order) {
add_trans(bo_name " " ngram_prefix, target, prob);
} else {
add_trans(ngram_prefix, target, prob);
}
if (check_bows) {
if (currorder < order) {
probs[ngram] = prob;
}
if (ngram_suffix in probs && \
probs[ngram_suffix] + bows[ngram_prefix] - prob > epsilon)
{
printf "warning: ngram loses to backoff %s -> %s\n", \
ngram_prefix, last_word > "/dev/stderr";
}
}
}
}
END {
end_grammar(grammar_name);
}
From thomae at ei.tum.de Mon Jul 8 01:04:50 2002
From: thomae at ei.tum.de (Matthias Thomae)
Date: Mon, 08 Jul 2002 10:04:50 +0200
Subject: 0-grams
Message-ID: <3D2947A2.7040304@ei.tum.de>
Hello,
I'd like to create 0-grams as well as higher-order n-grams, but when I
call ngram-count with option -order 0 I get a segmentation fault (SRI LM
1.3.1).
Regards
Matthias
From stolcke at speech.sri.com Mon Jul 8 11:40:41 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 08 Jul 2002 11:40:41 PDT
Subject: 0-grams
In-Reply-To: Your message of Mon, 08 Jul 2002 10:04:50 +0200.
<3D2947A2.7040304@ei.tum.de>
Message-ID: <200207081840.LAA08023@huge>
There are no 0-gram models, mostly because the DARPA format does not
support that. Because of that, SRILM handles the backoff probability mass
at the unigram level in a special way: it is distributed over all unobserved
words. This is equivalent to having a backoff to 0-th order distribution.
In practical terms, you use
ngram-count -vocab VOCAB -order 1 -lm LM
Since no ngram counts or text data are supplied, the mechanism that
distributes backoff probability mass for unigrams will spread all
probability uniformly over the entire vocabulary (which you have to
supply of course).
Of course -order 0 should not make the program core dump -- i'll fix that.
--Andreas
In message <3D2947A2.7040304 at ei.tum.de>you wrote:
> Hello,
>
> I'd like to create 0-grams as well as higher-order n-grams, but when I
> call ngram-count with option -order 0 I get a segmentation fault (SRI LM
> 1.3.1).
>
> Regards
> Matthias
>
From stolcke at speech.sri.com Fri Jul 12 12:20:23 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 12 Jul 2002 12:20:23 PDT
Subject: inquiry about SRI Toolkit
In-Reply-To: Your message of Fri, 12 Jul 2002 10:19:25 -0400.
Message-ID: <200207121920.MAA17618@huge>
Bing,
the answer to your question is in the ppl-scripts(1) man page.
The script "compute-best-mix" will estimate the optimal interpolation
weights given the ppl output from each of the component models on
an tuning corpus.
--Andreas
In message you wrote:
>
> Dear Dr. Stolcke,
>
>
> I'm a graduate student at ECE Dept. of Northeastern University.
> Currently, I've been working on the language modelling and trying to
> use the SRI Toolkit to do the LM interpolation. As I read your online
> document (the command manual), I can see that the "ngram" and
> "ngram-count" program can perform the interpolation operation. However,
> it seems to me that the "ngram" program can only perform
> model interpolation by given a user defined weight, so usually, how
> can I define such weight? Say, if I have a set of validation data,
> can I use the SRI Toolkit to obtain some optimal weights (by EM algorithm
> or some other optimization strategy)? I do appreciate if you could
> give me some advice on that. Thank you very much.
>
> Best regards
>
> Bing
>
From anand at speech.sri.com Tue Jul 16 12:50:25 2002
From: anand at speech.sri.com (Anand Venkataraman)
Date: Tue, 16 Jul 2002 12:50:25 -0700 (PDT)
Subject: inquiry about SRI Toolkit
In-Reply-To: <200207121920.MAA17618@huge> (message from Andreas Stolcke on
Fri, 12 Jul 2002 12:20:23 PDT)
Message-ID: <200207161950.MAA22380@chalumeau>
Bing,
I hope you found the info you want in the man page that Andreas pointed you
at. If you want an example of how to use the compute-best-mix program, the
following script may be useful. This script will probably be included into
the toolkit with the next release. It computes the mixed log probability
and perplexity on a given corpus according to a dynamic mixture of up to 6
language models by jack-knifing. i.e., the mixture coefficients for one
half of the corpus are those estimated using compute-best-mix on the other
half.
cheers.
&
#!/bin/ksh
#
# Computes the "fairly" (word level) interpolated probability of the
# given data set using all of the (upto 6) given language models. The
# procedure is to estimate lambdas on one half, mix by this proportion
# on the second half and vice versa. Usage example:
# compute-mixed-logprob -lm lm1 -lm lm2 ... -text text -sets set1 set2
#
# $Header: $
#
LMS="";
TEXT="-";
PWD=`pwd`
EXPT=`basename $PWD`
function split_lines
{
prefix="lines"
if [ x$1 = "x-prefix" ]; then
prefix=$2;
shift; shift;
fi
gawk -v f1=$prefix.set1 -v f2=$prefix.set2 -v n1=$1 -v n2=$2 '
BEGIN {
n=n1+n2;
}
(NR-1) % n < n1 {
print >f1;
next;
} {
print >f2;
next;
}'
}
#----------------------------------------------------------------------
# Main
#
while [ $# -gt 0 ]; do
case $1 in
-lm) LMS="$LMS $2"; shift; shift;;
-lm?flags) LMFLAGS="$2"; shift; shift;;
-text) TEXT=$2; shift; shift;;
-expt) EXPT=$2; shift; shift;;
-sets) set1=$2; set2=$3; shift; shift; shift;;
*) echo "Incorrect usage. Refer to man page ppl-scripts(1)."; exit 1;
esac
done
LOG=$EXPT.log
EXPTDIR=`dirname $EXPT`
mkdir -p $EXPTDIR
exec 2>>$LOG
echo "The following is the log of $0 starting at `date`" 1>&2
set -x
# Divide input text into two chunks. This will produce
# $EXPT.set1 and $EXPT.set2
#
if [ -z "$set1" -o -z "$set2" ]; then
cat $TEXT | split_lines -prefix $EXPT 1 1
set1=$EXPT.set1
set2=$EXPT.set2
fi
# Compute logprobs according to each lm on each half.
#
for lm in $LMS; do
for set in $set1 $set2; do
ngram $LMFLAGS -debug 2 -lm $lm -ppl $set >$set-`basename $lm`.ppl
done
done
# Compute best mix
#
for set in $set1 $set2; do
ppl_files="";
for lm in $LMS; do
ppl_files="$ppl_files $set-`basename $lm`.ppl"
done
compute-best-mix $ppl_files >$set-lambdas
done
# Interpolate each set, with lambdas from the other set.
#
(echo $set1 $set2; echo $set2 $set1;) | while read s1 s2; do
main_lm=`echo $LMS | gawk '{print $1}'`
lm_flags="$LMFLAGS -lm $main_lm"
if [ ! -s $s1-lambdas ]; then
echo Could not read $s1-lambdas 1>&2
exit 1;
fi
set `cat $s1-lambdas | sed 's/^.*(\(.*\))/\1/'`
shift;
if [ $# -gt 0 ]; then
mix_lm=`echo $LMS | gawk '{print $2}'`
lambdas="-lambda $1";
lm_flags="$lm_flags -mix-lm $mix_lm"
shift;
fi
for i in 2 3 4 5; do
if [ $# -gt 0 ]; then
lambdas="$lambdas -mix-lambda$i $1";
mix_lm=`echo $LMS | gawk -v i=$i '{print $(i+1)}'`
if [ -z "$mix_lm" ]; then
echo No mix lm found for lambda $1
exit;
fi
lm_flags="$lm_flags -mix-lm$i $mix_lm"
shift;
fi
done
ngram_flags="$lm_flags $lambdas"
ngram $ngram_flags -ppl $s2
done | \
gawk '{
print;
}
$1 ~ /^file$/ {
nsents += $3;
nwords += $5;
noovs += $7;
next;
}
$2 ~ /^zeroprobs,$/ {
nzeroprobs+= $1;
logprob += $4;
next;
}
END {
printf "file both: %d sentences, %d words, %d OOVs\n",
nsents, nwords, noovs;
printf "%d zeroprobs, logprob= %g ppl= %g ppl1= %g\n",
nzeroprobs, logprob,
10^(-logprob/(nsents+nwords-noovs)),
10^(-logprob/(nwords-noovs));
}'
From anand at speech.sri.com Tue Jul 16 12:55:34 2002
From: anand at speech.sri.com (Anand Venkataraman)
Date: Tue, 16 Jul 2002 12:55:34 -0700 (PDT)
Subject: inquiry about SRI Toolkit
In-Reply-To: <200207121920.MAA17618@huge> (message from Andreas Stolcke on
Fri, 12 Jul 2002 12:20:23 PDT)
Message-ID: <200207161955.MAA23970@chalumeau>
Man page for compute-mixed-logprob:
compute-mixed-logprob computes the log probability of a given
corpus of text according to the best mixture of the given com-
ponent language models. The interpolation is done fairly. That
is, the given corpus is split into two sets (with alternate lines
belonging to different sets) and the mixture coefficients for
each set are those computed using EM on the other set. Upto six
language models may be specified on the command line using the
-lm flag. If the splitting of the corpus into two sets by alter-
nate line order is not the method desired, the user may expli-
citly specify two sets on the command line using -sets set1 set2
instead of giving a single -text corpus option. The -lm-flags
option may be given to supply additional options passed on to
ngram during perplexity calculations, for instance, if the
language models are class language models and a class file needs
to be specified with -classes classfile. Language model ngram
orders may also likewise be passed on to ngram using -lm-flags
'-order n'. All such options that are to be passed to ngram must
be quoted and passed to compute-mixed-logprob as a single option.
However, note that the supplied ngram options will be used for
all the language models specified.
Further, the -expt exptID option may be used to specify the pre-
fix used for all ancillary files created by the program. The
exptID may include a path and any missing directories in this
path will be created.
Final output will include the ngram outputs for each separate set
and a combined output in the same format for both sets. A log-
file of the procedure is produced in exptID.log
Examples:
compute-mixed-logprob -expt 001/mix -text swbd.txt -lm
swbd.4bo.gz -lm bn.3bo.gz -lm ch.3bo.gz -lm-flags "-order 4
-classes train400.classes"
compute-mixed-logprob -expt 001/mix -sets swbd-set1.txt swbd-
set2.txt -lm swbd.4bo.gz -lm bn.3bo.gz -lm ch.3bo.gz -lm-flags
"-order 4 -classes train.400classes"
&
From stolcke at speech.sri.com Wed Jul 24 09:01:30 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 24 Jul 2002 09:01:30 PDT
Subject: Help
In-Reply-To: Your message of Wed, 24 Jul 2002 17:10:22 +0100.
<002101c2332c$9d6cb540$7b081b93@telecom.tuc.gr>
Message-ID: <200207241601.JAA17949@huge>
Dimitris,
your email to the list did not go through because at the time you
sent it you were not subscribed to the list (to prevent spam we
only allow list members to post).
Regarding your question: indeed the perplexity of the mixed LM should
be much closer to what compute-best-mix outputs.
There are two ways to create an interpolated model:
"on-the-fly" this is the traditional approach: you keep the component
models separate, and compute the interpolated probabilties
when you evaluate the model
The command for this is
ngram -lm ... -mix-lm ... -lambda L -bayes 0
"merged" you create a single static model that implements an
approximation to the on-the-fly method
The command for this is
ngram -lm ... -mix-lm ... -lambda L
(no -bayes option).
The -write-lm option outputs the merged model if desired.
In the "merged" case you only get an approximation because in general
it is not possible to create a single back-off model that exactly
implements the mixed probabilties of the two component models (without
expanding out all possible N-grams and effectively bypassing the
backoff mechanism).
As explained in the ICSLP paper, the "merged" approach is usually slightly
better than the traditional interpolation. However, it only works if
you have two models of the same type (both word-based or both class-based).
When you merge a word-based and a class-based model the approximation
doesn't work anymore. I suspect that's what you did in your experiment.
Rerun ngram with the -bayes 0 option and see if you get the perplexity
you expect.
--Andreas
In message <002101c2332c$9d6cb540$7b081b93 at telecom.tuc.gr>you wrote:
> Hi Andreas,
>
> Before five days I sent an e-mail at srilm-user at speech.sri.com
> and I still haven't receive an answer.
> I repeat it here. Please inform me...
>
> Hi
>
> I interpolate a 3-gram with a class 3-gram
>
> The output of compute-best-mix is:
> compute-best-mix debug2-LM1 debug2-LM2
> iteration 19, lambda = (0.849536 0.150464), ppl = 150.787
>
> The PP of the interpolated model at the held-out data I used to take
> debug2-LM1 and debug2-LM2 is 169.52
>
> This ain't to be the same with the output of compute-best-mix, eg 150.787?
> Do I something wrong?
>
> Regards,
> Dimitris
>
>
From stolcke at speech.sri.com Thu Aug 1 09:34:37 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Aug 2002 09:34:37 PDT
Subject: question on the tool
In-Reply-To: Your message of Thu, 01 Aug 2002 08:03:17 -0700.
<20020801150317.36624.qmail@web12503.mail.yahoo.com>
Message-ID: <200208011634.g71GYcN19001@huge>
In message <20020801150317.36624.qmail at web12503.mail.yahoo.com>you wrote:
>
> Dear Dr. Stolcke,
>
> I have one more question on your
> "replace-words-with-classes" tool, please.
>
> I used the "ngram-class" program to generate a set of
> classes using some broadcast news corpus
> (223,091 unique words) and specifying the vocab to be
> a 36875 words dict. And the output of the classes
> contains the mapping of 35325 words, as I can see,
> 187,766 OOVs have no mappings for them, since they've
> been treated as the unknown words. This should be no
> problem. But when using the generate classes to
> replace the word-based trans to be class-based trans,
> problem occured. The OOVs could not be mapped into any
> classes (since there is no mapping for such words in
> the classes file), thus they remain there! But in my
> knowledge, if we want to learn the class-ngram, an
> usual form for it to be interpolate with word-ngram is
> like:
>
> ^
> P (w3 | w1, w2) = lambda * Pw (w3 | w1, w2)
> + (1-lambda) * P (w3 | G3) * Pc (G3| G1, G2)
> where wi belongs to class Gi, i=1, 2, 3, respectively.
>
> So my question is, with the classes/words mixed trans,
> can we really obtain the correct class-ngram
> probabilities?
>
> Here are the commands I've been using:
>
> 1) ngram-class -debug 0 -text -vocab
> <36K-dict> -numclasses 1000 -classes
>
> 2) replace-words-with-classes classes=
> >
You need to add the class definition for the "unknown" word class
yourself. I would recommend that you prevent from being merged
with any other word class. You can do this by creating a file containing
and then invoking ngram-class -noclass-vocab and that file as argument.
Then you add a new unknown word class to the class definitions from
ngram-class, and put all the remaining words in that class.
(This assumes you actually want your overall LM vocabulary to contain
all 223,091 words. If the word ngram maps those to then
the class-ngram should do the same, and no modifications to the
class definitions are needed.)
>
> I did a small perl script to post-process the
> mixed-trans, but then I think there could be another
> problem. Too many unknown words will be mapped into
> one single unknown class, which somehow, could disturb
> the real probabilities of the class-ngram that we
> should have.
I'm not sure what you mean by "disturb the real probabilities".
But if you want all the words in the class-lm then they have to
get their probability somewhere, and a single class seems like
a reasonable approach. this will smooth their probabilities when
interpolated with the word ngram, which treats all those low-frequency
words as separate. a more sophisticated approach would maybe try
to distinguish the words based on their morphology, but that would require
some significant work.
> Also, I used the command mentioned in your paper to
> expand the built-up class-ngram model:
>
> ngram -lm -prune 1e-5 -expand-exact 3
> -write-lm
>
> but as I read the expanded model, I can see there
> are only probabilites for class (1,2,3-grams), but
> no membership distribution, i.e. no P (w3 | G3). Then
> how can it be interpolate with the word-level LM
> correctly?
First, the ngram -expand function also needs the -classes option
to read in the class definitions. However, I suspect that with
a corpus like BN it will not be feasible to expand the class-ngram
to a word-ngram, there are just too many word ngrams resulting
from such an expansion. Even the pruning won't help you because
pruning happens AFTER the expansion.
You don't need to expand a class-ngram to interpolate with a word ngram.
just use
ngram -lm WORD-LM -mix-lm CLASS-LM -classes CLASSDEFS \
-lambda WORD-LM-WEIGHT -bayes 0
followed by other options to compute perplexity etc.
> if there is any news board on using SRI Toolkit, then
> I could turn to the community for help, instead of
> taking too much of your time. Many thanks!
Indeed there is a mailing list for SRILM users.
To join, mail the line "subscribe srilm-user" (in the message body)
to majordomo at speech.sri.com.
Regards,
--Andreas
From stolcke at speech.sri.com Thu Aug 1 09:34:37 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Aug 2002 09:34:37 PDT
Subject: question on the tool
In-Reply-To: Your message of Thu, 01 Aug 2002 08:03:17 -0700.
<20020801150317.36624.qmail@web12503.mail.yahoo.com>
Message-ID: <200208011634.g71GYcN19001@huge>
In message <20020801150317.36624.qmail at web12503.mail.yahoo.com>you wrote:
>
> Dear Dr. Stolcke,
>
> I have one more question on your
> "replace-words-with-classes" tool, please.
>
> I used the "ngram-class" program to generate a set of
> classes using some broadcast news corpus
> (223,091 unique words) and specifying the vocab to be
> a 36875 words dict. And the output of the classes
> contains the mapping of 35325 words, as I can see,
> 187,766 OOVs have no mappings for them, since they've
> been treated as the unknown words. This should be no
> problem. But when using the generate classes to
> replace the word-based trans to be class-based trans,
> problem occured. The OOVs could not be mapped into any
> classes (since there is no mapping for such words in
> the classes file), thus they remain there! But in my
> knowledge, if we want to learn the class-ngram, an
> usual form for it to be interpolate with word-ngram is
> like:
>
> ^
> P (w3 | w1, w2) = lambda * Pw (w3 | w1, w2)
> + (1-lambda) * P (w3 | G3) * Pc (G3| G1, G2)
> where wi belongs to class Gi, i=1, 2, 3, respectively.
>
> So my question is, with the classes/words mixed trans,
> can we really obtain the correct class-ngram
> probabilities?
>
> Here are the commands I've been using:
>
> 1) ngram-class -debug 0 -text -vocab
> <36K-dict> -numclasses 1000 -classes
>
> 2) replace-words-with-classes classes=
> >
You need to add the class definition for the "unknown" word class
yourself. I would recommend that you prevent from being merged
with any other word class. You can do this by creating a file containing
and then invoking ngram-class -noclass-vocab and that file as argument.
Then you add a new unknown word class to the class definitions from
ngram-class, and put all the remaining words in that class.
(This assumes you actually want your overall LM vocabulary to contain
all 223,091 words. If the word ngram maps those to then
the class-ngram should do the same, and no modifications to the
class definitions are needed.)
>
> I did a small perl script to post-process the
> mixed-trans, but then I think there could be another
> problem. Too many unknown words will be mapped into
> one single unknown class, which somehow, could disturb
> the real probabilities of the class-ngram that we
> should have.
I'm not sure what you mean by "disturb the real probabilities".
But if you want all the words in the class-lm then they have to
get their probability somewhere, and a single class seems like
a reasonable approach. this will smooth their probabilities when
interpolated with the word ngram, which treats all those low-frequency
words as separate. a more sophisticated approach would maybe try
to distinguish the words based on their morphology, but that would require
some significant work.
> Also, I used the command mentioned in your paper to
> expand the built-up class-ngram model:
>
> ngram -lm -prune 1e-5 -expand-exact 3
> -write-lm
>
> but as I read the expanded model, I can see there
> are only probabilites for class (1,2,3-grams), but
> no membership distribution, i.e. no P (w3 | G3). Then
> how can it be interpolate with the word-level LM
> correctly?
First, the ngram -expand function also needs the -classes option
to read in the class definitions. However, I suspect that with
a corpus like BN it will not be feasible to expand the class-ngram
to a word-ngram, there are just too many word ngrams resulting
from such an expansion. Even the pruning won't help you because
pruning happens AFTER the expansion.
You don't need to expand a class-ngram to interpolate with a word ngram.
just use
ngram -lm WORD-LM -mix-lm CLASS-LM -classes CLASSDEFS \
-lambda WORD-LM-WEIGHT -bayes 0
followed by other options to compute perplexity etc.
> if there is any news board on using SRI Toolkit, then
> I could turn to the community for help, instead of
> taking too much of your time. Many thanks!
Indeed there is a mailing list for SRILM users.
To join, mail the line "subscribe srilm-user" (in the message body)
to majordomo at speech.sri.com.
Regards,
--Andreas
From tolos at sony.de Wed Aug 21 02:06:22 2002
From: tolos at sony.de (Tolos, Marta)
Date: Wed, 21 Aug 2002 11:06:22 +0200
Subject: Backoff missing
Message-ID:
Hi all,
I have a problem using the toolkit, I create a language model using only the
ngram-count command:
ngram-count -text my.text -lm my.arpa -wbdiscount1 -wbdiscount3 -wbdiscount3
My text file has the setences markers .
And then the arpa file I get, for the unigram has no backoff weight and
also all the bigrams that contain as the second word in the bigram have
no backoff either.
Does someone know how to get the backoff weight? My problem is that the
recognizer complains about the format of my language model, since all the
bigrams without the backoff are not considered and then at the end since
there are so many it stops.
I also have another question about the format of the arpa file created.
Between the probabilities and the words there is not a single space and this
causes problems also with the recognizer I am using. What I am doing right
now to avoid this problem is to use a perl script to fix the format and then
use the converted file that has only a single space, is there an option to
get a single space??
Thanks a lot.
Best,
Marta
From hliu at inzigo.com Wed Aug 21 06:58:51 2002
From: hliu at inzigo.com (Hongqin Liu)
Date: Wed, 21 Aug 2002 09:58:51 -0400
Subject: Bigram pfsg with Nunace
Message-ID: <3D639C9B.2AAAC7CC@inzigo.com>
Hi, all,
I was trying to use Nuance compiler to compile a PFSG Bigram LM. But got
the message:
ERROR: PfsgAssembleUnflattenedNodeArrayFromPfsgArray: Assembled 10852
nodes, but expected only 10299!
FREESimple: not freeing untagged memory at 92f12a0
FREESimple: not freeing untagged memory at 92f12b0
....
FREESimple: not freeing untagged memory at 92f3520
ERROR: CompilePfsg: Couldn't assemble node array for grammar .TOP.
ERROR: GSLCompiler::PFSGToNodeArray: Couldn't compile pfsg into node
array
Is there anyone know why? I succeeded in trigram, but stuck at bigram.
Many thanks!
Hongqin Liu
From stolcke at speech.sri.com Wed Aug 21 13:48:17 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 21 Aug 2002 13:48:17 PDT
Subject: Backoff missing
In-Reply-To: Your message of Wed, 21 Aug 2002 11:06:22 +0200.
Message-ID: <200208212048.NAA17105@huge>
In message you wrot
e:
> Hi all,
>
> I have a problem using the toolkit, I create a language model using only the
> ngram-count command:
>
> ngram-count -text my.text -lm my.arpa -wbdiscount1 -wbdiscount3 -wbdiscount3
>
>
> My text file has the setences markers .
>
> And then the arpa file I get, for the unigram has no backoff weight and
> also all the bigrams that contain as the second word in the bigram have
> no backoff either.
> Does someone know how to get the backoff weight? My problem is that the
> recognizer complains about the format of my language model, since all the
> bigrams without the backoff are not considered and then at the end since
> there are so many it stops.
We get this question a lot. Technically speaking, backoff weights are
only required for N-grams that are prefixes of longer N-grams (by the
definition of backoff weights). Practically speaking, there is a lot
of software out there that assumes that backoff weights
are assigned to all N-grams except those of highest order. This is
very wasteful once you are dealing with pruned (or so-called "variable
length") ngram models. The script add-dummy-bows will add those backoff
weights that your software is missing.
>
> I also have another question about the format of the arpa file created.
> Between the probabilities and the words there is not a single space and this
> causes problems also with the recognizer I am using. What I am doing right
> now to avoid this problem is to use a perl script to fix the format and then
> use the converted file that has only a single space, is there an option to
> get a single space??
The toolkit outputs a tab after the probabilities and before the backoff
weights, so as to make things line up visually and make the file more readable.
This is also convenient to search for ngrams or prefixes or suffixes of
ngrams in the file (by including \t in your search pattern).
again, if your software is too naive about the format then you need
to bridge the gap, just as you have been doing. Since all the tools
can read/write stdio you can do this on the fly with a command like
ngram-count ... -lm - | my-script-to-replace-tabs-with-spaces | \
gzip > my-fixed-lm.gz
Hope this helps.
--Andreas
From hliu at inzigo.com Tue Sep 3 08:27:00 2002
From: hliu at inzigo.com (Hongqin Liu)
Date: Tue, 03 Sep 2002 11:27:00 -0400
Subject: general help
Message-ID: <3D74D4C4.2A5F65BB@inzigo.com>
Hi, Crouching tigers & hidden dragons:
I am using a word based trigram (GT backoff) for an application, and
trying to make futher improvement. I tried to use class based, but
seemed not so good as word based. Higher gram (4gram) seems also worse
than 3gram. The WER (word error rate) i got now is about 8-10%, it seems
that there is still some room for improvement. Anyone got good ideas --
within ngram. Thanks in advance.
Hongqin Liu
From stolcke at speech.sri.com Tue Sep 3 08:45:06 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 03 Sep 2002 08:45:06 PDT
Subject: general help
In-Reply-To: Your message of Tue, 03 Sep 2002 11:27:00 -0400.
<3D74D4C4.2A5F65BB@inzigo.com>
Message-ID: <200209031545.IAA00058@huge>
Hongqin,
Two suggestions:
- interpolate your class-based LM with the word-based one
(class-based LMs alone usually don't give an improvement over word-based ones
except in very limited domains).
- use Kneser-Ney smoothing (with interpolation) for the 4gram LM:
-kndiscount1 -interpolate1 -kndiscount2 -interpolate2
-kndiscount3 -interpolate3 -kndiscount4 -interpolate4
You should see a perplexity reduction over the 3gram, and over GT
discounting. Of course you never know about WER...
--Andreas
In message <3D74D4C4.2A5F65BB at inzigo.com>you wrote:
> Hi, Crouching tigers & hidden dragons:
>
> I am using a word based trigram (GT backoff) for an application, and
> trying to make futher improvement. I tried to use class based, but
> seemed not so good as word based. Higher gram (4gram) seems also worse
> than 3gram. The WER (word error rate) i got now is about 8-10%, it seems
> that there is still some room for improvement. Anyone got good ideas --
> within ngram. Thanks in advance.
>
> Hongqin Liu
>
>
From tolos at sony.de Tue Sep 3 09:01:25 2002
From: tolos at sony.de (Tolos, Marta)
Date: Tue, 3 Sep 2002 18:01:25 +0200
Subject: Comparision between SRILM and CMU
Message-ID:
Hi all,
I have a general question about the toolkit. I have just started using this
SRILM toolkit, before I always used CMU toolkit, so I wanted to do a
comparision between the language models created with one and with the other
toolkit. So I created language models with the same corpus using both
toolkits, and I compute the perplexities with each toolkit (I mean, that I
use the same toolkit for creation and evaluation of the perplexity) and the
perplexities were quite different always better for the SRILM, so then I
tried to compute the perplexities of the CMU language models with the SRILM
toolkit and then I got strange results, since most of the time the
performance of the same CMU language model was better when computing the
perplexity with SRILM instead of CMU, except for one case were the value
that the SRILM gave was extremely high. After this, I did it the other way
arround, I used the CMU to evaluate the SRILM language models, and after
some trouble because of the format and some special requeriments of the CMU
toolkit, I got worse results when using the CMU toolkit for evaluating the
perplexity of the SRILM language models (and when the text used for
evaluating perplexity contained OOV words, CMU gave an error.) My question
is what is the difference in the computation of perplexity in the two
toolkits. And also what is the meaning of the "ppl1" that SRILM toolkit
gives.
Thanks a lot,
Marta
From hliu at inzigo.com Tue Sep 3 09:04:24 2002
From: hliu at inzigo.com (Hongqin Liu)
Date: Tue, 03 Sep 2002 12:04:24 -0400
Subject: mix LM
Message-ID: <3D74DD88.8E7C2BD8@inzigo.com>
Hi,
First I appreciate the quick response from Andreas, the guy with the
Long Quan sword.
His first suggestion reminds me the mixture LM. Actually I made some
tests on the interpolattion approach, including class + word LM. I
always found that the perplexity (and WER) is a linear function of the
interpolation parameter (Lambda), so the best results are always at the
ends, which makes the interpolation trivil. Did I miss something, or it
is the case for some domains?
Best,
Hongqin
From stolcke at speech.sri.com Tue Sep 3 09:13:18 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 03 Sep 2002 09:13:18 PDT
Subject: mix LM
In-Reply-To: Your message of Tue, 03 Sep 2002 12:04:24 -0400.
<3D74DD88.8E7C2BD8@inzigo.com>
Message-ID: <200209031613.JAA01419@huge>
In message <3D74DD88.8E7C2BD8 at inzigo.com>you wrote:
> His first suggestion reminds me the mixture LM. Actually I made some
> tests on the interpolattion approach, including class + word LM. I
> always found that the perplexity (and WER) is a linear function of the
> interpolation parameter (Lambda), so the best results are always at the
> ends, which makes the interpolation trivil. Did I miss something, or it
> is the case for some domains?
>
Hongqin,
how did you find the best interpolation weight? I hope you didn't
use trial-and-error and used the compute-best-mix script instead.
In my experience the perplexity is not a linear function of lambda,
unless maybe your class-based LM is very bad. Rather, ppl should be
U-shaped function as lambda varies between 0 and 1.
--Andreas
From stolcke at speech.sri.com Tue Sep 3 09:25:11 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 03 Sep 2002 09:25:11 PDT
Subject: Comparision between SRILM and CMU
In-Reply-To: Your message of Tue, 03 Sep 2002 18:01:25 +0200.
Message-ID: <200209031625.JAA01982@huge>
In message you wrot
e:
> Hi all,
>
> I have a general question about the toolkit. I have just started using this
> SRILM toolkit, before I always used CMU toolkit, so I wanted to do a
> comparision between the language models created with one and with the other
> toolkit. So I created language models with the same corpus using both
> toolkits, and I compute the perplexities with each toolkit (I mean, that I
> use the same toolkit for creation and evaluation of the perplexity) and the
> perplexities were quite different always better for the SRILM, so then I
> tried to compute the perplexities of the CMU language models with the SRILM
> toolkit and then I got strange results, since most of the time the
> performance of the same CMU language model was better when computing the
> perplexity with SRILM instead of CMU, except for one case were the value
> that the SRILM gave was extremely high. After this, I did it the other way
> arround, I used the CMU to evaluate the SRILM language models, and after
> some trouble because of the format and some special requeriments of the CMU
> toolkit, I got worse results when using the CMU toolkit for evaluating the
> perplexity of the SRILM language models (and when the text used for
> evaluating perplexity contained OOV words, CMU gave an error.) My question
> is what is the difference in the computation of perplexity in the two
> toolkits. And also what is the meaning of the "ppl1" that SRILM toolkit
> gives.
Marta,
I think what you are doing is an excellent idea, and I'm sure people here
would like to see the results, once you figured out the bugs.
Regarding your last question: ppl1 is the perplexity excluding
end-of-sentence tokens. That is, you normalize the total log likelihood
by the number of words, rather than (number of words + number of tags)
for computing perplexity. This is a little more meaningful (though not
perfect) when comparing perplexities on test sets that follow different
rules for sentence segmentation.
About the discrepancies between CMU and SRI toolkits: I think the only
way to resolve this is to dump out the word-level probabilities and
compare them one-by-one. This should allow you to tell how the two
differ in their perplexity computation. In SRILM, you can use
ngram -debug 2 -ppl for this. My suspicion is that it has something to do
with the way OOV words are handled.
Also, I'd be interested to know what prevented the SRILM-built LM
from working with the CMU tools. If it's something simple we will
fix it (unless it is clearly a CMU bug).
--Andreas
From hliu at inzigo.com Tue Sep 3 10:12:22 2002
From: hliu at inzigo.com (Hongqin Liu)
Date: Tue, 03 Sep 2002 13:12:22 -0400
Subject: mix LM
References: <200209031613.JAA01419@huge>
Message-ID: <3D74ED76.F46393D4@inzigo.com>
Andreas,
Sorry for bothering you again. I was trying to use 'compute-best-mix' to
get the weight (lambda), but got:
fatal: division by zero attempted
The inputs for it were two ppl files from class and word based models,
respectively by ngram ->
./compute-best-mix /home/hliu/language_model/word/lm.word.3.ppl
/home/hliu/language_model/class/lm.class.3.ppl
I guess I missed something in using this script (compute-best-mix)?
Best HL.
Stolcke wrote:
> In message <3D74DD88.8E7C2BD8 at inzigo.com>you wrote:
>
> > His first suggestion reminds me the mixture LM. Actually I made some
> > tests on the interpolattion approach, including class + word LM. I
> > always found that the perplexity (and WER) is a linear function of the
> > interpolation parameter (Lambda), so the best results are always at the
> > ends, which makes the interpolation trivil. Did I miss something, or it
> > is the case for some domains?
> >
>
> Hongqin,
>
> how did you find the best interpolation weight? I hope you didn't
> use trial-and-error and used the compute-best-mix script instead.
> In my experience the perplexity is not a linear function of lambda,
> unless maybe your class-based LM is very bad. Rather, ppl should be
> U-shaped function as lambda varies between 0 and 1.
>
> --Andreas
From anand at speech.sri.com Tue Sep 3 11:50:24 2002
From: anand at speech.sri.com (Anand Venkataraman)
Date: Tue, 3 Sep 2002 11:50:24 -0700 (PDT)
Subject: mix LM
In-Reply-To: <3D74ED76.F46393D4@inzigo.com> (message from Hongqin Liu on Tue,
03 Sep 2002 13:12:22 -0400)
Message-ID: <200209031850.LAA08883@stockholm>
Hongqin
> fatal: division by zero attempted
>
> The inputs for it were two ppl files from class and word based models,
> respectively by ngram ->
>
> ./compute-best-mix /home/hliu/language_model/word/lm.word.3.ppl
> /home/hliu/language_model/class/lm.class.3.ppl
Did you make sure to use the -classes "file" option to expand classes while
computing ppls using the class lm? If not, the EM algorithm could get
thrown off track by spurious probability values from the oovs. In any
case, if you email the ppl output for the first sentence by each lm, we
could confirm what's going on.
&
From hliu at inzigo.com Wed Sep 4 06:36:32 2002
From: hliu at inzigo.com (Hongqin Liu)
Date: Wed, 04 Sep 2002 09:36:32 -0400
Subject: compute-best-mix
Message-ID: <3D760C5F.1414D47D@inzigo.com>
Hi, Guys,
As suggested by Andreas and Anand, I got two huge files (.ppls for
class-based and word-based), then run the 'compue-best-mix'. It has been
runung for 17 hours (?!), and now in iteration 2 (runing...):
iteration 1, lambda = (0.5 0.5), ppl = 4.91204
iteration 2, lambda = (0.514374 0.485626), ppl = 4.908
ppl(word)=4.80, ppl(class)=5.08 for reference.
It seems that it's working, but very slow, my corpus is 265K sentences,
not too big.
Is this OK? If so, next (if it will stop somewhere) I should use ngram
-mix-lm to get the new LM with the lambda from here?
Best,
Hongqin
From hliu at inzigo.com Wed Sep 4 07:06:02 2002
From: hliu at inzigo.com (Hongqin Liu)
Date: Wed, 04 Sep 2002 10:06:02 -0400
Subject: update
Message-ID: <3D761349.864CCE73@inzigo.com>
Hi,
I got iteration 3:
teration 1, lambda = (0.5 0.5), ppl = 4.91204
iteration 2, lambda = (0.514374 0.485626), ppl = 4.908
iteration 3, lambda = (0.528604 0.471396), ppl = 4.90404
It seems that the final ppl will not less than that from word-based
trigram (4.80), in other words, there is no minimum between the two end
points. The other end (class) is 5.08, not too bad. I'll wait until the
iteration stops.
Good day,
Hongqin
From anand at speech.sri.com Wed Sep 4 10:21:56 2002
From: anand at speech.sri.com (Anand Venkataraman)
Date: Wed, 4 Sep 2002 10:21:56 -0700 (PDT)
Subject: compute-best-mix
In-Reply-To: <3D760C5F.1414D47D@inzigo.com> (message from Hongqin Liu on Wed,
04 Sep 2002 09:36:32 -0400)
Message-ID: <200209041721.KAA11960@clara>
Hongqin
> It seems that it's working, but very slow, my corpus is 265K sentences,
> not too big.
17+ hours indicates that something unusual going on. Is the process
swapping heavily on a slow machine? We have tuned parameters on much
larger corpora in shorter times.
If you are having real efficiency problems, you may want to reduce the size
of your dev test set. After all you only want a representative sample to
estimate the mixture coeffs.
> It seems that the final ppl will not less than that from word-based
> trigram (4.80), in other words, there is no minimum between the two end
The ppl *MUST* be at most as large as either of the individual values in
your case, unless the probabilities in your ppl files have been messed up.
In the worst case if the algorithm decides that either model is useless, it
will converge to a lambda of 0 for it.
Please check.
&
From stolcke at speech.sri.com Wed Sep 4 13:21:09 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 04 Sep 2002 13:21:09 PDT
Subject: update
In-Reply-To: Your message of Wed, 04 Sep 2002 10:06:02 -0400.
<3D761349.864CCE73@inzigo.com>
Message-ID: <200209042021.NAA01129@zap.speech.sri.com>
Hongqin,
let me venture a guess: you are using your LM training data to do the mixture
optimization. You should be using held-out data set that has NOT been
used to estimate the component models. If you are optimizing on the LM training
data then it is no surprise that the word-ngram gets weight 1.
--Andreas
In message <3D761349.864CCE73 at inzigo.com>you wrote:
> Hi,
>
> I got iteration 3:
>
> teration 1, lambda = (0.5 0.5), ppl = 4.91204
> iteration 2, lambda = (0.514374 0.485626), ppl = 4.908
> iteration 3, lambda = (0.528604 0.471396), ppl = 4.90404
>
> It seems that the final ppl will not less than that from word-based
> trigram (4.80), in other words, there is no minimum between the two end
> points. The other end (class) is 5.08, not too bad. I'll wait until the
> iteration stops.
>
> Good day,
>
> Hongqin
>
>
>
From mirjam.sepesy at uni-mb.si Fri Sep 20 05:01:56 2002
From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec)
Date: Fri, 20 Sep 2002 14:01:56 +0200
Subject: class LM
Message-ID: <3D8B0E33.3148460E@uni-mb.si>
Hi all,
I have a question about the class-based models. I have just started to
use them.
First I want to understand the test example in the toolkit.
I have problems with understanding the probability computation of the
devtest.text
Can you, please, explain me, which 1grams, 2grams, 3grams.... are meant
for example in this sentence:
kaybeck and lost ok
p( kaybeck | ) = [1gram][2gram] 0.000845361 [ -3.07296 ] / 1
p( and | kaybeck ...) = [1gram][3gram] 0.443827 [ -0.352786 ] / 1
p( lost | and ...) = [2gram][2gram][4gram][4gram] 0.0305452 [ -1.51506
] / 1
p( ok | lost ...) = [3gram][3gram][4gram][4gram] 0.0703371 [ -1.15282
] / 0.999999
p( | ok ...) = [3gram][4gram] 0.401395 [ -0.396428 ] / 1
I am familiar with the class model, where all words are mapped to
classes.
In this example, there are only two classes (GRIDLABEL and
SPELLED_GRIDLABEL) and
in the model we have ngrams of words and ngrams of words and classes.
I understand the idea, that if n-gram of words exists in is better to
use it
and if not, classes should help.
But what are the steps in probability computation?
Please, help!
Have a nice weekend!
Mirjam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mirjam.sepesy.vcf
Type: text/x-vcard
Size: 302 bytes
Desc: Card for Mirjam Sepesy Maucec
URL:
From stolcke at speech.sri.com Fri Sep 20 12:48:34 2002
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 20 Sep 2002 12:48:34 PDT
Subject: class LM
In-Reply-To: Your message of Fri, 20 Sep 2002 14:01:56 +0200.
<3D8B0E33.3148460E@uni-mb.si>
Message-ID: <200209201948.MAA23667@huge>
In message <3D8B0E33.3148460E at uni-mb.si>you wrote:
> This is a multi-part message in MIME format.
>
> --Boundary_(ID_ESpZRbOj8hesWAoqG6iE3Q)
> Content-type: text/plain; charset=us-ascii
> Content-transfer-encoding: 7BIT
>
> Hi all,
>
> I have a question about the class-based models. I have just started to
> use them.
> First I want to understand the test example in the toolkit.
> I have problems with understanding the probability computation of the
> devtest.text
> Can you, please, explain me, which 1grams, 2grams, 3grams.... are meant
> for example in this sentence:
The ClassNgram LM performs dynamic programming to compute the prefix
probabilities of sentences (and from that, the conditional word probabilities).
This is done because a given word can be "generated" by the LM either
as a plain word or as a member of one of several classes (and in the
case of multi-word class expansions at different positions in the
expansion). So the states in the DP trellis correspond to the different
classes and the positions of the word in the expansion. (This is
very similar to the notion of a "dotted item" in context-free
parsing, in case you're familiar with that).
As a result, many N-gram lookups are performed to go from one word to the
next: one for each state transition in the trellis.
So when you see something like
>
> kaybeck and lost ok
>
> p( kaybeck | ) = [1gram][2gram] 0.000845361 [ -3.07296 ] / 1
it means that both a unigram and a bigram lookup happened. In this
particular case this makes sense because p(kaybeck|) is probably
not among the bigrams (hence backoff to unigram), but
p(GRIDLABEL | ) is a bigram (kaybeck is a member of class GRIDLABEL).
The probabilities of both cases are summed to obtain the total probability of
the word.
I believe you can set -debug 4 to trace the state transitions in the DP trellis.
It becomes a little unwieldy to follow as the number of states increase as
you get deeper into the sentence.
> p( and | kaybeck ...) = [1gram][3gram] 0.443827 [ -0.352786 ] / 1
> p( lost | and ...) = [2gram][2gram][4gram][4gram] 0.0305452 [ -1.51506
> ] / 1
> p( ok | lost ...) = [3gram][3gram][4gram][4gram] 0.0703371 [ -1.15282
> ] / 0.999999
> p( | ok ...) = [3gram][4gram] 0.401395 [ -0.396428 ] / 1
>
> I am familiar with the class model, where all words are mapped to
> classes.
> In this example, there are only two classes (GRIDLABEL and
> SPELLED_GRIDLABEL) and
> in the model we have ngrams of words and ngrams of words and classes.
We generalized the class model so that mixed N-grams of words and classes
are allowed for convenience. This is however just equivalent to having
an extra class for each word that contains only that word itself.
>
> I understand the idea, that if n-gram of words exists in is better to
> use it
> and if not, classes should help.
As I explained above, it's not an either-or. You compute probabilities
for all the ways of generating a word, and sum.
> But what are the steps in probability computation?
Again, tracing the DP with the -debug option will give you a sense of
the details. You might have to also read the code for ClassNgram:prefixProb()
to get the full picture.
--Andreas