From thomae at ei.tum.de Thu Jul 4 06:54:49 2002 From: thomae at ei.tum.de (Matthias Thomae) Date: Thu, 04 Jul 2002 15:54:49 +0200 Subject: make-ngram-pfsg cannot handle 1-grams? Message-ID: <3D2453A9.7050802@ei.tum.de> Hello, I am using version 1.3.1 and would like to generate a unigram lm in pfsg format. I encounter problems when calling make-ngram-pfsg, see example below. Any ideas? Regards. Matthias ------------------------------------------------------------------------- > cat test.txt rote kugel gruene kugel > ngram-count -text test.txt -lm test.lm -order 1 warning: discount coeff 1 is out of range: -0 tho at odin: ~/worktho/nadia/lm > cat test.lm \data\ ngram 1=5 \1-grams: -0.4929155 -99 -0.748188 gruene -0.4929155 kugel -0.748188 rote \end\ > make-ngram-pfsg test.lm output_for_node: got empty name tag undefined in LM From stolcke at speech.sri.com Thu Jul 4 10:15:39 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 04 Jul 2002 10:15:39 PDT Subject: make-ngram-pfsg cannot handle 1-grams? In-Reply-To: Your message of Thu, 04 Jul 2002 15:54:49 +0200. <3D2453A9.7050802@ei.tum.de> Message-ID: <200207041715.KAA08739@huge> A clear bug, and my fault. When I changed make-ngram-pfsg to handle arbitrary N-gram lengths the unigram case was broken (it requires some special handling). A fixed version is included below. Put is in $SRILM/utils/src and "make release" in that directory. --Andreas In message <3D2453A9.7050802 at ei.tum.de>you wrote: > Hello, > > I am using version 1.3.1 and would like to generate a unigram lm in pfsg > format. I encounter problems when calling make-ngram-pfsg, see example > below. Any ideas? > > Regards. > Matthias > #!/usr/local/bin/gawk -f # # make-ngram-pfsg -- # Create a Decipher PFSG from an N-gram language model # # usage: make-ngram-pfsg [debug=1] [check_bows=1] [maxorder=N] backoff-lm > pfsg # # $Header: /home/srilm/devel/utils/src/RCS/make-ngram-pfsg.gawk,v 1.23 2002/07/04 16:59:59 stolcke Exp $ # ######################################### # # Output format specific code # BEGIN { logscale = 2.30258509299404568402 * 10000.5; round = 0.5; start_tag = ""; end_tag = ""; null = "NULL"; if ("pid" in PROCINFO) { pid = PROCINFO["pid"]; } else { getline pid < "/dev/pid"; } tmpfile = "tmp.pfsg.trans." pid; debug = 0; } function rint(x) { if (x < 0) { return int(x - round); } else { return int(x + round); } } function scale_log(x) { return rint(x * logscale); } function output_for_node(name) { num_words = split(name, words); if (num_words == 0) { print "output_for_node: got empty name" > "/dev/stderr"; exit(1); } else if (words[1] == bo_name) { return null; } else if (words[num_words] == end_tag || \ words[num_words] == start_tag) { return null; } else { return words[num_words]; } } function node_exists(name) { return (name in node_num); } function node_index(name) { i = node_num[name]; if (i == "") { i = num_nodes ++; node_num[name] = i; node_string[i] = output_for_node(name); if (debug) { print "node " i " = " name ", output = " node_string[i] \ > "/dev/stderr"; } } return i; } function start_grammar(name) { num_trans = 0; num_nodes = 0; return; } function end_grammar(name) { if (!node_exists(start_tag)) { print start_tag " tag undefined in LM" > "/dev/stderr"; exit(1); } else if (!node_exists(end_tag)) { print end_tag " tag undefined in LM" > "/dev/stderr"; exit(1); } printf "%d pfsg nodes\n", num_nodes > "/dev/stderr"; printf "%d pfsg transitions\n", num_trans > "/dev/stderr"; print "name " name; printf "nodes %s", num_nodes; for (i = 0; i < num_nodes; i ++) { printf " %s", node_string[i]; } printf "\n"; print "initial " node_index(start_tag); print "final " node_index(end_tag); print "transitions " num_trans; fflush(); if (close(tmpfile) < 0) { print "error closing tmp file" > "/dev/stderr"; exit(1); } system("/bin/cat " tmpfile "; /bin/rm -f " tmpfile); } function add_trans(from, to, prob) { #print "add_trans " from " -> " to " " prob > "/dev/stderr"; num_trans ++; print node_index(from), node_index(to), scale_log(prob) > tmpfile; } ######################################### # # Generic code for parsing backoff file # BEGIN { maxorder = 0; grammar_name = "PFSG"; bo_name = "BO"; check_bows = 0; epsilon = 1e-5; # tolerance for lowprob detection } NR == 1 { start_grammar(grammar_name); } NF == 0 { next; } /^ngram *[0-9][0-9]*=/ { num_grams = substr($2,index($2,"=")+1); if (num_grams > 0) { order = substr($2,1,index($2,"=")-1); # limit maximal N-gram order if desired if (maxorder > 0 && order > maxorder) { order = maxorder; } if (order == 1) { grammar_name = "UNIGRAM_PFSG"; } else if (order == 2) { grammar_name = "BIGRAM_PFSG"; } else if (order == 3) { grammar_name = "TRIGRAM_PFSG"; } else { grammar_name = "NGRAM_PFSG"; } } next; } /^\\[0-9]-grams:/ { currorder = substr($0,2,1); next; } /^\\/ { next; } # # unigram parsing # currorder == 1 { first_word = last_word = ngram = $2; ngram_prefix = ngram_suffix = ""; # we need all unigram backoffs (except for ), # so fill in missing bow where needed if (NF == 2 && last_word != end_tag) { $3 = 0; } } # # bigram parsing # currorder == 2 { ngram_prefix = first_word = $2; ngram_suffix = last_word = $3; ngram = $2 " " $3; } # # trigram parsing # currorder == 3 { first_word = $2; last_word = $4; ngram_prefix = $2 " " $3; ngram_suffix = $3 " " $4; ngram = ngram_prefix " " last_word; } # # higher-order N-gram parsing # currorder >= 4 && currorder <= order { first_word = $2; last_word = $(currorder + 1); ngram_infix = $3; for (i = 4; i <= currorder; i ++ ) { ngram_infix = ngram_infix " " $i; } ngram_prefix = first_word " " ngram_infix; ngram_suffix = ngram_infix " " last_word; ngram = ngram_prefix " " last_word; } # # shared code for N-grams of all orders # currorder <= order { prob = $1; bow = $(currorder + 2); # skip backoffs that exceed maximal order, # but always include unigram backoffs if (bow != "" && (currorder == 1 || currorder < order)) { # remember all LM contexts for creation of N-gram transitions bows[ngram] = bow; # insert backoff transitions if (currorder < order - 1) { add_trans(bo_name " " ngram, bo_name " " ngram_suffix, bow); add_trans(ngram, bo_name " " ngram, 0); } else { add_trans(ngram, bo_name " " ngram_suffix, bow); } } if (last_word == start_tag) { if (currorder > 1) { printf "warning: ignoring ngram into start tag %s -> %s\n", \ ngram_prefix, last_word > "/dev/stderr"; } } else { # insert N-gram transition to maximal suffix of target context if (last_word == end_tag) { target = end_tag; } else if (ngram in bows || currorder == 1) { # the minimal context is unigram target = ngram; } else if (ngram_suffix in bows) { target = ngram_suffix; } else { target = ngram_suffix; for (i = 3; i <= currorder; i ++) { target = substr(target, length($i) + 2); if (target in bows) break; } } if (currorder == 1 || currorder < order) { add_trans(bo_name " " ngram_prefix, target, prob); } else { add_trans(ngram_prefix, target, prob); } if (check_bows) { if (currorder < order) { probs[ngram] = prob; } if (ngram_suffix in probs && \ probs[ngram_suffix] + bows[ngram_prefix] - prob > epsilon) { printf "warning: ngram loses to backoff %s -> %s\n", \ ngram_prefix, last_word > "/dev/stderr"; } } } } END { end_grammar(grammar_name); } From thomae at ei.tum.de Mon Jul 8 01:04:50 2002 From: thomae at ei.tum.de (Matthias Thomae) Date: Mon, 08 Jul 2002 10:04:50 +0200 Subject: 0-grams Message-ID: <3D2947A2.7040304@ei.tum.de> Hello, I'd like to create 0-grams as well as higher-order n-grams, but when I call ngram-count with option -order 0 I get a segmentation fault (SRI LM 1.3.1). Regards Matthias From stolcke at speech.sri.com Mon Jul 8 11:40:41 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 08 Jul 2002 11:40:41 PDT Subject: 0-grams In-Reply-To: Your message of Mon, 08 Jul 2002 10:04:50 +0200. <3D2947A2.7040304@ei.tum.de> Message-ID: <200207081840.LAA08023@huge> There are no 0-gram models, mostly because the DARPA format does not support that. Because of that, SRILM handles the backoff probability mass at the unigram level in a special way: it is distributed over all unobserved words. This is equivalent to having a backoff to 0-th order distribution. In practical terms, you use ngram-count -vocab VOCAB -order 1 -lm LM Since no ngram counts or text data are supplied, the mechanism that distributes backoff probability mass for unigrams will spread all probability uniformly over the entire vocabulary (which you have to supply of course). Of course -order 0 should not make the program core dump -- i'll fix that. --Andreas In message <3D2947A2.7040304 at ei.tum.de>you wrote: > Hello, > > I'd like to create 0-grams as well as higher-order n-grams, but when I > call ngram-count with option -order 0 I get a segmentation fault (SRI LM > 1.3.1). > > Regards > Matthias > From stolcke at speech.sri.com Fri Jul 12 12:20:23 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 12 Jul 2002 12:20:23 PDT Subject: inquiry about SRI Toolkit In-Reply-To: Your message of Fri, 12 Jul 2002 10:19:25 -0400. Message-ID: <200207121920.MAA17618@huge> Bing, the answer to your question is in the ppl-scripts(1) man page. The script "compute-best-mix" will estimate the optimal interpolation weights given the ppl output from each of the component models on an tuning corpus. --Andreas In message you wrote: > > Dear Dr. Stolcke, > > > I'm a graduate student at ECE Dept. of Northeastern University. > Currently, I've been working on the language modelling and trying to > use the SRI Toolkit to do the LM interpolation. As I read your online > document (the command manual), I can see that the "ngram" and > "ngram-count" program can perform the interpolation operation. However, > it seems to me that the "ngram" program can only perform > model interpolation by given a user defined weight, so usually, how > can I define such weight? Say, if I have a set of validation data, > can I use the SRI Toolkit to obtain some optimal weights (by EM algorithm > or some other optimization strategy)? I do appreciate if you could > give me some advice on that. Thank you very much. > > Best regards > > Bing > From anand at speech.sri.com Tue Jul 16 12:50:25 2002 From: anand at speech.sri.com (Anand Venkataraman) Date: Tue, 16 Jul 2002 12:50:25 -0700 (PDT) Subject: inquiry about SRI Toolkit In-Reply-To: <200207121920.MAA17618@huge> (message from Andreas Stolcke on Fri, 12 Jul 2002 12:20:23 PDT) Message-ID: <200207161950.MAA22380@chalumeau> Bing, I hope you found the info you want in the man page that Andreas pointed you at. If you want an example of how to use the compute-best-mix program, the following script may be useful. This script will probably be included into the toolkit with the next release. It computes the mixed log probability and perplexity on a given corpus according to a dynamic mixture of up to 6 language models by jack-knifing. i.e., the mixture coefficients for one half of the corpus are those estimated using compute-best-mix on the other half. cheers. & #!/bin/ksh # # Computes the "fairly" (word level) interpolated probability of the # given data set using all of the (upto 6) given language models. The # procedure is to estimate lambdas on one half, mix by this proportion # on the second half and vice versa. Usage example: # compute-mixed-logprob -lm lm1 -lm lm2 ... -text text -sets set1 set2 # # $Header: $ # LMS=""; TEXT="-"; PWD=`pwd` EXPT=`basename $PWD` function split_lines { prefix="lines" if [ x$1 = "x-prefix" ]; then prefix=$2; shift; shift; fi gawk -v f1=$prefix.set1 -v f2=$prefix.set2 -v n1=$1 -v n2=$2 ' BEGIN { n=n1+n2; } (NR-1) % n < n1 { print >f1; next; } { print >f2; next; }' } #---------------------------------------------------------------------- # Main # while [ $# -gt 0 ]; do case $1 in -lm) LMS="$LMS $2"; shift; shift;; -lm?flags) LMFLAGS="$2"; shift; shift;; -text) TEXT=$2; shift; shift;; -expt) EXPT=$2; shift; shift;; -sets) set1=$2; set2=$3; shift; shift; shift;; *) echo "Incorrect usage. Refer to man page ppl-scripts(1)."; exit 1; esac done LOG=$EXPT.log EXPTDIR=`dirname $EXPT` mkdir -p $EXPTDIR exec 2>>$LOG echo "The following is the log of $0 starting at `date`" 1>&2 set -x # Divide input text into two chunks. This will produce # $EXPT.set1 and $EXPT.set2 # if [ -z "$set1" -o -z "$set2" ]; then cat $TEXT | split_lines -prefix $EXPT 1 1 set1=$EXPT.set1 set2=$EXPT.set2 fi # Compute logprobs according to each lm on each half. # for lm in $LMS; do for set in $set1 $set2; do ngram $LMFLAGS -debug 2 -lm $lm -ppl $set >$set-`basename $lm`.ppl done done # Compute best mix # for set in $set1 $set2; do ppl_files=""; for lm in $LMS; do ppl_files="$ppl_files $set-`basename $lm`.ppl" done compute-best-mix $ppl_files >$set-lambdas done # Interpolate each set, with lambdas from the other set. # (echo $set1 $set2; echo $set2 $set1;) | while read s1 s2; do main_lm=`echo $LMS | gawk '{print $1}'` lm_flags="$LMFLAGS -lm $main_lm" if [ ! -s $s1-lambdas ]; then echo Could not read $s1-lambdas 1>&2 exit 1; fi set `cat $s1-lambdas | sed 's/^.*(\(.*\))/\1/'` shift; if [ $# -gt 0 ]; then mix_lm=`echo $LMS | gawk '{print $2}'` lambdas="-lambda $1"; lm_flags="$lm_flags -mix-lm $mix_lm" shift; fi for i in 2 3 4 5; do if [ $# -gt 0 ]; then lambdas="$lambdas -mix-lambda$i $1"; mix_lm=`echo $LMS | gawk -v i=$i '{print $(i+1)}'` if [ -z "$mix_lm" ]; then echo No mix lm found for lambda $1 exit; fi lm_flags="$lm_flags -mix-lm$i $mix_lm" shift; fi done ngram_flags="$lm_flags $lambdas" ngram $ngram_flags -ppl $s2 done | \ gawk '{ print; } $1 ~ /^file$/ { nsents += $3; nwords += $5; noovs += $7; next; } $2 ~ /^zeroprobs,$/ { nzeroprobs+= $1; logprob += $4; next; } END { printf "file both: %d sentences, %d words, %d OOVs\n", nsents, nwords, noovs; printf "%d zeroprobs, logprob= %g ppl= %g ppl1= %g\n", nzeroprobs, logprob, 10^(-logprob/(nsents+nwords-noovs)), 10^(-logprob/(nwords-noovs)); }' From anand at speech.sri.com Tue Jul 16 12:55:34 2002 From: anand at speech.sri.com (Anand Venkataraman) Date: Tue, 16 Jul 2002 12:55:34 -0700 (PDT) Subject: inquiry about SRI Toolkit In-Reply-To: <200207121920.MAA17618@huge> (message from Andreas Stolcke on Fri, 12 Jul 2002 12:20:23 PDT) Message-ID: <200207161955.MAA23970@chalumeau> Man page for compute-mixed-logprob: compute-mixed-logprob computes the log probability of a given corpus of text according to the best mixture of the given com- ponent language models. The interpolation is done fairly. That is, the given corpus is split into two sets (with alternate lines belonging to different sets) and the mixture coefficients for each set are those computed using EM on the other set. Upto six language models may be specified on the command line using the -lm flag. If the splitting of the corpus into two sets by alter- nate line order is not the method desired, the user may expli- citly specify two sets on the command line using -sets set1 set2 instead of giving a single -text corpus option. The -lm-flags option may be given to supply additional options passed on to ngram during perplexity calculations, for instance, if the language models are class language models and a class file needs to be specified with -classes classfile. Language model ngram orders may also likewise be passed on to ngram using -lm-flags '-order n'. All such options that are to be passed to ngram must be quoted and passed to compute-mixed-logprob as a single option. However, note that the supplied ngram options will be used for all the language models specified. Further, the -expt exptID option may be used to specify the pre- fix used for all ancillary files created by the program. The exptID may include a path and any missing directories in this path will be created. Final output will include the ngram outputs for each separate set and a combined output in the same format for both sets. A log- file of the procedure is produced in exptID.log Examples: compute-mixed-logprob -expt 001/mix -text swbd.txt -lm swbd.4bo.gz -lm bn.3bo.gz -lm ch.3bo.gz -lm-flags "-order 4 -classes train400.classes" compute-mixed-logprob -expt 001/mix -sets swbd-set1.txt swbd- set2.txt -lm swbd.4bo.gz -lm bn.3bo.gz -lm ch.3bo.gz -lm-flags "-order 4 -classes train.400classes" & From stolcke at speech.sri.com Wed Jul 24 09:01:30 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 24 Jul 2002 09:01:30 PDT Subject: Help In-Reply-To: Your message of Wed, 24 Jul 2002 17:10:22 +0100. <002101c2332c$9d6cb540$7b081b93@telecom.tuc.gr> Message-ID: <200207241601.JAA17949@huge> Dimitris, your email to the list did not go through because at the time you sent it you were not subscribed to the list (to prevent spam we only allow list members to post). Regarding your question: indeed the perplexity of the mixed LM should be much closer to what compute-best-mix outputs. There are two ways to create an interpolated model: "on-the-fly" this is the traditional approach: you keep the component models separate, and compute the interpolated probabilties when you evaluate the model The command for this is ngram -lm ... -mix-lm ... -lambda L -bayes 0 "merged" you create a single static model that implements an approximation to the on-the-fly method The command for this is ngram -lm ... -mix-lm ... -lambda L (no -bayes option). The -write-lm option outputs the merged model if desired. In the "merged" case you only get an approximation because in general it is not possible to create a single back-off model that exactly implements the mixed probabilties of the two component models (without expanding out all possible N-grams and effectively bypassing the backoff mechanism). As explained in the ICSLP paper, the "merged" approach is usually slightly better than the traditional interpolation. However, it only works if you have two models of the same type (both word-based or both class-based). When you merge a word-based and a class-based model the approximation doesn't work anymore. I suspect that's what you did in your experiment. Rerun ngram with the -bayes 0 option and see if you get the perplexity you expect. --Andreas In message <002101c2332c$9d6cb540$7b081b93 at telecom.tuc.gr>you wrote: > Hi Andreas, > > Before five days I sent an e-mail at srilm-user at speech.sri.com > and I still haven't receive an answer. > I repeat it here. Please inform me... > > Hi > > I interpolate a 3-gram with a class 3-gram > > The output of compute-best-mix is: > compute-best-mix debug2-LM1 debug2-LM2 > iteration 19, lambda = (0.849536 0.150464), ppl = 150.787 > > The PP of the interpolated model at the held-out data I used to take > debug2-LM1 and debug2-LM2 is 169.52 > > This ain't to be the same with the output of compute-best-mix, eg 150.787? > Do I something wrong? > > Regards, > Dimitris > > From stolcke at speech.sri.com Thu Aug 1 09:34:37 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Aug 2002 09:34:37 PDT Subject: question on the tool In-Reply-To: Your message of Thu, 01 Aug 2002 08:03:17 -0700. <20020801150317.36624.qmail@web12503.mail.yahoo.com> Message-ID: <200208011634.g71GYcN19001@huge> In message <20020801150317.36624.qmail at web12503.mail.yahoo.com>you wrote: > > Dear Dr. Stolcke, > > I have one more question on your > "replace-words-with-classes" tool, please. > > I used the "ngram-class" program to generate a set of > classes using some broadcast news corpus > (223,091 unique words) and specifying the vocab to be > a 36875 words dict. And the output of the classes > contains the mapping of 35325 words, as I can see, > 187,766 OOVs have no mappings for them, since they've > been treated as the unknown words. This should be no > problem. But when using the generate classes to > replace the word-based trans to be class-based trans, > problem occured. The OOVs could not be mapped into any > classes (since there is no mapping for such words in > the classes file), thus they remain there! But in my > knowledge, if we want to learn the class-ngram, an > usual form for it to be interpolate with word-ngram is > like: > > ^ > P (w3 | w1, w2) = lambda * Pw (w3 | w1, w2) > + (1-lambda) * P (w3 | G3) * Pc (G3| G1, G2) > where wi belongs to class Gi, i=1, 2, 3, respectively. > > So my question is, with the classes/words mixed trans, > can we really obtain the correct class-ngram > probabilities? > > Here are the commands I've been using: > > 1) ngram-class -debug 0 -text -vocab > <36K-dict> -numclasses 1000 -classes > > 2) replace-words-with-classes classes= > > You need to add the class definition for the "unknown" word class yourself. I would recommend that you prevent from being merged with any other word class. You can do this by creating a file containing and then invoking ngram-class -noclass-vocab and that file as argument. Then you add a new unknown word class to the class definitions from ngram-class, and put all the remaining words in that class. (This assumes you actually want your overall LM vocabulary to contain all 223,091 words. If the word ngram maps those to then the class-ngram should do the same, and no modifications to the class definitions are needed.) > > I did a small perl script to post-process the > mixed-trans, but then I think there could be another > problem. Too many unknown words will be mapped into > one single unknown class, which somehow, could disturb > the real probabilities of the class-ngram that we > should have. I'm not sure what you mean by "disturb the real probabilities". But if you want all the words in the class-lm then they have to get their probability somewhere, and a single class seems like a reasonable approach. this will smooth their probabilities when interpolated with the word ngram, which treats all those low-frequency words as separate. a more sophisticated approach would maybe try to distinguish the words based on their morphology, but that would require some significant work. > Also, I used the command mentioned in your paper to > expand the built-up class-ngram model: > > ngram -lm -prune 1e-5 -expand-exact 3 > -write-lm > > but as I read the expanded model, I can see there > are only probabilites for class (1,2,3-grams), but > no membership distribution, i.e. no P (w3 | G3). Then > how can it be interpolate with the word-level LM > correctly? First, the ngram -expand function also needs the -classes option to read in the class definitions. However, I suspect that with a corpus like BN it will not be feasible to expand the class-ngram to a word-ngram, there are just too many word ngrams resulting from such an expansion. Even the pruning won't help you because pruning happens AFTER the expansion. You don't need to expand a class-ngram to interpolate with a word ngram. just use ngram -lm WORD-LM -mix-lm CLASS-LM -classes CLASSDEFS \ -lambda WORD-LM-WEIGHT -bayes 0 followed by other options to compute perplexity etc. > if there is any news board on using SRI Toolkit, then > I could turn to the community for help, instead of > taking too much of your time. Many thanks! Indeed there is a mailing list for SRILM users. To join, mail the line "subscribe srilm-user" (in the message body) to majordomo at speech.sri.com. Regards, --Andreas From stolcke at speech.sri.com Thu Aug 1 09:34:37 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Aug 2002 09:34:37 PDT Subject: question on the tool In-Reply-To: Your message of Thu, 01 Aug 2002 08:03:17 -0700. <20020801150317.36624.qmail@web12503.mail.yahoo.com> Message-ID: <200208011634.g71GYcN19001@huge> In message <20020801150317.36624.qmail at web12503.mail.yahoo.com>you wrote: > > Dear Dr. Stolcke, > > I have one more question on your > "replace-words-with-classes" tool, please. > > I used the "ngram-class" program to generate a set of > classes using some broadcast news corpus > (223,091 unique words) and specifying the vocab to be > a 36875 words dict. And the output of the classes > contains the mapping of 35325 words, as I can see, > 187,766 OOVs have no mappings for them, since they've > been treated as the unknown words. This should be no > problem. But when using the generate classes to > replace the word-based trans to be class-based trans, > problem occured. The OOVs could not be mapped into any > classes (since there is no mapping for such words in > the classes file), thus they remain there! But in my > knowledge, if we want to learn the class-ngram, an > usual form for it to be interpolate with word-ngram is > like: > > ^ > P (w3 | w1, w2) = lambda * Pw (w3 | w1, w2) > + (1-lambda) * P (w3 | G3) * Pc (G3| G1, G2) > where wi belongs to class Gi, i=1, 2, 3, respectively. > > So my question is, with the classes/words mixed trans, > can we really obtain the correct class-ngram > probabilities? > > Here are the commands I've been using: > > 1) ngram-class -debug 0 -text -vocab > <36K-dict> -numclasses 1000 -classes > > 2) replace-words-with-classes classes= > > You need to add the class definition for the "unknown" word class yourself. I would recommend that you prevent from being merged with any other word class. You can do this by creating a file containing and then invoking ngram-class -noclass-vocab and that file as argument. Then you add a new unknown word class to the class definitions from ngram-class, and put all the remaining words in that class. (This assumes you actually want your overall LM vocabulary to contain all 223,091 words. If the word ngram maps those to then the class-ngram should do the same, and no modifications to the class definitions are needed.) > > I did a small perl script to post-process the > mixed-trans, but then I think there could be another > problem. Too many unknown words will be mapped into > one single unknown class, which somehow, could disturb > the real probabilities of the class-ngram that we > should have. I'm not sure what you mean by "disturb the real probabilities". But if you want all the words in the class-lm then they have to get their probability somewhere, and a single class seems like a reasonable approach. this will smooth their probabilities when interpolated with the word ngram, which treats all those low-frequency words as separate. a more sophisticated approach would maybe try to distinguish the words based on their morphology, but that would require some significant work. > Also, I used the command mentioned in your paper to > expand the built-up class-ngram model: > > ngram -lm -prune 1e-5 -expand-exact 3 > -write-lm > > but as I read the expanded model, I can see there > are only probabilites for class (1,2,3-grams), but > no membership distribution, i.e. no P (w3 | G3). Then > how can it be interpolate with the word-level LM > correctly? First, the ngram -expand function also needs the -classes option to read in the class definitions. However, I suspect that with a corpus like BN it will not be feasible to expand the class-ngram to a word-ngram, there are just too many word ngrams resulting from such an expansion. Even the pruning won't help you because pruning happens AFTER the expansion. You don't need to expand a class-ngram to interpolate with a word ngram. just use ngram -lm WORD-LM -mix-lm CLASS-LM -classes CLASSDEFS \ -lambda WORD-LM-WEIGHT -bayes 0 followed by other options to compute perplexity etc. > if there is any news board on using SRI Toolkit, then > I could turn to the community for help, instead of > taking too much of your time. Many thanks! Indeed there is a mailing list for SRILM users. To join, mail the line "subscribe srilm-user" (in the message body) to majordomo at speech.sri.com. Regards, --Andreas From tolos at sony.de Wed Aug 21 02:06:22 2002 From: tolos at sony.de (Tolos, Marta) Date: Wed, 21 Aug 2002 11:06:22 +0200 Subject: Backoff missing Message-ID: Hi all, I have a problem using the toolkit, I create a language model using only the ngram-count command: ngram-count -text my.text -lm my.arpa -wbdiscount1 -wbdiscount3 -wbdiscount3 My text file has the setences markers . And then the arpa file I get, for the unigram has no backoff weight and also all the bigrams that contain as the second word in the bigram have no backoff either. Does someone know how to get the backoff weight? My problem is that the recognizer complains about the format of my language model, since all the bigrams without the backoff are not considered and then at the end since there are so many it stops. I also have another question about the format of the arpa file created. Between the probabilities and the words there is not a single space and this causes problems also with the recognizer I am using. What I am doing right now to avoid this problem is to use a perl script to fix the format and then use the converted file that has only a single space, is there an option to get a single space?? Thanks a lot. Best, Marta From hliu at inzigo.com Wed Aug 21 06:58:51 2002 From: hliu at inzigo.com (Hongqin Liu) Date: Wed, 21 Aug 2002 09:58:51 -0400 Subject: Bigram pfsg with Nunace Message-ID: <3D639C9B.2AAAC7CC@inzigo.com> Hi, all, I was trying to use Nuance compiler to compile a PFSG Bigram LM. But got the message: ERROR: PfsgAssembleUnflattenedNodeArrayFromPfsgArray: Assembled 10852 nodes, but expected only 10299! FREESimple: not freeing untagged memory at 92f12a0 FREESimple: not freeing untagged memory at 92f12b0 .... FREESimple: not freeing untagged memory at 92f3520 ERROR: CompilePfsg: Couldn't assemble node array for grammar .TOP. ERROR: GSLCompiler::PFSGToNodeArray: Couldn't compile pfsg into node array Is there anyone know why? I succeeded in trigram, but stuck at bigram. Many thanks! Hongqin Liu From stolcke at speech.sri.com Wed Aug 21 13:48:17 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 21 Aug 2002 13:48:17 PDT Subject: Backoff missing In-Reply-To: Your message of Wed, 21 Aug 2002 11:06:22 +0200. Message-ID: <200208212048.NAA17105@huge> In message you wrot e: > Hi all, > > I have a problem using the toolkit, I create a language model using only the > ngram-count command: > > ngram-count -text my.text -lm my.arpa -wbdiscount1 -wbdiscount3 -wbdiscount3 > > > My text file has the setences markers . > > And then the arpa file I get, for the unigram has no backoff weight and > also all the bigrams that contain as the second word in the bigram have > no backoff either. > Does someone know how to get the backoff weight? My problem is that the > recognizer complains about the format of my language model, since all the > bigrams without the backoff are not considered and then at the end since > there are so many it stops. We get this question a lot. Technically speaking, backoff weights are only required for N-grams that are prefixes of longer N-grams (by the definition of backoff weights). Practically speaking, there is a lot of software out there that assumes that backoff weights are assigned to all N-grams except those of highest order. This is very wasteful once you are dealing with pruned (or so-called "variable length") ngram models. The script add-dummy-bows will add those backoff weights that your software is missing. > > I also have another question about the format of the arpa file created. > Between the probabilities and the words there is not a single space and this > causes problems also with the recognizer I am using. What I am doing right > now to avoid this problem is to use a perl script to fix the format and then > use the converted file that has only a single space, is there an option to > get a single space?? The toolkit outputs a tab after the probabilities and before the backoff weights, so as to make things line up visually and make the file more readable. This is also convenient to search for ngrams or prefixes or suffixes of ngrams in the file (by including \t in your search pattern). again, if your software is too naive about the format then you need to bridge the gap, just as you have been doing. Since all the tools can read/write stdio you can do this on the fly with a command like ngram-count ... -lm - | my-script-to-replace-tabs-with-spaces | \ gzip > my-fixed-lm.gz Hope this helps. --Andreas From hliu at inzigo.com Tue Sep 3 08:27:00 2002 From: hliu at inzigo.com (Hongqin Liu) Date: Tue, 03 Sep 2002 11:27:00 -0400 Subject: general help Message-ID: <3D74D4C4.2A5F65BB@inzigo.com> Hi, Crouching tigers & hidden dragons: I am using a word based trigram (GT backoff) for an application, and trying to make futher improvement. I tried to use class based, but seemed not so good as word based. Higher gram (4gram) seems also worse than 3gram. The WER (word error rate) i got now is about 8-10%, it seems that there is still some room for improvement. Anyone got good ideas -- within ngram. Thanks in advance. Hongqin Liu From stolcke at speech.sri.com Tue Sep 3 08:45:06 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 03 Sep 2002 08:45:06 PDT Subject: general help In-Reply-To: Your message of Tue, 03 Sep 2002 11:27:00 -0400. <3D74D4C4.2A5F65BB@inzigo.com> Message-ID: <200209031545.IAA00058@huge> Hongqin, Two suggestions: - interpolate your class-based LM with the word-based one (class-based LMs alone usually don't give an improvement over word-based ones except in very limited domains). - use Kneser-Ney smoothing (with interpolation) for the 4gram LM: -kndiscount1 -interpolate1 -kndiscount2 -interpolate2 -kndiscount3 -interpolate3 -kndiscount4 -interpolate4 You should see a perplexity reduction over the 3gram, and over GT discounting. Of course you never know about WER... --Andreas In message <3D74D4C4.2A5F65BB at inzigo.com>you wrote: > Hi, Crouching tigers & hidden dragons: > > I am using a word based trigram (GT backoff) for an application, and > trying to make futher improvement. I tried to use class based, but > seemed not so good as word based. Higher gram (4gram) seems also worse > than 3gram. The WER (word error rate) i got now is about 8-10%, it seems > that there is still some room for improvement. Anyone got good ideas -- > within ngram. Thanks in advance. > > Hongqin Liu > > From tolos at sony.de Tue Sep 3 09:01:25 2002 From: tolos at sony.de (Tolos, Marta) Date: Tue, 3 Sep 2002 18:01:25 +0200 Subject: Comparision between SRILM and CMU Message-ID: Hi all, I have a general question about the toolkit. I have just started using this SRILM toolkit, before I always used CMU toolkit, so I wanted to do a comparision between the language models created with one and with the other toolkit. So I created language models with the same corpus using both toolkits, and I compute the perplexities with each toolkit (I mean, that I use the same toolkit for creation and evaluation of the perplexity) and the perplexities were quite different always better for the SRILM, so then I tried to compute the perplexities of the CMU language models with the SRILM toolkit and then I got strange results, since most of the time the performance of the same CMU language model was better when computing the perplexity with SRILM instead of CMU, except for one case were the value that the SRILM gave was extremely high. After this, I did it the other way arround, I used the CMU to evaluate the SRILM language models, and after some trouble because of the format and some special requeriments of the CMU toolkit, I got worse results when using the CMU toolkit for evaluating the perplexity of the SRILM language models (and when the text used for evaluating perplexity contained OOV words, CMU gave an error.) My question is what is the difference in the computation of perplexity in the two toolkits. And also what is the meaning of the "ppl1" that SRILM toolkit gives. Thanks a lot, Marta From hliu at inzigo.com Tue Sep 3 09:04:24 2002 From: hliu at inzigo.com (Hongqin Liu) Date: Tue, 03 Sep 2002 12:04:24 -0400 Subject: mix LM Message-ID: <3D74DD88.8E7C2BD8@inzigo.com> Hi, First I appreciate the quick response from Andreas, the guy with the Long Quan sword. His first suggestion reminds me the mixture LM. Actually I made some tests on the interpolattion approach, including class + word LM. I always found that the perplexity (and WER) is a linear function of the interpolation parameter (Lambda), so the best results are always at the ends, which makes the interpolation trivil. Did I miss something, or it is the case for some domains? Best, Hongqin From stolcke at speech.sri.com Tue Sep 3 09:13:18 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 03 Sep 2002 09:13:18 PDT Subject: mix LM In-Reply-To: Your message of Tue, 03 Sep 2002 12:04:24 -0400. <3D74DD88.8E7C2BD8@inzigo.com> Message-ID: <200209031613.JAA01419@huge> In message <3D74DD88.8E7C2BD8 at inzigo.com>you wrote: > His first suggestion reminds me the mixture LM. Actually I made some > tests on the interpolattion approach, including class + word LM. I > always found that the perplexity (and WER) is a linear function of the > interpolation parameter (Lambda), so the best results are always at the > ends, which makes the interpolation trivil. Did I miss something, or it > is the case for some domains? > Hongqin, how did you find the best interpolation weight? I hope you didn't use trial-and-error and used the compute-best-mix script instead. In my experience the perplexity is not a linear function of lambda, unless maybe your class-based LM is very bad. Rather, ppl should be U-shaped function as lambda varies between 0 and 1. --Andreas From stolcke at speech.sri.com Tue Sep 3 09:25:11 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 03 Sep 2002 09:25:11 PDT Subject: Comparision between SRILM and CMU In-Reply-To: Your message of Tue, 03 Sep 2002 18:01:25 +0200. Message-ID: <200209031625.JAA01982@huge> In message you wrot e: > Hi all, > > I have a general question about the toolkit. I have just started using this > SRILM toolkit, before I always used CMU toolkit, so I wanted to do a > comparision between the language models created with one and with the other > toolkit. So I created language models with the same corpus using both > toolkits, and I compute the perplexities with each toolkit (I mean, that I > use the same toolkit for creation and evaluation of the perplexity) and the > perplexities were quite different always better for the SRILM, so then I > tried to compute the perplexities of the CMU language models with the SRILM > toolkit and then I got strange results, since most of the time the > performance of the same CMU language model was better when computing the > perplexity with SRILM instead of CMU, except for one case were the value > that the SRILM gave was extremely high. After this, I did it the other way > arround, I used the CMU to evaluate the SRILM language models, and after > some trouble because of the format and some special requeriments of the CMU > toolkit, I got worse results when using the CMU toolkit for evaluating the > perplexity of the SRILM language models (and when the text used for > evaluating perplexity contained OOV words, CMU gave an error.) My question > is what is the difference in the computation of perplexity in the two > toolkits. And also what is the meaning of the "ppl1" that SRILM toolkit > gives. Marta, I think what you are doing is an excellent idea, and I'm sure people here would like to see the results, once you figured out the bugs. Regarding your last question: ppl1 is the perplexity excluding end-of-sentence tokens. That is, you normalize the total log likelihood by the number of words, rather than (number of words + number of tags) for computing perplexity. This is a little more meaningful (though not perfect) when comparing perplexities on test sets that follow different rules for sentence segmentation. About the discrepancies between CMU and SRI toolkits: I think the only way to resolve this is to dump out the word-level probabilities and compare them one-by-one. This should allow you to tell how the two differ in their perplexity computation. In SRILM, you can use ngram -debug 2 -ppl for this. My suspicion is that it has something to do with the way OOV words are handled. Also, I'd be interested to know what prevented the SRILM-built LM from working with the CMU tools. If it's something simple we will fix it (unless it is clearly a CMU bug). --Andreas From hliu at inzigo.com Tue Sep 3 10:12:22 2002 From: hliu at inzigo.com (Hongqin Liu) Date: Tue, 03 Sep 2002 13:12:22 -0400 Subject: mix LM References: <200209031613.JAA01419@huge> Message-ID: <3D74ED76.F46393D4@inzigo.com> Andreas, Sorry for bothering you again. I was trying to use 'compute-best-mix' to get the weight (lambda), but got: fatal: division by zero attempted The inputs for it were two ppl files from class and word based models, respectively by ngram -> ./compute-best-mix /home/hliu/language_model/word/lm.word.3.ppl /home/hliu/language_model/class/lm.class.3.ppl I guess I missed something in using this script (compute-best-mix)? Best HL. Stolcke wrote: > In message <3D74DD88.8E7C2BD8 at inzigo.com>you wrote: > > > His first suggestion reminds me the mixture LM. Actually I made some > > tests on the interpolattion approach, including class + word LM. I > > always found that the perplexity (and WER) is a linear function of the > > interpolation parameter (Lambda), so the best results are always at the > > ends, which makes the interpolation trivil. Did I miss something, or it > > is the case for some domains? > > > > Hongqin, > > how did you find the best interpolation weight? I hope you didn't > use trial-and-error and used the compute-best-mix script instead. > In my experience the perplexity is not a linear function of lambda, > unless maybe your class-based LM is very bad. Rather, ppl should be > U-shaped function as lambda varies between 0 and 1. > > --Andreas From anand at speech.sri.com Tue Sep 3 11:50:24 2002 From: anand at speech.sri.com (Anand Venkataraman) Date: Tue, 3 Sep 2002 11:50:24 -0700 (PDT) Subject: mix LM In-Reply-To: <3D74ED76.F46393D4@inzigo.com> (message from Hongqin Liu on Tue, 03 Sep 2002 13:12:22 -0400) Message-ID: <200209031850.LAA08883@stockholm> Hongqin > fatal: division by zero attempted > > The inputs for it were two ppl files from class and word based models, > respectively by ngram -> > > ./compute-best-mix /home/hliu/language_model/word/lm.word.3.ppl > /home/hliu/language_model/class/lm.class.3.ppl Did you make sure to use the -classes "file" option to expand classes while computing ppls using the class lm? If not, the EM algorithm could get thrown off track by spurious probability values from the oovs. In any case, if you email the ppl output for the first sentence by each lm, we could confirm what's going on. & From hliu at inzigo.com Wed Sep 4 06:36:32 2002 From: hliu at inzigo.com (Hongqin Liu) Date: Wed, 04 Sep 2002 09:36:32 -0400 Subject: compute-best-mix Message-ID: <3D760C5F.1414D47D@inzigo.com> Hi, Guys, As suggested by Andreas and Anand, I got two huge files (.ppls for class-based and word-based), then run the 'compue-best-mix'. It has been runung for 17 hours (?!), and now in iteration 2 (runing...): iteration 1, lambda = (0.5 0.5), ppl = 4.91204 iteration 2, lambda = (0.514374 0.485626), ppl = 4.908 ppl(word)=4.80, ppl(class)=5.08 for reference. It seems that it's working, but very slow, my corpus is 265K sentences, not too big. Is this OK? If so, next (if it will stop somewhere) I should use ngram -mix-lm to get the new LM with the lambda from here? Best, Hongqin From hliu at inzigo.com Wed Sep 4 07:06:02 2002 From: hliu at inzigo.com (Hongqin Liu) Date: Wed, 04 Sep 2002 10:06:02 -0400 Subject: update Message-ID: <3D761349.864CCE73@inzigo.com> Hi, I got iteration 3: teration 1, lambda = (0.5 0.5), ppl = 4.91204 iteration 2, lambda = (0.514374 0.485626), ppl = 4.908 iteration 3, lambda = (0.528604 0.471396), ppl = 4.90404 It seems that the final ppl will not less than that from word-based trigram (4.80), in other words, there is no minimum between the two end points. The other end (class) is 5.08, not too bad. I'll wait until the iteration stops. Good day, Hongqin From anand at speech.sri.com Wed Sep 4 10:21:56 2002 From: anand at speech.sri.com (Anand Venkataraman) Date: Wed, 4 Sep 2002 10:21:56 -0700 (PDT) Subject: compute-best-mix In-Reply-To: <3D760C5F.1414D47D@inzigo.com> (message from Hongqin Liu on Wed, 04 Sep 2002 09:36:32 -0400) Message-ID: <200209041721.KAA11960@clara> Hongqin > It seems that it's working, but very slow, my corpus is 265K sentences, > not too big. 17+ hours indicates that something unusual going on. Is the process swapping heavily on a slow machine? We have tuned parameters on much larger corpora in shorter times. If you are having real efficiency problems, you may want to reduce the size of your dev test set. After all you only want a representative sample to estimate the mixture coeffs. > It seems that the final ppl will not less than that from word-based > trigram (4.80), in other words, there is no minimum between the two end The ppl *MUST* be at most as large as either of the individual values in your case, unless the probabilities in your ppl files have been messed up. In the worst case if the algorithm decides that either model is useless, it will converge to a lambda of 0 for it. Please check. & From stolcke at speech.sri.com Wed Sep 4 13:21:09 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 04 Sep 2002 13:21:09 PDT Subject: update In-Reply-To: Your message of Wed, 04 Sep 2002 10:06:02 -0400. <3D761349.864CCE73@inzigo.com> Message-ID: <200209042021.NAA01129@zap.speech.sri.com> Hongqin, let me venture a guess: you are using your LM training data to do the mixture optimization. You should be using held-out data set that has NOT been used to estimate the component models. If you are optimizing on the LM training data then it is no surprise that the word-ngram gets weight 1. --Andreas In message <3D761349.864CCE73 at inzigo.com>you wrote: > Hi, > > I got iteration 3: > > teration 1, lambda = (0.5 0.5), ppl = 4.91204 > iteration 2, lambda = (0.514374 0.485626), ppl = 4.908 > iteration 3, lambda = (0.528604 0.471396), ppl = 4.90404 > > It seems that the final ppl will not less than that from word-based > trigram (4.80), in other words, there is no minimum between the two end > points. The other end (class) is 5.08, not too bad. I'll wait until the > iteration stops. > > Good day, > > Hongqin > > > From mirjam.sepesy at uni-mb.si Fri Sep 20 05:01:56 2002 From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec) Date: Fri, 20 Sep 2002 14:01:56 +0200 Subject: class LM Message-ID: <3D8B0E33.3148460E@uni-mb.si> Hi all, I have a question about the class-based models. I have just started to use them. First I want to understand the test example in the toolkit. I have problems with understanding the probability computation of the devtest.text Can you, please, explain me, which 1grams, 2grams, 3grams.... are meant for example in this sentence: kaybeck and lost ok p( kaybeck | ) = [1gram][2gram] 0.000845361 [ -3.07296 ] / 1 p( and | kaybeck ...) = [1gram][3gram] 0.443827 [ -0.352786 ] / 1 p( lost | and ...) = [2gram][2gram][4gram][4gram] 0.0305452 [ -1.51506 ] / 1 p( ok | lost ...) = [3gram][3gram][4gram][4gram] 0.0703371 [ -1.15282 ] / 0.999999 p( | ok ...) = [3gram][4gram] 0.401395 [ -0.396428 ] / 1 I am familiar with the class model, where all words are mapped to classes. In this example, there are only two classes (GRIDLABEL and SPELLED_GRIDLABEL) and in the model we have ngrams of words and ngrams of words and classes. I understand the idea, that if n-gram of words exists in is better to use it and if not, classes should help. But what are the steps in probability computation? Please, help! Have a nice weekend! Mirjam -------------- next part -------------- A non-text attachment was scrubbed... Name: mirjam.sepesy.vcf Type: text/x-vcard Size: 302 bytes Desc: Card for Mirjam Sepesy Maucec URL: From stolcke at speech.sri.com Fri Sep 20 12:48:34 2002 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 20 Sep 2002 12:48:34 PDT Subject: class LM In-Reply-To: Your message of Fri, 20 Sep 2002 14:01:56 +0200. <3D8B0E33.3148460E@uni-mb.si> Message-ID: <200209201948.MAA23667@huge> In message <3D8B0E33.3148460E at uni-mb.si>you wrote: > This is a multi-part message in MIME format. > > --Boundary_(ID_ESpZRbOj8hesWAoqG6iE3Q) > Content-type: text/plain; charset=us-ascii > Content-transfer-encoding: 7BIT > > Hi all, > > I have a question about the class-based models. I have just started to > use them. > First I want to understand the test example in the toolkit. > I have problems with understanding the probability computation of the > devtest.text > Can you, please, explain me, which 1grams, 2grams, 3grams.... are meant > for example in this sentence: The ClassNgram LM performs dynamic programming to compute the prefix probabilities of sentences (and from that, the conditional word probabilities). This is done because a given word can be "generated" by the LM either as a plain word or as a member of one of several classes (and in the case of multi-word class expansions at different positions in the expansion). So the states in the DP trellis correspond to the different classes and the positions of the word in the expansion. (This is very similar to the notion of a "dotted item" in context-free parsing, in case you're familiar with that). As a result, many N-gram lookups are performed to go from one word to the next: one for each state transition in the trellis. So when you see something like > > kaybeck and lost ok > > p( kaybeck | ) = [1gram][2gram] 0.000845361 [ -3.07296 ] / 1 it means that both a unigram and a bigram lookup happened. In this particular case this makes sense because p(kaybeck|) is probably not among the bigrams (hence backoff to unigram), but p(GRIDLABEL | ) is a bigram (kaybeck is a member of class GRIDLABEL). The probabilities of both cases are summed to obtain the total probability of the word. I believe you can set -debug 4 to trace the state transitions in the DP trellis. It becomes a little unwieldy to follow as the number of states increase as you get deeper into the sentence. > p( and | kaybeck ...) = [1gram][3gram] 0.443827 [ -0.352786 ] / 1 > p( lost | and ...) = [2gram][2gram][4gram][4gram] 0.0305452 [ -1.51506 > ] / 1 > p( ok | lost ...) = [3gram][3gram][4gram][4gram] 0.0703371 [ -1.15282 > ] / 0.999999 > p( | ok ...) = [3gram][4gram] 0.401395 [ -0.396428 ] / 1 > > I am familiar with the class model, where all words are mapped to > classes. > In this example, there are only two classes (GRIDLABEL and > SPELLED_GRIDLABEL) and > in the model we have ngrams of words and ngrams of words and classes. We generalized the class model so that mixed N-grams of words and classes are allowed for convenience. This is however just equivalent to having an extra class for each word that contains only that word itself. > > I understand the idea, that if n-gram of words exists in is better to > use it > and if not, classes should help. As I explained above, it's not an either-or. You compute probabilities for all the ways of generating a word, and sum. > But what are the steps in probability computation? Again, tracing the DP with the -debug option will give you a sense of the details. You might have to also read the code for ClassNgram:prefixProb() to get the full picture. --Andreas