From jachym at kky.zcu.cz Fri Nov 21 04:15:19 2003 From: jachym at kky.zcu.cz (Jachym Kolar) Date: Fri, 21 Nov 2003 13:15:19 +0100 Subject: Question about hidden-ngram Message-ID: <1069416919.3fbe01d79a3b7@webmail.zcu.cz> Hi, I've just tried the hidden-ngram tool to punctuate automatically an unpunctuated text. But I got some unexpected results - every word was tagged with the *noevent*. I've used a training text in a following form: ... for more than a century the fingerprint has been the quintessential piece of crime scene evidence but now the palm is getting its due ... Then I trained a 3-gram model with: ngram-count -write-vocab vocabulary -tolower -text trainingtext -write output -lm lmfile ... and then I used hidden-ngram tool with following option: hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab tags -continuous -posteriors ... and received something like that: 6 *noevent* 0.998811 0.00117427 1.46659e-05 7.92597e-10 m?s?c? *noevent* 0.999898 9.326e-05 9.07804e-06 4.61643e-10 do *noevent* 1 4.19776e-09 5.76912e-09 6.25918e-12 jednoho *noevent* 0.999998 4.18691e-07 1.24419e-06 8.63805e-11 roku *noevent* 0.197671 0.801881 0.000340206 0.000107651 jak *noevent* 0.99997 2.44243e-05 1.32587e-06 4.09674e-06 je *noevent* 0.999857 0.000142836 2.47722e-07 2.47757e-07 to *noevent* 0.972235 0.0266202 0.000937748 0.000206936 *noevent* 0.979455 0.0205446 2.70218e-07 1.33261e-07 uvedeno *noevent* 0.933133 0.0538742 0.0129924 6.16205e-08 na *noevent* 0.999965 4.71218e-07 3.39777e-05 1.57228e-07 v?robku *noevent* 0.736376 0.168451 0.0947272 0.00044499 Please, can somebody tell me what I did wrong? And is there in SRILM a tool to obtain a text-map from the training text? Thanks Jachym From carmena at mailandnews.com Fri Nov 21 05:32:17 2003 From: carmena at mailandnews.com (Carmen Alvarez) Date: Fri, 21 Nov 2003 08:32:17 -0500 Subject: Question about hidden-ngram References: <1069416919.3fbe01d79a3b7@webmail.zcu.cz> Message-ID: <009601c3b033$e26f9510$fcabca18@Beige> Try the flag -force-event for hidden-ngram: hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab tags -continuous -posteriors -force-event Carmen ----- Original Message ----- From: "Jachym Kolar" To: Sent: Friday, November 21, 2003 7:15 AM Subject: Question about hidden-ngram > Hi, > I've just tried the hidden-ngram tool to punctuate automatically an > unpunctuated text. But I got some unexpected results - every word was tagged > with the *noevent*. > > I've used a training text in a following form: > > ... > for more than a century the fingerprint has been the quintessential piece > of crime scene evidence > but now the palm is getting its due > ... > > Then I trained a 3-gram model with: > > ngram-count -write-vocab vocabulary -tolower -text trainingtext -write output > -lm lmfile > > ... and then I used hidden-ngram tool with following option: > > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab tags -continuous > -posteriors > > ... and received something like that: > > 6 *noevent* 0.998811 0.00117427 1.46659e-05 7.92597e-10 > m?s?c? *noevent* 0.999898 9.326e-05 9.07804e-06 4.61643e-10 > do *noevent* 1 4.19776e-09 5.76912e-09 6.25918e-12 > jednoho *noevent* 0.999998 4.18691e-07 1.24419e-06 8.63805e-11 > roku *noevent* 0.197671 0.801881 0.000340206 0.000107651 > jak *noevent* 0.99997 2.44243e-05 1.32587e-06 4.09674e-06 > je *noevent* 0.999857 0.000142836 2.47722e-07 2.47757e-07 > to *noevent* 0.972235 0.0266202 0.000937748 0.000206936 > *noevent* 0.979455 0.0205446 2.70218e-07 1.33261e-07 > uvedeno *noevent* 0.933133 0.0538742 0.0129924 6.16205e-08 > na *noevent* 0.999965 4.71218e-07 3.39777e-05 1.57228e-07 > v?robku *noevent* 0.736376 0.168451 0.0947272 0.00044499 > > Please, can somebody tell me what I did wrong? And is there in SRILM a tool to > obtain a text-map from the training text? > > Thanks Jachym > > > From stolcke at speech.sri.com Fri Nov 21 09:23:35 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 21 Nov 2003 09:23:35 PST Subject: Question about hidden-ngram In-Reply-To: Your message of Fri, 21 Nov 2003 08:32:17 -0500. <009601c3b033$e26f9510$fcabca18@Beige> Message-ID: <200311211723.JAA28488@huge> In message <009601c3b033$e26f9510$fcabca18 at Beige>you wrote: > Try the flag -force-event for hidden-ngram: > > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab > tags -continuous -posteriors -force-event > The -force-event flag is only appropriate if you encode the absence of punctuation by a special tag, too. I suspect the problem is in the training of the LM. Your training data sample has a single sentence split across 3 lines. Yet the standard behavior of ngram-count is that each line represents one sentence, so the and tags are added on each line. What you need to do to match the hidden-ngram -continous way of running the LM is train an LM that is trained on a continous stream of tokens without at the line breaks. You can do that like this: continuous-ngram-count order=3 trainingtext | \ ngram-count -read - -write-vocab vocabulary -tolower -write output -lm lmfile The continuous-ngram-count script is documented in the training-scripts(1) man page. It generates counts that ignore line breaks. Hope this solves your problem. I should note that using a word-based LM for punctuation restoration is probably not going to work very well, unless your vocabulary is small and/or you have tons of training data. A class-based LM, or an interpolated word/class LM should do better. --Andreas > Carmen > > > ----- Original Message ----- > From: "Jachym Kolar" > To: > Sent: Friday, November 21, 2003 7:15 AM > Subject: Question about hidden-ngram > > > > Hi, > > I've just tried the hidden-ngram tool to punctuate automatically an > > unpunctuated text. But I got some unexpected results - every word was > tagged > > with the *noevent*. > > > > I've used a training text in a following form: > > > > ... > > for more than a century the fingerprint has been the quintessential > piece > > of crime scene evidence > > but now the palm is getting its due > > ... > > > > Then I trained a 3-gram model with: > > > > ngram-count -write-vocab vocabulary -tolower -text trainingtext -write > output > > -lm lmfile > > > > ... and then I used hidden-ngram tool with following option: > > > > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab > tags -continuous > > -posteriors > > > > ... and received something like that: > > > > 6 *noevent* 0.998811 0.00117427 1.46659e-05 > 7.92597e-10 > > m?s?c? *noevent* 0.999898 9.326e-05 9.07804e-06 > 4.61643e-10 > > do *noevent* 1 4.19776e-09 5.76912e-09 6.25918e-12 > > jednoho *noevent* 0.999998 4.18691e-07 1.24419e-06 > 8.63805e-11 > > roku *noevent* 0.197671 0.801881 0.000340206 > 0.000107651 > > jak *noevent* 0.99997 2.44243e-05 1.32587e-06 > 4.09674e-06 > > je *noevent* 0.999857 0.000142836 2.47722e-07 > 2.47757e-07 > > to *noevent* 0.972235 0.0266202 0.000937748 > 0.000206936 > > *noevent* 0.979455 0.0205446 2.70218e-07 > 1.33261e-07 > > uvedeno *noevent* 0.933133 0.0538742 0.0129924 > 6.16205e-08 > > na *noevent* 0.999965 4.71218e-07 3.39777e-05 > 1.57228e-07 > > v?robku *noevent* 0.736376 0.168451 0.0947272 0.00044499 > > > > Please, can somebody tell me what I did wrong? And is there in SRILM a > tool to > > obtain a text-map from the training text? > > > > Thanks Jachym > > > > > > > > From yangl at ecn.purdue.edu Fri Nov 21 09:43:38 2003 From: yangl at ecn.purdue.edu (Yang Liu) Date: Fri, 21 Nov 2003 12:43:38 -0500 (EST) Subject: Question about hidden-ngram In-Reply-To: <200311211723.JAA28488@huge> References: <200311211723.JAA28488@huge> Message-ID: I guess Andreas has probably answered some of your questions, just want to add something that he skipped. Here is one line from your output: roku *noevent* 0.197671 0.801881 0.000340206 0.000107651 which means that at that interword boundary, comma is the mostly like punctuation (those numbers are the posterior probability for each tag), yet you said that you got 'noevent' for every location. Your output looks okay to me. --Yang On Fri, 21 Nov 2003, Andreas Stolcke wrote: > > > In message <009601c3b033$e26f9510$fcabca18 at Beige>you wrote: > > Try the flag -force-event for hidden-ngram: > > > > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab > > tags -continuous -posteriors -force-event > > > > The -force-event flag is only appropriate if you encode the absence > of punctuation by a special tag, too. > > I suspect the problem is in the training of the LM. > Your training data sample has a single sentence split across 3 lines. > Yet the standard behavior of ngram-count is that each line represents > one sentence, so the and tags are added on each line. > > What you need to do to match the hidden-ngram -continous way of > running the LM is train an LM that is trained on a continous stream of > tokens without at the line breaks. You can do that like this: > > continuous-ngram-count order=3 trainingtext | \ > ngram-count -read - -write-vocab vocabulary -tolower -write output -lm lmfile > > The continuous-ngram-count script is documented in the training-scripts(1) > man page. It generates counts that ignore line breaks. > > Hope this solves your problem. I should note that using a word-based LM > for punctuation restoration is probably not going to work very well, > unless your vocabulary is small and/or you have tons of training data. > A class-based LM, or an interpolated word/class LM should do better. > > --Andreas > > > > Carmen > > > > > > ----- Original Message ----- > > From: "Jachym Kolar" > > To: > > Sent: Friday, November 21, 2003 7:15 AM > > Subject: Question about hidden-ngram > > > > > > > Hi, > > > I've just tried the hidden-ngram tool to punctuate automatically an > > > unpunctuated text. But I got some unexpected results - every word was > > tagged > > > with the *noevent*. > > > > > > I've used a training text in a following form: > > > > > > ... > > > for more than a century the fingerprint has been the quintessential > > piece > > > of crime scene evidence > > > but now the palm is getting its due > > > ... > > > > > > Then I trained a 3-gram model with: > > > > > > ngram-count -write-vocab vocabulary -tolower -text trainingtext -write > > output > > > -lm lmfile > > > > > > ... and then I used hidden-ngram tool with following option: > > > > > > hidden-ngram -text test4.txt -lm lmfile -tolower -hidden-vocab > > tags -continuous > > > -posteriors > > > > > > ... and received something like that: > > > > > > 6 *noevent* 0.998811 0.00117427 1.46659e-05 > > 7.92597e-10 > > > m?s?c? *noevent* 0.999898 9.326e-05 9.07804e-06 > > 4.61643e-10 > > > do *noevent* 1 4.19776e-09 5.76912e-09 6.25918e-12 > > > jednoho *noevent* 0.999998 4.18691e-07 1.24419e-06 > > 8.63805e-11 > > > roku *noevent* 0.197671 0.801881 0.000340206 > > 0.000107651 > > > jak *noevent* 0.99997 2.44243e-05 1.32587e-06 > > 4.09674e-06 > > > je *noevent* 0.999857 0.000142836 2.47722e-07 > > 2.47757e-07 > > > to *noevent* 0.972235 0.0266202 0.000937748 > > 0.000206936 > > > *noevent* 0.979455 0.0205446 2.70218e-07 > > 1.33261e-07 > > > uvedeno *noevent* 0.933133 0.0538742 0.0129924 > > 6.16205e-08 > > > na *noevent* 0.999965 4.71218e-07 3.39777e-05 > > 1.57228e-07 > > > v?robku *noevent* 0.736376 0.168451 0.0947272 0.00044499 > > > > > > Please, can somebody tell me what I did wrong? And is there in SRILM a > > tool to > > > obtain a text-map from the training text? > > > > > > Thanks Jachym > > > > > > > > > > > > > > > From stolcke at speech.sri.com Tue Dec 9 13:04:43 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 09 Dec 2003 13:04:43 PST Subject: SRILM nan probablities in language models. In-Reply-To: Your message of Tue, 09 Dec 2003 14:00:55 +0200. Message-ID: <200312092104.NAA26683@tonga> In message you wrote: > This is a multi-part message in MIME format. > > ------=_NextPart_000_0044_01C3BE5C.DE44B7E0 > Content-Type: text/plain; > charset="US-ASCII" > Content-Transfer-Encoding: 7bit > > Dear Mr. Stolcke, > > I'm a CS M.Sc. student in the Technion, Israel. I've been using SRILM > during the last few months for tagging Hebrew. > > I use ngram-count for creating language models. I tried it on 5 > randomally created test sets. > In 4 out of 5, the language model was created successfully, although I > got warnings such as: > > warning: discount coeff X is out of range: Y (two warnings or so for > each file). > > But for one set I get for all the unigrams a "nan" probability (but the > bigram probablities seem OK), and of course, disambig performs poorly > with this language model. > I have no idea what is difference between this text file and the others. > > > The command line I used: > > ngram-count -order 2 -text train.tagseq -lm train.lm.bigram > > I attached the input text file, the output language model, and the debug > messages I got (with -debug 1) > > I would be very grateful if you could help me find out what the problem > is and how I can solve it. This is bug in the GT discounting method that was fixed recently. A quick patch is included below. It will also be fixed in the next release. (You will have to apply this patch by hand since the RCS IDs differ from your version.) --Andreas *** /tmp/T00Q9yJ8 Tue Dec 9 13:01:25 2003 --- Discount.cc Tue Nov 11 11:35:29 2003 *************** *** 5,14 **** */ #ifndef lint ! static char Copyright[] = "Copyright (c) 1995-2002 SRI International. All Rights Reserved."; ! static char RcsId[] = "@(#)$Header: /home/srilm/devel/lm/src/RCS/Discount.cc,v 1.18 2003/08/03 18:52:54 stolcke Exp $"; #endif #include "Discount.h" #include "Array.cc" --- 5,19 ---- */ #ifndef lint ! static char Copyright[] = "Copyright (c) 1995-2003 SRI International. All Rights Reserved."; ! static char RcsId[] = "@(#)$Header: /home/srilm/devel/lm/src/RCS/Discount.cc,v 1.19 2003/11/11 19:35:20 stolcke Exp $"; #endif + #include + #if defined(sun) || defined(sgi) + #include + #endif + #include "Discount.h" #include "Array.cc" *************** *** 193,199 **** double coeff0 = (i + 1) * (double)countOfCounts[i+1] / (i * (double)countOfCounts[i]); coeff = (coeff0 - commonTerm) / (1.0 - commonTerm); ! if (coeff <= Prob_Epsilon || coeff0 > 1.0) { cerr << "warning: discount coeff " << i << " is out of range: " << coeff << "\n"; coeff = 1.0; --- 198,204 ---- double coeff0 = (i + 1) * (double)countOfCounts[i+1] / (i * (double)countOfCounts[i]); coeff = (coeff0 - commonTerm) / (1.0 - commonTerm); ! if (!finite(coeff) || coeff <= Prob_Epsilon || coeff0 > 1.0) { cerr << "warning: discount coeff " << i << " is out of range: " << coeff << "\n"; coeff = 1.0; From teknee at yahoo.com Fri Dec 12 07:40:54 2003 From: teknee at yahoo.com (Jean-Francois Beaumont) Date: Fri, 12 Dec 2003 10:40:54 -0500 (EST) Subject: MAP adaptation in SRILM Message-ID: <20031212154054.96724.qmail@web41713.mail.yahoo.com> Hi, I wonder if anyone has implemented MAP adaptation in SRILM yet. Is it available somewhere? Thanks! JF --------------------------------- Post your free ad now! Yahoo! Canada Personals -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Fri Dec 12 15:39:45 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 12 Dec 2003 15:39:45 PST Subject: MAP adaptation in SRILM In-Reply-To: Your message of Fri, 12 Dec 2003 10:40:54 -0500. <20031212154054.96724.qmail@web41713.mail.yahoo.com> Message-ID: <200312122339.PAA03600@huge> In message <20031212154054.96724.qmail at web41713.mail.yahoo.com>you wrote: > --0-654912475-1071243654=:93703 > Content-Type: text/plain; charset=us-ascii > > Hi, > > I wonder if anyone has implemented MAP adaptation in SRILM yet. Is it availab > le somewhere? > Jean-Francois, there are two ways to do something like MAP adaptation. The traditional way to do MAP adaption estimates a new model from a weighted mixture of background data counts and adaptation data counts. You can do this easily by manipulating the N-gram count files and then giving the combined counts to ngram-count to estimate a new model. For example, say you have your background data in BDATA, and your adaptation data in ADATA, then you would do something like ngram-count -text BDATA -write BDATA.counts ngram-count -text ADATA -write ADATA.counts cat BDATA.counts ADATA.counts ADATA.counts ADATA.counts | \ ngram-count -read - -lm ADAPTED-LM (I'm omitting options controlling ngram order, smoothing, etc.). In this case I'm weighting the adaptation data 3 times, just by repeating the counts. In general you want to write a little script that takes a count file and multiplies the counts by some constant (i.e., the adaptation weight). However, this approach has some problems, because by manipulating the data through weighting (with a weight other than 1) you are messing up the count-of-count statistics that underlie most of the discounting schemes (GT and KN). so you might have to use a smoothing algorithm such as Witten-Bell that doesn't care about this, but isn't as good. If you want to use a non-integer adaptation weight you have to use ngram-count -float-counts which limits your choice of smoothing algorithms in a similar way. The other, more commonly used LM adaptation approach is simple model interpolation. You could achieve a similar effect as in the example above (weighting the adaptation data three times) with ngram-count -text BDATA -lm BDATA.lm ngram-count -text ADATA -lm ADATA.lm ngram -lm ADATA.lm -mix-lm BDATA.lm -lambda L -write-lm ADAPTED-LM where L is the weight given to the *model* for the adaptation data (as opposed to the data itself). Because the two source models are normalized first, then combined the value of L will be less dependent on the relative size of BDATA versus ADATA. If you ignore smoothing and assume MLE estimates you can figure out a value of L that is equivalent to the first approach for a given amount of data and adaptation data weight (a recent paper by Michiel Bacchiani and Brian Roark elaborates on this: Unsupervised language model adaptation, http://www.research.att.com/~roark/ICASSP03.pdf). In any case, the second approach is widely used and works quite well. It is also quite convenient since you can combine a bunch of preexisting LMs in various ways, without retraining any of them. Also, SRILM has a tool for estimating the optimal interpolation weight L from held-out data (see the ppl-scripts(1) man page). --Andreas From barhaim at cs.technion.ac.il Sun Dec 21 05:49:48 2003 From: barhaim at cs.technion.ac.il (Roy Bar-Haim) Date: Sun, 21 Dec 2003 15:49:48 +0200 Subject: GT discounting and backoff Message-ID: <000901c3c7c9$4dedf460$34284484@cs.technion.ac.il> Hi, I have a few questions about the implementation of GT-discounting and Katz backoff in ngram-count. 1. What is the default value of gtNmin and gtNmax in ngram-count? 2. Is backing off done only for ngrams that don't appear in the language model at all, or for ngrams that appear less than k>0 times (and what is this k). If I want backing off to be done only for counts below some k, should I set gtNmin to that value? 3. What does the following warning mean: warning: discount coeff 4 is out of range Does it mean that the discount for ngrams that appears only 4 times is very small? Why is it a warning? Thanks, Roy. From stolcke at speech.sri.com Sun Dec 28 15:06:51 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 28 Dec 2003 15:06:51 PST Subject: GT discounting and backoff In-Reply-To: Your message of Sun, 21 Dec 2003 15:49:48 +0200. <000901c3c7c9$4dedf460$34284484@cs.technion.ac.il> Message-ID: <200312282306.PAA08825@huge> Roy, In message <000901c3c7c9$4dedf460$34284484 at cs.technion.ac.il>you wrote: > Hi, > > I have a few questions about the implementation of GT-discounting and > Katz backoff in ngram-count. > > 1. What is the default value of gtNmin and gtNmax in ngram-count? It differs for different N. Run ngram-count -help to see all the default parameters. > > 2. Is backing off done only for ngrams that don't appear in the language > model at all, or for ngrams that appear less than k>0 times (and what is > this k). If I want backing off to be done only for counts below some k, > should I set gtNmin to that value? Exactly. However, for all N-grams in the *language model* the corresponding conditional N-gram probability is used, always. So the cutoffs refer not to the LM itself, but to the counts in the *training data*. > > 3. What does the following warning mean: > > warning: discount coeff 4 is out of range > > Does it mean that the discount for ngrams that appears only 4 times is > very small? Why is it a warning? The warning indicates that the GT discount formula yields a value outside the range 0...1, and therefore cannot be used. This happens when your counts-of-counts (how many singleton, 2-counts, 3-counts, etc.) are not smoothly distributed, usually as the result of insufficient data, or some artificial manipulation of the data (e.g. duplicating some portion of it). ngram-count simply disables discounting for those ngrams. If you get this a lot you can try some of the other smoothing methods. Witten-Bell, for example, is very robust to the kinds of problems that cause GT to fail. --Andreas From barhaim at cs.technion.ac.il Wed Dec 31 02:48:31 2003 From: barhaim at cs.technion.ac.il (Roy Bar-Haim) Date: Wed, 31 Dec 2003 12:48:31 +0200 Subject: Implementing Baum-Welch (Forward-Backward) algorithm in SRILM Message-ID: <005801c3cf8b$a20869d0$34284484@cs.technion.ac.il> Hi, I'm using disambig for part-of-speech tagging. I create a language model over sequences of tags with ngram-count, and provide P(word|tag) in the map file. What I would like to do is to start with this model, based on tagged corpus, and improve it using the Baum-Welch (forwad-backward) algorithm, with untagged corpus. After each iteration I should get a new language model for the tags and a new map file . After each iteration I would like to test the model on some held-out data, so I know when to stop. How can I implement that in SRILM? Thanks, Roy.