From nbassiou at aiia.csd.auth.gr Tue Jul 1 00:53:47 2008 From: nbassiou at aiia.csd.auth.gr (Nikoletta Bassiou) Date: Tue, 1 Jul 2008 10:53:47 +0300 Subject: Class n-grams Message-ID: <001a01c8db4f$9b2c9a30$1904cf9b@aiia.csd.auth.gr> I would like to build a class trigram using ngram-class but according to the documentation only class bigram is implemented. If this is true, do you know any other way I can build a class trigram? Is there an improvision for extending ngram-class for higher order n-grams (n>3)? Nikoletta -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Jul 1 09:25:22 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 01 Jul 2008 09:25:22 -0700 Subject: Class n-grams In-Reply-To: <001a01c8db4f$9b2c9a30$1904cf9b@aiia.csd.auth.gr> References: <001a01c8db4f$9b2c9a30$1904cf9b@aiia.csd.auth.gr> Message-ID: <486A5A72.7000500@speech.sri.com> Nikoletta Bassiou wrote: > I would like to build a class trigram using ngram-class but according > to the documentation only class bigram is implemented. > If this is true, do you know any other way I can build a class > trigram? Is there an improvision for extending ngram-class for higher > order n-grams (n>3)? > > Nikoletta The bigram restriction only applies to the statistics used to learn the word classes. Once you have the classes you can apply them to your text and build an ngram of any order. Andreas From stolcke at speech.sri.com Thu Jul 3 22:17:59 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 03 Jul 2008 22:17:59 -0700 Subject: Class n-grams In-Reply-To: <20080702111233.9779bb00@cronus.aiia.csd.auth.gr> References: <20080702111233.9779bb00@cronus.aiia.csd.auth.gr> Message-ID: <486DB287.7080805@speech.sri.com> Basiou Nikoletta wrote: > Dear Andreas, > > thanks a lot for your answer. Actually, i want to build the classes > from trigram statistics/counts. Is there any improvision for such an > implementation in the near future or there are restrictions due to > higher memory and process requirements? It would take a lot longer and is currently not implemented. I vaguely recall a paper by Herman Ney and colleagues many years ago showing that inducing classes based on higher-order statistics doesn't buy that much (i.e., it is sufficient to learn the classes using bigram statistics, and then use them in higher-order class-based models). Andreas > > Looking forward for your answer, > Nikoletta > > ------------------------------------------------------------------------ > *From:* Andreas Stolcke [mailto:stolcke at speech.sri.com] > *To:* Nikoletta Bassiou [mailto:nbassiou at aiia.csd.auth.gr] > *Cc:* srilm-user at speech.sri.com > *Sent:* Tue, 01 Jul 2008 19:25:22 +0300 > *Subject:* Re: Class n-grams > > Nikoletta Bassiou wrote: > > I would like to build a class trigram using ngram-class but > according > > to the documentation only class bigram is implemented. > > If this is true, do you know any other way I can build a class > > trigram? Is there an improvision for extending ngram-class for > higher > > order n-grams (n>3)? > > > > Nikoletta > The bigram restriction only applies to the statistics used to > learn the > word classes. Once you have the classes you can apply them to your > text > and build an ngram of any order. > > Andreas > > > > From marco.turchi at gmail.com Tue Jul 29 18:03:40 2008 From: marco.turchi at gmail.com (marco turchi) Date: Wed, 30 Jul 2008 02:03:40 +0100 Subject: strange symbols Message-ID: <79a042480807291803w44eb15c8ic8d4c4ef4a0e8182@mail.gmail.com> Dear all, I'm using srilm on some data crawled from the Web. The lm contains some strange symbols as these: \1-grams: -6.774207 ^A 0 -6.774207 ^C -6.774207 ^D -6.774207 ^E 0 -6.774207 ^F 0 -6.774207 ^G 0 -6.774207 ^H 0 -6.774207 ^K 0 -6.774207 ^N 0 -6.774207 ^O -6.774207 ^P -6.774207 ^T 0 -6.774207 ^X -6.774207 ^Y 0 -6.774207 ^\ -6.774207 ^] -6.774207 ^^ 0 -6.774207 ^_ these symbols are not the simple combination of ^ and a letter but it seems to be something different as a character that has been truncated or something similar. Do u have an idea what they are and how to remove them? thanks a lot Marco -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Jul 29 23:27:13 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 29 Jul 2008 23:27:13 PDT Subject: strange symbols In-Reply-To: Your message of Wed, 30 Jul 2008 02:03:40 +0100. <79a042480807291803w44eb15c8ic8d4c4ef4a0e8182@mail.gmail.com> Message-ID: <200807300627.m6U6RDb17302@huge> They look like ASCII control characters (character values < 0x20). You need to do a better job filtering your training data. --Andreas In message <79a042480807291803w44eb15c8ic8d4c4ef4a0e8182 at mail.gmail.com>you wro te: > > Dear all, > I'm using srilm on some data crawled from the Web. The lm contains some > strange symbols as these: > \1-grams: > -6.774207 ^A 0 > -6.774207 ^C > -6.774207 ^D > -6.774207 ^E 0 > -6.774207 ^F 0 > -6.774207 ^G 0 > -6.774207 ^H 0 > -6.774207 ^K 0 > -6.774207 ^N 0 > -6.774207 ^O > -6.774207 ^P > -6.774207 ^T 0 > -6.774207 ^X > -6.774207 ^Y 0 > -6.774207 ^\ > -6.774207 ^] > -6.774207 ^^ 0 > -6.774207 ^_ > > these symbols are not the simple combination of ^ and a letter but it seems > to be something different as a character that has been truncated or > something similar. > Do u have an idea what they are and how to remove them? > > thanks a lot > Marco > > ------=_Part_45139_5077409.1217379820193 > Content-Type: text/html; charset=ISO-8859-1 > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > >
Dear all,
I'm using srilm on some data crawled from th > e Web. The lm contains some strange symbols as these:
\1-grams:
-6.7742 > 07       ^A      0
> -6.774207       ^C
-6.774207  &nbs > p;    ^D
-6.774207       ^E&n > bsp;     0
> -6.774207       ^F      > ; 0
-6.774207       ^G   &nbs > p;  0
-6.774207       ^H  &nb > sp;   0
-6.774207       ^K &n > bsp;    0
-6.774207       ^N& > nbsp;     0
-6.774207     &nb > sp; ^O
-6.774207       ^P
-6.774207  >       ^T      0
-6.77420 > 7       ^X
> -6.774207       ^Y      > ; 0
-6.774207       ^\
-6.774207 &nb > sp;     ^]
-6.774207     &nbs > p; ^^      0
-6.774207    &nb > sp;  ^_

these symbols are not the simple combination of ^ and a l > etter but it seems to be something different as a character that has been tru > ncated or something similar.
> Do u have an idea what they are and how to remove them?

thanks a lot r>Marco
> > ------=_Part_45139_5077409.1217379820193-- From stolcke at speech.sri.com Thu Aug 14 13:16:18 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 14 Aug 2008 13:16:18 -0700 Subject: a naive question need your help In-Reply-To: References: Message-ID: <48A49292.6020603@speech.sri.com> jian zhu wrote: > Hi professor stolcke: > I am a computer programmer from China. Thanks a lot for your great > work on language model, and unselfishly sharing the perfect slm > tookit! > > I have a naive question need your help. > I want to use "disambig" tool for part-of-speech tagging, but I > have some trouble > with it. > I use the tool as following: > disambig -text file -map wtfile -lm ttfile > > file --- word text > wtfile --- P(word|tag2) emit file > ttfile --- P(tag2|tag1) transit file > > ttfile can be trained using "ngram-count" tool, but i don't know > how i can get > wtfile, i don't know how i can get this file by using srilm. > > it's format is as following: > -map file > Specifies the file containing the V1-to-V2 mapping information. > Each line of file contains the mapping for a single word in V1: > w1 w21 [p21] w22 [p22] ... > > where w1 is a word from V1, which has possible mappings w21, w22, > ... from V2. Optionally, each of these can be followed by a numeric > string for the probability p21, which defaults to 1. The number is > used as the conditional probability P(w1|w21), but the program does > not depend on these numbers being properly normalized. > > Thank you very much! > Looking forward for your help. > There is no ready-made tool for estimating and formatting the map probabilities. It is such a simple format that you should be able to write a perl script or similar to estimate these probabilities from data. Note that for taggers it is usually more convenient to construct the map file with probabilities p(w21 | w1) and use the -scale option. To estimate p(POS | word) you can count occurrences in a tagged training corpus (possibly with some smoothing to allow for unseen combinations (for unseen words and open-class POS classes). In the absence of training data you can try a uniform POS distribution. I know that people have built POS taggers with SRILM. I suggest that you direct further questions to the srilm-user mailing list. Andreas > Best Regards > jianzhu > 2008-08-14 > From liuchangliang at hccl.ioa.ac.cn Tue Aug 26 19:31:45 2008 From: liuchangliang at hccl.ioa.ac.cn (liuchangliang) Date: Wed, 27 Aug 2008 10:31:45 +0800 Subject: A question about Lattice::LatticeWER( ) Message-ID: <001301c907ed$0c746cd0$255d4670$@ioa.ac.cn> Hi: I use lattice-tool to compute the lattice WER. In the result, the insertion error is always very high. In the source code of function Lattice::LatticeWER( ), there is a sentence: * NOTE: since we process nodes in topological order this * will allow chains of multiple insertions. I don't know what this sentence mean? Does that mean the insertion error of the result is not reliable ? Thanks chliu -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Aug 26 20:34:28 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 26 Aug 2008 20:34:28 -0700 Subject: A question about Lattice::LatticeWER( ) In-Reply-To: <001301c907ed$0c746cd0$255d4670$@ioa.ac.cn> References: <001301c907ed$0c746cd0$255d4670$@ioa.ac.cn> Message-ID: <48B4CB44.40803@speech.sri.com> liuchangliang wrote: > > Hi: > > I use lattice-tool to compute the lattice WER. In the result, the > insertion error is always very high. > > In the source code of function Lattice::LatticeWER( ), there is a > sentence: > > * NOTE: since we process nodes in topological order this > > * will allow chains of multiple insertions. > > I don?t know what this sentence mean? Does that mean the insertion > error of the result is not reliable ? > No. It is a comment regarding the workings of the algorithm that aligns a word string to the lattice, and that topological order is required for correct computation of insertions. Andreas From stolcke at speech.sri.com Fri Sep 5 07:48:34 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 05 Sep 2008 07:48:34 -0700 Subject: -read-google In-Reply-To: <01ab01c90f53$415c25b0$ed1610ac@selena> References: <01ab01c90f53$415c25b0$ed1610ac@selena> Message-ID: <48C146C2.1090405@speech.sri.com> Mirjam Sepesy Mau?ec wrote: > Hi, > > I have my counts in Google directory structure (by make-google-ngrams). > I would like to use make-big-lm (bacause ngram-count runs out of memory), > but the script expects the switch -read (not -read-google)? Mirjam, I believe this mailing list is meant for users of the CMU-Cambridge SLM toolkit, but your question is obviously about SRILM. Please join the srilm-user mailing list and ask your SRILM questions there (see http://www.speech.sri.com/projects/srilm/#srilm-user for instructions). Regarding your question: make-big-lm does not support the -read-google option because its approach is incompatible with the google directory structure. However, you could enumerate all the count files under the google directory, prepend "-read" to each, and give that long string of arguments to make-big-lm. make-big-lm `find /path/to/google-ngrams/data -name \*.gz \! -name \*_cs.gz | xargs -n 1 echo "-read" ` other-options .... assuming your OS allows command lines this long. Andreas > > Thanks, > Mirjam From debond at gmx.net Mon Sep 8 05:39:43 2008 From: debond at gmx.net (Christine de Bond) Date: Mon, 08 Sep 2008 14:39:43 +0200 Subject: No subject Message-ID: <20080908123943.146280@gmx.net> Hello, I tried out: ngram-count -write-vocab vocab.txt -text input.txt and in the resulting file there is an entry " -pau- " which is not in my input.txt. Does anybody know where this pau comes from and what it means? Best regards, Christine -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser From stolcke at speech.sri.com Mon Sep 8 09:50:11 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 08 Sep 2008 09:50:11 -0700 Subject: In-Reply-To: <20080908123943.146280@gmx.net> References: <20080908123943.146280@gmx.net> Message-ID: <48C557C3.8060203@speech.sri.com> Christine de Bond wrote: > Hello, > > I tried out: > > ngram-count -write-vocab vocab.txt -text input.txt > > and in the resulting file there is an entry " -pau- " which is not in my input.txt. > Does anybody know where this pau comes from and what it means? > It's a predefine vocabulary item used to represent nonspeech (eg., in lattices). This word does not take up any probability mass so it doesn't interfere with the LM building. Andreas > Best regards, > Christine > From mirjam.sepesy at uni-mb.si Thu Sep 11 03:27:12 2008 From: mirjam.sepesy at uni-mb.si (Mirjam Sepesy Maucec) Date: Thu, 11 Sep 2008 12:27:12 +0200 Subject: Fw: GT coefficients Message-ID: <0a4201c913f8$f4221020$ed1610ac@selena> Hi, I found an old question and no answer (in the SRI-LM Mailing List Archive) . I attach it! I tackle the same problem: When I convert decimal , (comma) into a . (dot) in discount files, warnings disappear... Discount files were produced by make-big-lm script. Best, Mirjam ----- Original Message ----- From: ilya oparin To: srilm-list Sent: Sunday, June 11, 2006 3:05 PM Subject: GT coefficients Hello! If I count GT coefficients in advance and then feed GT-files (generated by make-gt-discounts) to ngram-count or make-big-lm, I get warnings of the kind file.gt1: line 9: warning: discount coefficient 1 = 0.0 file.gt1: line 9: warning: discount coefficient 2 = 0.0 ... and so on for all the gt parameters. Files themselves are alright and do not contain any zeroes. Number next to line corresponds to the last line in a gt-file. The model I get with this differs from that I get when just use ngram-count without loading GT coefficients (it appears much smaller in bigrams and trigrams) with the same gtmin and gtmax values. Could anybody tell me why it happens like this? best regards, Ilya Send instant messages to your online friends http://uk.messenger.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Thu Sep 11 03:57:43 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 11 Sep 2008 03:57:43 -0700 Subject: Fw: GT coefficients In-Reply-To: <0a4201c913f8$f4221020$ed1610ac@selena> References: <0a4201c913f8$f4221020$ed1610ac@selena> Message-ID: <48C8F9A7.9020903@speech.sri.com> Mirjam Sepesy Maucec wrote: > Hi, > > I found an old question and no answer (in the SRI-LM Mailing List > Archive) . I attach it! > I tackle the same problem: > When I convert decimal , (comma) into a . (dot) in discount files, > warnings disappear... > Discount files were produced by make-big-lm script. I decimal numbers in the discount files apear with commas instead of decimal points that's almost certainly a locale setting issue. The CHANGES file has the following : * Matthias Thomae found that make-ngram-pfsg (and probably other gawk scripts) may not work correctly with recent versions of gawk unless the environment is set to LC_NUMERIC=C. Note that the gt files are computed by gawk scripts. What I can do is set LC_NUMERIC=C in make-big-lm to avoid the problem in most common cases. Andreas > > Best, > > Mirjam > > ----- Original Message ----- > *From:* ilya oparin > *To:* srilm-list > *Sent:* Sunday, June 11, 2006 3:05 PM > *Subject:* GT coefficients > > Hello! > > If I count GT coefficients in advance and then feed GT-files > (generated by make-gt-discounts) to ngram-count or make-big-lm, I get > warnings of the kind > > file.gt1: line 9: warning: discount coefficient 1 = 0.0 > file.gt1: line 9: warning: discount coefficient 2 = 0.0 > ... > > and so on for all the gt parameters. Files themselves are alright and > do not contain any zeroes. Number next to line corresponds to the last > line in a gt-file. > The model I get with this differs from that I get when just use > ngram-count without loading GT coefficients (it appears much smaller > in bigrams and trigrams) with the same gtmin and gtmax values. > Could anybody tell me why it happens like this? > > > best regards, > Ilya > > Send instant messages to your online friends > http://uk.messenger.yahoo.com > From stolcke at speech.sri.com Thu Sep 25 12:57:36 2008 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 25 Sep 2008 12:57:36 -0700 Subject: Please advise In-Reply-To: <648456300809220236m5695fe43me23b14d83d0d086d@mail.gmail.com> References: <648456300809220236m5695fe43me23b14d83d0d086d@mail.gmail.com> Message-ID: <48DBED30.1010709@speech.sri.com> Nisha Yadav wrote: > Hi, > > I am a new user of srilm toolkit and have been using the same to > generate some language model. I will be grateful to have your advice > regarding the following. > > 1) While assigning backoff probabilities is assigned a very small > probability i.e. 1E-99 but is assigned a non-zero probability > 0.181. That is to say in the output lm file I can see the following > entries for and > > -0.7421436 > -99 -0.3938685 > > Can you please explain why is srilm doing this? that's because an LM never needs to predict the beginning-of-sentence token, only the end-of-sentence. The -99 is just a dummy entry to satisfy the LM format. > > 2) For perplexity calculation, ppl command outputs 2 values ppl and > ppl1. Which of these these two is to be taken into account to compare > the model performance generated by 2-order, 3-order...ngrams and so on? Please use the FAQ first for questions about SRILM. You will find the answer in http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html . When you cannot find the answer send email to srilm-user at speech.sri.com (you need to join the mailing list first). > > 3) How much significance can be attached to these values when the > difference between them is relatively small or lies in the first digit > after decimal. That is to say if the perplexity value (ppl) for the > language models for 1-gram, 2-gram, 3-gram etc. are > > for n = 1, 68.17368, > for n = 2, 26.52578, > for n = 3, 26.61326, > for n = 4, 25.89838, > for n = 5, 25.89838, > > can we say that the model performance is better with n = 4 in > comparison to n = 3 and 2 based on these values? Please note that the > size of our corpus is not very large, approximately 8000 tokens. > Thanks in advance, It looks like n=4 is better but obviously not by much. Whether difference matters depends on your application (like MT, ASR, etc.). Andreas From dmitry.kan at gmail.com Sat Sep 27 06:34:24 2008 From: dmitry.kan at gmail.com (Dmitry Kan) Date: Sat, 27 Sep 2008 16:34:24 +0300 Subject: Visualization Message-ID: <9a4d1d60809270634s4c1fa548s2439571bdd95a448@mail.gmail.com> Hello list, I was just wondering are there any visualization tools available for having some diagram (with statistical information for example) of a produced language model? -- Regards, Dmitry Kan From gelbart at icsi.berkeley.edu Mon Sep 29 11:39:50 2008 From: gelbart at icsi.berkeley.edu (David Gelbart) Date: Mon, 29 Sep 2008 11:39:50 -0700 (PDT) Subject: Language model visualization In-Reply-To: <9a4d1d60809270634s4c1fa548s2439571bdd95a448@mail.gmail.com> References: <9a4d1d60809270634s4c1fa548s2439571bdd95a448@mail.gmail.com> Message-ID: Hi Dmitry, Here is an example of language model visualization that might be of interest: http://www.chrisharrison.net/projects/trigramviz/index.html This is a similar tree-style visualization, in this case not statistical but it might give you some ideas: http://services.alphaworks.ibm.com/manyeyes/page/Word_Tree.html On Sat, 27 Sep 2008, Dmitry Kan wrote: > Hello list, > > I was just wondering are there any visualization tools available for > having some diagram (with statistical information for example) of a > produced language model? > >