From stolcke at speech.sri.com Thu Jan 1 11:05:48 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 01 Jan 2004 11:05:48 PST Subject: Implementing Baum-Welch (Forward-Backward) algorithm in SRILM In-Reply-To: Your message of Wed, 31 Dec 2003 12:48:31 +0200. <005801c3cf8b$a20869d0$34284484@cs.technion.ac.il> Message-ID: <200401011905.LAA25886@huge> In message <005801c3cf8b$a20869d0$34284484 at cs.technion.ac.il>you wrote: > Hi, > > I'm using disambig for part-of-speech tagging. I create a language model > over sequences of tags with ngram-count, and provide P(word|tag) in the > map file. > > What I would like to do is to start with this model, based on tagged > corpus, and improve it using the Baum-Welch (forwad-backward) algorithm, > with untagged corpus. After each iteration I should get a new language > model for the tags and a new map file . After each iteration I would > like to test the model on some held-out data, so I know when to stop. > > How can I implement that in SRILM? You need to write some scripts to manipulate intermediate data, but you can pretty much do what you want. To implement EM for your tagger you have two steps: 1. E-Step: get expected counts for the tag n-gram and the word/tag mapping. a. Tag n-gram expectations. This step is unfortunately not well supported by the tools right now. Although disambig uses the FB algorithm it doesn't collect (let alone output) the expected counts in a way that's suitable for reestimating a model from them. You can use two approximations. First, you could use the 1-best tag sequence as a stand-in for the real thing and generate tag N-gram counts from it (that's sometimes called the "Viterbi" approximation of EM). Second, you can use the -nbest option to generate the top N most likely taggings of each sentence along with their score. You then have to normalize the scores to obtain posterior probabilities for the tag sequences and weight the tag N-gram counts by these posteriors and total them over your entire training corpus. b. Word/tag expectations. Here again, you could use the Viterbi approximation, simply pairing up the words and their most likely tags (as output by disambig). However, the most recent version of disambig actually has an option to collect and output the expected word/tag bigram counts. I have appended a patch that should allow you to do this with the 1.3.3 version of disambig. The option that this adds is -write-counts file Outputs the V2-V1 bigram counts corresponding to the tagging performed on the input data. If -fb was specified these are expected counts, and other- wise they reflect the 1-best tagging decisions. 2. M-step: reestimate the tag N-gram LM and the word/tag mapping probabilties a. Once you have the tag N-gram counts (obtained by one of the methods suggested above) you just need to run ngram on the count file to get a new model. Use -float-counts and a suitable discounting method if you are using fractional counts. b. Again, just use ngram to estimate a word/tag bigram model from the expected counts. You then have to post-process the LM file to extract the word/tag probabilties and format them into a map file usable by disambig. Hope this helps. Happy New Year, Andreas *** /tmp/T00BSlQ1 Wed Dec 31 13:30:23 2003 --- /tmp/T10e6tMs Wed Dec 31 13:30:23 2003 *************** *** 38,46 **** --- 38,48 ---- static char *vocab1File = 0; static char *vocab2File = 0; static char *mapFile = 0; + static char *classesFile = 0; static char *mapWriteFile = 0; static char *textFile = 0; static char *textMapFile = 0; + static char *countsFile = 0; static int keepUnk = 0; static int tolower1 = 0; static int tolower2 = 0; *************** *** 63,70 **** --- 65,74 ---- { OPT_STRING, "write-vocab1", &vocab1File, "output observable vocabulary" }, { OPT_STRING, "write-vocab2", &vocab2File, "output hidden vocabulary" }, { OPT_STRING, "map", &mapFile, "mapping from observable to hidden tokens" }, + { OPT_STRING, "classes", &classesFile, "mapping in class expansion format" }, { OPT_TRUE, "logmap", &logMap, "map file contains log probabilities" }, { OPT_STRING, "write-map", &mapWriteFile, "output map file (for validation)" }, + { OPT_STRING, "write-counts", &countsFile, "output substitution counts" }, { OPT_TRUE, "scale", &scale, "scale map probabilities by unigram probs" }, { OPT_TRUE, "keep-unk", &keepUnk, "preserve unknown words" }, { OPT_TRUE, "tolower1", &tolower1, "map observable vocabulary to lowercase" }, *************** *** 88,94 **** */ unsigned disambiguateSentence(Vocab &vocab, VocabIndex *wids, VocabIndex *hiddenWids[], ! LogP totalProb[], VocabMap &map, LM &lm, unsigned numNbest, Boolean positionMapped = false) { static VocabIndex emptyContext[] = { Vocab_None }; --- 92,98 ---- */ unsigned disambiguateSentence(Vocab &vocab, VocabIndex *wids, VocabIndex *hiddenWids[], ! LogP totalProb[], VocabMap &map, LM &lm, VocabMap *counts, unsigned numNbest, Boolean positionMapped = false) { static VocabIndex emptyContext[] = { Vocab_None }; *************** *** 236,241 **** --- 240,256 ---- } hiddenWids[n][len] = Vocab_None; } + + /* + * update v1-v2 counts if requested + */ + if (counts) { + for (unsigned i = 0; i < len; i++) { + counts->put(wids[i], hiddenWids[0][i], + counts->get(wids[i], hiddenWids[0][i]) + 1); + } + } + return numNbest; } else { /* *************** *** 426,431 **** --- 441,460 ---- } cout << endl; } + + /* + * update v1-v2 counts if requested + */ + if (counts) { + symbolIter.init(); + while (symbolProb = symbolIter.next(symbol)) { + LogP2 posterior = *symbolProb - totalPosterior; + + counts->put(wids[pos], symbol, + counts->get(wids[pos], symbol) + + LogPtoProb(posteriors)); + } + } } /* *************** *** 442,448 **** * disambiguate it, and print out the result */ void ! disambiguateFile(File &file, VocabMap &map, LM &lm) { char *line; VocabString sentence[maxWordsPerLine]; --- 471,477 ---- * disambiguate it, and print out the result */ void ! disambiguateFile(File &file, VocabMap &map, LM &lm, VocabMap *counts) { char *line; VocabString sentence[maxWordsPerLine]; *************** *** 476,482 **** LogP totalProb[numNbest]; unsigned numHyps = disambiguateSentence(map.vocab1, wids, hiddenWids, ! totalProb, map, lm, numNbest); if (!numHyps) { file.position() << "Disambiguation failed\n"; } else if (totals) { --- 505,511 ---- LogP totalProb[numNbest]; unsigned numHyps = disambiguateSentence(map.vocab1, wids, hiddenWids, ! totalProb, map, lm, counts, numNbest); if (!numHyps) { file.position() << "Disambiguation failed\n"; } else if (totals) { *************** *** 521,527 **** * disambiguate it, and print out the result */ void ! disambiguateFileContinuous(File &file, VocabMap &map, LM &lm) { char *line; Array wids; --- 550,557 ---- * disambiguate it, and print out the result */ void ! disambiguateFileContinuous(File &file, VocabMap &map, LM &lm, ! VocabMap *counts) { char *line; Array wids; *************** *** 560,566 **** LogP totalProb[numNbest]; unsigned numHyps = disambiguateSentence(map.vocab1, &wids[0], hiddenWids, ! totalProb, map, lm, numNbest); if (!numHyps) { file.position() << "Disambiguation failed\n"; --- 590,596 ---- LogP totalProb[numNbest]; unsigned numHyps = disambiguateSentence(map.vocab1, &wids[0], hiddenWids, ! totalProb, map, lm, counts, numNbest); if (!numHyps) { file.position() << "Disambiguation failed\n"; *************** *** 593,599 **** * disambiguate it, and print out the result */ void ! disambiguateTextMap(File &file, Vocab &vocab, LM &lm) { char *line; --- 623,629 ---- * disambiguate it, and print out the result */ void ! disambiguateTextMap(File &file, Vocab &vocab, LM &lm, VocabMap *counts) { char *line; *************** *** 664,670 **** LogP totalProb[numNbest]; unsigned numHyps = disambiguateSentence(vocab, &wids[0], hiddenWids, totalProb, ! map, lm, numNbest, true); if (!numHyps) { file.position() << "Disambiguation failed\n"; --- 694,700 ---- LogP totalProb[numNbest]; unsigned numHyps = disambiguateSentence(vocab, &wids[0], hiddenWids, totalProb, ! map, lm, counts, numNbest, true); if (!numHyps) { file.position() << "Disambiguation failed\n"; *************** *** 720,725 **** --- 750,764 ---- } } + if (classesFile) { + File file(classesFile, "r"); + + if (!map.readClasses(file)) { + cerr << "format error in classes file\n"; + exit(1); + } + } + if (lmFile) { File file(lmFile, "r"); *************** *** 734,746 **** hiddenLM->debugme(debug); } if (textFile) { File file(textFile, "r"); if (continuous) { ! disambiguateFileContinuous(file, map, *hiddenLM); } else { ! disambiguateFile(file, map, *hiddenLM); } } --- 773,797 ---- hiddenLM->debugme(debug); } + VocabMap *counts; + if (countsFile) { + counts = new VocabMap(vocab, hiddenVocab); + assert(counts != 0); + + counts->remove(vocab.ssIndex, hiddenVocab.ssIndex); + counts->remove(vocab.seIndex, hiddenVocab.seIndex); + counts->remove(vocab.unkIndex, hiddenVocab.unkIndex); + } else { + counts = 0; + } + if (textFile) { File file(textFile, "r"); if (continuous) { ! disambiguateFileContinuous(file, map, *hiddenLM, counts); } else { ! disambiguateFile(file, map, *hiddenLM, counts); } } *************** *** 747,755 **** if (textMapFile) { File file(textMapFile, "r"); ! disambiguateTextMap(file, vocab, *hiddenLM); } if (mapWriteFile) { File file(mapWriteFile, "w"); map.write(file); --- 798,812 ---- if (textMapFile) { File file(textMapFile, "r"); ! disambiguateTextMap(file, vocab, *hiddenLM, counts); } + if (countsFile) { + File file(countsFile, "w"); + + counts->writeBigrams(file); + } + if (mapWriteFile) { File file(mapWriteFile, "w"); map.write(file); From vhquan at itc.it Fri Jan 23 06:36:48 2004 From: vhquan at itc.it (Vu Hai Quan) Date: Fri, 23 Jan 2004 15:36:48 +0100 Subject: SIRLM for unicode In-Reply-To: <200312282306.PAA08825@huge> References: <200312282306.PAA08825@huge> Message-ID: <40113180.4030109@itc.it> Dear All, Is it possible for me to use SIRLM for text corpus which was encoded in unicode format ? Best regards. From stolcke at speech.sri.com Fri Jan 23 09:47:31 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 23 Jan 2004 09:47:31 PST Subject: SIRLM for unicode In-Reply-To: Your message of Fri, 23 Jan 2004 15:36:48 +0100. <40113180.4030109@itc.it> Message-ID: <200401231747.JAA12790@huge> I'm not familiar with unicode, unfortunately. However, SRILM does not "interpret" characters other than for parsing lines of text into words. It assumes that words are separated by spaces. So if unicode uses the same encoding of space characters as ASCII then you should be fine. The case mappping functions (-tolower option) in various tools will probably not work correctly for multi-byte character sets. --Andreas In message <40113180.4030109 at itc.it>you wrote: > Dear All, > Is it possible for me to use SIRLM for text corpus which was encoded in > unicode format ? > Best regards. > From nlp at pobox.sk Fri Feb 6 07:14:18 2004 From: nlp at pobox.sk (Robert Wagner) Date: Fri, 6 Feb 2004 16:14:18 +0100 Subject: Class based 3-gram in SRILM Message-ID: <200402061514.i16FEIg4005091@www4.pobox.sk> Hi! I have a following problem. I've estimated a class-based bigram model (with some defined words excluded from the clustering process) using the ngram-class tool. But I want to use a class-based trigram model. How to get class-based trigram counts and probabilities using SRILM? I also want to ask whether anyone knows a freely available tool for word clustering using trigram counts? And it is possible to create a class language model based on POS-tags in SRILM? Thank you for help. Robert ____________________________________ http://www.pobox.sk/ - najvacsi slovensky freemail From stolcke at speech.sri.com Fri Feb 6 13:08:33 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 06 Feb 2004 13:08:33 PST Subject: Class based 3-gram in SRILM In-Reply-To: Your message of Fri, 06 Feb 2004 16:14:18 +0100. <200402061514.i16FEIg4005091@www4.pobox.sk> Message-ID: <200402062108.NAA27238@huge> In message <200402061514.i16FEIg4005091 at www4.pobox.sk>you wrote: > Hi! > I have a following problem. I've estimated a class-based bigram model > (with some defined words excluded from the clustering process) using > the ngram-class tool. But I want to use a class-based trigram model. > How to get class-based trigram counts and probabilities using SRILM? You use the "replace-words-with-classes" script and apply the class definitions to your training data. Then you train a trigram LM in the usual way. See training-scripts(1). > > I also want to ask whether anyone knows a freely available tool for > word clustering using trigram counts? And it is possible to create a > class language model based on POS-tags in SRILM? I don't know of an available implementations for trigram-based word clustering, but it would be quite expensive (slow) to do. I believe some work by Philips/Aachen researchers showed that the improvement over bigram-induced classes (in a higher-order class-based LM) is pretty small. Anyway, that's what most everybody does these days. As for POS-based LMs, all you need is a tagger (and there are many out there) and tag your training data. Then you use the tagged data to train a tag-n-gram model in the usual way. (You can also estimate the class-membership probabilities from the tagging results.) You could use the disambig tool to do the POS tagging itself, but since it doesn't deal with morphological and other non-n-gram cues cues (e.g., to handle unknown words) it won't be competitive with state-of-the-art taggers. --Andreas From nlp at pobox.sk Wed Feb 11 02:46:13 2004 From: nlp at pobox.sk (Robert Wagner) Date: Wed, 11 Feb 2004 11:46:13 +0100 Subject: Default smoothing in ngram-count Message-ID: <200402111046.i1BAk4ut006683@www6.pobox.sk> Hi to all! I haven't found anywhere in SRILM's documentation what is a default smoothing option to ngram-count. Is it Katz backoff? I have also got a following warning: discount coeff 1 is out of range. What does it mean? Is it a bad thing? I would also like to know whether is there some kind of compatibility between SRILM and CMU language modeling toolkit, i.e. if it is possible to use n-gram counts gained by CMU in SRILM and reversaly. And last question (probably stupid;-)): What are reverse n-grams good for? Thanks Robert ____________________________________ http://www.pobox.sk/ - spolahliva a bezpecna prevadzka From wavrow at hotmail.com Mon Feb 16 23:04:33 2004 From: wavrow at hotmail.com (Shlomo Wavrow) Date: Tue, 17 Feb 2004 09:04:33 +0200 Subject: New SRILM released - ngram-class -save option Message-ID: Hello, as a new version of SRILM has been released, I would also like to add one item to "wishlist". It would be nice to change a bit the -save option to ngram-class. Now you only can make ngram-class to save classes every S iterations. But this behaviour causes a plethora of class files to be saved to disk. It would be nice to add some "-startsave option" to start saving classes after reaching user defined number of classes. It also would be useful to add a possibility to continue interrupted merging using a previously saved class file. I hope these remarks will help you. Regards Shlomo _________________________________________________________________ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail From desaikey at egr.msu.edu Wed Feb 18 13:16:58 2004 From: desaikey at egr.msu.edu (desaikey) Date: Wed, 18 Feb 2004 16:16:58 -0500 Subject: Sentence generation using SRILM Message-ID: <000001c3f664$8b1e5810$ef8c0923@Keyur> Hi, I am trying to generate a set of random sentences using a specified n-gram language model. The command and related flags I m using are: ngram -lm x.lm -gen Xno When I use "small vocabulary (AN4 CMU database) whole-words trigram LM" the tool is able to generate sentences. But with other LMs (Bi/Uni -gram) the size of the generated sentences is excessively large or tool takes too much of time. While with spelling based LMs the tool is not able to generate sentences or again too large a size of sentc.(even for tri-gram). Please share any ideas/experience you have about such a problem. Thanks in advance for your help. Keyur ------------------------------------------- KEYUR DESAI Graduate Student Department of Electrical and Computer Eng. Michigan State University Email:desaikey at egr.msu.edu Phone:(517)664-1802 From tanel.alumae at aqris.com Thu Feb 19 07:13:56 2004 From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=) Date: Thu, 19 Feb 2004 17:13:56 +0200 Subject: Class expansion Message-ID: <1077203635.13538.38.camel@NOOL2> Hello, I'm trying to to convert a class bigram to its equivalent word n-gram, using the "ngram" tool with the -expand-classes option. The class model has 1000 classes, and there are 60000 words. I use the following command line: ngram -lm -classes -expand-classes 2 -write-lm The process runs about 15 minutes using over 700M of RAM, and then gets killed by the OS (I'm using Linux), probably when it asked even more memory that the OS didn't have (I have 512M of main memory). Is it normal that the class expansion takes that much RAM? Is there a way around it? Thanks and regards, -- Tanel Alum?e From stolcke at speech.sri.com Thu Feb 19 10:08:08 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 19 Feb 2004 10:08:08 PST Subject: Class expansion In-Reply-To: Your message of Thu, 19 Feb 2004 17:13:56 +0200. <1077203635.13538.38.camel@NOOL2> Message-ID: <200402191808.KAA13923@huge> In message <1077203635.13538.38.camel at NOOL2>you wrote: > Hello, > > I'm trying to to convert a class bigram to its equivalent word n-gram, > using the "ngram" tool with the -expand-classes option. The class model > has 1000 classes, and there are 60000 words. I use the following command > line: > > ngram -lm -classes -expand-classes 2 > -write-lm > > The process runs about 15 minutes using over 700M of RAM, and then gets > killed by the OS (I'm using Linux), probably when it asked even more > memory that the OS didn't have (I have 512M of main memory). > > Is it normal that the class expansion takes that much RAM? Is there a > way around it? It is expected. Your seeing a combinatorial explosion of ngrams as the classes get expanded. In general it is not feasible to expand a large-vocabulary class LM with several hundred classes. ngram -expand-classes was designed for medium-vocabulary class LMs, especially ones with hand-designed classes. It works fine for domains like ATIS, SPINE, Communicator, etc. There is a way around it, but it would require some coding. You could do the class expansion, and interleave it with ngram pruning. In other words, right after you expand all the class ngrams that share a word ngram context you perform entropy-based pruning to retain only those that "matter". This should dramantically reduce the size of the expanded model. --Andreas From wavrow at hotmail.com Mon Feb 23 02:13:57 2004 From: wavrow at hotmail.com (Shlomo Wavrow) Date: Mon, 23 Feb 2004 12:13:57 +0200 Subject: ngram-count : -tagged option Message-ID: Hello everybody! Does anybody has any experience of using -tagged option to ngram-count? I thought that -tagged option means that ngram-count creates tag-based model, but I got strange results. In the resulting counts-file appear a kind of mixture of words and tags... My input text file has a following form: word1/tag1 word2/tag2 .... wordN/tagN Regards Shlomo _________________________________________________________________ STOP MORE SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=features/junkmail From stolcke at speech.sri.com Mon Feb 23 14:03:34 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 23 Feb 2004 14:03:34 PST Subject: ngram-count : -tagged option In-Reply-To: Your message of Mon, 23 Feb 2004 12:13:57 +0200. Message-ID: <200402232203.OAA23323@huge> In message you wrote: > Hello everybody! > Does anybody has any experience of using -tagged option to ngram-count? I > thought that -tagged option means that ngram-count creates tag-based model, > but I got strange results. In the resulting counts-file appear a kind of > mixture of words and tags... My input text file has a following form: > word1/tag1 word2/tag2 .... wordN/tagN This option is for building ngram LMs that use the word class for backoff, and thus hopefully improved smoothing. It is not documented, I'm afraid, so will be hard to use unless you are willing to look closely at the code. I remember someone on this list reported a bug with the code a while back, so maybe there are some people out there who can help. Also, there is a small example in test suite (test/tests/tagged-ngram). I should note that the "factored N-gram" models recently added to SRILM (release 1.4) are a generalization of tagged N-grams, and there is good documentation for those. So you might want to think about reformulating whatever it is you are thinking of as a factored LM. --Andreas From s0343879 at sms.ed.ac.uk Tue Feb 24 07:54:17 2004 From: s0343879 at sms.ed.ac.uk (G Hofer) Date: Tue, 24 Feb 2004 15:54:17 +0000 Subject: decode lattice Message-ID: <1077638057.403b73a921e70@sms.ed.ac.uk> Hi, We are using the sri lm 1.4 toolkit. As for now we have created a lattice in the htk format and a 2gram model in the Arpa format. It is not clear from the manual page how to decode this lattice uing the 2-gram model. Can you give us the correct options for the lattice-tool to accomplish this if this is the correct tool to use? thank you, Gregor From stolcke at speech.sri.com Tue Feb 24 08:42:28 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 24 Feb 2004 08:42:28 PST Subject: decode lattice In-Reply-To: Your message of Tue, 24 Feb 2004 15:54:17 +0000. <1077638057.403b73a921e70@sms.ed.ac.uk> Message-ID: <200402241642.IAA11315@huge> In message <1077638057.403b73a921e70 at sms.ed.ac.uk>you wrote: > Hi, > > We are using the sri lm 1.4 toolkit. As for now we have created a lattice in > the htk format and a 2gram model in the Arpa format. It is not clear from the > > manual page how to decode this lattice uing the 2-gram model. Can you give us > > the correct options for the lattice-tool to accomplish this if this is the > correct tool to use? > > thank you, > Gregor Gregor, You would run lattice-tool twice, first to rescore the lattices with your LM, then to extract the best hypothesis. For LM rescoring use options lattice-tool -read-htk -write-htk -order 2 -lm LM -no-nulls For 1-best decoding use lattice-tool -read-htk -htk-lmscale LMWEIGHT -viterbi-decode You could also perform posterior-based (confusion network) decoding using lattice-tool -read-htk -htk-lmscale LMWEIGHT -posterior-decode where LMWEIGHT is the weight given to the LM scores relative to the acoustic model scores. Of course you need to add options specifying the location of input/output lattices. A future version of lattice-tool will probably allow you to combine these two steps into a single run, but for now you have to do it this way. --Andreas From stolcke at speech.sri.com Thu Feb 26 11:05:43 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 26 Feb 2004 11:05:43 PST Subject: decode lattice In-Reply-To: Your message of Wed, 25 Feb 2004 01:25:56 +0000. Message-ID: <200402261905.LAA08246@tonga> In message you wrote: > Dear Andreas, > > I am replying on behalf of my colleague who emailed you earlier regarding > correct use of the SRILM lattice-tool. > > Based on your previous advice I have tried to decode our lattice using our > bigram model. All files seem to be in the correct format, so far as I can > tell. However, when lattice-tool rescores the lattice, all the newly added > LM probabilities "l=..." come out as "-inf". I tried 1-best decoding using > viterbi on the rescored lattice and the output is simply: > > lattice.out > > where lattice.out is the utterance name inserted by lattice-tool. > > Do you have any idea why we're experiencing behaviour like this? Can you > suggest any alterations? John, the problem is that your lattices use double-quotes around the word strings, but the released version of SRILM does't yet implement the HTK quoting mechanism (an oversight on my part). You can replace the file lattice/src/HTKLattice.cc with the attached version and rebuild lattice-tool to make it work. Or, you can just strip the double quotes in your lattice files and keep using the old software. --Andreas -------------- next part -------------- /* * HTKLattice.cc -- * HTK Standard Lattice Format support for SRILM lattices * * Note: there is no separate HTKLattice class, only I/O methods! * */ #ifndef lint static char Copyright[] = "Copyright (c) 2004 SRI International. All Rights Reserved."; static char RcsId[] = "@(#)$Header: /home/srilm/devel/lattice/src/RCS/HTKLattice.cc,v 1.17 2004/02/26 18:48:22 stolcke Exp $"; #endif #include #include #include #include #include #include #include #include "Array.cc" #include "LHash.cc" #include "Lattice.h" #include "MultiwordVocab.h" #include "NBest.h" // for phoneSeparator defn #ifdef INSTANTIATE_TEMPLATES INSTANTIATE_ARRAY(HTKLink); #endif /* from Lattice.cc */ #define DebugPrintFatalMessages 1 #define DebugPrintFunctionality 1 const char *HTKLattice_Version = "1.1"; const float HTK_undef_float = HUGE_VAL; const unsigned HTK_undef_uint = (unsigned)-1; const char *HTK_null_word = "!NULL"; const float HTK_def_tscale = 1.0; const float HTK_def_acscale = 1.0; const float HTK_def_lmscale = 1.0; const float HTK_def_ngscale = 1.0; const float HTK_def_wdpenalty = 0.0; const float HTK_def_prscale = 1.0; const float HTK_def_duscale = 0.0; HTKHeader::HTKHeader() : logbase(10), tscale(HTK_def_tscale), acscale(HTK_def_acscale), ngscale(HTK_def_ngscale), lmscale(HTK_def_lmscale), wdpenalty(HTK_def_wdpenalty), prscale(HTK_def_prscale), duscale(HTK_def_duscale), amscale(HTK_undef_float), vocab(0), lmname(0), ngname(0), hmms(0), wordsOnNodes(false), scoresOnNodes(false) { }; HTKHeader::HTKHeader(double acscale, double lmscale, double ngscale, double prscale, double duscale, double wdpenalty) : logbase(10), tscale(HTK_def_tscale), acscale(acscale), ngscale(ngscale), lmscale(lmscale), wdpenalty(wdpenalty), prscale(prscale), duscale(duscale), amscale(HTK_undef_float), vocab(0), lmname(0), ngname(0), hmms(0), wordsOnNodes(false), scoresOnNodes(false) { }; HTKHeader::~HTKHeader() { if (vocab) free(vocab); if (lmname) free(lmname); if (ngname) free(ngname); if (hmms) free(hmms); } HTKHeader & HTKHeader::operator= (const HTKHeader &other) { if (&other == this) { return *this; } if (vocab) free(vocab); if (lmname) free(lmname); if (ngname) free(ngname); if (hmms) free(hmms); tscale = other.tscale; acscale = other.acscale; ngscale = other.ngscale; lmscale = other.lmscale; wdpenalty = other.wdpenalty; prscale = other.prscale; duscale = other.duscale; amscale = other.amscale; if (other.vocab == 0) { vocab = 0; } else { vocab = strdup(other.vocab); assert(vocab != 0); } if (other.lmname == 0) { lmname = 0; } else { lmname = strdup(other.lmname); assert(lmname != 0); } if (other.ngname == 0) { ngname = 0; } else { ngname = strdup(other.ngname); assert(ngname != 0); } if (other.hmms == 0) { hmms = 0; } else { hmms = strdup(other.hmms); assert(hmms != 0); } return *this; } HTKLink::HTKLink() : time(HTK_undef_float), word(Vocab_None), var(HTK_undef_uint), div(0), acoustic(HTK_undef_float), ngram(HTK_undef_float), language(HTK_undef_float), pron(HTK_undef_float), duration(HTK_undef_float), posterior(HTK_undef_float) { } HTKLink::~HTKLink() { if (div) free(div); } HTKLink & HTKLink::operator= (const HTKLink &other) { if (&other == this) { return *this; } if (div) free(div); time = other.time; word = other.word; var = other.var; if (other.div == 0) { div = 0; } else { div = strdup(other.div); assert(div != 0); } acoustic = other.acoustic; ngram = other.ngram; language = other.language; pron = other.pron; duration = other.duration; posterior = other.posterior; return *this; } /* * Format HTKLink (for debugging) */ ostream & operator<< (ostream &stream, HTKLink &link) { stream << "[HTKLink"; if (link.word != Vocab_None) { stream << " WORD=" << link.word; } if (link.time != HTK_undef_float) { stream << " time=" << link.time; } if (link.var != HTK_undef_uint) { stream << " var=" << link.var; } if (link.div != 0) { stream << " div=" << link.div; } if (link.acoustic != HTK_undef_float) { stream << " a=" << link.acoustic; } if (link.ngram != HTK_undef_float) { stream << " n=" << link.ngram; } if (link.language != HTK_undef_float) { stream << " l=" << link.language; } if (link.pron != HTK_undef_float) { stream << " r=" << link.pron; } if (link.duration != HTK_undef_float) { stream << " ds=" << link.duration; } if (link.posterior != HTK_undef_float) { stream << " p=" << link.posterior; } stream << "]"; return stream; } /* * Find the next key=value pair in line, return string value, nad * advance line pointer past it. * The string pointed to by line is modified in the process. */ static char * getHTKField(char *&line, char *&value) { char *cp = line; char *key; do { switch (*cp) { case '\0': case '#': return 0; break; case ' ': case '\t': case '\n': cp ++; break; default: key = cp; while (*cp != '\0' && !isspace(*cp) && *cp != '=') cp++; if (*cp == '=') { *(cp++) = '\0'; // terminate key string value = cp; // beginning of value string char *cpv = cp; // target location for copying value char inquote = '\0'; /* * Quotes are only treated specially if they * occur in first position */ if (*cp == '\"' || *cp == '\'') { inquote = *(cp++); } while (*cp != '\0') { if (*cp == '\\') { /* * Backslash quote processing */ cp ++; if (*cp == '\0') { /* * Shouldn't happen, we just ignore it */ break; } else if (*cp == '0') { /* * Octal char code */ unsigned charcode; unsigned charlen; sscanf(cp, "%o%n", &charcode, &charlen); *(cpv++) = charcode; cp += charlen; } else { /* * Other quoted character */ *(cpv++) = *(cp++); } } else if (!inquote && isspace(*cp)) { /* * String deliminted by White-space */ cp ++; break; } else if (inquote && *cp == inquote) { /* * String delimited by end quote */ cp ++; break; } else { /* * Character in string */ *(cpv++) = *(cp++); } } *cpv = '\0'; // terminate value string } else { value = cp; // beginning of value string if (*cp != '\0') { *(cp++) = '\0'; // terminate value string } } line = cp; return key; } } while (1); } /* * Output quoted version of string */ static void printQuoted(FILE *f, const char *name) { Boolean octalPrinted = false; for (const char *cp = name; *cp != '\0'; cp ++) { if (*cp == ' ' || *cp == '\\' || *cp == '\'' || *cp == '\"' || octalPrinted && isdigit(*cp)) { /* * This character needs to be quoted */ putc('\\', f); putc(*cp, f); octalPrinted = false; } else if (!isprint(*cp) || isspace(*cp)) { /* * Print as octal char code */ fprintf(f, "\\0%o", *cp); octalPrinted = true; } else { /* * Print as plain character */ putc(*cp, f); octalPrinted = false; } } } /* * Input lattice in HTK format * Algorithm: * - each HTK node becomes a null node. * - each HTK link becomes a non-null node. * - word and other link information is added to the non-null nodes. * - link information attached to HTK nodes is added to non-null nodes. * - lattice transition weights are computed as a log-linear combination * of HTK scores. * Arguments: * - if header != 0, supplied scaling parameters override information * from lattice header * - if useNullNodes == false null nodes corresponding to original * HTK nodes are eliminated */ Boolean Lattice::readHTK(File &file, HTKHeader *header, Boolean useNullNodes) { removeAll(); unsigned HTKnumlinks = 0; unsigned HTKnumnodes = 0; float HTKlogbase = M_E; unsigned HTKfinal = HTK_undef_uint; unsigned HTKinitial = HTK_undef_uint; char HTKdirection = 'f'; unsigned HTKfirstnode = HTK_undef_uint; unsigned HTKlastnode = HTK_undef_uint; float HTKinitialtime, HTKfinaltime; LHash nodeMap; // maps HTK nodes->lattice nodes Array nodeInfoMap; // node-based link information // dummy word used temporarily to represent HTK nodes // (could have used null nodes, but this way we preserve null nodes in // the input lattice) const char *HTKNodeWord = "***HTK_Node***"; VocabIndex HTKNodeDummy = useNullNodes ? Vocab_None : vocab.addWord(HTKNodeWord); /* * Override supplied header parameters */ if (header != 0) { if (header->logbase != HTK_undef_float) { htkheader.logbase = header->logbase; } if (header->acscale != HTK_undef_float) { htkheader.acscale = header->acscale; } if (header->lmscale != HTK_undef_float) { htkheader.lmscale = header->lmscale; } if (header->ngscale != HTK_undef_float) { htkheader.ngscale = header->ngscale; } if (header->prscale != HTK_undef_float) { htkheader.prscale = header->prscale; } if (header->duscale != HTK_undef_float) { htkheader.duscale = header->duscale; } if (header->wdpenalty != HTK_undef_float) { htkheader.wdpenalty = header->wdpenalty; } if (header->amscale != HTK_undef_float) { htkheader.amscale = header->amscale; } htkheader.wordsOnNodes = header->wordsOnNodes; htkheader.scoresOnNodes = header->scoresOnNodes; } /* * Parse HTK lattice file */ while (char *line = file.getline()) { char *key; char *value; /* * Parse key=value pairs * (we test for frequent fields first to save time) * We assume that header information comes before node information, * which comes before link information. However, this is is not * enforced, and incomplete lattices may result if the input file * contains things out of order. */ while (key = getHTKField(line, value)) { #define keyis(x) (strcmp(key, (x)) == 0) /* * Link fields */ if (keyis("J")) { unsigned HTKlinkno = atoi(value); /* * parse link fields */ HTKLink *linkinfo = new HTKLink; assert(linkinfo != 0); // allocates new HTKLink pointer in lattice htkinfos[htkinfos.size()] = linkinfo; unsigned HTKstartnode, HTKendnode; NodeIndex startIndex = NoNode, endIndex = NoNode; while (key = getHTKField(line, value)) { if (keyis("S") || keyis("START")) { HTKstartnode = atoi(value); Boolean found; NodeIndex *startIndexPtr = nodeMap.insert(HTKstartnode, found); if (!found) { // node index not seen before; create it *startIndexPtr = dupNode(Vocab_None); } startIndex = *startIndexPtr; } else if (keyis("E") || keyis("END")) { HTKendnode = atoi(value); Boolean found; NodeIndex *endIndexPtr = nodeMap.insert(HTKendnode, found); if (!found) { // node index not seen before; create it *endIndexPtr = dupNode(Vocab_None); } endIndex = *endIndexPtr; } else if (keyis("W") || keyis("WORD")) { if (strcmp(value, HTK_null_word) == 0) { linkinfo->word = Vocab_None; } else { linkinfo->word = vocab.addWord(value); } } else if (keyis("v") || keyis("var")) { linkinfo->var = atoi(value); } else if (keyis("d") || keyis("div")) { linkinfo->div = strdup(value); assert(linkinfo->div != 0); } else if (keyis("a") || keyis("acoustic")) { double score = atof(value); if (HTKlogbase > 0.0) { linkinfo->acoustic = score * ProbToLogP(HTKlogbase); } else { linkinfo->acoustic = ProbToLogP(score); } } else if (keyis("n") || keyis("ngram")) { double score = atof(value); if (HTKlogbase > 0.0) { linkinfo->ngram = score * ProbToLogP(HTKlogbase); } else { linkinfo->ngram = ProbToLogP(score); } } else if (keyis("l") || keyis("language")) { double score = atof(value); if (HTKlogbase > 0.0) { linkinfo->language = score * ProbToLogP(HTKlogbase); } else { linkinfo->language = ProbToLogP(score); } } else if (keyis("r")) { double score = atof(value); if (HTKlogbase > 0.0) { linkinfo->pron = score * ProbToLogP(HTKlogbase); } else { linkinfo->pron = ProbToLogP(score); } } else if (keyis("ds")) { double score = atof(value); if (HTKlogbase > 0.0) { linkinfo->duration = score * ProbToLogP(HTKlogbase); } else { linkinfo->duration = ProbToLogP(score); } } else if (keyis("p")) { linkinfo->posterior = atof(value); } else { file.position() << "unexpected link field name " << key << endl; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } } if (startIndex == NoNode) { file.position() << "missing start node spec\n"; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } if (endIndex == NoNode) { file.position() << "missing end node spec\n"; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } /* * fill in unspecified link info from associated node info * 'forward' lattices use end-node information. * 'backward' lattices use start-node information. */ HTKLink *nodeinfo = 0; if (HTKdirection == 'f') { nodeinfo = &nodeInfoMap[HTKendnode]; } else if (HTKdirection == 'b') { nodeinfo = &nodeInfoMap[HTKstartnode]; } if (nodeinfo != 0) { linkinfo->time = nodeinfo->time; if (linkinfo->word == Vocab_None) { linkinfo->word = nodeinfo->word; } if (linkinfo->var == HTK_undef_uint) { linkinfo->var = nodeinfo->var; } if (linkinfo->div == 0 && nodeinfo->div != 0) { linkinfo->div = strdup(nodeinfo->div); assert(linkinfo->div != 0); } if (linkinfo->acoustic == HTK_undef_float) { linkinfo->acoustic = nodeinfo->acoustic; } if (linkinfo->pron == HTK_undef_float) { linkinfo->pron = nodeinfo->pron; } if (linkinfo->duration == HTK_undef_float) { linkinfo->duration = nodeinfo->duration; } } /* * Create lattice node */ NodeIndex newNode = dupNode(linkinfo->word, 0, linkinfo); /* * Compute lattice transition weight as a weighted combination * of HTK lattice scores */ LogP weight = LogP_One; if (linkinfo->acoustic != HTK_undef_float) { weight += htkheader.acscale * linkinfo->acoustic; } if (linkinfo->ngram != HTK_undef_float) { weight += htkheader.ngscale * linkinfo->ngram; } if (linkinfo->language != HTK_undef_float) { weight += htkheader.lmscale * linkinfo->language; } if (linkinfo->pron != HTK_undef_float) { weight += htkheader.prscale * linkinfo->pron; } if (linkinfo->duration != HTK_undef_float) { weight += htkheader.duscale * linkinfo->duration; } if (!ignoreWord(linkinfo->word)) { weight += htkheader.wdpenalty; // do we need to scale ? } /* * Add transitions from start node, and to end node */ LatticeTransition trans1(weight, 0); insertTrans(startIndex, newNode, trans1); LatticeTransition trans2(LogP_One, 0); insertTrans(newNode, endIndex, trans2); continue; /* * Node fields */ } else if (keyis("I")) { unsigned HTKnodeno = atoi(value); /* * create a null node for this HTK node, * and record node-related info. */ NodeIndex nullNodeIndex = dupNode(HTKNodeDummy); *nodeMap.insert(HTKnodeno) = nullNodeIndex; HTKLink &nodeinfo = nodeInfoMap[HTKnodeno]; /* * parse node fields */ while (key = getHTKField(line, value)) { if (keyis("t") || keyis("time")) { nodeinfo.time = atof(value); // remember temporally first node and timestamp // in case input doesn't specify initial node if (HTKfirstnode == HTK_undef_uint || nodeinfo.time < HTKinitialtime) { HTKfirstnode = HTKnodeno; HTKinitialtime = nodeinfo.time; } // same for last timestamp if (HTKlastnode == HTK_undef_uint || nodeinfo.time > HTKfinaltime) { HTKlastnode = HTKnodeno; HTKfinaltime = nodeinfo.time; } } else if (keyis("W") || keyis("WORD")) { if (strcmp(value, HTK_null_word) == 0) { nodeinfo.word = Vocab_None; } else { nodeinfo.word = vocab.addWord(value); } } else if (keyis("v") || keyis("var")) { nodeinfo.var = atoi(value); } else if (keyis("d") || keyis("div")) { nodeinfo.div = strdup(value); assert(nodeinfo.div != 0); } else if (keyis("a") || keyis("acoustic")) { double score = atof(value); if (HTKlogbase > 0.0) { nodeinfo.acoustic = score * ProbToLogP(HTKlogbase); } else { nodeinfo.acoustic = ProbToLogP(score); } } else if (keyis("r")) { double score = atof(value); if (HTKlogbase > 0.0) { nodeinfo.pron = score * ProbToLogP(HTKlogbase); } else { nodeinfo.pron = ProbToLogP(score); } } else if (keyis("ds")) { double score = atof(value); if (HTKlogbase > 0.0) { nodeinfo.duration = score * ProbToLogP(HTKlogbase); } else { nodeinfo.duration = ProbToLogP(score); } } else { file.position() << "unexpected node field name " << key << endl; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } } if (nodeinfo.time != HTK_undef_float) { // record node time, but no word-related info LatticeNode *nullNode = findNode(nullNodeIndex); assert(nullNode != 0); HTKLink *nullInfo = new HTKLink; assert(nullInfo != 0); htkinfos[htkinfos.size()] = nullInfo; nullNode->htkinfo = nullInfo; nullInfo->time = nodeinfo.time; } continue; /* * Header fields */ } else if (keyis("V") || keyis("VERSION")) { ; // ignore } else if ( keyis("U") || keyis("UTTERANCE")) { if (name) free((void *)name); // HACK: strip duration spec (which shouldn't be there) char *p = strstr(value, "(duration="); if (p != 0) *p = '\0'; name = strdup(value); assert(name != 0); } else if (keyis("base")) { HTKlogbase = atof(value); } else if (keyis("start")) { HTKinitial = atoi(value); } else if (keyis("end")) { HTKfinal = atoi(value); } else if (keyis("dir")) { HTKdirection = value[0]; } else if (keyis("tscale")) { htkheader.tscale = atof(value); } else if (keyis("hmms")) { htkheader.hmms = strdup(value); assert(htkheader.hmms != 0); } else if (keyis("ngname")) { htkheader.ngname = strdup(value); assert(htkheader.ngname != 0); } else if (keyis("lmname")) { htkheader.lmname = strdup(value); assert(htkheader.lmname != 0); } else if (keyis("vocab")) { htkheader.vocab = strdup(value); assert(htkheader.vocab != 0); } else if (keyis("acscale")) { if (header == 0 || header->acscale == HTK_undef_float) { htkheader.acscale = atof(value); } } else if (keyis("ngscale")) { if (header == 0 || header->ngscale == HTK_undef_float) { htkheader.ngscale = atof(value); } } else if (keyis("lmscale")) { if (header == 0 || header->lmscale == HTK_undef_float) { htkheader.lmscale = atof(value); } } else if (keyis("prscale")) { if (header == 0 || header->prscale == HTK_undef_float) { htkheader.prscale = atof(value); } } else if (keyis("duscale")) { if (header == 0 || header->duscale == HTK_undef_float) { htkheader.duscale = atof(value); } } else if (keyis("wdpenalty")) { if (header == 0 || header->wdpenalty == HTK_undef_float) { htkheader.wdpenalty = atof(value); } } else if (keyis("amscale")) { if (header == 0 || header->amscale == HTK_undef_float) { htkheader.amscale = atof(value); } } else if (keyis("NODES") || keyis("N")) { HTKnumnodes = atoi(value); } else if (keyis("LINKS") || keyis("L")) { HTKnumlinks = atoi(value); } else { file.position() << "unknown field name " << key << endl; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } #undef keyis } } if (HTKnumnodes == 0) { file.position() << "lattice has no nodes\n"; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } /* * Set up initial node */ HTKLink *initialinfo; LatticeNode *initialNode; if (HTKinitial != HTK_undef_uint) { initialinfo = &nodeInfoMap[HTKinitial]; NodeIndex *initialPtr = nodeMap.find(HTKinitial); if (initialPtr) { initial = *initialPtr; initialNode = findNode(initial); } else { file.position() << "undefined start node " << HTKinitial << endl; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } } else { initialinfo = &nodeInfoMap[HTKfirstnode]; // search for start node: the one without incoming transitions LHashIter nodeIter(nodes); NodeIndex nodeIndex; while (LatticeNode *node = nodeIter.next(nodeIndex)) { if (node->inTransitions.numEntries() == 0) { initial = nodeIndex; initialNode = node; break; } } } initialNode->word = vocab.ssIndex(); // attach HTK initial node info to lattice initial node initialNode->htkinfo = new HTKLink; *initialNode->htkinfo = *initialinfo; htkinfos[htkinfos.size()] = initialNode->htkinfo; /* * Set up final node */ HTKLink *finalinfo; LatticeNode *finalNode; if (HTKfinal != HTK_undef_uint) { finalinfo = &nodeInfoMap[HTKfinal]; NodeIndex *finalPtr = nodeMap.find(HTKfinal); if (finalPtr) { final = *finalPtr; finalNode = findNode(final); } else { file.position() << "undefined end node " << HTKfinal << endl; if (!useNullNodes) vocab.remove(HTKNodeDummy); return false; } } else { finalinfo = &nodeInfoMap[HTKlastnode]; // search for end node: the one without outgoing transitions LHashIter nodeIter(nodes); NodeIndex nodeIndex; while (LatticeNode *node = nodeIter.next(nodeIndex)) { if (node->outTransitions.numEntries() == 0) { final = nodeIndex; finalNode = node; break; } } } finalNode->word = vocab.seIndex(); // attach HTK final node info to lattice final node finalNode->htkinfo = new HTKLink; *finalNode->htkinfo = *finalinfo; htkinfos[htkinfos.size()] = finalNode->htkinfo; // eliminate dummy nodes if (!useNullNodes) { removeAllXNodes(HTKNodeDummy); vocab.remove(HTKNodeDummy); } return true; } /* * Output lattice in HTK format * Algorithm: * - each lattice node becomes an HTK node. * - each lattice transitions becomes an HTK link. * - word information is added to the HTK nodes. * - link information attached to each node is added to the HTK link * leading into the node. * - lattice transition weights are mapped to one of the * HTK score fields as indicated by the second argument. */ Boolean Lattice::writeHTK(File &file, HTKScoreMapping scoreMapping, Boolean printPosteriors) { if (debug(DebugPrintFunctionality)) { dout() << "Lattice::writeHTK: writing "; } fprintf(file, "# Header (generated by SRILM)\n"); fprintf(file, "VERSION=%s\n", HTKLattice_Version); fprintf(file, "UTTERANCE="); printQuoted(file, name); fputc('\n', file); fprintf(file, "base=%g\n", htkheader.logbase); fprintf(file, "dir=%s\n", "f"); // forward lattice /* * Ancillary header information preserved from readHTK() */ if (htkheader.tscale != HTK_def_tscale) { fprintf(file, "tscale=%g\n", htkheader.tscale); } if (htkheader.acscale != HTK_def_acscale) { fprintf(file, "acscale=%g\n", htkheader.acscale); } if (htkheader.lmscale != HTK_def_lmscale) { fprintf(file, "lmscale=%g\n", htkheader.lmscale); } if (htkheader.ngscale != HTK_def_ngscale) { fprintf(file, "ngscale=%g\n", htkheader.ngscale); } if (htkheader.prscale != HTK_def_prscale) { fprintf(file, "prscale=%g\n", htkheader.prscale); } if (htkheader.duscale != HTK_def_duscale) { fprintf(file, "duscale=%g\n", htkheader.duscale); } if (htkheader.amscale != HTK_undef_float && printPosteriors) { fprintf(file, "amscale=%g\n", htkheader.amscale); } if (htkheader.hmms != 0) { fprintf(file, "hmms="); printQuoted(file, htkheader.hmms); fputc('\n', file); } if (htkheader.lmname != 0) { fprintf(file, "lmname="); printQuoted(file, htkheader.lmname); fputc('\n', file); } if (htkheader.ngname != 0) { fprintf(file, "ngname="); printQuoted(file, htkheader.ngname); fputc('\n', file); } if (htkheader.vocab != 0) { fprintf(file, "vocab=", htkheader.vocab); printQuoted(file, htkheader.vocab); fputc('\n', file); } /* * We remap the internal node indices to consecutive unsigned integers * to allow a compact output representation. * We iterate over all nodes, renumbering them, and also counting the * number of transitions overall. */ LHash nodeMap; // map nodeIndex to unsigned unsigned numNodes = 0; unsigned numTransitions = 0; LHashIter nodeIter(nodes, nodeSort); NodeIndex nodeIndex; while (LatticeNode *node = nodeIter.next(nodeIndex)) { *nodeMap.insert(nodeIndex) = numNodes ++; numTransitions += node->outTransitions.numEntries(); } fprintf(file, "start=%u end=%u\n", *nodeMap.find(initial), *nodeMap.find(final)); fprintf(file, "NODES=%u LINKS=%u\n", numNodes, numTransitions); if (debug(DebugPrintFunctionality)) { dout() << numNodes << " nodes, " << numTransitions << " transitions\n"; } fprintf(file, "# Nodes\n"); double logscale = 1.0 / ProbToLogP(htkheader.logbase); nodeIter.init(); while (LatticeNode *node = nodeIter.next(nodeIndex)) { fprintf(file, "I=%u", *nodeMap.find(nodeIndex)); if (htkheader.wordsOnNodes) { fprintf(file, "\tW="); printQuoted(file, (node->word == vocab.ssIndex() || node->word == vocab.seIndex() || node->word == Vocab_None) ? HTK_null_word : vocab.getWord(node->word)); } if (node->htkinfo != 0) { HTKLink &htkinfo = *node->htkinfo; if (htkinfo.time != HTK_undef_float) { fprintf(file, "\tt=%g", htkinfo.time); } if (htkheader.scoresOnNodes && scoreMapping != mapHTKacoustic && htkinfo.acoustic != HTK_undef_float) { fprintf(file, "\ta=%g", htkinfo.acoustic * logscale); } if (htkheader.scoresOnNodes && htkinfo.pron != HTK_undef_float) { fprintf(file, "\tr=%g", htkinfo.pron * logscale); } if (htkheader.scoresOnNodes && htkinfo.duration != HTK_undef_float) { fprintf(file, "\tds=%g", htkinfo.duration * logscale); } if (htkheader.wordsOnNodes && htkinfo.var != HTK_undef_uint) { fprintf(file, "\tv=%u", htkinfo.var); } if (htkheader.wordsOnNodes && htkinfo.div != 0) { fprintf(file, "\td=%s", htkinfo.div); } } if (printPosteriors) { fprintf(file, "\tp=%lg", (double)LogPtoProb(node->posterior)); } fprintf(file, "\n"); } fprintf(file, "# Links\n"); unsigned linkNumber = 0; nodeIter.init(); while (LatticeNode *node = nodeIter.next(nodeIndex)) { unsigned *fromNodeId = nodeMap.find(nodeIndex); NodeIndex toNodeIndex; TRANSITER_T transIter(node->outTransitions); while (LatticeTransition *trans = transIter.next(toNodeIndex)) { LatticeNode *toNode = findNode(toNodeIndex); assert(toNode != 0); unsigned *toNodeId = nodeMap.find(toNodeIndex); assert(toNodeId != 0); fprintf(file, "J=%u\tS=%u\tE=%u", linkNumber++, *fromNodeId, *toNodeId); if (!htkheader.wordsOnNodes) { fprintf(file, "\tW="); printQuoted(file, (toNode->word == vocab.ssIndex() || toNode->word == vocab.seIndex() || toNode->word == Vocab_None) ? HTK_null_word : vocab.getWord(toNode->word)); } if (toNode->htkinfo != 0) { HTKLink &htkinfo = *toNode->htkinfo; if (!htkheader.scoresOnNodes && scoreMapping != mapHTKacoustic && htkinfo.acoustic != HTK_undef_float) { fprintf(file, "\ta=%g", htkinfo.acoustic * logscale); } if (!htkheader.scoresOnNodes && htkinfo.pron != HTK_undef_float) { fprintf(file, "\tr=%g", htkinfo.pron * logscale); } if (!htkheader.scoresOnNodes && htkinfo.duration != HTK_undef_float) { fprintf(file, "\tds=%g", htkinfo.duration * logscale); } if (!htkheader.wordsOnNodes && htkinfo.var != HTK_undef_uint) { fprintf(file, "\tv=%u", htkinfo.var); } if (!htkheader.wordsOnNodes && htkinfo.div != 0) { fprintf(file, "\td=%s", htkinfo.div); } if (scoreMapping != mapHTKngram && htkinfo.ngram != HTK_undef_float) { fprintf(file, "\tn=%g", htkinfo.ngram * logscale); } if (scoreMapping != mapHTKlanguage && htkinfo.language != HTK_undef_float) { fprintf(file, "\tl=%g", htkinfo.language * logscale); } } /* * map transition weight to one of the standard HTK scores */ if (scoreMapping != mapHTKnone) { fprintf(file, "\t%c=%g", (scoreMapping == mapHTKacoustic ? 'a' : (scoreMapping == mapHTKngram ? 'n' : (scoreMapping == mapHTKlanguage ? 'l' : '?'))), trans->weight * logscale); } fprintf(file, "\n"); } } return true; } /* * Compute pronunciation scores * (for nodes with HTKLink information that have phone backtraces) */ Boolean Lattice::scorePronunciations(VocabMultiMap &dictionary, Boolean intlogs) { if (debug(DebugPrintFunctionality)) { dout() << "Lattice::scorePronunciations: starting\n"; } Vocab &phoneVocab = dictionary.vocab2; /* * Go through all HTLink structures, extract the phone sequences, * and look up their probabilities in the dictionary */ for (unsigned i = 0; i < htkinfos.size(); i ++) { HTKLink *info = htkinfos[i]; /* * only rescore words that have pronunciations * (e.g., don't include NULL nodes) */ if (info->div != 0) { assert(info->word != Vocab_None); /* * parse the phone sequence from the string * example: * d=:#[s]t,0.12:s[t]r,0.03:t[r]ay,0.05:r[ay]k,0.09:ay[k]#,0.09: * and convert into an index string */ char phoneString[strlen(info->div) + 1]; strcpy(phoneString, info->div); Array phones; unsigned numPhones = 0; for (char *s = strtok(phoneString, phoneSeparator); s != 0; s = strtok(NULL, phoneSeparator)) { // skip empty components (at beginning and end) if (s[0] == '\0') continue; // strip duration part char *e = strchr(s, ','); if (e != 0) *e = '\0'; // strip context from triphone labels e = strchr(s, '['); if (e != 0) s = e + 1; e = strrchr(s, ']'); if (e != 0) *e = '\0'; phones[numPhones ++] = phoneVocab.addWord(s); } phones[numPhones] = Vocab_None; // find pronunciation prob Prob p = dictionary.get(info->word, phones.data()); if (p == 0.0) { // missing pronunciation get score 0 info->pron = LogP_One; } else { if (intlogs) { info->pron = IntlogToLogP(p); } else { info->pron += ProbToLogP(p); } } } } return true; } From stolcke at speech.sri.com Tue Mar 2 16:15:46 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 02 Mar 2004 16:15:46 PST Subject: SRILM 1.4 In-Reply-To: Your message of Tue, 02 Mar 2004 15:36:05 -0800. Message-ID: <200403030015.QAA10968@tonga> ngram -prune-lowprobs does a -renorm implicitly AFTER eliminating pruned N-gram probabilties. However, if you do specify both options the renormalization is done FIRST, then the pruning. What this could mean is that your original model is not properly normalized (so the -renorm operation changes the backoff weights before pruning). Even if the model is normalized (as it should be if produced by SRILM) you might see small differences due to rounding or loss of precision when writing/reading the log probabilities, or other numerical inaccuracies. Note that even small differences in values might affect the pruning decisions in some cases, so you probably will end up with slightly different sets of N-grams. Again, the differences would be small and the resulting models should perform equivalently in practice. As a sanity check, compute perplexity of the two models. They should be essentially identical. --Andreas In message you wrote: > This is a multipart message in MIME format. > --=_alternative 0081A4E788256E4B_= > Content-Type: text/plain; charset="US-ASCII" > > Hello, > > If I want to export a LM to an FSM, such as the AT&T FSM library, then I > need to do -prune-lowprobs... but what about -renorm? I notice that if I > do/don't add this flag on the command line... it makes a different LM... > but I'm not sure which one is right. I was assuming I needed both > -prune-lowprobs and -renorm, but the LM looks a little funny... so now I'm > not sure. > > Thanks, > Chris > From thomae at ei.tum.de Thu Mar 4 06:48:04 2004 From: thomae at ei.tum.de (Matthias Thomae) Date: Thu, 04 Mar 2004 15:48:04 +0100 Subject: make-ngram-pfsg: bad results with new gawk version Message-ID: <404741A4.2090104@ei.tum.de> Hello Andreas, make-ngram-pfsg gives me different results with different versions of gawk. The header and the links are the same, but the weights differ substantially. I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly created PFSGs cause some ASR error degradation... Any clues? Regards. Matthias From thomae at ei.tum.de Thu Mar 4 08:13:13 2004 From: thomae at ei.tum.de (Matthias Thomae) Date: Thu, 04 Mar 2004 17:13:13 +0100 Subject: make-ngram-pfsg: bad results with new gawk version In-Reply-To: <404741A4.2090104@ei.tum.de> References: <404741A4.2090104@ei.tum.de> Message-ID: <40475599.9070700@ei.tum.de> Hello again, forgot to say that I tested this with srilm 1.3.3 and 1.3.1. Matthias Matthias Thomae wrote: > Hello Andreas, > > make-ngram-pfsg gives me different results with different versions of > gawk. The header and the links are the same, but the weights differ > substantially. > > I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), > and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly > created PFSGs cause some ASR error degradation... > > Any clues? > > Regards. > Matthias From stolcke at speech.sri.com Thu Mar 4 12:53:00 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 04 Mar 2004 12:53:00 PST Subject: SRILM 1.4 In-Reply-To: Your message of Thu, 04 Mar 2004 08:07:01 -0800. Message-ID: <200403042053.MAA09988@huge> In message you wrote: > This is a multipart message in MIME format. > --=_alternative 005888C888256E4D_= > Content-Type: text/plain; charset="US-ASCII" > > Ok... that seemed to be fine... they did perform similarly. I just wanted > to make sure everything was ok. > > If I wanted to change the backoff order of the LM... is there an easy way > to do this...? I looked into the NgramLM.cc file... and it seems kind of > tricky... becuase I need to know how the trie is used... > > ... is there some other code that I should be looking in? > > In particular... if the ngram is: p(a|b,c,d) I would prefer the backoff to > be: > p(a|b,c,d) => p(a|b,c)bo(b,c,d) // This is normal > => p(a|c)bo(b,c) // BO normal, p context is not... > => p(a)bo(c) // This is normal... > > Or, even better would be: > p(a|b,c,d) => p(a|b,c)bo(b,c,d) // This is normal > => p(a|b,c)bo(b,c) + p(a|c)bo(b,c) // ... is > something like this possible? > => p(a)bo(c) // This is > normal... > > I was also thinking that maybe I could write a script to output a counts > file given the text file that would somehow "trick" the LM to generate the > backoff order I'm interested in... is that an option? This would be one solution. Use ngram-counts -read and then ngram -counts. Just reorder the words in the N-grams to reflect the backoff order you want. Note that the factored LM stuff in the latest version (courtesy of Jeff Bilmes) gives you complete flexibility in specifying the backoff order (and many other things, such as parallel backoff paths and their combination). Look in $SRILM/flm/doc for details. --Andreas From stolcke at speech.sri.com Thu Mar 4 13:35:58 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 04 Mar 2004 13:35:58 PST Subject: SRILM 1.4 In-Reply-To: Your message of Thu, 04 Mar 2004 12:57:51 -0800. Message-ID: <200403042135.NAA07730@cuatro> > > > This would be one solution. Use ngram-counts -read > > and then ngram -counts. Just reorder the words in the N-grams to > > reflect the > > backoff order you want. > > > > So how exactly would I reorder them supposing I wanted to do the backoff > as I explained earlier? Can you just give a concrete example of > reordering them...? This works only if each backoff level drops exactly one of the history elements. So if you want to backoff p(a|b,c,d) -> p(a|b,c) -> p(a|c) you are dropping history words in the order 3 (farthest), then 1 (nearest), then 2. To achieve this extract N-grams (d c b a) from your data and prepare a count file with d b c a For training (ngram-count) you also need to generate the lower-order counts, ie. b c a c a a For testing (ngram -counts) you only need the highest order counts. (except at the start of sentence where the length of the N-grams is liminted by the tag). --Andreas From john at newington.f9.co.uk Fri Mar 5 03:53:30 2004 From: john at newington.f9.co.uk (john at newington.f9.co.uk) Date: 5 Mar 2004 11:53:30 -0000 Subject: decode lattice Message-ID: <20040305115330.newington+john@force9> Andreas, I changed our lattice files so that the words were not enclosed in double quotes. This fixed the initial problem and enabled me to get an output from lattice-tool. However, I then realised that I needed to scale the output from my classifier by subtracting the log prior probabilities for each class before building the lattice. Now, when I try the rescaling and decoding using lattice-tool it predicts the same (low frequency) label for almost every token. Am I wrong to scale my 'accoustic' probabilities before building the lattice? Does lattice-tool do this for me when I call: ./lattice-tool -read-htk -in-lattice lattice.slf -write-htk \ -out-lattice lattice.out -lm DAgrammar -no-nulls Hope you can shed some light on this. Regards, John Ferguson > > Dear Andreas, > > > > I am replying on behalf of my colleague who emailed you earlier > regarding > > correct use of the SRILM lattice-tool. > > > > Based on your previous advice I have tried to decode our > lattice using our > > bigram model. All files seem to be in the correct format, so > far as I can > > tell. However, when lattice-tool rescores the lattice, all the > newly added > > LM probabilities "l=..." come out as "-inf". I tried 1-best > decoding using > > viterbi on the rescored lattice and the output is simply: > > > > lattice.out > > > > where lattice.out is the utterance name inserted by lattice-tool. > > > > Do you have any idea why we're experiencing behaviour like this? Can you > > suggest any alterations? > > John, > > the problem is that your lattices use double-quotes around the > word strings, > but the released version of SRILM does't yet implement the HTK quoting > mechanism (an oversight on my part). > > You can replace the file lattice/src/HTKLattice.cc with the > attached version > and rebuild lattice-tool to make it work. Or, you can just strip the > double quotes in your lattice files and keep using the old software. > From thomae at ei.tum.de Fri Mar 5 05:26:00 2004 From: thomae at ei.tum.de (Matthias Thomae) Date: Fri, 05 Mar 2004 14:26:00 +0100 Subject: make-ngram-pfsg: bad results with new gawk version In-Reply-To: <200403042048.MAA09526@huge> References: <200403042048.MAA09526@huge> Message-ID: <40487FE8.3020708@ei.tum.de> Hi Andreas, Andreas Stolcke wrote: > This is quite odd. I think so, too :) > make-ngram-pfsg doesn't perform much arithmetic on the log probabilties > in the LM. It only scales and rounds them. > > Can you apply the scale_log() function in make-ngram-pfsg to your LM > probabilties and backoff weights, and extract the cases where the output > differs? old awk: add_trans BO -> -0.314718 scale_log(prob) = -7247 add_trans -> BO -2.596963 scale_log(prob) = -59800 new awk: logscale = 23027 add_trans BO -> -0.314718 scale_log(prob) = 0 add_trans -> BO -2.596963 scale_log(prob) = -46054 Note that I printed the logscale which seems to be correct. ... I think I found the problem: The float log-probs (x) seem to be converted to integers when multiplying them with the logscale: function scale_log(x) { return rint(x * logscale); } This seems to be related to the locale settings http://mail.gnu.org/archive/html/bug-gnu-utils/2002-07/msg00196.html If I set LC_ALL="C" in my shell, it also works as expected. So the bad behaviour seems to occur with gawk 3.1.3 AND LC_ALL=""... Regards. Matthias > --Andreas > > In message <40475599.9070700 at ei.tum.de>you wrote: > >>Hello again, >> >>forgot to say that I tested this with srilm 1.3.3 and 1.3.1. >> >>Matthias >> >>Matthias Thomae wrote: >> >>>Hello Andreas, >>> >>>make-ngram-pfsg gives me different results with different versions of >>>gawk. The header and the links are the same, but the weights differ >>>substantially. >>> >>>I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), >>>and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly >>>created PFSGs cause some ASR error degradation... >>> >>>Any clues? >>> >>>Regards. >>>Matthias >> > > From stolcke at speech.sri.com Fri Mar 5 07:52:32 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 05 Mar 2004 07:52:32 PST Subject: make-ngram-pfsg: bad results with new gawk version In-Reply-To: Your message of Fri, 05 Mar 2004 14:26:00 +0100. <40487FE8.3020708@ei.tum.de> Message-ID: <200403051552.HAA01764@huge> Thanks for tracking this down. I'll add a note somewhere that one better set LC_NUMERIC=C or LC_ALL=C for gawk scripts to do proper artihmetic. --Andreas In message <40487FE8.3020708 at ei.tum.de>you wrote: > Hi Andreas, > > Andreas Stolcke wrote: > > This is quite odd. > > I think so, too :) > > > make-ngram-pfsg doesn't perform much arithmetic on the log probabilties > > in the LM. It only scales and rounds them. > > > > Can you apply the scale_log() function in make-ngram-pfsg to your LM > > probabilties and backoff weights, and extract the cases where the output > > differs? > > old awk: > add_trans BO -> -0.314718 > scale_log(prob) = -7247 > add_trans -> BO -2.596963 > scale_log(prob) = -59800 > > new awk: > logscale = 23027 > add_trans BO -> -0.314718 > scale_log(prob) = 0 > add_trans -> BO -2.596963 > scale_log(prob) = -46054 > > Note that I printed the logscale which seems to be correct. > ... > I think I found the problem: > > The float log-probs (x) seem to be converted to integers when > multiplying them with the logscale: > > function scale_log(x) { > return rint(x * logscale); > } > > This seems to be related to the locale settings > http://mail.gnu.org/archive/html/bug-gnu-utils/2002-07/msg00196.html > > If I set LC_ALL="C" in my shell, it also works as expected. So the bad > behaviour seems to occur with gawk 3.1.3 AND LC_ALL=""... > > > Regards. > Matthias > > > > --Andreas > > > > In message <40475599.9070700 at ei.tum.de>you wrote: > > > >>Hello again, > >> > >>forgot to say that I tested this with srilm 1.3.3 and 1.3.1. > >> > >>Matthias > >> > >>Matthias Thomae wrote: > >> > >>>Hello Andreas, > >>> > >>>make-ngram-pfsg gives me different results with different versions of > >>>gawk. The header and the links are the same, but the weights differ > >>>substantially. > >>> > >>>I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), > >>>and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly > >>>created PFSGs cause some ASR error degradation... > >>> > >>>Any clues? > >>> > >>>Regards. > >>>Matthias > >> > > > > > From stolcke at speech.sri.com Fri Mar 5 08:24:38 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 05 Mar 2004 08:24:38 PST Subject: decode lattice In-Reply-To: Your message of 05 Mar 2004 11:53:30 +0000. <20040305115330.newington+john@force9> Message-ID: <200403051624.IAA04571@huge> In message <20040305115330.newington+john at force9>you wrote: > > Andreas, > > I changed our lattice files so that the words were not enclosed in double quo > tes. This fixed the initial problem and enabled me to get an output from latt > ice-tool. However, I then realised that I needed to scale the output from my > classifier by subtracting the log prior probabilities for each class before b > uilding the lattice. Now, when I try the rescaling and decoding using lattice > -tool it predicts the same (low frequency) label for almost every token. I'm a little confused by your description. I gather you have a classifier that operates on word hypotheses and outputs posterior probabilities, which you scale by the priors to obtain pseudo-likelihoods, giving you your acoustic scores. That part sounds reasonable (correct me if I got it wrong). Does the unigram LM you are using encode the priors ? What do you mean by "token" in your last sentence? > > Am I wrong to scale my 'accoustic' probabilities before building the lattice? > Does lattice-tool do this for me when I call: > > ./lattice-tool -read-htk -in-lattice lattice.slf -write-htk \ -out-lattice la > ttice.out -lm DAgrammar -no-nulls lattice-tool only performs global scaling of the scores in the lattice. By default the scores are interpreted as being natural logs (base e). If you add a header field base=B then scores are taken to be logs base B. So, if your acoustic scores are not natural logs you should either convert them, or insert the "base=" spec in the lattice header. (You can also use straight probabilities as scores by setting base=0.) The default log scale for output lattices is 10 (so that LM scores can be more easily inspected and compared to LM files), so the header of an output lattice will contain "base=10" regardless of the input. However, you can chose that with the "-htk-logbase" option. That won't change your result, though, because when the lattice is read back in everything is converted to log base 10 internally. The important thing is that the acoustic scores have the right base in the original lattice so that the LM scores generated by rescoring are compatible. When you decode from the lattice (lattice-tool -viterbi-decode) you can chose to scale the acoustic and LM scores differently to give different weights to these knowledge sources. This is controlled by the options -htk-acscale -htk-lmscale So you might want to play with those. --Andreas From thomae at ei.tum.de Mon Mar 8 02:46:23 2004 From: thomae at ei.tum.de (Matthias Thomae) Date: Mon, 08 Mar 2004 11:46:23 +0100 Subject: make-ngram-pfsg: bad results with new gawk version In-Reply-To: <200403051552.HAA01764@huge> References: <200403051552.HAA01764@huge> Message-ID: <404C4EFF.9020507@ei.tum.de> Andreas Stolcke wrote: > Thanks for tracking this down. You're welcome. > I'll add a note somewhere that one better > set LC_NUMERIC=C or LC_ALL=C for gawk scripts to do proper artihmetic. Good. Maybe you would even want to set LC_NUMERIC temporarily (from a wrapper script?) or print a warning if it is not set to "C". Regards. Matthias From stolcke at speech.sri.com Fri Mar 12 09:03:09 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 12 Mar 2004 09:03:09 PST Subject: question about SRILM In-Reply-To: Your message of Fri, 12 Mar 2004 16:43:36 +0100. <4051DAA8.5080700@irisa.fr> Message-ID: <200403121703.JAA04805@huge> In message <4051DAA8.5080700 at irisa.fr>you wrote: > Hi. > I have one question about SRILM. I don't understand how is computed the > log-probability of an unigram. > Isn't it log[P(w)] = log[c(w)] - log[|V|], where c(w) is the frequency > of the word w in the training set and |V| the size of the vocabulary ? > And, if this formula is used, are the tokens and considered to > be part of the vocabulary or not (i.e. are they counted in |V| ?) ? > > Thank you for answering. > Solen Quiniou. > The formula for unigram probabilities (modulo smoothing) is log[P(w)] = log[c(w)] - log[N] where N is the number of word TOKENS in the training corpus (not the vocabulary). End-of-sentence tags are included in the count, since they are among the events that are predicted by the LM, but Beginning-of-sentence is not. You will notice that the log probabilty of is set to -99 (a stand-in for minus infinity). --Andreas PS. Please send your questions to "srilm-user at speech.sri.com" in the future. From solen.quiniou at irisa.fr Mon Mar 15 00:36:49 2004 From: solen.quiniou at irisa.fr (Solen Quiniou) Date: Mon, 15 Mar 2004 09:36:49 +0100 Subject: singleton counts warning Message-ID: <40556B21.8080706@irisa.fr> Hi ! I use SRILM to build a language model on letters. I have a warning that I don't understand : "warning: no singleton counts GT discounting disabled" So, the model computed is wrong since some back-off weight are positives (in log-probability) ! Do you know what does this warning mean ? I thought no counts on single letters were computed but they were so I can't find an explanation ! I've got another question, about the computation of unigram log-probability. When I used the formula : log[P(w)] = log[c(w)] - log[N], where N is the number of word TOKENS in the training corpus, I don't find exactly the value given by SRILM. Is there smoothing on unigram ? And if so, how is it made ? Thank you for answering. Solen. From stolcke at speech.sri.com Mon Mar 15 16:24:53 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 15 Mar 2004 16:24:53 PST Subject: singleton counts warning In-Reply-To: Your message of Mon, 15 Mar 2004 09:36:49 +0100. <40556B21.8080706@irisa.fr> Message-ID: <200403160024.QAA22852@huge> In message <40556B21.8080706 at irisa.fr>you wrote: > Hi ! > I use SRILM to build a language model on letters. I have a warning that > I don't understand : "warning: no singleton counts > GT discounting disabled" > So, the model computed is wrong since some back-off weight are positives > (in log-probability) ! Do you know what does this warning mean ? I > thought no counts on single letters were computed but they were so I > can't find an explanation ! GT (and also KN) discounting need the number of words that appear only once (singletons) in the training corpus. If that number is 0 the discounting formulae for those methods cannot be applied. Please try using a different smoothing method, such as Witten-Bell to your letter LM, at least for the unigrams. > > I've got another question, about the computation of unigram > log-probability. When I used the formula : log[P(w)] = log[c(w)] - > log[N], where N is the number of word TOKENS in the training corpus, I > don't find exactly the value given by SRILM. Is there smoothing on > unigram ? And if so, how is it made ? Of course there is smoothing. I don't have time to elaborate on the different smoothing algorthms implemented in SRILM, but you can either study the code in Discount.cc, or refer to the excellent survey paper by Chen & Goodman (SEE ALSO section of the ngram-count(1) man page). --Andreas From solen.quiniou at irisa.fr Thu Mar 18 02:52:02 2004 From: solen.quiniou at irisa.fr (Solen Quiniou) Date: Thu, 18 Mar 2004 11:52:02 +0100 Subject: positive backoff weight Message-ID: <40597F52.4050803@irisa.fr> Thank you for the past answers to my questions. I've got another question. Sometimes, when I use a Good-Turing discounting, some of the backoff weight of the unigram (I compute a bigram model) are positive log-probability. How is it possible ? Is it because Good-Turing discounting is disabled on unigram since there are no unigram which frequency is 1 ? And, more, generally, how are computed backoff weights for unigrams, in the case of a bigram model ? Thanks a lot for your answers. Solen. From solen.quiniou at irisa.fr Thu Mar 25 09:21:49 2004 From: solen.quiniou at irisa.fr (Solen Quiniou) Date: Thu, 25 Mar 2004 18:21:49 +0100 Subject: pfsg-format Message-ID: <4063152D.3060201@irisa.fr> Hi ! I've got one question about the pfsg format : is the transition cost, between 2 states, considered to be 10000.5 times the log-probability of the bigram corresponding to the 2 states ? Because, when I use a language model made from an ARPA file (by using the NgramLM class) to compute the probability of a word (my language model is based on letters) and when I use a language model made from a PFSG file (I convert the ARPA thanks to the make-ngram-pfsg script and then by using the LatticeLM class), I don't have the same log-probability from both representations. Why is there a difference ? Since I convert the ARPA file into a PFSG file, it should be the same. Thanks for answering. Solen. From stolcke at speech.sri.com Thu Mar 25 09:33:16 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 25 Mar 2004 09:33:16 PST Subject: pfsg-format In-Reply-To: Your message of Thu, 25 Mar 2004 18:21:49 +0100. <4063152D.3060201@irisa.fr> Message-ID: <200403251733.JAA19358@huge> In message <4063152D.3060201 at irisa.fr>you wrote: > Hi ! > I've got one question about the pfsg format : is the transition cost, > between 2 states, considered to be 10000.5 times the log-probability of > the bigram corresponding to the 2 states ? correct. > Because, when I use a language model made from an ARPA file (by using > the NgramLM class) to compute the probability of a word (my language > model is based on letters) and when I use a language model made from a > PFSG file (I convert the ARPA thanks to the make-ngram-pfsg script and > then by using the LatticeLM class), I don't have the same > log-probability from both representations. Why is there a difference ? > Since I convert the ARPA file into a PFSG file, it should be the same. How big are the differences? there will be some discrepancy due to rounding the scaled log probabilities to an integer, but it should be a small error. --Andreas From stolcke at speech.sri.com Thu Mar 25 09:58:57 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 25 Mar 2004 09:58:57 PST Subject: pfsg-format In-Reply-To: Your message of Thu, 25 Mar 2004 09:52:30 -0800. Message-ID: <200403251758.JAA21661@huge> Ciprian raises a good point. Before comparing results you should process the LM with ngram -prune-lowprobs. (Otherwise the PFSG may not be an accurate representation of the LM.) --Andreas In message you wrote: > Hi Andreas, > > I am following these threads since they sometimes contain useful > information. > > > > Because, when I use a language model made from an ARPA file (by > using > > > the NgramLM class) to compute the probability of a word (my language > > > model is based on letters) and when I use a language model made from > a > > > PFSG file (I convert the ARPA thanks to the make-ngram-pfsg script > and > > > then by using the LatticeLM class), I don't have the same > > > log-probability from both representations. Why is there a difference > ? > > > Since I convert the ARPA file into a PFSG file, it should be the > same. > > > > How big are the differences? there will be some discrepancy due to > > rounding the scaled log probabilities to an integer, but it should > > be a small error. > > [Ciprian] I assume PFSG is Probabilistic Finite State Grammar. I do not > know how exactly the conversion is done in the SRIlm toolkit, but the > difference could also come from the standard hack used in representing > ARPA back-off models in FSM format --- having a common back-off state > that forgets what higher order n-gram state we arrived there from. Am I > wrong? > > -Ciprian > From barhaim at cs.technion.ac.il Tue Mar 30 07:49:43 2004 From: barhaim at cs.technion.ac.il (Roy Bar Haim) Date: Tue, 30 Mar 2004 17:49:43 +0200 Subject: Disambig n-best scores Message-ID: <009501c4166e$a0b50cd0$34284484@cs.technion.ac.il> Hi, How is path score in disambig with n-best option calculated? For example, suppose that I have the sentence: W1 W2 Which is tagged with T1 T2 Then I calculated the path probability as follows: Log10 [ P(T1|)*P(T2|T1)*P(<\s>|T2)*P(W1|T1)*P(W2|T2) ] I got it "almost right" . I checked for two paths: For one I got -20.549 (while disambig returned -120.549) For the other I got -20.837 (while disambig returned -120.837) What is the reason for this difference? Should I always ignore the "1" after the "-"? Thanks, Roy. From stolcke at speech.sri.com Tue Mar 30 15:58:02 2004 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 30 Mar 2004 15:58:02 PST Subject: Disambig n-best scores In-Reply-To: Your message of Tue, 30 Mar 2004 17:49:43 +0200. <009501c4166e$a0b50cd0$34284484@cs.technion.ac.il> Message-ID: <200403302358.i2UNw3Z02903@conga.speech.sri.com> In message <009501c4166e$a0b50cd0$34284484 at cs.technion.ac.il>you wrote: > Hi, > > How is path score in disambig with n-best option calculated? > > For example, suppose that I have the sentence: > > W1 W2 > Which is tagged with T1 T2 > > Then I calculated the path probability as follows: > > Log10 [ P(T1|)*P(T2|T1)*P(<\s>|T2)*P(W1|T1)*P(W2|T2) ] > > I got it "almost right" . I checked for two paths: > For one I got -20.549 (while disambig returned -120.549) > For the other I got -20.837 (while disambig returned -120.837) > > What is the reason for this difference? Should I always ignore the "1" > after the "-"? The -100 comes from an OOV word. When the LM returns a probability of 0 AND the word is not in the LM it is considered an OOV. To allow the probability computation to go on a large negative, but finite, log probability of -100 is substituted (cf. the constant LogP_PseudoZero in disambig.cc). --Andreas