From stolcke at speech.sri.com  Thu Jan  1 11:05:48 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 01 Jan 2004 11:05:48 PST
Subject: Implementing Baum-Welch (Forward-Backward) algorithm in SRILM 
In-Reply-To: Your message of Wed, 31 Dec 2003 12:48:31 +0200.
             <005801c3cf8b$a20869d0$34284484@cs.technion.ac.il> 
Message-ID: <200401011905.LAA25886@huge>


In message <005801c3cf8b$a20869d0$34284484 at cs.technion.ac.il>you wrote:
> Hi,
> 
> I'm using disambig for part-of-speech tagging. I create a language model
> over sequences of tags with ngram-count, and provide P(word|tag) in the
> map file. 
> 
> What I would like to do is to start with this model, based on tagged
> corpus, and improve it using the Baum-Welch (forwad-backward) algorithm,
> with untagged corpus. After each iteration I should get a new language
> model for the tags and a new map file . After each iteration I would
> like to test the model on some held-out data, so I know when to stop.
> 
> How can I implement that in SRILM?

You need to write some scripts to manipulate intermediate data, but you
can pretty much do what you want.
To implement EM for your tagger you have two steps:

1. E-Step:   get expected counts for the tag n-gram and the word/tag mapping.

   a. Tag n-gram expectations.   This step is unfortunately not well supported 
	by the tools right now.  Although disambig uses the FB algorithm
	it doesn't collect (let alone output) the expected counts in a way
	that's suitable for reestimating a model from them.  You can use 
	two approximations.  First, you could use the 1-best tag sequence
	as a stand-in for the real thing and generate tag N-gram counts from
	it (that's sometimes called the "Viterbi" approximation of EM).
	Second, you can use the -nbest option to generate the top N most
	likely taggings of each sentence along with their score.  You then
	have to normalize the scores to obtain posterior probabilities for
	the tag sequences and weight the tag N-gram counts by these 
	posteriors and total them over your entire training corpus.

   b. Word/tag expectations.  Here again, you could use the Viterbi
	approximation, simply pairing up the words and their most likely
	tags (as output by disambig).  However, the most recent version of
	disambig actually has an option to collect and output the 
	expected word/tag bigram counts.  I have appended a patch that 
	should allow you to do this with the 1.3.3 version of disambig.
	The option that this adds is 

       -write-counts file
              Outputs the V2-V1 bigram  counts  corresponding  to
              the  tagging  performed  on the input data.  If -fb
              was specified these are expected counts, and other-
              wise they reflect the 1-best tagging decisions.

2. M-step:   reestimate the tag N-gram LM and the word/tag mapping probabilties

   a.  Once you have the tag N-gram counts (obtained by one of the methods
	suggested above) you just need to run ngram on the count file to
	get a new model.  Use -float-counts and a suitable discounting method
	if you are using fractional counts.  

   b.  Again, just use ngram to estimate a word/tag bigram model from the
	expected counts.  You then have to post-process the LM file to
	extract the word/tag probabilties and format them into a map file
	usable by disambig.

Hope this helps.

Happy New Year,

Andreas 

*** /tmp/T00BSlQ1	Wed Dec 31 13:30:23 2003
--- /tmp/T10e6tMs	Wed Dec 31 13:30:23 2003
***************
*** 38,46 ****
--- 38,48 ----
  static char *vocab1File = 0;
  static char *vocab2File = 0;
  static char *mapFile = 0;
+ static char *classesFile = 0;
  static char *mapWriteFile = 0;
  static char *textFile = 0;
  static char *textMapFile = 0;
+ static char *countsFile = 0;
  static int keepUnk = 0;
  static int tolower1 = 0;
  static int tolower2 = 0;
***************
*** 63,70 ****
--- 65,74 ----
      { OPT_STRING, "write-vocab1", &vocab1File, "output observable vocabulary" },
      { OPT_STRING, "write-vocab2", &vocab2File, "output hidden vocabulary" },
      { OPT_STRING, "map", &mapFile, "mapping from observable to hidden tokens" },
+     { OPT_STRING, "classes", &classesFile, "mapping in class expansion format" },
      { OPT_TRUE, "logmap", &logMap, "map file contains log probabilities" },
      { OPT_STRING, "write-map", &mapWriteFile, "output map file (for validation)" },
+     { OPT_STRING, "write-counts", &countsFile, "output substitution counts" },
      { OPT_TRUE, "scale", &scale, "scale map probabilities by unigram probs" },
      { OPT_TRUE, "keep-unk", &keepUnk, "preserve unknown words" },
      { OPT_TRUE, "tolower1", &tolower1, "map observable vocabulary to lowercase" },
***************
*** 88,94 ****
   */
  unsigned
  disambiguateSentence(Vocab &vocab, VocabIndex *wids, VocabIndex *hiddenWids[],
! 		     LogP totalProb[], VocabMap &map, LM &lm,
  		     unsigned numNbest, Boolean positionMapped = false)
  {
      static VocabIndex emptyContext[] = { Vocab_None };
--- 92,98 ----
   */
  unsigned
  disambiguateSentence(Vocab &vocab, VocabIndex *wids, VocabIndex *hiddenWids[],
! 		     LogP totalProb[], VocabMap &map, LM &lm, VocabMap *counts,
  		     unsigned numNbest, Boolean positionMapped = false)
  {
      static VocabIndex emptyContext[] = { Vocab_None };
***************
*** 236,241 ****
--- 240,256 ----
  	    }
  	    hiddenWids[n][len] = Vocab_None;
  	}
+ 
+ 	/* 
+ 	 * update v1-v2 counts if requested 
+ 	 */
+ 	if (counts) {
+ 	    for (unsigned i = 0; i < len; i++) {
+ 		counts->put(wids[i], hiddenWids[0][i],
+ 			    counts->get(wids[i], hiddenWids[0][i]) + 1);
+ 	    }
+ 	}
+ 
  	return numNbest;
      } else {
  	/*
***************
*** 426,431 ****
--- 441,460 ----
  		}
  		cout << endl;
  	    }
+ 
+ 	    /* 
+ 	     * update v1-v2 counts if requested 
+ 	     */
+ 	    if (counts) {
+ 		symbolIter.init();
+ 		while (symbolProb = symbolIter.next(symbol)) {
+ 		    LogP2 posterior = *symbolProb - totalPosterior;
+ 
+ 		    counts->put(wids[pos], symbol,
+ 			        counts->get(wids[pos], symbol) +
+ 					LogPtoProb(posteriors));
+ 		}
+ 	    }
  	}
  
          /*
***************
*** 442,448 ****
   * disambiguate it, and print out the result
   */
  void
! disambiguateFile(File &file, VocabMap &map, LM &lm)
  {
      char *line;
      VocabString sentence[maxWordsPerLine];
--- 471,477 ----
   * disambiguate it, and print out the result
   */
  void
! disambiguateFile(File &file, VocabMap &map, LM &lm, VocabMap *counts)
  {
      char *line;
      VocabString sentence[maxWordsPerLine];
***************
*** 476,482 ****
  	    LogP totalProb[numNbest];
  	    unsigned numHyps =
  			disambiguateSentence(map.vocab1, wids, hiddenWids,
! 						totalProb, map, lm, numNbest);
  	    if (!numHyps) {
  		file.position() << "Disambiguation failed\n";
  	    } else if (totals) {
--- 505,511 ----
  	    LogP totalProb[numNbest];
  	    unsigned numHyps =
  			disambiguateSentence(map.vocab1, wids, hiddenWids,
! 					totalProb, map, lm, counts, numNbest);
  	    if (!numHyps) {
  		file.position() << "Disambiguation failed\n";
  	    } else if (totals) {
***************
*** 521,527 ****
   * disambiguate it, and print out the result
   */
  void
! disambiguateFileContinuous(File &file, VocabMap &map, LM &lm)
  {
      char *line;
      Array<VocabIndex> wids;
--- 550,557 ----
   * disambiguate it, and print out the result
   */
  void
! disambiguateFileContinuous(File &file, VocabMap &map, LM &lm,
! 							VocabMap *counts)
  {
      char *line;
      Array<VocabIndex> wids;
***************
*** 560,566 ****
  
      LogP totalProb[numNbest];
      unsigned numHyps = disambiguateSentence(map.vocab1, &wids[0], hiddenWids,
! 					    totalProb, map, lm, numNbest);
  
      if (!numHyps) {
  	file.position() << "Disambiguation failed\n";
--- 590,596 ----
  
      LogP totalProb[numNbest];
      unsigned numHyps = disambiguateSentence(map.vocab1, &wids[0], hiddenWids,
! 					totalProb, map, lm, counts, numNbest);
  
      if (!numHyps) {
  	file.position() << "Disambiguation failed\n";
***************
*** 593,599 ****
   * disambiguate it, and print out the result
   */
  void
! disambiguateTextMap(File &file, Vocab &vocab, LM &lm)
  {
      char *line;
  
--- 623,629 ----
   * disambiguate it, and print out the result
   */
  void
! disambiguateTextMap(File &file, Vocab &vocab, LM &lm, VocabMap *counts)
  {
      char *line;
  
***************
*** 664,670 ****
  	    LogP totalProb[numNbest];
  	    unsigned numHyps =
  		    disambiguateSentence(vocab, &wids[0], hiddenWids, totalProb,
! 						    map, lm, numNbest, true);
  
  	    if (!numHyps) {
  		file.position() << "Disambiguation failed\n";
--- 694,700 ----
  	    LogP totalProb[numNbest];
  	    unsigned numHyps =
  		    disambiguateSentence(vocab, &wids[0], hiddenWids, totalProb,
! 					    map, lm, counts, numNbest, true);
  
  	    if (!numHyps) {
  		file.position() << "Disambiguation failed\n";
***************
*** 720,725 ****
--- 750,764 ----
  	}
      }
  
+     if (classesFile) {
+ 	File file(classesFile, "r");
+ 
+ 	if (!map.readClasses(file)) {
+ 	    cerr << "format error in classes file\n";
+ 	    exit(1);
+ 	}
+     }
+ 
      if (lmFile) {
  	File file(lmFile, "r");
  
***************
*** 734,746 ****
  	hiddenLM->debugme(debug);
      }
  
      if (textFile) {
  	File file(textFile, "r");
  
  	if (continuous) {
! 	    disambiguateFileContinuous(file, map, *hiddenLM);
  	} else {
! 	    disambiguateFile(file, map, *hiddenLM);
  	}
      }
  
--- 773,797 ----
  	hiddenLM->debugme(debug);
      }
  
+     VocabMap *counts;
+     if (countsFile) {
+ 	counts = new VocabMap(vocab, hiddenVocab);
+ 	assert(counts != 0);
+ 
+ 	counts->remove(vocab.ssIndex, hiddenVocab.ssIndex);
+ 	counts->remove(vocab.seIndex, hiddenVocab.seIndex);
+ 	counts->remove(vocab.unkIndex, hiddenVocab.unkIndex);
+     } else {
+ 	counts = 0;
+     }
+ 
      if (textFile) {
  	File file(textFile, "r");
  
  	if (continuous) {
! 	    disambiguateFileContinuous(file, map, *hiddenLM, counts);
  	} else {
! 	    disambiguateFile(file, map, *hiddenLM, counts);
  	}
      }
  
***************
*** 747,755 ****
      if (textMapFile) {
  	File file(textMapFile, "r");
  
! 	disambiguateTextMap(file, vocab, *hiddenLM);
      }
  
      if (mapWriteFile) {
  	File file(mapWriteFile, "w");
  	map.write(file);
--- 798,812 ----
      if (textMapFile) {
  	File file(textMapFile, "r");
  
! 	disambiguateTextMap(file, vocab, *hiddenLM, counts);
      }
  
+     if (countsFile) {
+ 	File file(countsFile, "w");
+ 
+ 	counts->writeBigrams(file);
+     }
+ 
      if (mapWriteFile) {
  	File file(mapWriteFile, "w");
  	map.write(file);


From vhquan at itc.it  Fri Jan 23 06:36:48 2004
From: vhquan at itc.it (Vu Hai Quan)
Date: Fri, 23 Jan 2004 15:36:48 +0100
Subject: SIRLM for unicode
In-Reply-To: <200312282306.PAA08825@huge>
References: <200312282306.PAA08825@huge>
Message-ID: <40113180.4030109@itc.it>

Dear All,
Is it possible for me to use SIRLM for text corpus which was encoded in 
unicode format ?
Best regards.


From stolcke at speech.sri.com  Fri Jan 23 09:47:31 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 23 Jan 2004 09:47:31 PST
Subject: SIRLM for unicode 
In-Reply-To: Your message of Fri, 23 Jan 2004 15:36:48 +0100.
             <40113180.4030109@itc.it> 
Message-ID: <200401231747.JAA12790@huge>


I'm not familiar with unicode, unfortunately.  However, SRILM does
not "interpret" characters other than for parsing lines of text into 
words.  It assumes that words are separated by spaces.  So if unicode
uses the same encoding of space characters as ASCII then you should be fine.

The case mappping functions (-tolower option) in various tools will
probably not work correctly for multi-byte character sets.

--Andreas

In message <40113180.4030109 at itc.it>you wrote:
> Dear All,
> Is it possible for me to use SIRLM for text corpus which was encoded in 
> unicode format ?
> Best regards.
> 


From nlp at pobox.sk  Fri Feb  6 07:14:18 2004
From: nlp at pobox.sk (Robert Wagner)
Date: Fri, 6 Feb 2004 16:14:18 +0100
Subject: Class based 3-gram in SRILM
Message-ID: <200402061514.i16FEIg4005091@www4.pobox.sk>

Hi!
 I have a following problem. I've estimated a class-based bigram model
(with some defined words excluded from the clustering process) using
the ngram-class tool. But I want to use a class-based trigram model.
How to get class-based trigram counts and probabilities using SRILM?

 I also want to ask whether anyone knows a freely available tool for
word clustering using trigram counts? And it is possible to create a
class language model based on POS-tags in SRILM?

Thank you for help.

Robert

 
____________________________________
http://www.pobox.sk/ - najvacsi slovensky freemail


From stolcke at speech.sri.com  Fri Feb  6 13:08:33 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 06 Feb 2004 13:08:33 PST
Subject: Class based 3-gram in SRILM 
In-Reply-To: Your message of Fri, 06 Feb 2004 16:14:18 +0100.
             <200402061514.i16FEIg4005091@www4.pobox.sk> 
Message-ID: <200402062108.NAA27238@huge>


In message <200402061514.i16FEIg4005091 at www4.pobox.sk>you wrote:
> Hi!
>  I have a following problem. I've estimated a class-based bigram model
> (with some defined words excluded from the clustering process) using
> the ngram-class tool. But I want to use a class-based trigram model.
> How to get class-based trigram counts and probabilities using SRILM?

You use the "replace-words-with-classes" script and apply the class definitions
to your training data.  Then you train a trigram LM in the usual way.
See training-scripts(1).

> 
>  I also want to ask whether anyone knows a freely available tool for
> word clustering using trigram counts? And it is possible to create a
> class language model based on POS-tags in SRILM?

I don't know of an available implementations for trigram-based word
clustering, but it would be quite expensive (slow) to do.
I believe some work by Philips/Aachen researchers showed that the 
improvement over bigram-induced classes (in a higher-order class-based LM)
is pretty small.  Anyway, that's what most everybody does these days.

As for POS-based LMs, all you need is a tagger (and there are many out there)
and tag your training data.  Then you use the tagged data to 
train a tag-n-gram model in the usual way.  (You can also estimate the 
class-membership probabilities from the tagging results.)

You could use the disambig tool to do the POS tagging itself, but since it 
doesn't deal with morphological and other non-n-gram cues cues (e.g.,
to handle unknown words) it won't be competitive with state-of-the-art taggers.

--Andreas 


From nlp at pobox.sk  Wed Feb 11 02:46:13 2004
From: nlp at pobox.sk (Robert Wagner)
Date: Wed, 11 Feb 2004 11:46:13 +0100
Subject: Default smoothing in ngram-count
Message-ID: <200402111046.i1BAk4ut006683@www6.pobox.sk>

Hi to all!
 I haven't found anywhere in SRILM's documentation what is a default 
smoothing option to ngram-count. Is it Katz backoff?

I have also got a following warning: discount coeff 1 is out of 
range. What does it mean? Is it a bad thing?

I would also like to know whether is there some kind of compatibility 
between SRILM and CMU language modeling toolkit, i.e. if it is 
possible to use n-gram counts gained by CMU in SRILM and reversaly.

And last question (probably stupid;-)): What are reverse n-grams good 
for?

 Thanks
Robert

____________________________________
http://www.pobox.sk/ - spolahliva a bezpecna prevadzka


From wavrow at hotmail.com  Mon Feb 16 23:04:33 2004
From: wavrow at hotmail.com (Shlomo Wavrow)
Date: Tue, 17 Feb 2004 09:04:33 +0200
Subject: New SRILM released - ngram-class -save option
Message-ID: <Law10-F126UTAPhBcJl00042652@hotmail.com>

Hello,
as a new version of SRILM has been released, I would also like to add one 
item to "wishlist". It would be nice to change a bit the -save option to 
ngram-class. Now you only can make ngram-class to save classes every S 
iterations. But this behaviour causes a plethora of class files to be saved 
to disk. It would be nice to add some "-startsave option" to start saving 
classes after reaching user defined number of classes. It also would be 
useful to add a possibility to continue interrupted merging using a 
previously saved class file.
I hope these remarks will help you.
Regards Shlomo

_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE* 
http://join.msn.com/?page=features/junkmail


From desaikey at egr.msu.edu  Wed Feb 18 13:16:58 2004
From: desaikey at egr.msu.edu (desaikey)
Date: Wed, 18 Feb 2004 16:16:58 -0500
Subject: Sentence generation using SRILM
Message-ID: <000001c3f664$8b1e5810$ef8c0923@Keyur>


Hi,

I am trying to generate a set of random sentences using a specified
n-gram language model. The command and related flags I m using are:

ngram -lm x.lm -gen Xno 

When I use "small vocabulary (AN4 CMU database) whole-words trigram LM"
the tool is able to generate sentences. But with other LMs (Bi/Uni
-gram) the size of the generated sentences is excessively large or tool
takes too much of time. While with spelling based LMs the tool is not
able to generate sentences or again too large a size of sentc.(even for
tri-gram). 

Please share any ideas/experience you have about such a problem.

Thanks in advance for your help.

Keyur
-------------------------------------------
KEYUR DESAI 
Graduate Student
Department of Electrical and Computer Eng.
Michigan State University
Email:desaikey at egr.msu.edu
Phone:(517)664-1802


From tanel.alumae at aqris.com  Thu Feb 19 07:13:56 2004
From: tanel.alumae at aqris.com (Tanel =?ISO-8859-1?Q?Alum=E4e?=)
Date: Thu, 19 Feb 2004 17:13:56 +0200
Subject: Class expansion
Message-ID: <1077203635.13538.38.camel@NOOL2>

Hello,

I'm trying to to convert a class bigram to its equivalent word n-gram,
using the "ngram" tool with the -expand-classes option. The class model
has 1000 classes, and there are 60000 words. I use the following command
line:

ngram -lm <classmodel> -classes <classesfile> -expand-classes 2
-write-lm <outputmodel>

The process runs about 15 minutes using over 700M of RAM, and then gets
killed by the OS (I'm using Linux), probably when it asked even more
memory that the OS didn't have (I have 512M of main memory).

Is it normal that the class expansion takes that much RAM? Is there a
way around it?

Thanks and regards,

-- 
Tanel Alum?e <tanel.alumae at aqris.com>


From stolcke at speech.sri.com  Thu Feb 19 10:08:08 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 19 Feb 2004 10:08:08 PST
Subject: Class expansion 
In-Reply-To: Your message of Thu, 19 Feb 2004 17:13:56 +0200.
             <1077203635.13538.38.camel@NOOL2> 
Message-ID: <200402191808.KAA13923@huge>


In message <1077203635.13538.38.camel at NOOL2>you wrote:
> Hello,
> 
> I'm trying to to convert a class bigram to its equivalent word n-gram,
> using the "ngram" tool with the -expand-classes option. The class model
> has 1000 classes, and there are 60000 words. I use the following command
> line:
> 
> ngram -lm <classmodel> -classes <classesfile> -expand-classes 2
> -write-lm <outputmodel>
> 
> The process runs about 15 minutes using over 700M of RAM, and then gets
> killed by the OS (I'm using Linux), probably when it asked even more
> memory that the OS didn't have (I have 512M of main memory).
> 
> Is it normal that the class expansion takes that much RAM? Is there a
> way around it?

It is expected.  Your seeing a combinatorial explosion of ngrams 
as the classes get expanded.   In general it is not feasible to expand
a large-vocabulary class LM with several hundred classes.

ngram -expand-classes was designed for medium-vocabulary class LMs,
especially ones with hand-designed classes.  It works fine for domains
like ATIS, SPINE, Communicator, etc.

There is a way around it, but it would require some coding.  
You could do the class expansion, and interleave it with ngram pruning.
In other words, right after you expand all the class ngrams that share
a word ngram context you perform entropy-based pruning to retain only
those that "matter".  This should dramantically reduce the size of 
the expanded model.

--Andreas 


From wavrow at hotmail.com  Mon Feb 23 02:13:57 2004
From: wavrow at hotmail.com (Shlomo Wavrow)
Date: Mon, 23 Feb 2004 12:13:57 +0200
Subject: ngram-count : -tagged option
Message-ID: <Law10-F558I2lreZkHY0000a10c@hotmail.com>

Hello everybody!
Does anybody has any experience of using -tagged option to ngram-count? I 
thought that -tagged option means that ngram-count creates tag-based model, 
but I got strange results. In the resulting counts-file appear a kind of 
mixture of words and tags... My input text file has a following form:
<s>word1/tag1 word2/tag2 .... wordN/tagN</s>

Regards Shlomo

_________________________________________________________________
STOP MORE SPAM with the new MSN 8 and get 2 months FREE* 
http://join.msn.com/?page=features/junkmail


From stolcke at speech.sri.com  Mon Feb 23 14:03:34 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 23 Feb 2004 14:03:34 PST
Subject: ngram-count : -tagged option 
In-Reply-To: Your message of Mon, 23 Feb 2004 12:13:57 +0200.
             <Law10-F558I2lreZkHY0000a10c@hotmail.com> 
Message-ID: <200402232203.OAA23323@huge>


In message <Law10-F558I2lreZkHY0000a10c at hotmail.com>you wrote:
> Hello everybody!
> Does anybody has any experience of using -tagged option to ngram-count? I 
> thought that -tagged option means that ngram-count creates tag-based model, 
> but I got strange results. In the resulting counts-file appear a kind of 
> mixture of words and tags... My input text file has a following form:
> <s>word1/tag1 word2/tag2 .... wordN/tagN</s>

This option is for building ngram LMs that use the word class for 
backoff, and thus hopefully improved smoothing.  It is not documented,
I'm afraid, so will be hard to use unless you are willing to look closely
at the code.  I remember someone on this list reported a bug with the
code a while back, so maybe there are some people out there who can help.
Also, there is a small example in test suite (test/tests/tagged-ngram).

I should note that the "factored N-gram" models recently added to
SRILM (release 1.4) are a generalization of tagged N-grams, and 
there is good documentation for those.  So you might want to 
think about reformulating whatever it is you are thinking of as a factored
LM.

--Andreas 


From s0343879 at sms.ed.ac.uk  Tue Feb 24 07:54:17 2004
From: s0343879 at sms.ed.ac.uk (G Hofer)
Date: Tue, 24 Feb 2004 15:54:17 +0000
Subject: decode lattice
Message-ID: <1077638057.403b73a921e70@sms.ed.ac.uk>

Hi, 
 
We are using the sri lm 1.4 toolkit. As for now we have created a lattice in 
the htk format and a 2gram model in the Arpa format. It is not clear from the 
manual page how to decode this lattice uing the 2-gram model. Can you give us 
the correct options for the lattice-tool to accomplish this if this is the 
correct tool to use? 
 
thank you, 
Gregor 
 

From stolcke at speech.sri.com  Tue Feb 24 08:42:28 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 24 Feb 2004 08:42:28 PST
Subject: decode lattice 
In-Reply-To: Your message of Tue, 24 Feb 2004 15:54:17 +0000.
             <1077638057.403b73a921e70@sms.ed.ac.uk> 
Message-ID: <200402241642.IAA11315@huge>


In message <1077638057.403b73a921e70 at sms.ed.ac.uk>you wrote:
> Hi, 
>  
> We are using the sri lm 1.4 toolkit. As for now we have created a lattice in 
> the htk format and a 2gram model in the Arpa format. It is not clear from the
>  
> manual page how to decode this lattice uing the 2-gram model. Can you give us
>  
> the correct options for the lattice-tool to accomplish this if this is the 
> correct tool to use? 
>  
> thank you, 
> Gregor 

Gregor,

You would run lattice-tool twice, first to rescore the lattices with your
LM, then to extract the best hypothesis.

For LM rescoring use options

	lattice-tool -read-htk -write-htk -order 2 -lm LM -no-nulls

For 1-best decoding use

	lattice-tool -read-htk -htk-lmscale LMWEIGHT -viterbi-decode

You could also perform posterior-based (confusion network) decoding using

	lattice-tool -read-htk -htk-lmscale LMWEIGHT -posterior-decode

where LMWEIGHT is the weight given to the LM scores relative to the 
acoustic model scores.

Of course you need to add options specifying the location of input/output
lattices. 

A future version of lattice-tool will probably allow you to combine these
two steps into a single run, but for now you have to do it this way.

--Andreas 


From stolcke at speech.sri.com  Thu Feb 26 11:05:43 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 26 Feb 2004 11:05:43 PST
Subject: decode lattice 
In-Reply-To: Your message of Wed, 25 Feb 2004 01:25:56 +0000.
             <GLEPKGIJOMFGABPKCAMJEEMLCEAA.john@newington.f9.co.uk> 
Message-ID: <200402261905.LAA08246@tonga>

In message <GLEPKGIJOMFGABPKCAMJEEMLCEAA.john at newington.f9.co.uk>you wrote:
> Dear Andreas,
> 
> I am replying on behalf of my colleague who emailed you earlier regarding
> correct use of the SRILM lattice-tool.
> 
> Based on your previous advice I have tried to decode our lattice using our
> bigram model. All files seem to be in the correct format, so far as I can
> tell. However, when lattice-tool rescores the lattice, all the newly added
> LM probabilities "l=..." come out as "-inf". I tried 1-best decoding using
> viterbi on the rescored lattice and the output is simply:
> 
> lattice.out </s>
> 
> where lattice.out is the utterance name inserted by lattice-tool.
> 
> Do you have any idea why we're experiencing behaviour like this? Can you
> suggest any alterations?

John,

the problem is that your lattices use double-quotes around the word strings,
but the released version of SRILM does't yet implement the HTK quoting
mechanism (an oversight on my part).

You can replace the file lattice/src/HTKLattice.cc with the attached version
and rebuild lattice-tool to make it work.  Or, you can just strip the 
double quotes in your lattice files and keep using the old software.

--Andreas 

-------------- next part --------------
/*
 * HTKLattice.cc --
 *	HTK Standard Lattice Format support for SRILM lattices
 *
 *	Note: there is no separate HTKLattice class, only I/O methods!
 *
 */

#ifndef lint
static char Copyright[] = "Copyright (c) 2004 SRI International.  All Rights Reserved.";
static char RcsId[] = "@(#)$Header: /home/srilm/devel/lattice/src/RCS/HTKLattice.cc,v 1.17 2004/02/26 18:48:22 stolcke Exp $";
#endif

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <ctype.h>
#include <math.h>
#include <assert.h>

#include "Array.cc"
#include "LHash.cc"
#include "Lattice.h"
#include "MultiwordVocab.h"
#include "NBest.h"		// for phoneSeparator defn

#ifdef INSTANTIATE_TEMPLATES
INSTANTIATE_ARRAY(HTKLink);
#endif

/* from Lattice.cc */
#define DebugPrintFatalMessages         1 
#define DebugPrintFunctionality         1 

const char *HTKLattice_Version = "1.1";

const float HTK_undef_float = HUGE_VAL;
const unsigned HTK_undef_uint = (unsigned)-1;

const char *HTK_null_word = "!NULL";

const float HTK_def_tscale = 1.0;
const float HTK_def_acscale = 1.0;
const float HTK_def_lmscale = 1.0;
const float HTK_def_ngscale = 1.0;
const float HTK_def_wdpenalty = 0.0;
const float HTK_def_prscale = 1.0;
const float HTK_def_duscale = 0.0;

HTKHeader::HTKHeader()
    : logbase(10), tscale(HTK_def_tscale), acscale(HTK_def_acscale),
      ngscale(HTK_def_ngscale), lmscale(HTK_def_lmscale),
      wdpenalty(HTK_def_wdpenalty), prscale(HTK_def_prscale),
      duscale(HTK_def_duscale), amscale(HTK_undef_float),
      vocab(0), lmname(0), ngname(0), hmms(0),
      wordsOnNodes(false), scoresOnNodes(false)
{
};

HTKHeader::HTKHeader(double acscale, double lmscale, double ngscale,
			double prscale, double duscale, double wdpenalty)
    : logbase(10), tscale(HTK_def_tscale), acscale(acscale),
      ngscale(ngscale), lmscale(lmscale),
      wdpenalty(wdpenalty), prscale(prscale),
      duscale(duscale), amscale(HTK_undef_float),
      vocab(0), lmname(0), ngname(0), hmms(0),
      wordsOnNodes(false), scoresOnNodes(false)
{
};

HTKHeader::~HTKHeader()
{
    if (vocab) free(vocab);
    if (lmname) free(lmname);
    if (ngname) free(ngname);
    if (hmms) free(hmms);
}

HTKHeader &
HTKHeader::operator= (const HTKHeader &other)
{
    if (&other == this) {
	return *this;
    }

    if (vocab) free(vocab);
    if (lmname) free(lmname);
    if (ngname) free(ngname);
    if (hmms) free(hmms);

    tscale = other.tscale;
    acscale = other.acscale;
    ngscale = other.ngscale;
    lmscale = other.lmscale;
    wdpenalty = other.wdpenalty;
    prscale = other.prscale;
    duscale = other.duscale;
    amscale = other.amscale;
    if (other.vocab == 0) {
	vocab = 0;
    } else {
	vocab = strdup(other.vocab);
	assert(vocab != 0);
    }
    if (other.lmname == 0) {
	lmname = 0;
    } else {
	lmname = strdup(other.lmname);
	assert(lmname != 0);
    }
    if (other.ngname == 0) {
	ngname = 0;
    } else {
	ngname = strdup(other.ngname);
	assert(ngname != 0);
    }
    if (other.hmms == 0) {
	hmms = 0;
    } else {
	hmms = strdup(other.hmms);
	assert(hmms != 0);
    }

    return *this;
}


HTKLink::HTKLink()
    : time(HTK_undef_float), word(Vocab_None), var(HTK_undef_uint), div(0),
      acoustic(HTK_undef_float), ngram(HTK_undef_float),
      language(HTK_undef_float), pron(HTK_undef_float),
      duration(HTK_undef_float), posterior(HTK_undef_float)
{
}

HTKLink::~HTKLink()
{
    if (div) free(div);
}

HTKLink &
HTKLink::operator= (const HTKLink &other)
{
    if (&other == this) {
	return *this;
    }

    if (div) free(div);

    time = other.time;
    word = other.word;
    var = other.var;
    if (other.div == 0) {
	div = 0;
    } else {
	div = strdup(other.div);
	assert(div != 0);
    }
    acoustic = other.acoustic;
    ngram = other.ngram;
    language = other.language;
    pron = other.pron;
    duration = other.duration;
    posterior = other.posterior;
    return *this;
}

/* 
 * Format HTKLink (for debugging)
 */
ostream &
operator<< (ostream &stream, HTKLink &link)
{
    stream << "[HTKLink";

    if (link.word != Vocab_None) {
	stream << " WORD=" << link.word;
    }
    if (link.time != HTK_undef_float) {
	stream << " time=" << link.time;
    }
    if (link.var != HTK_undef_uint) {
	stream << " var=" << link.var;
    }
    if (link.div != 0) {
	stream << " div=" << link.div;
    }
    if (link.acoustic != HTK_undef_float) {
	stream << " a=" << link.acoustic;
    }
    if (link.ngram != HTK_undef_float) {
	stream << " n=" << link.ngram;
    }
    if (link.language != HTK_undef_float) {
	stream << " l=" << link.language;
    }
    if (link.pron != HTK_undef_float) {
	stream << " r=" << link.pron;
    }
    if (link.duration != HTK_undef_float) {
	stream << " ds=" << link.duration;
    }
    if (link.posterior != HTK_undef_float) {
	stream << " p=" << link.posterior;
    }
    stream << "]";
    return stream;
}


/*
 * Find the next key=value pair in line, return string value, nad 
 * advance line pointer past it.
 * The string pointed to by line is modified in the process.
 */
static char *
getHTKField(char *&line, char *&value)
{
    char *cp = line;
    char *key;

    do {
	switch (*cp) {
	case '\0':
	case '#':
		return 0;
		break;
	case ' ':
	case '\t':
	case '\n':
		cp ++;
		break;
	default:
		key = cp;

		while (*cp != '\0' && !isspace(*cp) && *cp != '=') cp++;

		if (*cp == '=') {
		    *(cp++) = '\0';	// terminate key string
		    value = cp;		// beginning of value string
		    char *cpv = cp;	// target location for copying value

		    char inquote = '\0';

		    /*
		     * Quotes are only treated specially if they 
		     * occur in first position
		     */
		    if (*cp == '\"' || *cp == '\'') {
			inquote = *(cp++);
		    }

		    while (*cp != '\0') {
			if (*cp == '\\') {
			    /*
			     * Backslash quote processing
			     */
			    cp ++;
			    if (*cp == '\0') {
				/*
				 * Shouldn't happen, we just ignore it
				 */
				break;
			    } else if (*cp == '0') {
				/*
				 * Octal char code
				 */
				unsigned charcode;
				unsigned charlen;
				sscanf(cp, "%o%n", &charcode, &charlen);
				*(cpv++) = charcode;
				cp += charlen;
			    } else {
				/*
				 * Other quoted character
				 */
				*(cpv++) = *(cp++);
			    }
			} else if (!inquote && isspace(*cp)) {
			    /*
			     * String deliminted by White-space
			     */
			    cp ++;
			    break;
			} else if (inquote && *cp == inquote) {
			    /*
			     * String delimited by end quote
			     */
			    cp ++;
			    break;
			} else {
			    /* 
			     * Character in string
			     */
			    *(cpv++) = *(cp++);
			}
		    }
		    *cpv = '\0';	// terminate value string
		} else {
		    value = cp;		// beginning of value string
		    if (*cp != '\0') {
			*(cp++) = '\0';	// terminate value string
		    }
		}

		line = cp;
		return key;
	}
    } while (1);
}

/*
 * Output quoted version of string
 */
static void
printQuoted(FILE *f, const char *name)
{
    Boolean octalPrinted = false;

    for (const char *cp = name; *cp != '\0'; cp ++) {
	if (*cp == ' ' || *cp == '\\' || *cp == '\'' || *cp == '\"' ||
	    octalPrinted && isdigit(*cp))
	{
	    /*
	     * This character needs to be quoted
	     */
	    putc('\\', f);
	    putc(*cp, f);
	    octalPrinted = false;
	} else if (!isprint(*cp) || isspace(*cp)) {
	    /*
	     * Print as octal char code
	     */
	    fprintf(f, "\\0%o", *cp);
	    octalPrinted = true;
	} else {
	    /*
	     * Print as plain character
	     */
	    putc(*cp, f);
	    octalPrinted = false;
	}
    }
}

/*
 * Input lattice in HTK format
 *	Algorithm:
 *	- each HTK node becomes a null node.
 *	- each HTK link becomes a non-null node.
 *	- word and other link information is added to the non-null nodes.
 *	- link information attached to HTK nodes is added to non-null nodes.
 *	- lattice transition weights are computed as a log-linear combination
 *	  of HTK scores.
 * Arguments:
 *	- if header != 0, supplied scaling parameters override information
 *	  from lattice header
 *	- if useNullNodes == false null nodes corresponding to original
 *	  HTK nodes are eliminated
 */
Boolean
Lattice::readHTK(File &file, HTKHeader *header, Boolean useNullNodes)
{
    removeAll();

    unsigned HTKnumlinks = 0;
    unsigned HTKnumnodes = 0;
    float HTKlogbase = M_E;
    unsigned HTKfinal = HTK_undef_uint;
    unsigned HTKinitial = HTK_undef_uint;
    char HTKdirection = 'f';

    unsigned HTKfirstnode = HTK_undef_uint;
    unsigned HTKlastnode = HTK_undef_uint;
    float HTKinitialtime, HTKfinaltime;

    LHash<unsigned, NodeIndex> nodeMap;		// maps HTK nodes->lattice nodes
    Array<HTKLink> nodeInfoMap;			// node-based link information

    // dummy word used temporarily to represent HTK nodes
    // (could have used null nodes, but this way we preserve null nodes in
    // the input lattice)
    const char *HTKNodeWord = "***HTK_Node***";
    VocabIndex HTKNodeDummy = useNullNodes ? Vocab_None :
					     vocab.addWord(HTKNodeWord);

    /*
     * Override supplied header parameters
     */
    if (header != 0) {
	if (header->logbase != HTK_undef_float) {
	    htkheader.logbase = header->logbase;
	}
	if (header->acscale != HTK_undef_float) {
	    htkheader.acscale = header->acscale;
	}
	if (header->lmscale != HTK_undef_float) {
	    htkheader.lmscale = header->lmscale;
	}
	if (header->ngscale != HTK_undef_float) {
	    htkheader.ngscale = header->ngscale;
	}
	if (header->prscale != HTK_undef_float) {
	    htkheader.prscale = header->prscale;
	}
	if (header->duscale != HTK_undef_float) {
	    htkheader.duscale = header->duscale;
	}
	if (header->wdpenalty != HTK_undef_float) {
	    htkheader.wdpenalty = header->wdpenalty;
	}
	if (header->amscale != HTK_undef_float) {
	    htkheader.amscale = header->amscale;
	}
	htkheader.wordsOnNodes = header->wordsOnNodes;
	htkheader.scoresOnNodes = header->scoresOnNodes;
    }


    /*
     * Parse HTK lattice file
     */
    while (char *line = file.getline()) {
	char *key;
	char *value;

	/*
	 * Parse key=value pairs
	 * (we test for frequent fields first to save time)
	 * We assume that header information comes before node information,
	 * which comes before link information.  However, this is is not
	 * enforced, and incomplete lattices may result if the input file
	 * contains things out of order.
	 */
	while (key = getHTKField(line, value)) {
#define keyis(x)	(strcmp(key, (x)) == 0)
	    /*
	     * Link fields
	     */
	    if (keyis("J")) {
		unsigned HTKlinkno = atoi(value);

		/*
		 * parse link fields
		 */
		HTKLink *linkinfo = new HTKLink;
		assert(linkinfo != 0);
				// allocates new HTKLink pointer in lattice
		htkinfos[htkinfos.size()] = linkinfo;

		unsigned HTKstartnode, HTKendnode;
		NodeIndex startIndex = NoNode, endIndex = NoNode;

		while (key = getHTKField(line, value)) {
		    if (keyis("S") || keyis("START")) {
			HTKstartnode = atoi(value);
			Boolean found;
			NodeIndex *startIndexPtr =
				nodeMap.insert(HTKstartnode, found);
			if (!found) {
			    // node index not seen before; create it
			    *startIndexPtr = dupNode(Vocab_None);
			}
			startIndex = *startIndexPtr;

		    } else if (keyis("E") || keyis("END")) {
			HTKendnode = atoi(value);
			Boolean found;
			NodeIndex *endIndexPtr =
				nodeMap.insert(HTKendnode, found);
			if (!found) {
			    // node index not seen before; create it
			    *endIndexPtr = dupNode(Vocab_None);
			}
			endIndex = *endIndexPtr;

		    } else if (keyis("W") || keyis("WORD")) {
			if (strcmp(value, HTK_null_word) == 0) {
			    linkinfo->word = Vocab_None;
			} else {
			    linkinfo->word = vocab.addWord(value);
			}
		    } else if (keyis("v") || keyis("var")) {
			linkinfo->var = atoi(value);
		    } else if (keyis("d") || keyis("div")) {
			linkinfo->div = strdup(value);
			assert(linkinfo->div != 0);
		    } else if (keyis("a") || keyis("acoustic")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    linkinfo->acoustic = score * ProbToLogP(HTKlogbase);
			} else {
			    linkinfo->acoustic = ProbToLogP(score);
			}
		    } else if (keyis("n") || keyis("ngram")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    linkinfo->ngram = score * ProbToLogP(HTKlogbase);
			} else {
			    linkinfo->ngram = ProbToLogP(score);
			}
		    } else if (keyis("l") || keyis("language")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    linkinfo->language = score * ProbToLogP(HTKlogbase);
			} else {
			    linkinfo->language = ProbToLogP(score);
			}
		    } else if (keyis("r")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    linkinfo->pron = score * ProbToLogP(HTKlogbase);
			} else {
			    linkinfo->pron = ProbToLogP(score);
			}
		    } else if (keyis("ds")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    linkinfo->duration = score * ProbToLogP(HTKlogbase);
			} else {
			    linkinfo->duration = ProbToLogP(score);
			}
		    } else if (keyis("p")) {
			linkinfo->posterior = atof(value);
		    } else {
			file.position() << "unexpected link field name "
					<< key << endl;
			if (!useNullNodes) vocab.remove(HTKNodeDummy);
			return false;
		    }
		}

		if (startIndex == NoNode) {
		    file.position() << "missing start node spec\n";
		    if (!useNullNodes) vocab.remove(HTKNodeDummy);
		    return false;
		}

		if (endIndex == NoNode) {
		    file.position() << "missing end node spec\n";
		    if (!useNullNodes) vocab.remove(HTKNodeDummy);
		    return false;
		}

		/*
		 * fill in unspecified link info from associated node info
		 * 'forward' lattices use end-node information.
		 * 'backward' lattices use start-node information.
		 */
		HTKLink *nodeinfo = 0;
		if (HTKdirection == 'f') {
		    nodeinfo = &nodeInfoMap[HTKendnode];
		} else if (HTKdirection == 'b') {
		    nodeinfo = &nodeInfoMap[HTKstartnode];
		}

		if (nodeinfo != 0) {
		    linkinfo->time = nodeinfo->time;

		    if (linkinfo->word == Vocab_None) {
			linkinfo->word = nodeinfo->word;
		    }
		    if (linkinfo->var == HTK_undef_uint) {
			linkinfo->var = nodeinfo->var;
		    }
		    if (linkinfo->div == 0 && nodeinfo->div != 0) {
			linkinfo->div = strdup(nodeinfo->div);
			assert(linkinfo->div != 0);
		    }
		    if (linkinfo->acoustic == HTK_undef_float) {
			linkinfo->acoustic = nodeinfo->acoustic;
		    }
		    if (linkinfo->pron == HTK_undef_float) {
			linkinfo->pron = nodeinfo->pron;
		    }
		    if (linkinfo->duration == HTK_undef_float) {
			linkinfo->duration = nodeinfo->duration;
		    }
		}

		/*
		 * Create lattice node
		 */
		NodeIndex newNode = dupNode(linkinfo->word, 0, linkinfo);

		/*
		 * Compute lattice transition weight as a weighted combination
		 * of HTK lattice scores
		 */
		LogP weight = LogP_One;

		if (linkinfo->acoustic != HTK_undef_float) {
		    weight += htkheader.acscale * linkinfo->acoustic;
		}
		if (linkinfo->ngram != HTK_undef_float) {
		    weight += htkheader.ngscale * linkinfo->ngram;
		}
		if (linkinfo->language != HTK_undef_float) {
		    weight += htkheader.lmscale * linkinfo->language;
		}
		if (linkinfo->pron != HTK_undef_float) {
		    weight += htkheader.prscale * linkinfo->pron;
		}
		if (linkinfo->duration != HTK_undef_float) {
		    weight += htkheader.duscale * linkinfo->duration;
		}
		if (!ignoreWord(linkinfo->word)) {
		    weight += htkheader.wdpenalty; 	// do we need to scale ?
		}

		/*
		 * Add transitions from start node, and to end node
		 */
		LatticeTransition trans1(weight, 0);
		insertTrans(startIndex, newNode, trans1);

		LatticeTransition trans2(LogP_One, 0);
		insertTrans(newNode, endIndex, trans2);

		continue;

	    /*
	     * Node fields
	     */
	    } else if (keyis("I")) {
		unsigned HTKnodeno = atoi(value);

		/*
		 * create a null node for this HTK node,
		 * and record node-related info.
		 */
		NodeIndex nullNodeIndex = dupNode(HTKNodeDummy);

		*nodeMap.insert(HTKnodeno) = nullNodeIndex;
		HTKLink &nodeinfo = nodeInfoMap[HTKnodeno];

		/*
		 * parse node fields
		 */
		while (key = getHTKField(line, value)) {
		    if (keyis("t") || keyis("time")) {
			nodeinfo.time = atof(value);

			// remember temporally first node and timestamp
			// in case input doesn't specify initial node
			if (HTKfirstnode == HTK_undef_uint ||
			    nodeinfo.time < HTKinitialtime)
			{
			    HTKfirstnode = HTKnodeno;
			    HTKinitialtime = nodeinfo.time;
			}
			// same for last timestamp
			if (HTKlastnode == HTK_undef_uint ||
			    nodeinfo.time > HTKfinaltime)
			{
			    HTKlastnode = HTKnodeno;
			    HTKfinaltime = nodeinfo.time;
			}
		    } else if (keyis("W") || keyis("WORD")) {
			if (strcmp(value, HTK_null_word) == 0) {
			    nodeinfo.word = Vocab_None;
			} else {
			    nodeinfo.word = vocab.addWord(value);
			}
		    } else if (keyis("v") || keyis("var")) {
			nodeinfo.var = atoi(value);
		    } else if (keyis("d") || keyis("div")) {
			nodeinfo.div = strdup(value);
			assert(nodeinfo.div != 0);
		    } else if (keyis("a") || keyis("acoustic")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    nodeinfo.acoustic = score * ProbToLogP(HTKlogbase);
			} else {
			    nodeinfo.acoustic = ProbToLogP(score);
			}
		    } else if (keyis("r")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    nodeinfo.pron = score * ProbToLogP(HTKlogbase);
			} else {
			    nodeinfo.pron = ProbToLogP(score);
			}
		    } else if (keyis("ds")) {
			double score = atof(value);
			if (HTKlogbase > 0.0) {
			    nodeinfo.duration = score * ProbToLogP(HTKlogbase);
			} else {
			    nodeinfo.duration = ProbToLogP(score);
			}
		    } else {
			file.position() << "unexpected node field name "
					<< key << endl;
			if (!useNullNodes) vocab.remove(HTKNodeDummy);
			return false;
		    }
		}

		if (nodeinfo.time != HTK_undef_float) {
		    // record node time, but no word-related info
		    LatticeNode *nullNode = findNode(nullNodeIndex);
		    assert(nullNode != 0);

		    HTKLink *nullInfo = new HTKLink;
		    assert(nullInfo != 0);
		    htkinfos[htkinfos.size()] = nullInfo;

		    nullNode->htkinfo = nullInfo;
		    nullInfo->time = nodeinfo.time;
		}

		continue;

	    /*
	     * Header fields
	     */
	    } else if (keyis("V") || keyis("VERSION")) {
		; 		// ignore
	    } else if ( keyis("U") || keyis("UTTERANCE")) {
		if (name) free((void *)name);

		// HACK: strip duration spec (which shouldn't be there)
		char *p = strstr(value, "(duration=");
		if (p != 0) *p = '\0';
		    
		name = strdup(value);
		assert(name != 0);
	    } else if (keyis("base")) {
		HTKlogbase = atof(value);
	    } else if (keyis("start")) {
		HTKinitial = atoi(value);
	    } else if (keyis("end")) {
		HTKfinal = atoi(value);
	    } else if (keyis("dir")) {
		HTKdirection = value[0];
	    } else if (keyis("tscale")) {
		htkheader.tscale = atof(value);
	    } else if (keyis("hmms")) {
		htkheader.hmms = strdup(value);
		assert(htkheader.hmms != 0);
	    } else if (keyis("ngname")) {
		htkheader.ngname = strdup(value);
		assert(htkheader.ngname != 0);
	    } else if (keyis("lmname")) {
		htkheader.lmname = strdup(value);
		assert(htkheader.lmname != 0);
	    } else if (keyis("vocab")) {
		htkheader.vocab = strdup(value);
		assert(htkheader.vocab != 0);
	    } else if (keyis("acscale")) {
		if (header == 0 || header->acscale == HTK_undef_float) {
		    htkheader.acscale = atof(value);
		}
	    } else if (keyis("ngscale")) {
		if (header == 0 || header->ngscale == HTK_undef_float) {
		    htkheader.ngscale = atof(value);
		}
	    } else if (keyis("lmscale")) {
		if (header == 0 || header->lmscale == HTK_undef_float) {
		    htkheader.lmscale = atof(value);
		}
	    } else if (keyis("prscale")) {
		if (header == 0 || header->prscale == HTK_undef_float) {
		    htkheader.prscale = atof(value);
		}
	    } else if (keyis("duscale")) {
		if (header == 0 || header->duscale == HTK_undef_float) {
		    htkheader.duscale = atof(value);
		}
	    } else if (keyis("wdpenalty")) {
		if (header == 0 || header->wdpenalty == HTK_undef_float) {
		    htkheader.wdpenalty = atof(value);
		}
	    } else if (keyis("amscale")) {
		if (header == 0 || header->amscale == HTK_undef_float) {
		    htkheader.amscale = atof(value);
		}
	    } else if (keyis("NODES") || keyis("N")) {
		HTKnumnodes = atoi(value);
	    } else if (keyis("LINKS") || keyis("L")) {
		HTKnumlinks = atoi(value);
	    } else {
		file.position() << "unknown field name " << key << endl;
		if (!useNullNodes) vocab.remove(HTKNodeDummy);
		return false;
	    }
#undef keyis
	}
    }

    if (HTKnumnodes == 0) {
	file.position() << "lattice has no nodes\n";
	if (!useNullNodes) vocab.remove(HTKNodeDummy);
	return false;
    }

    /*
     * Set up initial node
     */
    HTKLink *initialinfo;
    LatticeNode *initialNode;

    if (HTKinitial != HTK_undef_uint) {
	initialinfo = &nodeInfoMap[HTKinitial];
	NodeIndex *initialPtr = nodeMap.find(HTKinitial);
	if (initialPtr) {
	    initial = *initialPtr;
	    initialNode = findNode(initial);
	} else {
	    file.position() << "undefined start node " << HTKinitial << endl;
	    if (!useNullNodes) vocab.remove(HTKNodeDummy);
	    return false;
	}
    } else {
	initialinfo = &nodeInfoMap[HTKfirstnode];

	// search for start node: the one without incoming transitions
	LHashIter<NodeIndex, LatticeNode> nodeIter(nodes);
	NodeIndex nodeIndex;
	while (LatticeNode *node = nodeIter.next(nodeIndex)) {
	    if (node->inTransitions.numEntries() == 0) {
		initial = nodeIndex;
		initialNode = node;
		break;
	    }
	}
    }
    initialNode->word = vocab.ssIndex();

    // attach HTK initial node info to lattice initial node
    initialNode->htkinfo = new HTKLink;
    *initialNode->htkinfo = *initialinfo;
    htkinfos[htkinfos.size()] = initialNode->htkinfo;

    /*
     * Set up final node
     */
    HTKLink *finalinfo;
    LatticeNode *finalNode;

    if (HTKfinal != HTK_undef_uint) {
	finalinfo = &nodeInfoMap[HTKfinal];
	NodeIndex *finalPtr = nodeMap.find(HTKfinal);
	if (finalPtr) {
	    final = *finalPtr;
	    finalNode = findNode(final);
	} else {
	    file.position() << "undefined end node " << HTKfinal << endl;
	    if (!useNullNodes) vocab.remove(HTKNodeDummy);
	    return false;
	}
    } else {
	finalinfo = &nodeInfoMap[HTKlastnode];
	// search for end node: the one without outgoing transitions
	LHashIter<NodeIndex, LatticeNode> nodeIter(nodes);
	NodeIndex nodeIndex;
	while (LatticeNode *node = nodeIter.next(nodeIndex)) {
	    if (node->outTransitions.numEntries() == 0) {
		final = nodeIndex;
		finalNode = node;
		break;
	    }
	}
    }
    finalNode->word = vocab.seIndex();

    // attach HTK final node info to lattice final node
    finalNode->htkinfo = new HTKLink;
    *finalNode->htkinfo = *finalinfo;
    htkinfos[htkinfos.size()] = finalNode->htkinfo;

    // eliminate dummy nodes 
    if (!useNullNodes) {
	removeAllXNodes(HTKNodeDummy);
	vocab.remove(HTKNodeDummy);
    }

    return true;
}

/*
 * Output lattice in HTK format
 *	Algorithm:
 *	- each lattice node becomes an HTK node.
 *	- each lattice transitions becomes an HTK link.
 *	- word information is added to the HTK nodes.
 *	- link information attached to each node is added to the HTK link
 *	  leading into the node.
 *	- lattice transition weights are mapped to one of the
 *	  HTK score fields as indicated by the second argument.
 */
Boolean
Lattice::writeHTK(File &file, HTKScoreMapping scoreMapping,
						    Boolean printPosteriors)
{
    if (debug(DebugPrintFunctionality)) {
      dout()  << "Lattice::writeHTK: writing ";
    }

    fprintf(file, "# Header (generated by SRILM)\n");
    fprintf(file, "VERSION=%s\n", HTKLattice_Version);
    fprintf(file, "UTTERANCE="); printQuoted(file, name); fputc('\n', file);
    fprintf(file, "base=%g\n", htkheader.logbase);
    fprintf(file, "dir=%s\n", "f");		// forward lattice

    /* 
     * Ancillary header information preserved from readHTK()
     */
    if (htkheader.tscale != HTK_def_tscale) {
	fprintf(file, "tscale=%g\n", htkheader.tscale);
    }
    if (htkheader.acscale != HTK_def_acscale) {
	fprintf(file, "acscale=%g\n", htkheader.acscale);
    }
    if (htkheader.lmscale != HTK_def_lmscale) {
	fprintf(file, "lmscale=%g\n", htkheader.lmscale);
    }
    if (htkheader.ngscale != HTK_def_ngscale) {
	fprintf(file, "ngscale=%g\n", htkheader.ngscale);
    }
    if (htkheader.prscale != HTK_def_prscale) {
	fprintf(file, "prscale=%g\n", htkheader.prscale);
    }
    if (htkheader.duscale != HTK_def_duscale) {
	fprintf(file, "duscale=%g\n", htkheader.duscale);
    }
    if (htkheader.amscale != HTK_undef_float && printPosteriors) {
	fprintf(file, "amscale=%g\n", htkheader.amscale);
    }
    if (htkheader.hmms != 0) {
	fprintf(file, "hmms=");
	printQuoted(file, htkheader.hmms); fputc('\n', file);
    }
    if (htkheader.lmname != 0) {
	fprintf(file, "lmname=");
	printQuoted(file, htkheader.lmname); fputc('\n', file);
    }
    if (htkheader.ngname != 0) {
	fprintf(file, "ngname=");
	printQuoted(file, htkheader.ngname); fputc('\n', file);
    }
    if (htkheader.vocab != 0) {
	fprintf(file, "vocab=", htkheader.vocab);
	printQuoted(file, htkheader.vocab); fputc('\n', file);
    }
	
    /*
     * We remap the internal node indices to consecutive unsigned integers
     * to allow a compact output representation.
     * We iterate over all nodes, renumbering them, and also counting the
     * number of transitions overall.
     */
    LHash<NodeIndex,unsigned> nodeMap;		// map nodeIndex to unsigned
    unsigned numNodes = 0;
    unsigned numTransitions = 0;

    LHashIter<NodeIndex, LatticeNode> nodeIter(nodes, nodeSort);
    NodeIndex nodeIndex;

    while (LatticeNode *node = nodeIter.next(nodeIndex)) {
	*nodeMap.insert(nodeIndex) = numNodes ++;
	numTransitions += node->outTransitions.numEntries();
    }

    fprintf(file, "start=%u end=%u\n",  *nodeMap.find(initial),
					*nodeMap.find(final));
    fprintf(file, "NODES=%u LINKS=%u\n", numNodes, numTransitions);

    if (debug(DebugPrintFunctionality)) {
      dout()  << numNodes << " nodes, "
	      << numTransitions << " transitions\n";
    }

    fprintf(file, "# Nodes\n");

    double logscale = 1.0 / ProbToLogP(htkheader.logbase);

    nodeIter.init(); 
    while (LatticeNode *node = nodeIter.next(nodeIndex)) {

	fprintf(file, "I=%u", *nodeMap.find(nodeIndex));

 	if (htkheader.wordsOnNodes) {
	    fprintf(file, "\tW=");
	    printQuoted(file, (node->word == vocab.ssIndex() ||
			       node->word == vocab.seIndex() ||
			       node->word == Vocab_None) ?
				    HTK_null_word : vocab.getWord(node->word));
	}

	if (node->htkinfo != 0) {
	    HTKLink &htkinfo = *node->htkinfo;

	    if (htkinfo.time != HTK_undef_float) {
		fprintf(file, "\tt=%g", htkinfo.time);
	    }
	    if (htkheader.scoresOnNodes &&
		scoreMapping != mapHTKacoustic &&
		htkinfo.acoustic != HTK_undef_float)
	    {
		fprintf(file, "\ta=%g", htkinfo.acoustic * logscale);
	    }
	    if (htkheader.scoresOnNodes &&
		htkinfo.pron != HTK_undef_float)
	    {
		fprintf(file, "\tr=%g", htkinfo.pron * logscale);
	    }
	    if (htkheader.scoresOnNodes &&
		htkinfo.duration != HTK_undef_float)
	    {
		fprintf(file, "\tds=%g", htkinfo.duration * logscale);
	    }
	    if (htkheader.wordsOnNodes &&
		htkinfo.var != HTK_undef_uint)
	    {
		fprintf(file, "\tv=%u", htkinfo.var);
	    }
	    if (htkheader.wordsOnNodes &&
		htkinfo.div != 0)
	    {
		fprintf(file, "\td=%s", htkinfo.div);
	    }
	}
	if (printPosteriors) {
	    fprintf(file, "\tp=%lg", (double)LogPtoProb(node->posterior));
	}
	fprintf(file, "\n");
    }

    fprintf(file, "# Links\n");

    unsigned linkNumber = 0;
    nodeIter.init(); 
    while (LatticeNode *node = nodeIter.next(nodeIndex)) {
	unsigned *fromNodeId = nodeMap.find(nodeIndex);

 	NodeIndex toNodeIndex;

	TRANSITER_T<NodeIndex,LatticeTransition>
	  transIter(node->outTransitions);
	while (LatticeTransition *trans = transIter.next(toNodeIndex)) {
	    LatticeNode *toNode = findNode(toNodeIndex);
	    assert(toNode != 0);

	    unsigned *toNodeId = nodeMap.find(toNodeIndex); 
	    assert(toNodeId != 0);

	    fprintf(file, "J=%u\tS=%u\tE=%u",
				linkNumber++, *fromNodeId, *toNodeId);

	    if (!htkheader.wordsOnNodes) {
		fprintf(file, "\tW=");
		printQuoted(file, (toNode->word == vocab.ssIndex() ||
				   toNode->word == vocab.seIndex() ||
				   toNode->word == Vocab_None) ?
				   HTK_null_word : vocab.getWord(toNode->word));
	    }

	    if (toNode->htkinfo != 0) {
		HTKLink &htkinfo = *toNode->htkinfo;

		if (!htkheader.scoresOnNodes &&
		    scoreMapping != mapHTKacoustic &&
		    htkinfo.acoustic != HTK_undef_float)
		{
		    fprintf(file, "\ta=%g", htkinfo.acoustic * logscale);
		}
		if (!htkheader.scoresOnNodes &&
		    htkinfo.pron != HTK_undef_float)
		{
		    fprintf(file, "\tr=%g", htkinfo.pron * logscale);
		}
		if (!htkheader.scoresOnNodes &&
		    htkinfo.duration != HTK_undef_float)
		{
		    fprintf(file, "\tds=%g", htkinfo.duration * logscale);
		}
		if (!htkheader.wordsOnNodes &&
		    htkinfo.var != HTK_undef_uint) {
		    fprintf(file, "\tv=%u", htkinfo.var);
		}
		if (!htkheader.wordsOnNodes &&
		    htkinfo.div != 0)
		{
		    fprintf(file, "\td=%s", htkinfo.div);
		}
		if (scoreMapping != mapHTKngram &&
		    htkinfo.ngram != HTK_undef_float)
		{
		    fprintf(file, "\tn=%g", htkinfo.ngram * logscale);
		}
		if (scoreMapping != mapHTKlanguage &&
		    htkinfo.language != HTK_undef_float)
		{
		    fprintf(file, "\tl=%g", htkinfo.language * logscale);
		}
	    }

	    /*
	     * map transition weight to one of the standard HTK scores
	     */
	    if (scoreMapping != mapHTKnone) {
		fprintf(file, "\t%c=%g",
			    (scoreMapping == mapHTKacoustic ? 'a' :
			     (scoreMapping == mapHTKngram ? 'n' :
			      (scoreMapping == mapHTKlanguage ? 'l' : '?'))),
			    trans->weight * logscale);
	    }

	    fprintf(file, "\n");
	}
    }

    return true;
}


/* 
 * Compute pronunciation scores
 * 	(for nodes with HTKLink information that have phone backtraces)
 */
Boolean
Lattice::scorePronunciations(VocabMultiMap &dictionary, Boolean intlogs)
{
    if (debug(DebugPrintFunctionality)) {
      dout() << "Lattice::scorePronunciations: starting\n";
    }

    Vocab &phoneVocab = dictionary.vocab2;

    /*
     * Go through all HTLink structures, extract the phone sequences,
     * and look up their probabilities in the dictionary
     */
    for (unsigned i = 0; i < htkinfos.size(); i ++) {
	HTKLink *info = htkinfos[i];

	/*
	 * only rescore words that have pronunciations
	 * (e.g., don't include NULL nodes)
	 */
	if (info->div != 0) {
	    assert(info->word != Vocab_None);

	    /*
	     * parse the phone sequence from the string
	     * example:
	     *	d=:#[s]t,0.12:s[t]r,0.03:t[r]ay,0.05:r[ay]k,0.09:ay[k]#,0.09:
	     * and convert into an index string
	     */
	    char phoneString[strlen(info->div) + 1];
	    strcpy(phoneString, info->div);

	    Array<VocabIndex> phones;
	    unsigned numPhones = 0;

	    for (char *s = strtok(phoneString, phoneSeparator);
		 s != 0;
		 s = strtok(NULL, phoneSeparator))
	    {
		// skip empty components (at beginning and end)
		if (s[0] == '\0') continue;

		// strip duration part
		char *e = strchr(s, ',');
		if (e != 0) *e = '\0';

		// strip context from triphone labels
		e = strchr(s, '[');
		if (e != 0) s = e + 1;

		e = strrchr(s, ']');
		if (e != 0) *e = '\0';

		phones[numPhones ++] = phoneVocab.addWord(s);
	    }
	    phones[numPhones] = Vocab_None;

	    // find pronunciation prob
	    Prob p = dictionary.get(info->word, phones.data());

	    if (p == 0.0) {
		// missing pronunciation get score 0
		info->pron = LogP_One;
	    } else {
		if (intlogs) {
		    info->pron = IntlogToLogP(p);
		} else {
		    info->pron += ProbToLogP(p);
		}
	    }
	}
    }

    return true;
}


From stolcke at speech.sri.com  Tue Mar  2 16:15:46 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 02 Mar 2004 16:15:46 PST
Subject: SRILM 1.4 
In-Reply-To: Your message of Tue, 02 Mar 2004 15:36:05 -0800.
             <OF42132E19.478316A4-ON88256E4B.0081849E-88256E4B.0081A4E8@mohomine.com> 
Message-ID: <200403030015.QAA10968@tonga>


ngram -prune-lowprobs does a -renorm implicitly AFTER eliminating pruned 
N-gram probabilties.  However, if you do specify both 
options the renormalization is done FIRST, then the pruning.

What this could mean is that your original model is not properly
normalized (so the -renorm operation changes the backoff weights before
pruning).  Even if the model is normalized (as it should be if produced
by SRILM) you might see small differences due to rounding or loss of
precision when writing/reading the log probabilities, or other numerical
inaccuracies.  Note that even small differences in values might affect the
pruning decisions in some cases, so you probably will end up with
slightly different sets of N-grams.  Again, the differences would be
small and the resulting models should perform equivalently in
practice.

As a sanity check, compute perplexity of the two models.  They should be 
essentially identical.

--Andreas

In message <OF42132E19.478316A4-ON88256E4B.0081849E-88256E4B.0081A4E8 at mohomine.
com>you wrote:
> This is a multipart message in MIME format.
> --=_alternative 0081A4E788256E4B_=
> Content-Type: text/plain; charset="US-ASCII"
> 
> Hello,
> 
> If I want to export a LM to an FSM, such as the AT&T FSM library, then I 
> need to do -prune-lowprobs... but what about -renorm?  I notice that if I 
> do/don't add this flag on the command line... it makes a different LM... 
> but I'm not sure which one is right.  I was assuming I needed both 
> -prune-lowprobs and -renorm, but the LM looks a little funny... so now I'm 
> not sure.
> 
> Thanks,
> Chris
> 


From thomae at ei.tum.de  Thu Mar  4 06:48:04 2004
From: thomae at ei.tum.de (Matthias Thomae)
Date: Thu, 04 Mar 2004 15:48:04 +0100
Subject: make-ngram-pfsg: bad results with new gawk version
Message-ID: <404741A4.2090104@ei.tum.de>

Hello Andreas,

make-ngram-pfsg gives me different results with different versions of 
gawk. The header and the links are the same, but the weights differ 
substantially.

I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), 
and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly 
created PFSGs cause some ASR error degradation...

Any clues?

Regards.
Matthias


From thomae at ei.tum.de  Thu Mar  4 08:13:13 2004
From: thomae at ei.tum.de (Matthias Thomae)
Date: Thu, 04 Mar 2004 17:13:13 +0100
Subject: make-ngram-pfsg: bad results with new gawk version
In-Reply-To: <404741A4.2090104@ei.tum.de>
References: <404741A4.2090104@ei.tum.de>
Message-ID: <40475599.9070700@ei.tum.de>

Hello again,

forgot to say that I tested this with srilm 1.3.3 and 1.3.1.

Matthias

Matthias Thomae wrote:
> Hello Andreas,
> 
> make-ngram-pfsg gives me different results with different versions of 
> gawk. The header and the links are the same, but the weights differ 
> substantially.
> 
> I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), 
> and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly 
> created PFSGs cause some ASR error degradation...
> 
> Any clues?
> 
> Regards.
> Matthias


From stolcke at speech.sri.com  Thu Mar  4 12:53:00 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 04 Mar 2004 12:53:00 PST
Subject: SRILM 1.4 
In-Reply-To: Your message of Thu, 04 Mar 2004 08:07:01 -0800.
             <OF2145503D.B96019FD-ON88256E4D.005803E1-88256E4D.005888D3@mohomine.com> 
Message-ID: <200403042053.MAA09988@huge>


In message <OF2145503D.B96019FD-ON88256E4D.005803E1-88256E4D.005888D3 at mohomine.
com>you wrote:
> This is a multipart message in MIME format.
> --=_alternative 005888C888256E4D_=
> Content-Type: text/plain; charset="US-ASCII"
> 
> Ok... that seemed to be fine... they did perform similarly.  I just wanted 
> to make sure everything was ok.
> 
> If I wanted to change the backoff order of the LM... is there an easy way 
> to do this...?  I looked into the NgramLM.cc file... and it seems kind of 
> tricky... becuase I need to know how the trie is used...
> 
> ... is there some other code that I should be looking in?
> 
> In particular... if the ngram is: p(a|b,c,d) I would prefer the backoff to 
> be:
> p(a|b,c,d) => p(a|b,c)bo(b,c,d) // This is normal
>                     => p(a|c)bo(b,c)    // BO normal, p context is not...
>             => p(a)bo(c)                // This is normal...
> 
> Or, even better would be:
> p(a|b,c,d) => p(a|b,c)bo(b,c,d)                 // This is normal
>                     => p(a|b,c)bo(b,c) + p(a|c)bo(b,c)  // ... is 
> something like this possible?
>             => p(a)bo(c)                                // This is 
> normal...
> 
> I was also thinking that maybe I could write a script to output a counts 
> file given the text file that would somehow "trick" the LM to generate the 
> backoff order I'm interested in... is that an option?

This would be one solution.  Use ngram-counts -read 
and then ngram -counts.   Just reorder the words in the N-grams to reflect the 
backoff order you want.

Note that the factored LM stuff in the latest version (courtesy of Jeff Bilmes)
gives you complete flexibility in specifying the backoff order (and many other
things, such as parallel backoff paths and their combination).
Look in $SRILM/flm/doc for details.

--Andreas 


From stolcke at speech.sri.com  Thu Mar  4 13:35:58 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 04 Mar 2004 13:35:58 PST
Subject: SRILM 1.4 
In-Reply-To: Your message of Thu, 04 Mar 2004 12:57:51 -0800.
             <OF55C68357.7592A8C7-ON88256E4D.0072EF62-88256E4D.00732909@mohomine.com> 
Message-ID: <200403042135.NAA07730@cuatro>


> 
> > This would be one solution.  Use ngram-counts -read 
> > and then ngram -counts.   Just reorder the words in the N-grams to 
> > reflect the 
> > backoff order you want.
> > 
> 
> So how exactly would I reorder them supposing I wanted to do the backoff 
> as I explained earlier?  Can you just give a concrete example of 
> reordering them...?

This works only if each backoff level drops exactly one of the history
elements.  So if you want to backoff

	p(a|b,c,d) -> p(a|b,c) -> p(a|c)

you are dropping history words in the order 3 (farthest), then 1 (nearest),
then 2.
To achieve this extract N-grams (d c b a) from your data and prepare a count
file with 

	d b c a	<count>
	
For training (ngram-count) you also need to generate the lower-order counts,
ie.

	b c a	<count>
	c a	<count>
	a	<count>

For testing (ngram -counts) you only need the highest order counts.
(except at the start of sentence where the length of the N-grams is 
liminted by the <s> tag).

--Andreas 


From john at newington.f9.co.uk  Fri Mar  5 03:53:30 2004
From: john at newington.f9.co.uk (john at newington.f9.co.uk)
Date: 5 Mar 2004 11:53:30 -0000
Subject: decode lattice
Message-ID: <20040305115330.newington+john@force9>


Andreas,

I changed our lattice files so that the words were not enclosed in double quotes. This fixed the initial problem and enabled me to get an output from lattice-tool. However, I then realised that I needed to scale the output from my classifier by subtracting the log prior probabilities for each class before building the lattice. Now, when I try the rescaling and decoding using lattice-tool it predicts the same (low frequency) label for almost every token.

Am I wrong to scale my 'accoustic' probabilities before building the lattice? Does lattice-tool do this for me when I call:

./lattice-tool -read-htk -in-lattice lattice.slf -write-htk \ -out-lattice lattice.out -lm DAgrammar -no-nulls

Hope you can shed some light on this.

Regards,

John Ferguson

  > > Dear Andreas,
 > >
 > > I am replying on behalf of my colleague who emailed you earlier
 > regarding
 > > correct use of the SRILM lattice-tool.
 > >
 > > Based on your previous advice I have tried to decode our
 > lattice using our
 > > bigram model. All files seem to be in the correct format, so
 > far as I can
 > > tell. However, when lattice-tool rescores the lattice, all the
 > newly added
 > > LM probabilities "l=..." come out as "-inf". I tried 1-best
 > decoding using
 > > viterbi on the rescored lattice and the output is simply:
 > >
 > > lattice.out </s>
 > >
 > > where lattice.out is the utterance name inserted by lattice-tool.
 > >
 > > Do you have any idea why we're experiencing behaviour like this? Can you
 > > suggest any alterations?
 >
 > John,
 >
 > the problem is that your lattices use double-quotes around the
 > word strings,
 > but the released version of SRILM does't yet implement the HTK quoting
 > mechanism (an oversight on my part).
 >
 > You can replace the file lattice/src/HTKLattice.cc with the
 > attached version
 > and rebuild lattice-tool to make it work. Or, you can just strip the
 > double quotes in your lattice files and keep using the old software.
 >


From thomae at ei.tum.de  Fri Mar  5 05:26:00 2004
From: thomae at ei.tum.de (Matthias Thomae)
Date: Fri, 05 Mar 2004 14:26:00 +0100
Subject: make-ngram-pfsg: bad results with new gawk version
In-Reply-To: <200403042048.MAA09526@huge>
References: <200403042048.MAA09526@huge>
Message-ID: <40487FE8.3020708@ei.tum.de>

Hi Andreas,

Andreas Stolcke wrote:
> This is quite odd.

I think so, too :)

> make-ngram-pfsg doesn't perform much arithmetic on the log probabilties
> in the LM.  It only scales and rounds them.
 >
> Can you apply the scale_log() function in make-ngram-pfsg to your LM
> probabilties and backoff weights, and extract the cases where the output
> differs?

old awk:
	add_trans BO  -> </s> -0.314718
	scale_log(prob) = -7247
	add_trans <s> -> BO  -2.596963
	scale_log(prob) = -59800

new awk:
	logscale = 23027
	add_trans BO  -> </s> -0.314718
	scale_log(prob) = 0
	add_trans <s> -> BO  -2.596963
	scale_log(prob) = -46054

Note that I printed the logscale which seems to be correct.
...
I think I found the problem:

The float log-probs (x) seem to be converted to integers when 
multiplying them with the logscale:

function scale_log(x) {
	return rint(x * logscale);
}

This seems to be related to the locale settings
http://mail.gnu.org/archive/html/bug-gnu-utils/2002-07/msg00196.html

If I set LC_ALL="C" in my shell, it also works as expected. So the bad 
behaviour seems to occur with gawk 3.1.3 AND LC_ALL=""...


Regards.
Matthias


> --Andreas
> 
> In message <40475599.9070700 at ei.tum.de>you wrote:
> 
>>Hello again,
>>
>>forgot to say that I tested this with srilm 1.3.3 and 1.3.1.
>>
>>Matthias
>>
>>Matthias Thomae wrote:
>>
>>>Hello Andreas,
>>>
>>>make-ngram-pfsg gives me different results with different versions of 
>>>gawk. The header and the links are the same, but the weights differ 
>>>substantially.
>>>
>>>I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), 
>>>and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly 
>>>created PFSGs cause some ASR error degradation...
>>>
>>>Any clues?
>>>
>>>Regards.
>>>Matthias
>>
> 
> 


From stolcke at speech.sri.com  Fri Mar  5 07:52:32 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 05 Mar 2004 07:52:32 PST
Subject: make-ngram-pfsg: bad results with new gawk version 
In-Reply-To: Your message of Fri, 05 Mar 2004 14:26:00 +0100.
             <40487FE8.3020708@ei.tum.de> 
Message-ID: <200403051552.HAA01764@huge>


Thanks for tracking this down.  I'll add a note somewhere that one better
set LC_NUMERIC=C or LC_ALL=C for gawk scripts to do proper artihmetic.

--Andreas

In message <40487FE8.3020708 at ei.tum.de>you wrote:
> Hi Andreas,
> 
> Andreas Stolcke wrote:
> > This is quite odd.
> 
> I think so, too :)
> 
> > make-ngram-pfsg doesn't perform much arithmetic on the log probabilties
> > in the LM.  It only scales and rounds them.
>  >
> > Can you apply the scale_log() function in make-ngram-pfsg to your LM
> > probabilties and backoff weights, and extract the cases where the output
> > differs?
> 
> old awk:
> 	add_trans BO  -> </s> -0.314718
> 	scale_log(prob) = -7247
> 	add_trans <s> -> BO  -2.596963
> 	scale_log(prob) = -59800
> 
> new awk:
> 	logscale = 23027
> 	add_trans BO  -> </s> -0.314718
> 	scale_log(prob) = 0
> 	add_trans <s> -> BO  -2.596963
> 	scale_log(prob) = -46054
> 
> Note that I printed the logscale which seems to be correct.
> ...
> I think I found the problem:
> 
> The float log-probs (x) seem to be converted to integers when 
> multiplying them with the logscale:
> 
> function scale_log(x) {
> 	return rint(x * logscale);
> }
> 
> This seems to be related to the locale settings
> http://mail.gnu.org/archive/html/bug-gnu-utils/2002-07/msg00196.html
> 
> If I set LC_ALL="C" in my shell, it also works as expected. So the bad 
> behaviour seems to occur with gawk 3.1.3 AND LC_ALL=""...
> 
> 
> Regards.
> Matthias
> 
> 
> > --Andreas
> > 
> > In message <40475599.9070700 at ei.tum.de>you wrote:
> > 
> >>Hello again,
> >>
> >>forgot to say that I tested this with srilm 1.3.3 and 1.3.1.
> >>
> >>Matthias
> >>
> >>Matthias Thomae wrote:
> >>
> >>>Hello Andreas,
> >>>
> >>>make-ngram-pfsg gives me different results with different versions of 
> >>>gawk. The header and the links are the same, but the weights differ 
> >>>substantially.
> >>>
> >>>I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse), 
> >>>and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly 
> >>>created PFSGs cause some ASR error degradation...
> >>>
> >>>Any clues?
> >>>
> >>>Regards.
> >>>Matthias
> >>
> > 
> > 
> 


From stolcke at speech.sri.com  Fri Mar  5 08:24:38 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 05 Mar 2004 08:24:38 PST
Subject: decode lattice 
In-Reply-To: Your message of 05 Mar 2004 11:53:30 +0000.
             <20040305115330.newington+john@force9> 
Message-ID: <200403051624.IAA04571@huge>


In message <20040305115330.newington+john at force9>you wrote:
> 
> Andreas,
> 
> I changed our lattice files so that the words were not enclosed in double quo
> tes. This fixed the initial problem and enabled me to get an output from latt
> ice-tool. However, I then realised that I needed to scale the output from my 
> classifier by subtracting the log prior probabilities for each class before b
> uilding the lattice. Now, when I try the rescaling and decoding using lattice
> -tool it predicts the same (low frequency) label for almost every token.

I'm a little confused by your description.  I gather you have a classifier that
operates on word hypotheses and outputs posterior probabilities, which you
scale by the priors to obtain pseudo-likelihoods, giving you your acoustic 
scores.  That part sounds reasonable (correct me if I got it wrong).

Does the unigram LM you are using encode the priors ?

What do you mean by "token" in your last sentence? 

> 
> Am I wrong to scale my 'accoustic' probabilities before building the lattice?
>  Does lattice-tool do this for me when I call:
> 
> ./lattice-tool -read-htk -in-lattice lattice.slf -write-htk \ -out-lattice la
> ttice.out -lm DAgrammar -no-nulls

lattice-tool only performs global scaling of the scores in the lattice.
By default the scores are interpreted as being natural logs (base e).
If you add a header field

	base=B

then scores are taken to be logs base B.
So, if your acoustic scores are not natural logs you should either convert
them, or insert the "base=" spec in the lattice header.
(You can also use straight probabilities as scores by setting base=0.)

The default log scale for output lattices is 10 (so that LM scores can 
be more easily inspected and compared to LM files),
so the header of an output lattice will contain "base=10"
regardless of the input.  However, you can chose that with the "-htk-logbase"
option.  That won't change your result, though, because when the lattice
is read back in everything is converted to log base 10 internally.
The important thing is that the acoustic scores have the right base 
in the original lattice so that the LM scores generated by rescoring
are compatible.

When you decode from the lattice (lattice-tool -viterbi-decode) you can
chose to scale the acoustic and LM scores differently to give different weights
to these knowledge sources.  This is controlled by the options

 -htk-acscale
 -htk-lmscale

So you might want to play with those.

--Andreas 


From thomae at ei.tum.de  Mon Mar  8 02:46:23 2004
From: thomae at ei.tum.de (Matthias Thomae)
Date: Mon, 08 Mar 2004 11:46:23 +0100
Subject: make-ngram-pfsg: bad results with new gawk version
In-Reply-To: <200403051552.HAA01764@huge>
References: <200403051552.HAA01764@huge>
Message-ID: <404C4EFF.9020507@ei.tum.de>

Andreas Stolcke wrote:
> Thanks for tracking this down.

You're welcome.

>  I'll add a note somewhere that one better
> set LC_NUMERIC=C or LC_ALL=C for gawk scripts to do proper artihmetic.

Good. Maybe you would even want to set LC_NUMERIC temporarily (from a 
wrapper script?) or print a warning if it is not set to "C".

Regards.
Matthias


From stolcke at speech.sri.com  Fri Mar 12 09:03:09 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Fri, 12 Mar 2004 09:03:09 PST
Subject: question about SRILM 
In-Reply-To: Your message of Fri, 12 Mar 2004 16:43:36 +0100.
             <4051DAA8.5080700@irisa.fr> 
Message-ID: <200403121703.JAA04805@huge>


In message <4051DAA8.5080700 at irisa.fr>you wrote:
> Hi.
> I have one question about SRILM. I don't understand how is computed the 
> log-probability of an unigram.
> Isn't it log[P(w)] = log[c(w)] - log[|V|], where c(w) is the frequency 
> of the word w in the training set and |V| the size of the vocabulary ?
> And, if this formula is used, are the tokens <s> and </s> considered to 
> be part of the vocabulary or not (i.e. are they counted in |V| ?) ?
> 
> Thank you for answering.
> Solen Quiniou.
> 

The formula for unigram probabilities (modulo smoothing) is 

	log[P(w)] = log[c(w)] - log[N]

where N is the number of word TOKENS in the training corpus (not the 
vocabulary).

End-of-sentence tags are included in the count, since they are among the
events that are predicted by the LM, but Beginning-of-sentence is not.
You will notice that the log probabilty of <s> is set to -99 (a
stand-in for minus infinity).

--Andreas 

PS. Please send your questions to "srilm-user at speech.sri.com" in the 
future.


From solen.quiniou at irisa.fr  Mon Mar 15 00:36:49 2004
From: solen.quiniou at irisa.fr (Solen Quiniou)
Date: Mon, 15 Mar 2004 09:36:49 +0100
Subject: singleton counts warning
Message-ID: <40556B21.8080706@irisa.fr>

Hi !
I use SRILM to build a language model on letters. I have a warning that 
I don't understand : "warning: no singleton counts
GT discounting disabled"
So, the model computed is wrong since some back-off weight are positives 
(in log-probability) ! Do you know what does this warning mean ? I 
thought no counts on single letters were computed but they were so I 
can't find an explanation !

I've got another question, about the computation of unigram 
log-probability. When I used the formula  : log[P(w)] = log[c(w)] - 
log[N], where N is the number of word TOKENS in the training corpus, I 
don't find exactly the value given by SRILM. Is there smoothing on 
unigram ? And if so, how is it made ?

Thank you for answering.

Solen.


From stolcke at speech.sri.com  Mon Mar 15 16:24:53 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 15 Mar 2004 16:24:53 PST
Subject: singleton counts warning 
In-Reply-To: Your message of Mon, 15 Mar 2004 09:36:49 +0100.
             <40556B21.8080706@irisa.fr> 
Message-ID: <200403160024.QAA22852@huge>


In message <40556B21.8080706 at irisa.fr>you wrote:
> Hi !
> I use SRILM to build a language model on letters. I have a warning that 
> I don't understand : "warning: no singleton counts
> GT discounting disabled"
> So, the model computed is wrong since some back-off weight are positives 
> (in log-probability) ! Do you know what does this warning mean ? I 
> thought no counts on single letters were computed but they were so I 
> can't find an explanation !

GT (and also KN) discounting need the number of words that appear only 
once (singletons) in the training corpus.  If that number is 0 the 
discounting formulae for those methods cannot be applied.

Please try using a different smoothing method, such as 
Witten-Bell to your letter LM, at least for the unigrams.

> 
> I've got another question, about the computation of unigram 
> log-probability. When I used the formula  : log[P(w)] = log[c(w)] - 
> log[N], where N is the number of word TOKENS in the training corpus, I 
> don't find exactly the value given by SRILM. Is there smoothing on 
> unigram ? And if so, how is it made ?

Of course there is smoothing.  I don't have time to elaborate on
the different smoothing algorthms implemented in SRILM, but you can either
study the code in Discount.cc, or refer to the excellent survey paper 
by Chen & Goodman (SEE ALSO section of the ngram-count(1) man page).

--Andreas 


From solen.quiniou at irisa.fr  Thu Mar 18 02:52:02 2004
From: solen.quiniou at irisa.fr (Solen Quiniou)
Date: Thu, 18 Mar 2004 11:52:02 +0100
Subject: positive backoff weight
Message-ID: <40597F52.4050803@irisa.fr>

Thank you for the past answers to my questions.

I've got another question. Sometimes, when I use a Good-Turing 
discounting, some of the backoff weight of the unigram (I compute a 
bigram model) are positive log-probability. How is it possible ? Is it 
because Good-Turing discounting is disabled on unigram since there are 
no unigram which frequency is 1 ? And, more, generally, how are computed 
backoff weights for unigrams, in the case of a bigram model ?

Thanks a lot for your answers.
Solen.


From solen.quiniou at irisa.fr  Thu Mar 25 09:21:49 2004
From: solen.quiniou at irisa.fr (Solen Quiniou)
Date: Thu, 25 Mar 2004 18:21:49 +0100
Subject: pfsg-format
Message-ID: <4063152D.3060201@irisa.fr>

Hi !
I've got one question about the pfsg format : is the transition cost, 
between 2 states, considered to be 10000.5 times the log-probability of 
the bigram corresponding to the 2 states ?
Because, when I use a language model made from an ARPA file (by using 
the NgramLM class) to compute the probability of a word (my language 
model is based on letters) and when I use a language model made from a 
PFSG file (I convert the ARPA thanks to the make-ngram-pfsg script and 
then by using the LatticeLM class), I don't have the same 
log-probability from both representations. Why is there a difference ? 
Since I convert the ARPA file into a PFSG file, it should be the same.

Thanks for answering.

Solen.


From stolcke at speech.sri.com  Thu Mar 25 09:33:16 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 25 Mar 2004 09:33:16 PST
Subject: pfsg-format 
In-Reply-To: Your message of Thu, 25 Mar 2004 18:21:49 +0100.
             <4063152D.3060201@irisa.fr> 
Message-ID: <200403251733.JAA19358@huge>


In message <4063152D.3060201 at irisa.fr>you wrote:
> Hi !
> I've got one question about the pfsg format : is the transition cost, 
> between 2 states, considered to be 10000.5 times the log-probability of 
> the bigram corresponding to the 2 states ?

correct.

> Because, when I use a language model made from an ARPA file (by using 
> the NgramLM class) to compute the probability of a word (my language 
> model is based on letters) and when I use a language model made from a 
> PFSG file (I convert the ARPA thanks to the make-ngram-pfsg script and 
> then by using the LatticeLM class), I don't have the same 
> log-probability from both representations. Why is there a difference ? 
> Since I convert the ARPA file into a PFSG file, it should be the same.

How big are the differences?  there will be some discrepancy due to
rounding the scaled log probabilities to an integer, but it should 
be a small error.

--Andreas 


From stolcke at speech.sri.com  Thu Mar 25 09:58:57 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 25 Mar 2004 09:58:57 PST
Subject: pfsg-format 
In-Reply-To: Your message of Thu, 25 Mar 2004 09:52:30 -0800.
             <C7C4A42E1B9A2740B0BDF39D9BB22B4B037C2019@RED-MSG-30.redmond.corp.microsoft.com> 
Message-ID: <200403251758.JAA21661@huge>


Ciprian raises a good point.  Before comparing results you should
process the LM with ngram -prune-lowprobs. (Otherwise the PFSG may not 
be an accurate representation of the LM.)

--Andreas

In message <C7C4A42E1B9A2740B0BDF39D9BB22B4B037C2019 at RED-MSG-30.redmond.corp.mi
crosoft.com>you wrote:
> Hi Andreas,
> 
> I am following these threads since they sometimes contain useful
> information.
> 
> > > Because, when I use a language model made from an ARPA file (by
> using
> > > the NgramLM class) to compute the probability of a word (my language
> > > model is based on letters) and when I use a language model made from
> a
> > > PFSG file (I convert the ARPA thanks to the make-ngram-pfsg script
> and
> > > then by using the LatticeLM class), I don't have the same
> > > log-probability from both representations. Why is there a difference
> ?
> > > Since I convert the ARPA file into a PFSG file, it should be the
> same.
> > 
> > How big are the differences?  there will be some discrepancy due to
> > rounding the scaled log probabilities to an integer, but it should
> > be a small error.
> 
> [Ciprian] I assume PFSG is Probabilistic Finite State Grammar. I do not
> know how exactly the conversion is done in the SRIlm toolkit, but the
> difference could also come from the standard hack used in representing
> ARPA back-off models in FSM format --- having a common back-off state
> that forgets what higher order n-gram state we arrived there from. Am I
> wrong?
> 
> -Ciprian
> 


From barhaim at cs.technion.ac.il  Tue Mar 30 07:49:43 2004
From: barhaim at cs.technion.ac.il (Roy Bar Haim)
Date: Tue, 30 Mar 2004 17:49:43 +0200
Subject: Disambig n-best scores
Message-ID: <009501c4166e$a0b50cd0$34284484@cs.technion.ac.il>

Hi,

How is path score in disambig with n-best option calculated?

For example, suppose that I have the sentence:

W1 W2 
Which is tagged with T1 T2

Then I calculated the path probability as follows:

Log10 [ P(T1|<s>)*P(T2|T1)*P(<\s>|T2)*P(W1|T1)*P(W2|T2) ]

I got it "almost right" . I checked for two paths:
For one I got -20.549 (while disambig returned -120.549)
For the other I got -20.837 (while disambig returned -120.837)

What is the reason for this difference? Should I always ignore the "1"
after the "-"?
Thanks,
Roy.


From stolcke at speech.sri.com  Tue Mar 30 15:58:02 2004
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 30 Mar 2004 15:58:02 PST
Subject: Disambig n-best scores 
In-Reply-To: Your message of Tue, 30 Mar 2004 17:49:43 +0200.
             <009501c4166e$a0b50cd0$34284484@cs.technion.ac.il> 
Message-ID: <200403302358.i2UNw3Z02903@conga.speech.sri.com>


In message <009501c4166e$a0b50cd0$34284484 at cs.technion.ac.il>you wrote:
> Hi,
> 
> How is path score in disambig with n-best option calculated?
> 
> For example, suppose that I have the sentence:
> 
> W1 W2 
> Which is tagged with T1 T2
> 
> Then I calculated the path probability as follows:
> 
> Log10 [ P(T1|<s>)*P(T2|T1)*P(<\s>|T2)*P(W1|T1)*P(W2|T2) ]
> 
> I got it "almost right" . I checked for two paths:
> For one I got -20.549 (while disambig returned -120.549)
> For the other I got -20.837 (while disambig returned -120.837)
> 
> What is the reason for this difference? Should I always ignore the "1"
> after the "-"?

The -100 comes from an OOV word.  When the LM returns a probability of 0
AND the word is not in the LM it is considered an OOV.  To allow the 
probability computation to go on a large negative, but finite, log probability
of -100 is substituted (cf. the constant LogP_PseudoZero in disambig.cc).

--Andreas