From lambert at jhu.edu Fri Jul 8 08:36:50 2005 From: lambert at jhu.edu (lambert mathias) Date: Fri, 08 Jul 2005 11:36:50 -0400 Subject: Modifying ngram.cc In-Reply-To: <200506242323.j5ONNU022974@huge> Message-ID: Hi, This is kind of a C++ question I wrote the following copy constructor in Ngram.h Ngram(const Ngram &ng):LM(ng.vocab),contexts(ng.contexts),order(ng.order),_skipOOVs(ng._skipOO Vs),_trustTotals(ng._trustTotals){}; I have the following declarations Ngram ngramLM(vocab*,order); While(//) Ngram useLM(ngramLM); // Do some stuff with useLM ......... ........ } The problem is that the assignment useLM=ngramLM doesn't assign the original ngramLM. Any changes I make to useLM shows up in ngramLM too. I just want to make a copy (useLM) of the original ngramLM, work on that copy and then reinitialize another useLM with the original ngramLM. Any thoughts? Lambert >> . 1 >> /PT 1 >> >> Why is the slash considered as part of the tag? > > The / in front of a token signifies that it's a tag, as opposed to a > word. It's just a way to encode word/tags, as well as > word and tags individually, without ambiguity. > >> >> b) as can be seen in the example, the n-grams with tags are only built >> left-to-right, e.g. there is no bigram "la /N5", as I would have expected >> (and needed). > > The program collects only those N-gram statistics that are required > by the underlying model. Since the goal is to use the tags in backoff > the statistics needed are asymmetrical. > > If you want a different set of N-grams you can probably write a simple > perl script to do the job. > > --Andreas > From stolcke at speech.sri.com Fri Jul 8 15:33:07 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 08 Jul 2005 15:33:07 PDT Subject: Modifying ngram.cc In-Reply-To: Your message of Fri, 08 Jul 2005 11:36:50 -0400. Message-ID: <200507082233.PAA27190@tonga> In message you wrote: > Hi, > > This is kind of a C++ question > I wrote the following copy constructor in Ngram.h > Ngram(const Ngram > &ng):LM(ng.vocab),contexts(ng.contexts),order(ng.order),_skipOOVs(ng._skipOO > Vs),_trustTotals(ng._trustTotals){}; > > I have the following declarations > > Ngram ngramLM(vocab*,order); > > While(//) > Ngram useLM(ngramLM); > > // Do some stuff with useLM > ......... > ........ > } > > The problem is that the assignment useLM=ngramLM doesn't assign the original > ngramLM. Any changes I make to useLM shows up in ngramLM too. > I just want to make a copy (useLM) of the original ngramLM, work on that > copy and then reinitialize another useLM with the original ngramLM. > > Any thoughts? That's because there is not copy constructor defined for the Trie (and all component classes) class. So what happens is that pointers to tries are copied, and the datastructures end up shared between orignal and copy. Adding these is not difficult, just tedious, and wasn't needed up to this point. --Andreas > From stolcke at speech.sri.com Sun Jul 17 17:47:57 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 17 Jul 2005 17:47:57 PDT Subject: Modifying ngram.cc In-Reply-To: Your message of Fri, 08 Jul 2005 11:36:50 -0400. Message-ID: <200507180047.j6I0lvd29146@huge> The problem was that copy constructors (and assignment operators) for the container data structures (Array, SArray, LHash) weren't implemented. I fixed that in the current development version. No changes to any of the higher-level classes (like Ngram) is needed, as C++ automatically (and recursively) defines copy constructors based on the member copy constructors. You can either replace the attached source files (in dstruct/src), or download the "1.4.5 (Beta)" version from the web page. --Andreas In message you wrote: > Hi, > > This is kind of a C++ question > I wrote the following copy constructor in Ngram.h > Ngram(const Ngram > &ng):LM(ng.vocab),contexts(ng.contexts),order(ng.order),_skipOOVs(ng._skipOO > Vs),_trustTotals(ng._trustTotals){}; > > I have the following declarations > > Ngram ngramLM(vocab*,order); > > While(//) > Ngram useLM(ngramLM); > > // Do some stuff with useLM > ......... > ........ > } > > The problem is that the assignment useLM=ngramLM doesn't assign the original > ngramLM. Any changes I make to useLM shows up in ngramLM too. > I just want to make a copy (useLM) of the original ngramLM, work on that > copy and then reinitialize another useLM with the original ngramLM. > > Any thoughts? > > > Lambert > -------------- next part -------------- /* * Array.h -- * Extensible array class * * Copyright (c) 1995-2005 SRI International. All Rights Reserved. * * @(#)$Header: /home/srilm/devel/dstruct/src/RCS/Array.h,v 1.11 2005/07/17 22:12:21 stolcke Exp $ * */ #ifndef _Array_h_ #define _Array_h_ #include #include "MemStats.h" template class Array { public: Array(int base = 0, unsigned int size = 0) : _base(base), _size(size), _data(0), alloc_size(0) { if (size > 0) { alloc(size-1); } }; Array(Array &source) : _base(source._base), _size(0), _data(0), alloc_size(0) { *this = source; } ~Array() { delete [] _data; } ; DataT &operator[](int index) { int offset = index - _base; assert(offset >= 0); if (offset >= _size) { _size = offset + 1; if (offset >= alloc_size) { alloc(offset); } } return _data[offset]; }; Array & operator= (const Array &other); DataT *data() const { return _data - _base; }; int base() const { return _base; } unsigned int size() const { return _size; } void memStats(MemStats &stats) const; private: unsigned int _size; /* used size */ int _base; DataT *_data; unsigned int alloc_size; /* allocated size */ void alloc(unsigned int size); }; #endif /* _Array_h_ */ -------------- next part -------------- /* * LHash.cc -- * Linear search hash table implementation * */ #ifndef _LHash_cc_ #define _LHash_cc_ #ifndef lint static char LHash_Copyright[] = "Copyright (c) 1995-2005 SRI International. All Rights Reserved."; static char LHash_RcsId[] = "@(#)$Header: /home/srilm/devel/dstruct/src/RCS/LHash.cc,v 1.46 2005/07/17 22:01:31 stolcke Exp $"; #endif #include #include #include #include #include #include "LHash.h" #undef INSTANTIATE_LHASH #define INSTANTIATE_LHASH(KeyT, DataT) \ template <> DataT *LHash< KeyT, DataT >::removedData = 0; \ template class LHash< KeyT, DataT >; \ template class LHashIter< KeyT, DataT > #ifndef __GNUG__ template DataT *LHash::removedData = 0; #endif /* __GNUG__ */ const unsigned minHashBits = 3; /* minimum no. bits for hashing * tables smaller than this use linear * search to save space */ const float fillRatio = 0.8; /* fill ration at which the table is * expanded and rehashed */ #define BODY(b) ((LHashBody *)b) /* * Dump the entire hash array to cerr. Unused slots are printed as "FREE". */ template void LHash::dump() const { if (body) { unsigned maxEntries = hashSize(BODY(body)->maxBits); unsigned i; for (i = 0; i < maxEntries; i++) { cerr << " " <data[i].key)) { cerr << "FREE"; } else { cerr << BODY(body)->data[i].key #ifdef DUMP_VALUES /* * Only certain types can be printed. */ << "->" << BODY(body)->data[i].value #endif /* DUMP_VALUES */ ; } } } else { cerr << "EMPTY"; } cerr << endl; } template void LHash::memStats(MemStats &stats) const { stats.total += sizeof(*this); if (body) { unsigned maxEntries = hashSize(BODY(body)->maxBits); stats.total += sizeof(*BODY(body)) + sizeof(BODY(body)->data[0]) * (maxEntries - 1); stats.wasted += sizeof(BODY(body)->data[0]) * (maxEntries - BODY(body)->nEntries); } } /* * Compute the actual minimum size required for a given number of entries */ static inline unsigned roundSize(unsigned size) { if (size < hashSize(minHashBits)) { return size; } else { return (unsigned)((size + 1)/ fillRatio); } } template void LHash::alloc(unsigned size) { unsigned maxBits, maxEntries; unsigned i; /* * round up to power of two */ maxBits = 0; while (hashSize(maxBits) < size) { assert(maxBits < LHash_maxBitLimit); maxBits++; } maxEntries = hashSize(maxBits); //cerr << "LHash::alloc: current " << (body ? BODY(body)->nEntries : 0) // << ", requested " << size // << ", allocating " << maxEntries << " (" << maxBits << ")\n"; body = (LHashBody *)malloc(sizeof(*BODY(body)) + (maxEntries - 1) * sizeof(BODY(body)->data[0])); assert(body != 0); BODY(body)->maxBits = maxBits; BODY(body)->nEntries = 0; for (i = 0; i < maxEntries; i++) { Map_noKey(BODY(body)->data[i].key); } //cerr << "memory for header = " << // ((void *)&(BODY(body)->data[0]) - (void *)BODY(body)) << endl; //cerr << "memory per entry = " << // ((void *)&(BODY(body)->data[3]) - (void *)&(BODY(body)->data[2])) << endl; } template LHash::LHash(unsigned size) : body(0) { if (size != 0) { /* * determine actual initial size */ alloc(roundSize(size)); } } template void LHash::clear(unsigned size) { if (body) { unsigned maxEntries = hashSize(BODY(body)->maxBits); unsigned i; for (i = 0; i < maxEntries; i++) { if (! Map_noKeyP(BODY(body)->data[i].key)) { Map_freeKey(BODY(body)->data[i].key); } } free(body); body = 0; } if (size != 0) { alloc(roundSize(size)); } } template LHash::~LHash() { clear(0); } template LHash::LHash(const LHash &source) : body(0) { #ifdef DEBUG cerr << "warning: LHash copy constructor called\n"; #endif *this = source; } template LHash & LHash::operator= (const LHash &other) { #ifdef DEBUG cerr << "warning: LHash::operator= called\n"; #endif if (&other == this) { return *this; } /* * copy hash entries from old to new */ if (other.body) { unsigned maxEntries = hashSize(BODY(other.body)->maxBits); clear(maxEntries); for (unsigned i = 0; i < maxEntries; i++) { KeyT thisKey = BODY(other.body)->data[i].key; if (!Map_noKeyP(thisKey)) { /* * Copy key */ BODY(body)->data[i].key = Map_copyKey(thisKey); /* * Initialize data, required for = operator */ new (&(BODY(body)->data[i].value)) DataT; /* * Copy data */ BODY(body)->data[i].value = BODY(other.body)->data[i].value; } } BODY(body)->nEntries = BODY(other.body)->nEntries; } else { clear(0); } return *this; } /* * Returns index into data array that has the key which is either * equal to key, or indicates the place where key would be placed if it * occurred in the array. */ template Boolean LHash::locate(KeyT key, unsigned &index) const { assert(!Map_noKeyP(key)); if (body) { unsigned maxBits = BODY(body)->maxBits; register MapEntry *data = BODY(body)->data; if (maxBits < minHashBits) { /* * Do a linear search */ unsigned nEntries = BODY(body)->nEntries; register unsigned i; for (i = 0; i < nEntries; i++) { if (LHash_equalKey(data[i].key, key)) { index = i; return true; } } index = i; return false; } else { /* * Do a hashed search */ unsigned hash = LHash_hashKey(key, maxBits); unsigned i; for (i = hash; ; i = (i + 1) & hashMask(maxBits)) { if (Map_noKeyP(data[i].key)) { index = i; return false; } else if (LHash_equalKey(data[i].key, key)) { index = i; return true; } } } } else { return false; } } template DataT * LHash::find(KeyT key, Boolean &foundP) const { unsigned index; if (foundP = locate(key, index)) { return &(BODY(body)->data[index].value); } else { return 0; } } template KeyT LHash::getInternalKey(KeyT key, Boolean &foundP) const { unsigned index; static KeyT zeroKey; if (foundP = locate(key, index)) { return BODY(body)->data[index].key; } else { return zeroKey; } } template DataT * LHash::insert(KeyT key, Boolean &foundP) { unsigned index; assert(!(Map_noKeyP(key))); /* * Make sure there is room for at least one entry */ if (body == 0) { alloc(1); } if (foundP = locate(key, index)) { return &(BODY(body)->data[index].value); } else { unsigned maxEntries = hashSize(BODY(body)->maxBits); unsigned nEntries = BODY(body)->nEntries; /* * Rehash table if necessary */ unsigned minSize = roundSize(nEntries + 1); if (minSize > maxEntries) { LHashBody *oldBody = BODY(body); unsigned i; /* * Since LHash_maxEntriesLimit is a power of two minus 1 * we need to check this only when the array is enlarged */ assert(nEntries < LHash_maxEntriesLimit); alloc(minSize); BODY(body)->nEntries = nEntries; if (BODY(body)->maxBits < minHashBits) { /* * Just copy old entries to new storage, no reindexing * required. */ memcpy(BODY(body)->data, oldBody->data, sizeof(oldBody->data[0]) * nEntries); } else { /* * Rehash */ for (i = 0; i < maxEntries; i++) { KeyT key = oldBody->data[i].key; if (! Map_noKeyP(key)) { (void)locate(key, index); memcpy(&(BODY(body)->data[index]), &(oldBody->data[i]), sizeof(oldBody->data[0])); } } } free(oldBody); /* * Entries have been moved, so have to re-locate key */ (void)locate(key, index); } BODY(body)->data[index].key = Map_copyKey(key); /* * Initialize data to zero, but also call constructors, if any */ memset(&(BODY(body)->data[index].value), 0, sizeof(BODY(body)->data[index].value)); new (&(BODY(body)->data[index].value)) DataT; BODY(body)->nEntries++; return &(BODY(body)->data[index].value); } } template DataT * LHash::remove(KeyT key, Boolean &foundP) { unsigned index; /* * Allocate pseudo-static memory for removed objects (returned by * remove()). We cannot make this static because otherwise * the destructor for DataT would be called on it at program exit. */ if (removedData == 0) { removedData = (DataT *)malloc(sizeof(DataT)); assert(removedData != 0); } if (foundP = locate(key, index)) { Map_freeKey(BODY(body)->data[index].key); Map_noKey(BODY(body)->data[index].key); memcpy(removedData, &BODY(body)->data[index].value, sizeof(DataT)); if (BODY(body)->maxBits < minHashBits) { /* * Linear search mode -- Just move all entries above the * the one removed to fill the gap. */ unsigned nEntries = BODY(body)->nEntries; memmove(&BODY(body)->data[index], &BODY(body)->data[index+1], (nEntries - index - 1) * sizeof(BODY(body)->data[0])); Map_noKey(BODY(body)->data[nEntries - 1].key); } else { /* * The entry after the one being deleted could actually * belong to a prior spot in the table, but was bounced forward due * to a collision. The invariant used in lookup is that * all locations between the hash index and the actual index are * filled. Hence we examine all entries between the deleted * position and the next free position as whether they need to * be moved backward. */ while (1) { unsigned newIndex; index = (index + 1) & hashMask(BODY(body)->maxBits); if (Map_noKeyP(BODY(body)->data[index].key)) { break; } /* * If find returns false that means the deletion has * introduced a hole in the table that would prevent * us from finding the next entry. Luckily, find * also tells us where the hole is. We move the * entry to its rightful place and continue filling * the hole created by this move. */ if (!locate(BODY(body)->data[index].key, newIndex)) { memcpy(&(BODY(body)->data[newIndex]), &(BODY(body)->data[index]), sizeof(BODY(body)->data[0])); Map_noKey(BODY(body)->data[index].key); } } } BODY(body)->nEntries--; return removedData; } else { return 0; } } #ifdef __GNUG__ static #endif int (*LHash_thisKeyCompare)(); /* used by LHashIter() to communicate * with compareIndex() */ #ifdef __GNUG__ static #endif void *LHash_thisBody; /* ditto */ template void LHashIter::sortKeys() { /* * Store keys away and sort them to the user's orders. */ unsigned maxEntries = hashSize(myLHashBody->maxBits); unsigned *sortedIndex = new unsigned[numEntries]; assert(sortedIndex != 0); unsigned i; unsigned j = 0; for (i = 0; i < maxEntries; i++) { if (!Map_noKeyP(myLHashBody->data[i].key)) { sortedIndex[j++] = i; } } assert(j == numEntries); /* * Due to the limitations of the qsort interface we have to * pass extra information to compareIndex in these global * variables - yuck. */ LHash_thisKeyCompare = (int(*)())sortFunction; LHash_thisBody = myLHashBody; qsort(sortedIndex, numEntries, sizeof(*sortedIndex), compareIndex); /* * Save the keys for enumeration. The reason we save the keys, * not the indices, is that indices may change during enumeration * due to deletions. */ sortedKeys = new KeyT[numEntries]; assert(sortedKeys != 0); for (i = 0; i < numEntries; i++) { sortedKeys[i] = myLHashBody->data[sortedIndex[i]].key; } delete [] sortedIndex; } template LHashIter::LHashIter(const LHash &lhash, int (*keyCompare)(KeyT, KeyT)) : myLHashBody(BODY(lhash.body)), current(0), numEntries(lhash.numEntries()), sortFunction(keyCompare) { /* * Note: we access the underlying LHash through the body pointer, * not the top-level LHash itself. This allows the top-level object * to be moved without affecting an ongoing iteration. * XXX: This only works because * - it is illegal to insert while iterating * - deletions don't cause reallocation of the data */ if (sortFunction && myLHashBody) { sortKeys(); } else { sortedKeys = 0; } } /* * This is the callback function passed to qsort for comparing array * entries. It is passed the indices into the data array, which are * compared according to the user-supplied function applied to the * keys found at those locations. */ template int LHashIter::compareIndex(const void *idx1, const void *idx2) { return (*(compFnType)LHash_thisKeyCompare) (BODY(LHash_thisBody)->data[*(unsigned *)idx1].key, BODY(LHash_thisBody)->data[*(unsigned *)idx2].key); } template LHashIter::~LHashIter() { delete [] sortedKeys; } template void LHashIter::init() { delete [] sortedKeys; current = 0; { /* * XXX: fake LHash object to access numEntries() */ LHash myLHash(0); myLHash.body = myLHashBody; numEntries = myLHash.numEntries(); myLHash.body = 0; } if (sortFunction && myLHashBody) { sortKeys(); } else { sortedKeys = 0; } } template DataT * LHashIter::next(KeyT &key) { if (myLHashBody == 0) { return 0; } else { unsigned index; if (sortedKeys) { /* * Sorted enumeration -- advance to next key in sorted list */ if (current == numEntries) { return 0; } /* * XXX: fake LHash object to access locate() */ LHash myLHash(0); myLHash.body = myLHashBody; myLHash.locate(sortedKeys[current++], index); myLHash.body = 0;; } else { /* * Detect deletions by comparing old and current number of * entries. * A legal deletion can only affect the current entry, so * adjust the current position to reflect the next entry was * moved. */ if (myLHashBody->nEntries != numEntries) { assert(myLHashBody->nEntries == numEntries - 1); numEntries --; current --; } /* * Random enumeration mode */ unsigned maxBits = myLHashBody->maxBits; if (maxBits < minHashBits) { /* * Linear search mode - advance one to the next entry */ if (current == numEntries) { return 0; } } else { unsigned maxEntries = hashSize(maxBits); while (current < maxEntries && Map_noKeyP(myLHashBody->data[current].key)) { current++; } if (current == maxEntries) { return 0; } } index = current ++; } key = myLHashBody->data[index].key; return &(myLHashBody->data[index].value); } } #undef BODY #endif /* _LHash_cc_ */ -------------- next part -------------- /* * SArray.cc -- * Sorted array implementation * */ #ifndef _SArray_cc_ #define _SArray_cc_ #ifndef lint static char SArray_Copyright[] = "Copyright (c) 1995-2005 SRI International. All Rights Reserved."; static char SArray_RcsId[] = "@(#)$Header: /home/srilm/devel/dstruct/src/RCS/SArray.cc,v 1.37 2005/07/17 22:01:31 stolcke Exp $"; #endif #include #include #include #include #include #include "SArray.h" #undef INSTANTIATE_SARRAY #define INSTANTIATE_SARRAY(KeyT, DataT) \ template <> DataT *SArray< KeyT, DataT >::removedData = 0; \ template class SArray< KeyT, DataT >; \ template class SArrayIter< KeyT, DataT > #ifndef __GNUG__ template DataT *SArray::removedData = 0; #endif /* __GNUG__ */ #define BODY(b) ((SArrayBody *)b) const double growSize = 1.1; template void SArray::dump() const { if (body == 0) { cerr << "EMPTY" << endl; } else { unsigned nEntries = numEntries(); cerr << "maxEntries = " << BODY(body)->maxEntries; cerr << " nEntries = " << nEntries; for (unsigned i = 0; i < nEntries; i++) { cerr << " " <data[i].key #ifdef DUMP_VALUES /* * Only certain types can be printed. */ << "->" << BODY(body)->data[i].value #endif /* DUMP_VALUES */ ; } } } template void SArray::memStats(MemStats &stats) const { stats.total += sizeof(*this); if (body) { unsigned maxEntries = BODY(body)->maxEntries; stats.total += sizeof(*BODY(body)) + sizeof(BODY(body)->data[0]) * (maxEntries - 1); stats.wasted += sizeof(BODY(body)->data[0]) * (maxEntries - numEntries()); } } template void SArray::alloc(unsigned size) { body = (SArrayBody *)malloc(sizeof(*BODY(body)) + (size - 1) * sizeof(BODY(body)->data[0])); assert(body != 0); BODY(body)->deleted = 0; BODY(body)->maxEntries = size; /* * Fill the array with dummy keys, marking the unused space */ for (unsigned i = 0; i < size; i ++) { Map_noKey(BODY(body)->data[i].key); } } template SArray::SArray(unsigned size) : body(0) { if (size != 0) { alloc(size); } } template void SArray::clear(unsigned size) { if (body) { unsigned maxEntries = BODY(body)->maxEntries; for (unsigned i = 0; i < maxEntries; i++) { KeyT thisKey = BODY(body)->data[i].key; if (Map_noKeyP(thisKey)) { break; } Map_freeKey(thisKey); } free(body); body = 0; } if (size != 0) { alloc(size); } } template SArray::~SArray() { clear(0); } template unsigned SArray::numEntries() const { if (body == 0) { return 0; } else if (Map_noKeyP(BODY(body)->data[0].key)) { return 0; } else { /* * Do a binary search for the end of the used memory * lower always points to a filled entry * upper always points to a free entry beyond the end of used entries */ unsigned lower, upper; lower = 0; /* minimum index (inclusive) */ upper = BODY(body)->maxEntries; /* maximum index (exclusive) */ while (lower + 1 < upper) { unsigned middle = (lower + upper)/2; if (Map_noKeyP(BODY(body)->data[middle].key)) { upper = middle; } else { lower = middle; } } /* lower + 1 == upper */ return upper; } } template SArray::SArray(const SArray &source) : body(0) { #ifdef DEBUG cerr << "warning: SArray copy constructor called\n"; #endif *this = source; } template SArray & SArray::operator= (const SArray &other) { #ifdef DEBUG cerr << "warning: SArray::operator= called\n"; #endif if (&other == this) { return *this; } /* * copy array entries from old to new */ if (other.body) { unsigned maxEntries = BODY(other.body)->maxEntries; clear(maxEntries); for (unsigned i = 0; i < maxEntries; i++) { KeyT thisKey = BODY(other.body)->data[i].key; if (Map_noKeyP(thisKey)) { break; } /* * Copy key */ BODY(body)->data[i].key = Map_copyKey(thisKey); /* * Initialize data, required for = operator */ new (&(BODY(body)->data[i].value)) DataT; /* * Copy data */ BODY(body)->data[i].value = BODY(other.body)->data[i].value; } } else { clear(0); } return *this; } /* * Returns index into data array that has the key which is either * equal to key, or indicates the place where key would be placed if it * occurred in the array. */ template Boolean SArray::locate(KeyT key, unsigned &index) const { assert(!Map_noKeyP(key)); if (body) { unsigned lower, upper; lower = 0; /* minimum index (inclusive) */ upper = BODY(body)->maxEntries; /* maximum index (exclusive) */ while (lower < upper) { unsigned middle = (lower + upper)/2; KeyT thisKey = BODY(body)->data[middle].key; if (Map_noKeyP(thisKey)) { upper = middle; continue; } int compare = SArray_compareKey(key, thisKey); if (compare < 0) { upper = middle; } else if (compare > 0) { lower = middle + 1; } else { index = middle; return true; } } /* we have lower == upper */ if (lower == BODY(body)->maxEntries || Map_noKeyP(BODY(body)->data[lower].key) || SArray_compareKey(key, BODY(body)->data[lower].key) < 0) { index = lower; } else { index = lower + 1; } return false; } else { return false; } } template DataT * SArray::find(KeyT key, Boolean &foundP) const { unsigned index; if (foundP = locate(key, index)) { return &(BODY(body)->data[index].value); } else { return 0; } } template KeyT SArray::getInternalKey(KeyT key, Boolean &foundP) const { unsigned index; static KeyT zeroKey; if (foundP = locate(key, index)) { return BODY(body)->data[index].key; } else { return zeroKey; } } template DataT * SArray::insert(KeyT key, Boolean &foundP) { unsigned index; /* * Make sure there is room for at least one entry */ if (body == 0) { alloc(1); } if (foundP = locate(key, index)) { return &(BODY(body)->data[index].value); } else { unsigned nEntries = numEntries(); unsigned maxEntries = BODY(body)->maxEntries; /* * Need to add an element. * First, enlarge array if necessary */ if (nEntries == maxEntries) { maxEntries = (int)(maxEntries * growSize); if (maxEntries == nEntries) { maxEntries ++; } body = (SArrayBody *)realloc(body, sizeof(*BODY(body)) + (maxEntries - 1) * sizeof(BODY(body)->data[0])); assert(body != 0); /* * Fill new space with dummy keys */ for (unsigned i = BODY(body)->maxEntries; i < maxEntries; i ++) { Map_noKey(BODY(body)->data[i].key); } BODY(body)->maxEntries = maxEntries; } memmove(&(BODY(body)->data[index + 1]), &(BODY(body)->data[index]), (nEntries - index) * sizeof(BODY(body)->data[0])); BODY(body)->data[index].key = Map_copyKey(key); /* * Initialize data to zero, but also call constructors, if any */ memset(&(BODY(body)->data[index].value), 0, sizeof(BODY(body)->data[index].value)); new (&(BODY(body)->data[index].value)) DataT; return &(BODY(body)->data[index].value); } } template DataT * SArray::remove(KeyT key, Boolean &foundP) { unsigned index; /* * Allocate pseudo-static memory for removed objects (returned by * remove()). We cannot make this static because otherwise * the destructor for DataT would be called on it at program exit. */ if (removedData == 0) { removedData = (DataT *)malloc(sizeof(DataT)); assert(removedData != 0); } if (foundP = locate(key, index)) { unsigned nEntries = numEntries(); Map_freeKey(BODY(body)->data[index].key); memcpy(removedData, &BODY(body)->data[index].value, sizeof(DataT)); memmove(&(BODY(body)->data[index]), &(BODY(body)->data[index + 1]), (nEntries - index - 1) * sizeof(BODY(body)->data[0])); /* * mark previous last slot as free */ Map_noKey(BODY(body)->data[nEntries-1].key); /* * Indicate that deletion occurred */ BODY(body)->deleted = 1; return removedData; } else { return 0; } } #ifdef __GNUG__ static #endif int (*SArray_thisKeyCompare)(); /* used by SArrayIter() to communicate * with compareIndex() */ #ifdef __GNUG__ static #endif void *SArray_thisBody; /* ditto */ template void SArrayIter::sortKeys() { /* * Store keys away and sort them to the user's orders. */ unsigned *sortedIndex = new unsigned[numEntries]; assert(sortedIndex != 0); unsigned i; for (i = 0; i < numEntries; i++) { sortedIndex[i] = i; } /* * Due to the limitations of the qsort interface we have to * pass extra information to compareIndex in these global * variables - yuck. */ SArray_thisKeyCompare = (int (*)())sortFunction; SArray_thisBody = mySArrayBody; qsort(sortedIndex, numEntries, sizeof(*sortedIndex), compareIndex); /* * Save the keys for enumeration. The reason we save the keys, * not the indices, is that indices may change during enumeration * due to deletions. */ sortedKeys = new KeyT[numEntries]; assert(sortedKeys != 0); for (i = 0; i < numEntries; i++) { sortedKeys[i] = mySArrayBody->data[sortedIndex[i]].key; } delete [] sortedIndex; } template SArrayIter::SArrayIter(const SArray &sarray, int (*keyCompare)(KeyT, KeyT)) : mySArrayBody(BODY(sarray.body)), current(0), numEntries(sarray.numEntries()), sortFunction(keyCompare) { /* * Note: we access the underlying SArray through the body pointer, * not the top-level SArray itself. This allows the top-level object * to be moved without affecting an ongoing iteration. * XXX: This only works because * - it is illegal to insert while iterating * - deletions don't cause reallocation of the data */ if (sortFunction && mySArrayBody) { sortKeys(); } else { sortedKeys = 0; } if (mySArrayBody) { mySArrayBody->deleted = 0; } } /* * This is the callback function passed to qsort for comparing array * entries. It is passed the indices into the data array, which are * compared according to the user-supplied function applied to the * keys found at those locations. */ template int SArrayIter::compareIndex(const void *idx1, const void *idx2) { return (*(compFnType)SArray_thisKeyCompare) (BODY(SArray_thisBody)->data[*(unsigned *)idx1].key, BODY(SArray_thisBody)->data[*(unsigned *)idx2].key); } template SArrayIter::~SArrayIter() { delete [] sortedKeys; } template void SArrayIter::init() { delete [] sortedKeys; sortedKeys = 0; current = 0; { /* * XXX: fake SArray object to access numEntries() */ SArray mySArray(0); mySArray.body = mySArrayBody; numEntries = mySArray.numEntries(); mySArray.body = 0; } if (mySArrayBody) { if (sortFunction) { sortKeys(); } mySArrayBody->deleted = 0; } } template DataT * SArrayIter::next(KeyT &key) { if (mySArrayBody == 0) { return 0; } else { unsigned index; if (sortedKeys == 0) { /* * Detect deletion while iterating. * A legal deletion can only affect the current entry, so * adjust the current position to reflect the next entry was * moved. */ if (mySArrayBody->deleted) { numEntries --; current --; } if (current == numEntries) { return 0; } index = current++; } else { if (current == numEntries) { return 0; } /* * XXX: fake an SArray object to access locate() */ SArray mySArray(0); mySArray.body = mySArrayBody; (void)mySArray.locate(sortedKeys[current++], index); mySArray.body = 0; } mySArrayBody->deleted = 0; key = mySArrayBody->data[index].key; return &(mySArrayBody->data[index].value); } } #undef BODY #endif /* _SArray_cc_ */ From stolcke at speech.sri.com Mon Jul 18 07:28:40 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Mon, 18 Jul 2005 07:28:40 PDT Subject: Log-linear interpolation In-Reply-To: Your message of Sun, 11 Apr 2004 09:57:55 +0100. <32809.213.58.88.69.1081673875.squirrel@mail.crb.ucp.pt> Message-ID: <200507181428.HAA02188@tonga> In message <32809.213.58.88.69.1081673875.squirrel at mail.crb.ucp.pt>you wrote: > > Hi! > > Does anyone know a program or toolkit allowing to do log-linear > interpolation of different language models? since SRILM only permit to do > linear interpolation. > Thanks for your help, > > Ciro Martins Ciro, sorry for the late response ;-) There is now, in the current development version of SRILM, an implementation of log-linear interpolation. The class name is LoglinearMix, and the ngram -loglinear-mix option triggers its use. Note that log-linear interpolation is much slower to evaluate than linear interpolation, due to the need to normalize the combined LM. This is done somewhat efficiently in SRILM by caching the normalizers for previously seen contexts. You might also want to try using log-linear combination of LM scores without normalization. This can be done in the nbest or lattice rescoring framework implemented by the toolkit, simply by computing scores from multiple LMs. The latest version of the toolkit can by downloaded in the usual way by choosing the "1.4.5 (Beta)" version in the web form. --Andreas From patryale at iro.umontreal.ca Mon Jul 18 12:15:16 2005 From: patryale at iro.umontreal.ca (Alexandre Patry) Date: Mon, 18 Jul 2005 15:15:16 -0400 Subject: bug? Message-ID: <1121714116.4619.10.camel@florali3.iro.umontreal.ca> Hi, I began to use your excellent library and noticed that in prob.h, in the functions similar to this one : inline Prob LogPtoProb(LogP2 prob) { if (prob == LogP_Zero) { return 0; } else { return exp(prob * M_LN10); } } the base of the logarithm is changed via a multiplication. From what I understand of the function and of the logarithms (http://en.wikipedia.org/wiki/Logarithm#Change_of_base), I think that a division should be used instead. Thank you, Alexandre Patry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Ceci est une partie de message num?riquement sign?e. URL: From anand at speech.sri.com Mon Jul 18 12:36:02 2005 From: anand at speech.sri.com (Anand Venkataraman) Date: Mon, 18 Jul 2005 12:36:02 -0700 (PDT) Subject: bug? In-Reply-To: <1121714116.4619.10.camel@florali3.iro.umontreal.ca> (message from Alexandre Patry on Mon, 18 Jul 2005 15:15:16 -0400) Message-ID: <200507181936.j6IJa2r03482@clara> Alexander, No, the code is correct. Let x = log_10 p. Then e^(x ln 10) = 10^x = p. > inline Prob LogPtoProb(LogP2 prob) > { > if (prob == LogP_Zero) { > return 0; > } else { > return exp(prob * M_LN10); > } > } & From lambert at jhu.edu Tue Jul 19 14:04:19 2005 From: lambert at jhu.edu (lambert mathias) Date: Tue, 19 Jul 2005 17:04:19 -0400 Subject: Modifying ngram.cc In-Reply-To: <200507180047.j6I0lvd29146@huge> Message-ID: Hi, I tried using the copy constructors in srilm-1.4.5. I initialized my LM as Ngram useLM(ngramLM); However, when I do a useLM->write(), the output shows some of the entried as (null). \data\ ngram 1=6 ngram 2=24 ngram 3=96 \1-grams: 0 (null) -99 0 (null) -99 -0.6083246 3 0 -0.6093718 4 0 -1.78533 -99 ~~-99 \2-grams: 0 3 (null) -99 0 3 (null) -99 -0.6108742 3 3 0 -0.6078538 3 4 3.818703 -1.778024 3~~ ............ I am trying to figure out if there is a bug in the copy constructor initialization. Lambert On 7/17/05 8:47 PM, "Andreas Stolcke" wrote: > The problem was that copy constructors (and assignment operators) > for the container data structures (Array, SArray, LHash) weren't implemented. > I fixed that in the current development version. > No changes to any of the higher-level classes (like Ngram) is needed, > as C++ automatically (and recursively) defines copy constructors based on the > member copy constructors. > > You can either replace the attached source files (in dstruct/src), or > download the "1.4.5 (Beta)" version from the web page. > > --Andreas > > In message you wrote: >> Hi, >> >> This is kind of a C++ question >> I wrote the following copy constructor in Ngram.h >> Ngram(const Ngram >> &ng):LM(ng.vocab),contexts(ng.contexts),order(ng.order),_skipOOVs(ng._skipOO >> Vs),_trustTotals(ng._trustTotals){}; >> >> I have the following declarations >> >> Ngram ngramLM(vocab*,order); >> >> While(//) >> Ngram useLM(ngramLM); >> >> // Do some stuff with useLM >> ......... >> ........ >> } >> >> The problem is that the assignment useLM=ngramLM doesn't assign the original >> ngramLM. Any changes I make to useLM shows up in ngramLM too. >> I just want to make a copy (useLM) of the original ngramLM, work on that >> copy and then reinitialize another useLM with the original ngramLM. >> >> Any thoughts? >> >> >> Lambert >> > > /* > * Array.h -- > * Extensible array class > * > * Copyright (c) 1995-2005 SRI International. All Rights Reserved. > * > * @(#)$Header: /home/srilm/devel/dstruct/src/RCS/Array.h,v 1.11 2005/07/17 > 22:12:21 stolcke Exp $ > * > */ > > #ifndef _Array_h_ > #define _Array_h_ > > #include > > #include "MemStats.h" > > template > class Array > { > public: > Array(int base = 0, unsigned int size = 0) > : _base(base), _size(size), _data(0), alloc_size(0) > { if (size > 0) { alloc(size-1); } }; > Array(Array &source) > : _base(source._base), _size(0), _data(0), alloc_size(0) > { *this = source; } > > ~Array() { delete [] _data; } ; > > DataT &operator[](int index) > { int offset = index - _base; assert(offset >= 0); > if (offset >= _size) { > _size = offset + 1; > if (offset >= alloc_size) { alloc(offset); } > } > return _data[offset]; > }; > > Array & operator= (const Array &other); > > DataT *data() const { return _data - _base; }; > int base() const { return _base; } > unsigned int size() const { return _size; } > > void memStats(MemStats &stats) const; > > private: > unsigned int _size; /* used size */ > int _base; > DataT *_data; > unsigned int alloc_size; /* allocated size */ > void alloc(unsigned int size); > }; > > #endif /* _Array_h_ */ > /* > * LHash.cc -- > * Linear search hash table implementation > * > */ > > #ifndef _LHash_cc_ > #define _LHash_cc_ > > #ifndef lint > static char LHash_Copyright[] = "Copyright (c) 1995-2005 SRI International. > All Rights Reserved."; > static char LHash_RcsId[] = "@(#)$Header: > /home/srilm/devel/dstruct/src/RCS/LHash.cc,v 1.46 2005/07/17 22:01:31 stolcke > Exp $"; > #endif > > #include > #include > #include > #include > #include > > #include "LHash.h" > > #undef INSTANTIATE_LHASH > #define INSTANTIATE_LHASH(KeyT, DataT) \ > template <> DataT *LHash< KeyT, DataT >::removedData = 0; \ > template class LHash< KeyT, DataT >; \ > template class LHashIter< KeyT, DataT > > > #ifndef __GNUG__ > template > DataT *LHash::removedData = 0; > #endif /* __GNUG__ */ > > const unsigned minHashBits = 3; /* minimum no. bits for hashing > * tables smaller than this use linear > * search to save space */ > const float fillRatio = 0.8; /* fill ration at which the table is > * expanded and rehashed */ > > #define BODY(b) ((LHashBody *)b) > > /* > * Dump the entire hash array to cerr. Unused slots are printed as "FREE". > */ > template > void > LHash::dump() const > { > if (body) { > unsigned maxEntries = hashSize(BODY(body)->maxBits); > unsigned i; > > for (i = 0; i < maxEntries; i++) { > cerr << " " < if (Map_noKeyP(BODY(body)->data[i].key)) { > cerr << "FREE"; > } else { > cerr << BODY(body)->data[i].key > #ifdef DUMP_VALUES > /* > * Only certain types can be printed. > */ > << "->" << BODY(body)->data[i].value > #endif /* DUMP_VALUES */ > ; > } > } > } else { > cerr << "EMPTY"; > } > cerr << endl; > } > > template > void > LHash::memStats(MemStats &stats) const > { > stats.total += sizeof(*this); > if (body) { > unsigned maxEntries = hashSize(BODY(body)->maxBits); > > stats.total += sizeof(*BODY(body)) + > sizeof(BODY(body)->data[0]) * > (maxEntries - 1); > stats.wasted += sizeof(BODY(body)->data[0]) * > (maxEntries - BODY(body)->nEntries); > } > } > > /* > * Compute the actual minimum size required for a given number of entries > */ > static inline unsigned > roundSize(unsigned size) > { > if (size < hashSize(minHashBits)) { > return size; > } else { > return (unsigned)((size + 1)/ fillRatio); > } > } > > template > void > LHash::alloc(unsigned size) > { > unsigned maxBits, maxEntries; > unsigned i; > > /* > * round up to power of two > */ > maxBits = 0; > while (hashSize(maxBits) < size) { > assert(maxBits < LHash_maxBitLimit); > maxBits++; > } > > maxEntries = hashSize(maxBits); > > //cerr << "LHash::alloc: current " << (body ? BODY(body)->nEntries : 0) > // << ", requested " << size > // << ", allocating " << maxEntries << " (" << maxBits << ")\n"; > > body = (LHashBody *)malloc(sizeof(*BODY(body)) + > (maxEntries - 1) * sizeof(BODY(body)->data[0])); > assert(body != 0); > > BODY(body)->maxBits = maxBits; > BODY(body)->nEntries = 0; > > for (i = 0; i < maxEntries; i++) { > Map_noKey(BODY(body)->data[i].key); > } > //cerr << "memory for header = " << > // ((void *)&(BODY(body)->data[0]) - (void *)BODY(body)) << endl; > //cerr << "memory per entry = " << > // ((void *)&(BODY(body)->data[3]) - (void *)&(BODY(body)->data[2])) << > endl; > } > > template > LHash::LHash(unsigned size) > : body(0) > { > if (size != 0) { > /* > * determine actual initial size > */ > alloc(roundSize(size)); > } > } > > template > void > LHash::clear(unsigned size) > { > if (body) { > unsigned maxEntries = hashSize(BODY(body)->maxBits); > unsigned i; > > for (i = 0; i < maxEntries; i++) { > if (! Map_noKeyP(BODY(body)->data[i].key)) { > Map_freeKey(BODY(body)->data[i].key); > } > } > free(body); > body = 0; > } > if (size != 0) { > alloc(roundSize(size)); > } > } > > template > LHash::~LHash() > { > clear(0); > } > > template > LHash::LHash(const LHash &source) > : body(0) > { > #ifdef DEBUG > cerr << "warning: LHash copy constructor called\n"; > #endif > *this = source; > } > > template > LHash & > LHash::operator= (const LHash &other) > { > #ifdef DEBUG > cerr << "warning: LHash::operator= called\n"; > #endif > > if (&other == this) { > return *this; > } > > /* > * copy hash entries from old to new > */ > if (other.body) { > unsigned maxEntries = hashSize(BODY(other.body)->maxBits); > clear(maxEntries); > > for (unsigned i = 0; i < maxEntries; i++) { > KeyT thisKey = BODY(other.body)->data[i].key; > > if (!Map_noKeyP(thisKey)) { > /* > * Copy key > */ > BODY(body)->data[i].key = Map_copyKey(thisKey); > > /* > * Initialize data, required for = operator > */ > new (&(BODY(body)->data[i].value)) DataT; > > /* > * Copy data > */ > BODY(body)->data[i].value = BODY(other.body)->data[i].value; > } > } > BODY(body)->nEntries = BODY(other.body)->nEntries; > } else { > clear(0); > } > > return *this; > } > > /* > * Returns index into data array that has the key which is either > * equal to key, or indicates the place where key would be placed if it > * occurred in the array. > */ > template > Boolean > LHash::locate(KeyT key, unsigned &index) const > { > assert(!Map_noKeyP(key)); > > if (body) { > unsigned maxBits = BODY(body)->maxBits; > register MapEntry *data = BODY(body)->data; > > if (maxBits < minHashBits) { > /* > * Do a linear search > */ > unsigned nEntries = BODY(body)->nEntries; > register unsigned i; > > for (i = 0; i < nEntries; i++) { > if (LHash_equalKey(data[i].key, key)) { > index = i; > return true; > } > } > index = i; > return false; > } else { > /* > * Do a hashed search > */ > unsigned hash = LHash_hashKey(key, maxBits); > unsigned i; > > for (i = hash; ; i = (i + 1) & hashMask(maxBits)) > { > if (Map_noKeyP(data[i].key)) { > index = i; > return false; > } else if (LHash_equalKey(data[i].key, key)) { > index = i; > return true; > } > } > } > } else { > return false; > } > } > > template > DataT * > LHash::find(KeyT key, Boolean &foundP) const > { > unsigned index; > > if (foundP = locate(key, index)) { > return &(BODY(body)->data[index].value); > } else { > return 0; > } > } > > template > KeyT > LHash::getInternalKey(KeyT key, Boolean &foundP) const > { > unsigned index; > static KeyT zeroKey; > > if (foundP = locate(key, index)) { > return BODY(body)->data[index].key; > } else { > return zeroKey; > } > } > > template > DataT * > LHash::insert(KeyT key, Boolean &foundP) > { > unsigned index; > > assert(!(Map_noKeyP(key))); > > /* > * Make sure there is room for at least one entry > */ > if (body == 0) { > alloc(1); > } > > if (foundP = locate(key, index)) { > return &(BODY(body)->data[index].value); > } else { > unsigned maxEntries = hashSize(BODY(body)->maxBits); > unsigned nEntries = BODY(body)->nEntries; > > /* > * Rehash table if necessary > */ > unsigned minSize = roundSize(nEntries + 1); > > if (minSize > maxEntries) { > LHashBody *oldBody = BODY(body); > unsigned i; > > /* > * Since LHash_maxEntriesLimit is a power of two minus 1 > * we need to check this only when the array is enlarged > */ > assert(nEntries < LHash_maxEntriesLimit); > > alloc(minSize); > BODY(body)->nEntries = nEntries; > > if (BODY(body)->maxBits < minHashBits) { > /* > * Just copy old entries to new storage, no reindexing > * required. > */ > memcpy(BODY(body)->data, oldBody->data, > sizeof(oldBody->data[0]) * nEntries); > } else { > /* > * Rehash > */ > for (i = 0; i < maxEntries; i++) { > KeyT key = oldBody->data[i].key; > > if (! Map_noKeyP(key)) { > (void)locate(key, index); > memcpy(&(BODY(body)->data[index]), &(oldBody->data[i]), > sizeof(oldBody->data[0])); > } > } > } > free(oldBody); > > /* > * Entries have been moved, so have to re-locate key > */ > (void)locate(key, index); > } > > BODY(body)->data[index].key = Map_copyKey(key); > > /* > * Initialize data to zero, but also call constructors, if any > */ > memset(&(BODY(body)->data[index].value), 0, > sizeof(BODY(body)->data[index].value)); > new (&(BODY(body)->data[index].value)) DataT; > > BODY(body)->nEntries++; > > return &(BODY(body)->data[index].value); > } > } > > template > DataT * > LHash::remove(KeyT key, Boolean &foundP) > { > unsigned index; > > /* > * Allocate pseudo-static memory for removed objects (returned by > * remove()). We cannot make this static because otherwise > * the destructor for DataT would be called on it at program exit. > */ > if (removedData == 0) { > removedData = (DataT *)malloc(sizeof(DataT)); > assert(removedData != 0); > } > > if (foundP = locate(key, index)) { > Map_freeKey(BODY(body)->data[index].key); > Map_noKey(BODY(body)->data[index].key); > memcpy(removedData, &BODY(body)->data[index].value, sizeof(DataT)); > > if (BODY(body)->maxBits < minHashBits) { > /* > * Linear search mode -- Just move all entries above the > * the one removed to fill the gap. > */ > unsigned nEntries = BODY(body)->nEntries; > > memmove(&BODY(body)->data[index], > &BODY(body)->data[index+1], > (nEntries - index - 1) * sizeof(BODY(body)->data[0])); > Map_noKey(BODY(body)->data[nEntries - 1].key); > } else { > /* > * The entry after the one being deleted could actually > * belong to a prior spot in the table, but was bounced forward due > * to a collision. The invariant used in lookup is that > * all locations between the hash index and the actual index are > * filled. Hence we examine all entries between the deleted > * position and the next free position as whether they need to > * be moved backward. > */ > while (1) { > unsigned newIndex; > > index = (index + 1) & hashMask(BODY(body)->maxBits); > > if (Map_noKeyP(BODY(body)->data[index].key)) { > break; > } > > /* > * If find returns false that means the deletion has > * introduced a hole in the table that would prevent > * us from finding the next entry. Luckily, find > * also tells us where the hole is. We move the > * entry to its rightful place and continue filling > * the hole created by this move. > */ > if (!locate(BODY(body)->data[index].key, newIndex)) { > memcpy(&(BODY(body)->data[newIndex]), > &(BODY(body)->data[index]), > sizeof(BODY(body)->data[0])); > Map_noKey(BODY(body)->data[index].key); > } > } > } > BODY(body)->nEntries--; > return removedData; > } else { > return 0; > } > } > > #ifdef __GNUG__ > static > #endif > int (*LHash_thisKeyCompare)(); /* used by LHashIter() to communicate > * with compareIndex() */ > #ifdef __GNUG__ > static > #endif > void *LHash_thisBody; /* ditto */ > > template > void > LHashIter::sortKeys() > { > /* > * Store keys away and sort them to the user's orders. > */ > unsigned maxEntries = hashSize(myLHashBody->maxBits); > > unsigned *sortedIndex = new unsigned[numEntries]; > assert(sortedIndex != 0); > > unsigned i; > > unsigned j = 0; > for (i = 0; i < maxEntries; i++) { > if (!Map_noKeyP(myLHashBody->data[i].key)) { > sortedIndex[j++] = i; > } > } > assert(j == numEntries); > > /* > * Due to the limitations of the qsort interface we have to > * pass extra information to compareIndex in these global > * variables - yuck. > */ > LHash_thisKeyCompare = (int(*)())sortFunction; > LHash_thisBody = myLHashBody; > qsort(sortedIndex, numEntries, sizeof(*sortedIndex), compareIndex); > > /* > * Save the keys for enumeration. The reason we save the keys, > * not the indices, is that indices may change during enumeration > * due to deletions. > */ > sortedKeys = new KeyT[numEntries]; > assert(sortedKeys != 0); > > for (i = 0; i < numEntries; i++) { > sortedKeys[i] = myLHashBody->data[sortedIndex[i]].key; > } > > delete [] sortedIndex; > } > > template > LHashIter::LHashIter(const LHash &lhash, > int (*keyCompare)(KeyT, KeyT)) > : myLHashBody(BODY(lhash.body)), current(0), > numEntries(lhash.numEntries()), sortFunction(keyCompare) > { > /* > * Note: we access the underlying LHash through the body pointer, > * not the top-level LHash itself. This allows the top-level object > * to be moved without affecting an ongoing iteration. > * XXX: This only works because > * - it is illegal to insert while iterating > * - deletions don't cause reallocation of the data > */ > if (sortFunction && myLHashBody) { > sortKeys(); > } else { > sortedKeys = 0; > } > } > > /* > * This is the callback function passed to qsort for comparing array > * entries. It is passed the indices into the data array, which are > * compared according to the user-supplied function applied to the > * keys found at those locations. > */ > template > int > LHashIter::compareIndex(const void *idx1, const void *idx2) > { > return (*(compFnType)LHash_thisKeyCompare) > (BODY(LHash_thisBody)->data[*(unsigned *)idx1].key, > BODY(LHash_thisBody)->data[*(unsigned *)idx2].key); > } > > template > LHashIter::~LHashIter() > { > delete [] sortedKeys; > } > > > template > void > LHashIter::init() > { > delete [] sortedKeys; > > current = 0; > > { > /* > * XXX: fake LHash object to access numEntries() > */ > LHash myLHash(0); > > myLHash.body = myLHashBody; > numEntries = myLHash.numEntries(); > myLHash.body = 0; > } > > if (sortFunction && myLHashBody) { > sortKeys(); > } else { > sortedKeys = 0; > } > } > > template > DataT * > LHashIter::next(KeyT &key) > { > > if (myLHashBody == 0) { > return 0; > } else { > unsigned index; > > if (sortedKeys) { > /* > * Sorted enumeration -- advance to next key in sorted list > */ > if (current == numEntries) { > return 0; > } > > /* > * XXX: fake LHash object to access locate() > */ > LHash myLHash(0); > > myLHash.body = myLHashBody; > myLHash.locate(sortedKeys[current++], index); > myLHash.body = 0;; > } else { > /* > * Detect deletions by comparing old and current number of > * entries. > * A legal deletion can only affect the current entry, so > * adjust the current position to reflect the next entry was > * moved. > */ > if (myLHashBody->nEntries != numEntries) { > assert(myLHashBody->nEntries == numEntries - 1); > > numEntries --; > current --; > } > > /* > * Random enumeration mode > */ > unsigned maxBits = myLHashBody->maxBits; > > if (maxBits < minHashBits) { > /* > * Linear search mode - advance one to the next entry > */ > if (current == numEntries) { > return 0; > } > } else { > unsigned maxEntries = hashSize(maxBits); > > while (current < maxEntries && > Map_noKeyP(myLHashBody->data[current].key)) > { > current++; > } > > if (current == maxEntries) { > return 0; > } > } > > index = current ++; > } > > key = myLHashBody->data[index].key; > return &(myLHashBody->data[index].value); > } > } > > #undef BODY > > #endif /* _LHash_cc_ */ > /* > * SArray.cc -- > * Sorted array implementation > * > */ > > #ifndef _SArray_cc_ > #define _SArray_cc_ > > #ifndef lint > static char SArray_Copyright[] = "Copyright (c) 1995-2005 SRI International. > All Rights Reserved."; > static char SArray_RcsId[] = "@(#)$Header: > /home/srilm/devel/dstruct/src/RCS/SArray.cc,v 1.37 2005/07/17 22:01:31 stolcke > Exp $"; > #endif > > #include > #include > #include > #include > #include > > #include "SArray.h" > > #undef INSTANTIATE_SARRAY > #define INSTANTIATE_SARRAY(KeyT, DataT) \ > template <> DataT *SArray< KeyT, DataT >::removedData = 0; \ > template class SArray< KeyT, DataT >; \ > template class SArrayIter< KeyT, DataT > > > #ifndef __GNUG__ > template > DataT *SArray::removedData = 0; > #endif /* __GNUG__ */ > > #define BODY(b) ((SArrayBody *)b) > > const double growSize = 1.1; > > template > void > SArray::dump() const > { > if (body == 0) { > cerr << "EMPTY" << endl; > } else { > unsigned nEntries = numEntries(); > > cerr << "maxEntries = " << BODY(body)->maxEntries; > cerr << " nEntries = " << nEntries; > > for (unsigned i = 0; i < nEntries; i++) { > cerr << " " < cerr << BODY(body)->data[i].key > #ifdef DUMP_VALUES > /* > * Only certain types can be printed. > */ > << "->" << BODY(body)->data[i].value > #endif /* DUMP_VALUES */ > ; > } > } > } > > template > void > SArray::memStats(MemStats &stats) const > { > stats.total += sizeof(*this); > > if (body) { > unsigned maxEntries = BODY(body)->maxEntries; > > stats.total += sizeof(*BODY(body)) + > sizeof(BODY(body)->data[0]) * > (maxEntries - 1); > stats.wasted += sizeof(BODY(body)->data[0]) * > (maxEntries - numEntries()); > } > } > > template > void > SArray::alloc(unsigned size) > { > body = (SArrayBody *)malloc(sizeof(*BODY(body)) + > (size - 1) * sizeof(BODY(body)->data[0])); > assert(body != 0); > > BODY(body)->deleted = 0; > BODY(body)->maxEntries = size; > > /* > * Fill the array with dummy keys, marking the unused space > */ > for (unsigned i = 0; i < size; i ++) { > Map_noKey(BODY(body)->data[i].key); > } > } > > template > SArray::SArray(unsigned size) > : body(0) > { > if (size != 0) { > alloc(size); > } > } > > template > void > SArray::clear(unsigned size) > { > if (body) { > unsigned maxEntries = BODY(body)->maxEntries; > > for (unsigned i = 0; i < maxEntries; i++) { > KeyT thisKey = BODY(body)->data[i].key; > > if (Map_noKeyP(thisKey)) { > break; > } > Map_freeKey(thisKey); > } > free(body); > body = 0; > } > if (size != 0) { > alloc(size); > } > } > > template > SArray::~SArray() > { > clear(0); > } > > template > unsigned > SArray::numEntries() const > { > if (body == 0) { > return 0; > } else if (Map_noKeyP(BODY(body)->data[0].key)) { > return 0; > } else { > /* > * Do a binary search for the end of the used memory > * lower always points to a filled entry > * upper always points to a free entry beyond the end of used entries > */ > unsigned lower, upper; > > lower = 0; /* minimum index (inclusive) */ > upper = BODY(body)->maxEntries; /* maximum index (exclusive) */ > > while (lower + 1 < upper) { > unsigned middle = (lower + upper)/2; > > if (Map_noKeyP(BODY(body)->data[middle].key)) { > upper = middle; > } else { > lower = middle; > } > } > > /* lower + 1 == upper */ > return upper; > } > } > > template > SArray::SArray(const SArray &source) > : body(0) > { > #ifdef DEBUG > cerr << "warning: SArray copy constructor called\n"; > #endif > *this = source; > } > > template > SArray & > SArray::operator= (const SArray &other) > { > #ifdef DEBUG > cerr << "warning: SArray::operator= called\n"; > #endif > > if (&other == this) { > return *this; > } > > /* > * copy array entries from old to new > */ > if (other.body) { > unsigned maxEntries = BODY(other.body)->maxEntries; > clear(maxEntries); > > for (unsigned i = 0; i < maxEntries; i++) { > KeyT thisKey = BODY(other.body)->data[i].key; > > if (Map_noKeyP(thisKey)) { > break; > } > > /* > * Copy key > */ > BODY(body)->data[i].key = Map_copyKey(thisKey); > > /* > * Initialize data, required for = operator > */ > new (&(BODY(body)->data[i].value)) DataT; > > /* > * Copy data > */ > BODY(body)->data[i].value = BODY(other.body)->data[i].value; > > } > } else { > clear(0); > } > > return *this; > } > > /* > * Returns index into data array that has the key which is either > * equal to key, or indicates the place where key would be placed if it > * occurred in the array. > */ > template > Boolean > SArray::locate(KeyT key, unsigned &index) const > { > assert(!Map_noKeyP(key)); > > if (body) { > unsigned lower, upper; > > lower = 0; /* minimum index (inclusive) */ > upper = BODY(body)->maxEntries; /* maximum index (exclusive) */ > > while (lower < upper) { > unsigned middle = (lower + upper)/2; > > KeyT thisKey = BODY(body)->data[middle].key; > > if (Map_noKeyP(thisKey)) { > upper = middle; > continue; > } > > int compare = SArray_compareKey(key, thisKey); > > if (compare < 0) { > upper = middle; > } else if (compare > 0) { > lower = middle + 1; > } else { > index = middle; > return true; > } > } > > /* we have lower == upper */ > if (lower == BODY(body)->maxEntries || > Map_noKeyP(BODY(body)->data[lower].key) || > SArray_compareKey(key, BODY(body)->data[lower].key) < 0) > { > index = lower; > } else { > index = lower + 1; > } > return false; > } else { > return false; > } > } > > template > DataT * > SArray::find(KeyT key, Boolean &foundP) const > { > unsigned index; > > if (foundP = locate(key, index)) { > return &(BODY(body)->data[index].value); > } else { > return 0; > } > } > > template > KeyT > SArray::getInternalKey(KeyT key, Boolean &foundP) const > { > unsigned index; > static KeyT zeroKey; > > if (foundP = locate(key, index)) { > return BODY(body)->data[index].key; > } else { > return zeroKey; > } > } > > template > DataT * > SArray::insert(KeyT key, Boolean &foundP) > { > unsigned index; > > /* > * Make sure there is room for at least one entry > */ > if (body == 0) { > alloc(1); > } > > if (foundP = locate(key, index)) { > return &(BODY(body)->data[index].value); > } else { > unsigned nEntries = numEntries(); > unsigned maxEntries = BODY(body)->maxEntries; > > /* > * Need to add an element. > * First, enlarge array if necessary > */ > if (nEntries == maxEntries) { > maxEntries = (int)(maxEntries * growSize); > if (maxEntries == nEntries) { > maxEntries ++; > } > body = (SArrayBody *)realloc(body, > sizeof(*BODY(body)) + > (maxEntries - 1) * sizeof(BODY(body)->data[0])); > assert(body != 0); > > /* > * Fill new space with dummy keys > */ > for (unsigned i = BODY(body)->maxEntries; i < maxEntries; i ++) { > Map_noKey(BODY(body)->data[i].key); > } > > BODY(body)->maxEntries = maxEntries; > } > > memmove(&(BODY(body)->data[index + 1]), > &(BODY(body)->data[index]), > (nEntries - index) * sizeof(BODY(body)->data[0])); > > BODY(body)->data[index].key = Map_copyKey(key); > > /* > * Initialize data to zero, but also call constructors, if any > */ > memset(&(BODY(body)->data[index].value), 0, > sizeof(BODY(body)->data[index].value)); > new (&(BODY(body)->data[index].value)) DataT; > > return &(BODY(body)->data[index].value); > } > } > > template > DataT * > SArray::remove(KeyT key, Boolean &foundP) > { > unsigned index; > > /* > * Allocate pseudo-static memory for removed objects (returned by > * remove()). We cannot make this static because otherwise > * the destructor for DataT would be called on it at program exit. > */ > if (removedData == 0) { > removedData = (DataT *)malloc(sizeof(DataT)); > assert(removedData != 0); > } > > if (foundP = locate(key, index)) { > unsigned nEntries = numEntries(); > > Map_freeKey(BODY(body)->data[index].key); > memcpy(removedData, &BODY(body)->data[index].value, sizeof(DataT)); > > memmove(&(BODY(body)->data[index]), > &(BODY(body)->data[index + 1]), > (nEntries - index - 1) * sizeof(BODY(body)->data[0])); > > /* > * mark previous last slot as free > */ > Map_noKey(BODY(body)->data[nEntries-1].key); > > /* > * Indicate that deletion occurred > */ > BODY(body)->deleted = 1; > > return removedData; > } else { > return 0; > } > } > > #ifdef __GNUG__ > static > #endif > int (*SArray_thisKeyCompare)(); /* used by SArrayIter() to communicate > * with compareIndex() */ > #ifdef __GNUG__ > static > #endif > void *SArray_thisBody; /* ditto */ > > > template > void > SArrayIter::sortKeys() > { > /* > * Store keys away and sort them to the user's orders. > */ > unsigned *sortedIndex = new unsigned[numEntries]; > assert(sortedIndex != 0); > > unsigned i; > for (i = 0; i < numEntries; i++) { > sortedIndex[i] = i; > } > > /* > * Due to the limitations of the qsort interface we have to > * pass extra information to compareIndex in these global > * variables - yuck. > */ > SArray_thisKeyCompare = (int (*)())sortFunction; > SArray_thisBody = mySArrayBody; > qsort(sortedIndex, numEntries, sizeof(*sortedIndex), compareIndex); > > /* > * Save the keys for enumeration. The reason we save the keys, > * not the indices, is that indices may change during enumeration > * due to deletions. > */ > sortedKeys = new KeyT[numEntries]; > assert(sortedKeys != 0); > > for (i = 0; i < numEntries; i++) { > sortedKeys[i] = mySArrayBody->data[sortedIndex[i]].key; > } > > delete [] sortedIndex; > } > > template > SArrayIter::SArrayIter(const SArray &sarray, > int (*keyCompare)(KeyT, KeyT)) > : mySArrayBody(BODY(sarray.body)), current(0), > numEntries(sarray.numEntries()), sortFunction(keyCompare) > { > /* > * Note: we access the underlying SArray through the body pointer, > * not the top-level SArray itself. This allows the top-level object > * to be moved without affecting an ongoing iteration. > * XXX: This only works because > * - it is illegal to insert while iterating > * - deletions don't cause reallocation of the data > */ > if (sortFunction && mySArrayBody) { > sortKeys(); > } else { > sortedKeys = 0; > } > > if (mySArrayBody) { > mySArrayBody->deleted = 0; > } > } > > /* > * This is the callback function passed to qsort for comparing array > * entries. It is passed the indices into the data array, which are > * compared according to the user-supplied function applied to the > * keys found at those locations. > */ > template > int > SArrayIter::compareIndex(const void *idx1, const void *idx2) > { > return (*(compFnType)SArray_thisKeyCompare) > (BODY(SArray_thisBody)->data[*(unsigned *)idx1].key, > BODY(SArray_thisBody)->data[*(unsigned *)idx2].key); > } > > template > SArrayIter::~SArrayIter() > { > delete [] sortedKeys; > } > > template > void > SArrayIter::init() > { > delete [] sortedKeys; > sortedKeys = 0; > > current = 0; > > { > /* > * XXX: fake SArray object to access numEntries() > */ > SArray mySArray(0); > > mySArray.body = mySArrayBody; > numEntries = mySArray.numEntries(); > mySArray.body = 0; > } > > if (mySArrayBody) { > if (sortFunction) { > sortKeys(); > } > mySArrayBody->deleted = 0; > } > } > > template > DataT * > SArrayIter::next(KeyT &key) > { > if (mySArrayBody == 0) { > return 0; > } else { > unsigned index; > > if (sortedKeys == 0) { > /* > * Detect deletion while iterating. > * A legal deletion can only affect the current entry, so > * adjust the current position to reflect the next entry was > * moved. > */ > if (mySArrayBody->deleted) { > numEntries --; > current --; > } > > if (current == numEntries) { > return 0; > } > > index = current++; > } else { > if (current == numEntries) { > return 0; > } > > /* > * XXX: fake an SArray object to access locate() > */ > SArray mySArray(0); > > mySArray.body = mySArrayBody; > (void)mySArray.locate(sortedKeys[current++], index); > mySArray.body = 0; > } > mySArrayBody->deleted = 0; > > key = mySArrayBody->data[index].key; > return &(mySArrayBody->data[index].value); > } > } > > #undef BODY > > #endif /* _SArray_cc_ */ From andyf at umich.edu Wed Jul 27 10:44:09 2005 From: andyf at umich.edu (andyf at umich.edu) Date: Wed, 27 Jul 2005 13:44:09 -0400 Subject: No subject Message-ID: <20050727134409.bqfql9j8mcco0gk0@web.mail.umich.edu> Hello SRILM-user group, I am having difficulties building the toolkit in 3 different environnments. The closest I come to getting a good build is on my Redhat WorkStation version 3 system. I have installed the latest version of tcl, and edited the file Makefile.common.i686 in the following way TCL_INCLUDE = /usr/local/ActiveTcl8.4/include TCL_LIBRARY = /usr/local/ActiveTcl8.4 -ltcl As near as I can tell the binary which I am most interested "ngram" is part of the lm directory. None of the files in that directory build, even though I do get what looks like a succesful build in some of the other directories. I get one of two errors: 1) No include path in which to find tcl.h 2) g++ cannot specify -o with -c or -S and multiple compilation My version of gcc is 3.something Has anybody on the list done a succesful build on Redhat Workstaion version 3 and if so, what am I doing wrong? Andy Freeman PhD, linguistics and Arabic BS computer science From stolcke at speech.sri.com Wed Jul 27 11:33:17 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 27 Jul 2005 11:33:17 PDT Subject: In-Reply-To: Your message of Wed, 27 Jul 2005 13:44:09 -0400. <20050727134409.bqfql9j8mcco0gk0@web.mail.umich.edu> Message-ID: <200507271833.j6RIXHC23647@huge> In message <20050727134409.bqfql9j8mcco0gk0 at web.mail.umich.edu>you wrote: > Hello SRILM-user group, > > I am having difficulties building the toolkit in 3 different > environnments. > > The closest I come to getting a good build is on my Redhat WorkStation > version 3 system. I have installed the latest version of tcl, and edited the > file Makefile.common.i686 in the following way > TCL_INCLUDE = /usr/local/ActiveTcl8.4/include > TCL_LIBRARY = /usr/local/ActiveTcl8.4 -ltcl Instead use TCL_INCLUDE = -I/usr/local/ActiveTcl8.4/include TCL_LIBRARY = -L/usr/local/ActiveTcl8.4 -ltcl If your tcl library is dynamic you might also have to add a flag to specify the run-time location of it: TCL_LIBRARY = -L/usr/local/ActiveTcl8.4 -ltcl -R/usr/local/ActiveTcl8.4 --Andreas From cs_kumar at ettimadai.amrita.edu Mon Aug 1 03:48:45 2005 From: cs_kumar at ettimadai.amrita.edu (C. Santhosh Kumar) Date: Mon, 1 Aug 2005 16:18:45 +0530 (IST) Subject: On Fedora core I Message-ID: <43404.192.168.9.9.1122893325.squirrel@192.168.9.9> Good evening, May I know if anybody could compile SRILM on Fedora core I. I got it compiled on Redhat Linux 9, but got problems with Fedora core 1. I would greatly appreciate your help. Thanks, -- Santhosh From hassan at mimos.my Fri Aug 12 00:02:44 2005 From: hassan at mimos.my (Hassan Mohamed) Date: Fri, 12 Aug 2005 15:02:44 +0800 (MYT) Subject: New to SRILM - cornfirmation on the fuctionality. Message-ID: <3891852.1123830164136.SLOX.WebMail.wwwrun@openx.mimos.my> Hello, I'm new to the SRILM. Can someone help me on the installation step! According to the installation manual I stuck at step no 7 which is to test the functionality of the system. I see only two files are ?IDENTICAL?, where the others are ?DIFFERS?. The identical files are make-unigram-pfsg.stdout and make-unigram-pfsg.stderr. My question is does the system functionality of the system partly failed? From stolcke at speech.sri.com Fri Aug 12 09:44:44 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 12 Aug 2005 09:44:44 PDT Subject: New to SRILM - cornfirmation on the fuctionality. In-Reply-To: Your message of Fri, 12 Aug 2005 15:02:44 +0800. <3891852.1123830164136.SLOX.WebMail.wwwrun@openx.mimos.my> Message-ID: <200508121644.j7CGiic22431@huge> It looks like none of the binary program were built. I suggest contacting a local expert who can help in setting up the compiler and linker flags correctly. --Andreas In message <3891852.1123830164136.SLOX.WebMail.wwwrun at openx.mimos.my>you wrote: > Hello, I'm new to the SRILM. Can someone help me on the installation step! Ac > cording to the installation manual I stuck at step no 7 which is to test the > functionality of the system. I see only two files are ???IDENTICAL???, where the ot > hers are ???DIFFERS???. The identical files are make-unigram-pfsg.stdout and make- > unigram-pfsg.stderr. My question is does the system functionality of the syst > em partly failed? > > From james.waterhouse at elf.mcgill.ca Wed Aug 24 12:37:33 2005 From: james.waterhouse at elf.mcgill.ca (james.waterhouse at elf.mcgill.ca) Date: Wed, 24 Aug 2005 15:37:33 -0400 Subject: Lidia Mangu's Thesis Message-ID: <1124912253.430ccc7d74da0@www.webmail.mcgill.ca> Hello, This is a question for the developers of srilm... is there any plan to include a complete implimentation of Lidia Mangu's thesis in lattice-tools at any point in the future? Thanks, James From stolcke at speech.sri.com Wed Aug 24 17:24:28 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 24 Aug 2005 17:24:28 PDT Subject: Lidia Mangu's Thesis In-Reply-To: Your message of Wed, 24 Aug 2005 15:37:33 -0400. <1124912253.430ccc7d74da0@www.webmail.mcgill.ca> Message-ID: <200508250024.RAA08329@tonga> In message <1124912253.430ccc7d74da0 at www.webmail.mcgill.ca>you wrote: > Hello, > > This is a question for the developers of srilm... is there any plan to includ > e a > complete implimentation of Lidia Mangu's thesis in lattice-tools at any point > in > the future? > > Thanks, > James James, the bad news is that there are no plans to reimplement Lidia's work in SRILM. The good news is that it shouldn't be necessary. The SRILM lattice-tool program incorporates a new algorithm for confusion network construction from lattices. The algorithm operates in time O(N*L), where N is the number of nodes in the lattice and L is the length of the confusion network. The original algorithm can take time O(N^3) and is therefore much slower in most cases. This function is enabled by the lattice-tool -posterior-decode and -write-mesh options. I should note that several fast (linear-time) algorithms for CN construction have been published in recent years. --Andreas From james.waterhouse at elf.mcgill.ca Thu Aug 25 08:19:30 2005 From: james.waterhouse at elf.mcgill.ca (james.waterhouse at elf.mcgill.ca) Date: Thu, 25 Aug 2005 11:19:30 -0400 Subject: Lidia Mangu's Thesis In-Reply-To: <200508250024.RAA08329@tonga> References: <200508250024.RAA08329@tonga> Message-ID: <1124983170.430de18227afc@www.webmail.mcgill.ca> Andreas, Is there a paper that describes the algorithm used in SRILM? Also, the reason I asked about Lidia Mangu's thesis and about the possibility of a future reimplementation is because the current algorithm does not take into account timing information. The research I am conducting is dealing with recognition at the phone level. In creating sausages from the resulting lattices it is uncertain how much timing information could effect the results. Perhaps timing information may be more of an issue in phone recognition then in word recognition. Do you happen to know of any sausage creation software that takes into account the timing information? Thanks, James Quoting Andreas Stolcke : > > In message <1124912253.430ccc7d74da0 at www.webmail.mcgill.ca>you wrote: > > Hello, > > > > This is a question for the developers of srilm... is there any plan to > includ > > e a > > complete implimentation of Lidia Mangu's thesis in lattice-tools at any > point > > in > > the future? > > > > Thanks, > > James > > James, > > the bad news is that there are no plans to reimplement Lidia's work in > SRILM. The good news is that it shouldn't be necessary. > The SRILM lattice-tool program incorporates a new algorithm > for confusion network construction from lattices. The algorithm > operates in time O(N*L), where N is the number of nodes in the lattice > and L is the length of the confusion network. The original algorithm > can take time O(N^3) and is therefore much slower in most cases. > > This function is enabled by the lattice-tool -posterior-decode and > -write-mesh > options. > > I should note that several fast (linear-time) algorithms for > CN construction have been published in recent years. > > --Andreas > > From stolcke at speech.sri.com Fri Aug 26 13:28:34 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 26 Aug 2005 13:28:34 PDT Subject: Lidia Mangu's Thesis In-Reply-To: Your message of Thu, 25 Aug 2005 11:19:30 -0400. <1124983170.430de18227afc@www.webmail.mcgill.ca> Message-ID: <200508262028.j7QKSYZ06125@huge> In message <1124983170.430de18227afc at www.webmail.mcgill.ca>you wrote: > Andreas, > > Is there a paper that describes the algorithm used in SRILM? No, never got around to it. But, there is a rather detailed comment in lattice/src/LatticeAlign.cc (around line 3500) that describes the algorithm. > Also, the reason I asked about Lidia Mangu's thesis and about the possibility > of > a future reimplementation is because the current algorithm does not take into > account timing information. The research I am conducting is dealing with > recognition at the phone level. In creating sausages from the resulting > lattices it is uncertain how much timing information could effect the results > . > Perhaps timing information may be more of an issue in phone recognition then > in > word recognition. The goal of my algorithm is precisely not to rely on time information, for two reasons: - time information may not be available, thereby limiting the applicability of the algorithm. - word error scoring is oblivious to time alignment, therefore you should not need time information to compute the hypothesis with minimal expected word error. Of course for certain applications time information might be desirable to use. > Do you happen to know of any sausage creation software that takes into accoun > t > the timing information? Well, Lidia's algorithm actually does take time into acount when evaluating the merging of word hypotheses into equivalence classes. Two recent linear-time algorithms that do use time information are described in A general algorithm for word graph matrix decomposition D Hakkani-Tur, G Riccardi - ICASSP 2003 http://akpublic.research.att.com/~dsp3/icassp03.pdf Improved Confusion Network Algorithm and Shortest Path Search from Word Lattice Jian Xue, Yunxin Zhao, ICASSP 2005 http://www.icassp2005.com/Papers/ViewPapers_MS.asp?papernum=3537 (i don't have a link to the full paper) As for modifying SRILM to use time information: it should not be too hard because time information is already optionally maintained in the hypotheses and confusion networks (enabled by lattice-tool -acoustic-mesh). Internally it is encoded in the NBestWordInfo structures (NBest.h). The cost of adding a word to a confusion set is computed by WordMesh::alignError(). You would have to modify it to take time overlap into account, and change its interface and calling functions to pass the time information along with the word identifies. --Andreas From kermorvant at gmail.com Wed Aug 31 07:51:53 2005 From: kermorvant at gmail.com (Christopher Kermorvant) Date: Wed, 31 Aug 2005 16:51:53 +0200 Subject: language model adaptation Message-ID: HelloSriLM users, I am looking for a way to save language models adapted with ngram -adapt-marginals. Is it possible ? As the write method for AdaptiveMarginals is not implemented, I tried to write one (copied from writeWithOrder in NgramLM.cc) , by calling wordProb(pword,context+1) instead of iterating on the tree, but it doesn't seem to work. The probabilities I get in the file, are not the same as the one I get if I parse a text file with ppl option (precisely, alphas are the same, but denominators are different). Any clue or solution to save an adapted model are welcome ! Thank you, -- Christopher Kermorvant void AdaptiveMarginals::write(File &file) { this->running(true) ; Ngram * lm = (Ngram*)&baseLM; unsigned i; unsigned howmanyNgrams[maxNgramOrder + 1]; VocabIndex context[maxNgramOrder + 2]; VocabString scontext[maxNgramOrder + 1]; unsigned order=lm->setorder(); if (order > maxNgramOrder) { order = maxNgramOrder; } fprintf(file, "\n\\data\\\n"); for (i = 1; i <= order; i++ ) { howmanyNgrams[i] = lm->numNgrams(i); fprintf(file, "ngram %d=%d\n", i, howmanyNgrams[i]); } for (i = 1; i <= order; i++ ) { fprintf(file, "\n\\%d-grams:\n", i); if (debug(DEBUG_WRITE_STATS)) { dout() << "writing " << howmanyNgrams[i] << " " << i << "-grams\n"; } NgramBOsIter iter(*lm, context + 1, i - 1, vocab.compareIndex()); BOnode *node; while (node = iter.next()) { vocab.getWords(context + 1, scontext, maxNgramOrder + 1); Vocab::reverse(scontext); NgramProbsIter piter(*node, vocab.compareIndex()); VocabIndex pword; LogP *prob; LogP probAd ; while (prob = piter.next(pword)) { if (file.error()) { return; } context[0] = pword; probAd = wordProb(pword,context+1); *prob = probAd ; fprintf(file, "%.*lg\t", LogP_Precision, (double)(*prob == LogP_Zero ? LogP_PseudoZero : *prob)); Vocab::write(file, scontext); fprintf(file, "%s%s", (i > 1 ? " " : ""), vocab.getWord(pword)); if (i < order) { context[0] = pword; LogP *bow = lm->findBOW(context); if (bow) { fprintf(file, "\t%.*lg", LogP_Precision, (double)(*bow == LogP_Zero ? LogP_PseudoZero : *bow)); } } fprintf(file, "\n"); } } } fprintf(file, "\n\\end\\\n"); } -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at i6.informatik.rwth-aachen.de Wed Aug 31 11:31:45 2005 From: hahn at i6.informatik.rwth-aachen.de (Stefan Hahn) Date: Wed, 31 Aug 2005 20:31:45 +0200 Subject: Perplexity calculation: Strange behavior Message-ID: <200508312031.45859.hahn@i6.informatik.rwth-aachen.de> Hi! During some language modeling using the SRI Toolkit (V.1.4.3 and V.1.4.5) on i686 Intel GNU/Linux I encountered some strange behavior concerning perplexity calculation: For any order greater than 3, the perplexity calculated with ngram seems to be fixed and wrong. For example, I used Defoe's "Robinson Crusoe" to create modified Kneser-Ney discounted Language Models for orders 1 up to 6 and calculated the perplexity for the same text using "ngram" and our own software: +------------------------+ I perplexity I +-------+-------------+----------+ I order | SRI-Toolkit I our Tool I +-------+-------------+----------+ I 1 I 394.79 I 394.794 I +-------+-------------+----------+ I 2 I 68.0706 I 68.071 I +-------+-------------+----------+ I 3 I 54.29 I 54.2903 I +-------+-------------+----------+ I 4 I 57.1554 I 52.6306 I +-------+-------------+----------+ I 5 I 57.1554 I 52.6502 I +-------+-------------+----------+ I 6 I 57.1554 I 52.7033 I +-------+-------------+----------+ The script I used to download "Robinson Crusoe", create the LMs and SRI-results: wget "http://www-i6.informatik.rwth-aachen.de/~gollan/make-lm-01.sh" chmod a+x make-lm-01.sh ./make-lm-01.sh Is there any error in my script? Thanks, Stefan From stolcke at speech.sri.com Wed Aug 31 13:06:53 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 31 Aug 2005 13:06:53 PDT Subject: Perplexity calculation: Strange behavior In-Reply-To: Your message of Wed, 31 Aug 2005 20:31:45 +0200. <200508312031.45859.hahn@i6.informatik.rwth-aachen.de> Message-ID: <200508312006.j7VK6rs23399@huge> In message <200508312031.45859.hahn at i6.informatik.rwth-aachen.de>you wrote: > Hi! > > During some language modeling using the SRI Toolkit (V.1.4.3 and V.1.4.5) on > i686 Intel GNU/Linux I encountered some strange behavior concerning perplexit > y > calculation: > For any order greater than 3, the perplexity calculated with ngram seems to b > e > fixed and wrong. > For example, I used Defoe's "Robinson Crusoe" to create modified Kneser-Ney > discounted Language Models for orders 1 up to 6 and calculated the perplexity > > for the same text using "ngram" and our own software: > > +------------------------+ > I perplexity I > +-------+-------------+----------+ > I order | SRI-Toolkit I our Tool I > +-------+-------------+----------+ > I 1 I 394.79 I 394.794 I > +-------+-------------+----------+ > I 2 I 68.0706 I 68.071 I > +-------+-------------+----------+ > I 3 I 54.29 I 54.2903 I > +-------+-------------+----------+ > I 4 I 57.1554 I 52.6306 I > +-------+-------------+----------+ > I 5 I 57.1554 I 52.6502 I > +-------+-------------+----------+ > I 6 I 57.1554 I 52.7033 I > +-------+-------------+----------+ I haven't looked at your script, but my guess is that you didn't specify the -order option when evaluating the LM. The default is to only use up to trigram probabilities regardless of what is in the LM file. (That's for historical reasons.) So of course you get same result for any LM order >=4 . Also, because of KN, you are getting a degradation relative to the trigram, as the lower-order probabilities are optimized to minimize the higher-order estimates. If this is not the case then we may have a bug, but I can assure you that we use order >= 4 all the time. --Andreas > > The script I used to download "Robinson Crusoe", create the LMs and > SRI-results: > > wget "http://www-i6.informatik.rwth-aachen.de/~gollan/make-lm-01.sh" > chmod a+x make-lm-01.sh > ./make-lm-01.sh > > Is there any error in my script? > Thanks, > Stefan From stolcke at speech.sri.com Wed Aug 31 13:19:44 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 31 Aug 2005 13:19:44 PDT Subject: language model adaptation In-Reply-To: Your message of Wed, 31 Aug 2005 16:51:53 +0200. Message-ID: <200508312019.j7VKJmA24208@huge> There is a better method (no coding required). Use the ngram -rescore-ngram option. You feed it the unadapted LM, and the probs will be recomputed by the adapted model (and renormalized). I haven't tried this specifically with -adapt-marginals, but it should work for any LM supported by ngram. --Andreas In message you wrote: > ------=_Part_4844_29044224.1125499913119 > Content-Type: text/plain; charset=ISO-8859-1 > Content-Transfer-Encoding: quoted-printable > Content-Disposition: inline > > HelloSriLM users, > > I am looking for a way to save language models adapted with=20 > > ngram -adapt-marginals. > > Is it possible ? > > As the write method for AdaptiveMarginals is not implemented, I tried to=20 > write one (copied from writeWithOrder in NgramLM.cc) , by calling=20 > wordProb(pword,context+1) instead of iterating on the tree, but it doesn't= > =20 > seem to work. The probabilities I get in the file, are not the same as the= > =20 > one I get if I parse a text file with ppl option (precisely, alphas are the= > =20 > same, but denominators are different).=20 > > Any clue or solution to save an adapted model are welcome ! > > Thank you, > > -- > Christopher Kermorvant=20 > > > void > AdaptiveMarginals::write(File &file) > { > this->running(true) ; > Ngram * lm =3D (Ngram*)&baseLM; > unsigned i; > unsigned howmanyNgrams[maxNgramOrder + 1]; > VocabIndex context[maxNgramOrder + 2]; > VocabString scontext[maxNgramOrder + 1]; > unsigned order=3Dlm->setorder();=20 > if (order > maxNgramOrder) { > order =3D maxNgramOrder; > } > > fprintf(file, "\n\\data\\\n"); > > for (i =3D 1; i <=3D order; i++ ) { > howmanyNgrams[i] =3D lm->numNgrams(i); > fprintf(file, "ngram %d=3D%d\n", i, howmanyNgrams[i]); > } > > for (i =3D 1; i <=3D order; i++ ) { > fprintf(file, "\n\\%d-grams:\n", i); > > if (debug(DEBUG_WRITE_STATS)) { > dout() << "writing " << howmanyNgrams[i] << " " > < } > > NgramBOsIter iter(*lm, context + 1, i - 1, vocab.compareIndex()); > BOnode *node; > > while (node =3D iter.next()) { > > vocab.getWords(context + 1, scontext, maxNgramOrder + 1); > Vocab::reverse(scontext); > > NgramProbsIter piter(*node, vocab.compareIndex()); > VocabIndex pword; > LogP *prob; > LogP probAd ; > while (prob =3D piter.next(pword)) { > if (file.error()) { > return; > } > context[0] =3D pword; > probAd =3D wordProb(pword,context+1); > *prob =3D probAd ; > fprintf(file, "%.*lg\t", LogP_Precision, > (double)(*prob =3D=3D LogP_Zero ? > LogP_PseudoZero : *prob)); > Vocab::write(file, scontext); > fprintf(file, "%s%s", (i > 1 ? " " : ""), vocab.getWord(pword)); > > if (i < order) { > context[0] =3D pword; > > LogP *bow =3D lm->findBOW(context); > if (bow) { > fprintf(file, "\t%.*lg", LogP_Precision, > (double)(*bow =3D=3D LogP_Zero ? > LogP_PseudoZero : *bow)); > } > } > > fprintf(file, "\n"); > } > } > } > > fprintf(file, "\n\\end\\\n"); > } > > ------=_Part_4844_29044224.1125499913119 > Content-Type: text/html; charset=ISO-8859-1 > Content-Transfer-Encoding: quoted-printable > Content-Disposition: inline > > HelloSriLM users,
>
> I am looking for a way to save language models adapted with
>
> ngram -adapt-marginals.
>
> Is it possible ?
>
> As the write method for AdaptiveMarginals is not implemented, I tried > to write one (copied from writeWithOrder in NgramLM.cc) , by calling > wordProb(pword,context+1) instead of iterating on the tree, but it > doesn't seem to work. The probabilities I get in the file, are not the > same as the one I get if I parse a text file with ppl option > (precisely, alphas are the same, but denominators are different).
>
> Any clue or solution to save an adapted model are welcome !
>
> Thank you,
>
> --
> Christopher Kermorvant
>
>
> void
> AdaptiveMarginals::write(File &file)
> {
> this->running(true) ;
> Ngram * lm =3D (Ngram*)&baseLM;
> unsigned i;
> unsigned howmanyNgrams[maxNgramOrder + 1];
> VocabIndex context[maxNgramOrder + 2];
> VocabString scontext[maxNgramOrder + 1];
> unsigned order=3Dlm->setorder();
> if (order > maxNgramOrder) {
> order =3D maxNgramOrder;
> }
>
> fprintf(file, "\n\\data\\\n");
>
> for (i =3D 1; i <=3D order; i++ ) {
> howmanyNgrams[i] =3D lm->numNgrams(i);
> fprintf(file, "ngram %d=3D%d\n", i, howmanyNgr= > ams[i]);
> }
>
> for (i =3D 1; i <=3D order; i++ ) {
> fprintf(file, "\n\\%d-grams:\n", i);
>
> if (debug(DEBUG_WRITE_STATS)) {<= > br> > dout() << "writing " = > << howmanyNgrams[i] << " "
> < t;-grams\n";
> }
>
> NgramBOsIter iter(*lm, context + 1, i - 1, vocab.compare= > Index());
> BOnode *node;
>
> while (node =3D iter.next()) {
>
> vocab.getWords(context + 1, scontext,= > maxNgramOrder + 1);
> Vocab::reverse(scontext);
>
> NgramProbsIter piter(*node, vocab.com= > pareIndex());
> VocabIndex pword;
> LogP *prob;
> LogP probAd ;
> while (prob =3D piter.next(pword)) {<= > br> > if (file.error()) {
> return;
> }
> context[0] =3D pword;
> probAd =3D wordProb(pword,context+1);= >
> *prob =3D probAd ;
> fprintf(file, "%.*lg\t", Lo= > gP_Precision,
> = > (double)(*prob =3D=3D LogP_Zero ?
> > > LogP_PseudoZero : *prob));
> Vocab::write(file, scontext);
> fprintf(file, "%s%s", (i &g= > t; 1 ? " " : ""), vocab.getWord(pword));
>
> if (i < order) {
> context[0] =3D pwo= > rd;
>
> LogP *bow =3D lm-&= > gt;findBOW(context);
> if (bow) {
> fprintf(file, &quo= > t;\t%.*lg", LogP_Precision,
> = > (double)(*bow =3D=3D LogP_Zero ?
> > > LogP_PseudoZero : *bow));
> }
> }
>
> fprintf(file, "\n");
> }
> }
> }
>
> fprintf(file, "\n\\end\\\n");
> }
>
>
> > ------=_Part_4844_29044224.1125499913119-- From hahn at i6.informatik.rwth-aachen.de Thu Sep 1 02:52:29 2005 From: hahn at i6.informatik.rwth-aachen.de (Stefan Hahn) Date: Thu, 1 Sep 2005 11:52:29 +0200 Subject: Perplexity calculation: Strange behavior In-Reply-To: <200508312006.j7VK6rs23399@huge> References: <200508312006.j7VK6rs23399@huge> Message-ID: <200509011152.30373.hahn@i6.informatik.rwth-aachen.de> Hi again! Your guess was perfectly right, I simply overlooked to specify the -order option for perplexity calculation.... Thanks again, Stefan > In message <200508312031.45859.hahn at i6.informatik.rwth-aachen.de>you wrote: > > Hi! > > > > During some language modeling using the SRI Toolkit (V.1.4.3 and V.1.4.5) > > on i686 Intel GNU/Linux I encountered some strange behavior concerning > > perplexit y > > calculation: > > For any order greater than 3, the perplexity calculated with ngram seems > > to b e > > fixed and wrong. > > For example, I used Defoe's "Robinson Crusoe" to create modified > > Kneser-Ney discounted Language Models for orders 1 up to 6 and calculated > > the perplexity > > > > for the same text using "ngram" and our own software: > > > > +------------------------+ > > I perplexity I > > +-------+-------------+----------+ > > I order | SRI-Toolkit I our Tool I > > +-------+-------------+----------+ > > I 1 I 394.79 I 394.794 I > > +-------+-------------+----------+ > > I 2 I 68.0706 I 68.071 I > > +-------+-------------+----------+ > > I 3 I 54.29 I 54.2903 I > > +-------+-------------+----------+ > > I 4 I 57.1554 I 52.6306 I > > +-------+-------------+----------+ > > I 5 I 57.1554 I 52.6502 I > > +-------+-------------+----------+ > > I 6 I 57.1554 I 52.7033 I > > +-------+-------------+----------+ > > I haven't looked at your script, but my guess is that you didn't specify > the -order option when evaluating the LM. The default is to only use > up to trigram probabilities regardless of what is in the LM file. > (That's for historical reasons.) So of course you get same result for > any LM order >=4 . Also, because of KN, you are getting a degradation > relative to the trigram, as the lower-order probabilities are optimized > to minimize the higher-order estimates. > > If this is not the case then we may have a bug, but I can assure you that > we use order >= 4 all the time. > > --Andreas > > > The script I used to download "Robinson Crusoe", create the LMs and > > SRI-results: > > > > wget "http://www-i6.informatik.rwth-aachen.de/~gollan/make-lm-01.sh" > > chmod a+x make-lm-01.sh > > ./make-lm-01.sh > > > > Is there any error in my script? > > Thanks, > > Stefan From stolcke at speech.sri.com Fri Sep 2 02:36:28 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 02 Sep 2005 02:36:28 PDT Subject: SRILM 1.4.5 released In-Reply-To: Your message of Fri, 02 Sep 2005 14:19:34 +0900. <006601c5af7d$e7ff2cc0$7a40ba85@pcn122> Message-ID: <200509020936.CAA16765@tonga> In message <006601c5af7d$e7ff2cc0$7a40ba85 at pcn122>you wrote: > Dear Dr.Stolcke, > > Thanks for the reply. Recently I have made an experiment on loglinear > integration of multiple LMs. I found if the input LMs are in Arpa format, > the output loglinear LM can be turned into Arpa format. I think you could > give me some comments on the work because it seems you are thinking in the > same way as I. I have made a program to implement the idea. Could you take > some time to have a look at the page? > http://www.slc.atr.jp/~rzhang/lma.html Ruiqiang, Thank you for this pointer to your web page. I haven't had time to understand your code in detail, but I believe you can achieve a similar effect using SRILM, maybe not as efficiently. You can use the following steps to construct a merged N-gram LM incorporating log-linear mixture probabilities. Suppose your input LMs are called LM1, LM2, LM3, etc. 1. Use the standard linear mixture LM to create a new LM that has the union of all the ngrams: ngram -lm LM1 -mix-lm LM2 ... -write-lm MIXLM 2. Recompute the probabilities in the merged LM using the log-linear model: ngram -lm LM1 -mix-lm LM2 ... -rescore-ngram MIXLM -write-lm LOGMIXLM (I omitted the lambda weight options in the last command.) Note this will be rather slow, since the probability normalization has to be carried out for each ngram context in the LM. --Andreas > Thanks, > -Ruiqiang > > > ----- Original Message ----- > From: "Andreas Stolcke" > To: "Ruiqiang Zhang" > Sent: Tuesday, August 30, 2005 3:05 PM > Subject: Re: SRILM 1.4.5 released > > > > > > In message <016301c5ad28$30ce7120$7a40ba85 at pcn122>you wrote: > > > Dear Dr.Stolcke, > > > > > > Thanks for informing me the latest version. I tested the new features, > > > however, i got no output when I was using the "loglinear-mix". The usage > and > > > errors are shown below. Please give me a reply in your free time. > > > > > > [rzhang at pxn147 test]$ ngram -lm lm1 -mix-lm lm2 -lambda > > > 0.5 -loglinear-mix -write-lm lm.o > > > lm1: line 1034: warning: non-zero probability for in > closed-vocabulary > > > LM > > > lm2: line 1191: warning: non-zero probability for in > closed-vocabulary > > > LM > > > write() method not implemented > > > > -loglinear-mix does not support merging of LMs (static interpolation). > > you can only evaluate the mixture LM using -ppl and other options that > compute > > LM probabilies (dynamic interpolation). > > > > the first two messages are just warnings. Use the -unk option to avoid > > them. > > > > --Andreas > > > > > > From ereiter at csd.abdn.ac.uk Tue Sep 6 06:07:16 2005 From: ereiter at csd.abdn.ac.uk (Ehud Reiter) Date: Tue, 06 Sep 2005 14:07:16 +0100 Subject: Problems finding best path (to choose synonynm) Message-ID: <6.0.1.1.1.20050906134955.036d7238@pigeon.csd.abdn.ac.uk> I'm trying to use srilm for a Natural Language Generation application, to choose between synonymns of a word. The input to a system is a structure such as you OR[answered,got] 4 questions OR[correctly,correct,right] The system needs to make a choice at each OR point, with the goal of producing the easiest-to-read final sentence. There are preference weights for the choices, for example, "answered" gets a preference weight of 0.2 and "got" gets 0.8, this reflects the fact that even ignoring LM issues we expect "got" to be easier to read (shorter, simple phoneme->letter mapping) I represent the above as a "wlat" format file, which I convert to pfsg and then run lattice-tool on. However, I can't get lattice-tool to find the best path through the mesh taking into account both the language model and the preference weights. If I specify -viterbi-decode I get the best path based on the LM (but ignoring the preference scores), while if I specify -posterior-decode I get the best path based on preference scores (but ignoring the LM). I'd also like to see the actual scores, I thought I would get this with -nbest-decode but the nbest file has 0 for all the scores. Is there any way to find the best path taking both LM and preference weights into consideration, and giving actual scores? Many thanks Ehud Reiter From stolcke at speech.sri.com Tue Sep 6 09:24:24 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 06 Sep 2005 09:24:24 PDT Subject: Problems finding best path (to choose synonynm) In-Reply-To: Your message of Tue, 06 Sep 2005 14:07:16 +0100. <6.0.1.1.1.20050906134955.036d7238@pigeon.csd.abdn.ac.uk> Message-ID: <200509061624.j86GOOQ06298@huge> In message <6.0.1.1.1.20050906134955.036d7238 at pigeon.csd.abdn.ac.uk>you wrote: > I'm trying to use srilm for a Natural Language Generation > application, to choose between synonymns of a word. The input > to a system is a structure such as > > you OR[answered,got] 4 questions OR[correctly,correct,right] > > The system needs to make a choice at each OR point, with the > goal of producing the easiest-to-read final sentence. There are > preference weights for the choices, for example, "answered" > gets a preference weight of 0.2 and "got" gets 0.8, this reflects > the fact that even ignoring LM issues we expect "got" > to be easier to read (shorter, simple phoneme->letter mapping) > > I represent the above as a "wlat" format file, which I convert > to pfsg and then run lattice-tool on. However, I can't get > lattice-tool to find the best path through the mesh taking into > account both the language model and the preference weights. > If I specify -viterbi-decode I get the best path based on the > LM (but ignoring the preference scores), while if I specify > -posterior-decode I get the best path based on preference scores > (but ignoring the LM). I'd also like to see the actual scores, > I thought I would get this with -nbest-decode but the nbest file > has 0 for all the scores. > > Is there any way to find the best path taking both LM and > preference weights into consideration, and giving actual > scores? I think you would have to directly encode your problem as an HTK-style lattice, where you can have a number of scores associated with each word. The HTK format is not documented as part of SRILM, but as part of the HTK documentation (which is available online at http://htk.eng.cam.ac.uk/ That said, it seems like your problem is more straightforwardly encoded as a HMM tagging problem. Have a look at the disambig tool, especially the -text-map option. The preference values would be encoded in the map file, and the unamiguous words are mapped to themselves. --Andreas > k > Many thanks > Ehud Reiter > From tl.martin at qut.edu.au Tue Sep 6 17:55:09 2005 From: tl.martin at qut.edu.au (tl.martin at qut.edu.au) Date: Wed, 7 Sep 2005 10:55:09 +1000 (EST) Subject: Interpolating Lang Models for Indonesian ASR Message-ID: <20050907105509.BSN53884@mail-msgstore01.qut.edu.au> Hi All, At present I am trying to use the SRI tools to improve the LM for an Indonesian ASR system we are building. We have just over ten hours of Australian Broadcast Commision training data and at present the system gets just over 80% on a heldout test set, with a bigram LM trained on the training data. However, we also have approximately 12 million words of Text from Indonesian papers Kompass and Tempo and were hoping that we could interpolate these with the existing ABC LM to improve the ngram estimates and subsequent perplexity. Evaluating the ABC data PPL on a separate dev transcript provides a ppl=297 Noting the advice given in the package notes when using limited vocabs(vocab is 11000 words) I computed discount coefficients first on unlimited vocab.and then subsequently used these as input to a second pass of n-gram-count to get the LM. I used good-turing. I then ran ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab $DESIRED_VOCAB -limit-vocab -ppl output/sr\ i_trans/$PPL_CORPUS.dev.sri.trans to get perplexity score Using a similar technique using the much larger set of Kompass text produces a ppl score of 808 when evaluated on the ABC dev set. All is well and good, until I try and interpolate the 2. I have trialled two approaches. The first uses the dynamic interpolation capabality incorporated in ngram.Using ngram -bayes 0 -lm ./ABC/lm/arpa.bo.lm -mix- lm ./Kompass/lm/arpa.bo.lm -debug 2 - ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342 ie much worse than the original 297. I then tried using the "compute-best-mix" utility which starts of as expected at lambda values 0.5 and 0.5 and iterates to 0.66 and 0.33.Plugging these vals into ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix- lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/ $PPL_CORPUS.dev.sri.tran|tail yields s ppl= 331.8 ppl1= 608.504 still worse. I would expect it to perhaps stay the same, and iterate to lambda values which excluded the Kompass data, but these seem to be at odds with the ppl score. I then trialled the same technique using Switchboard and Gigaword and got the normal expected behaviour ie improvement Unsure of whether this was because the Kompass data was unsuitable or I was just making a foolish error somewhere I trialled CMU LM toolkit. Agian using gt discounting to build a lm and evaluate on ABC devset gives a ppl of 268 which was a little surprising. More surprising was when I used their interpolation tools. To cut the story short it produces: weights: 0.547 0.453 (7843 items) - -> PP=152.624029 =============> TOTAL PP = 152.624 No doubt the devil is in the detail, but has anyone got some suggestions. Cheers Terry Martin QUT Speech Lab Australia From stolcke at speech.sri.com Wed Sep 7 09:32:09 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 07 Sep 2005 09:32:09 PDT Subject: Interpolating Lang Models for Indonesian ASR In-Reply-To: Your message of Wed, 07 Sep 2005 10:55:09 +1000. <20050907105509.BSN53884@mail-msgstore01.qut.edu.au> Message-ID: <200509071632.j87GW9c10411@huge> My first suggestion is to make sure that all LMs you are comparing and interpolating use the same vocabulary. In SRILM you can enforce this by using the -vocab option when building the LM. PPLs over different vocabularies are of course not comparable, and it is easy to mess this up when building LMs from different data sets. --Andreas In message <20050907105509.BSN53884 at mail-msgstore01.qut.edu.au>you wrote: > Hi All, > > At present I am trying to use the SRI tools to improve the > LM for an Indonesian ASR system we are building. We have > just over ten hours of Australian Broadcast Commision > training data and at present the system gets just over 80% > on a heldout test set, with a bigram LM trained on the > training data. However, we also have approximately 12 > million words of Text from Indonesian papers Kompass and > Tempo and were hoping that we could interpolate these with > the existing ABC LM to improve the ngram estimates and > subsequent perplexity. > > Evaluating the ABC data PPL on a separate dev transcript > provides a ppl=297 > > Noting the advice given in the package notes when using > limited vocabs(vocab is 11000 words) I computed discount > coefficients first on unlimited vocab.and then subsequently > used these as input to a second pass of n-gram-count to get > the LM. I used good-turing. > > > I then ran > > ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab > $DESIRED_VOCAB -limit-vocab -ppl output/sr\ > i_trans/$PPL_CORPUS.dev.sri.trans > > to get perplexity score > > Using a similar technique using the much larger set of > Kompass text produces a ppl score of 808 when evaluated on > the ABC dev set. > > > All is well and good, until I try and interpolate the 2. I > have trialled two approaches. The first uses the dynamic > interpolation capabality incorporated in ngram.Using > > ngram -bayes 0 -lm ./ABC/lm/arpa.bo.lm -mix- > lm ./Kompass/lm/arpa.bo.lm -debug 2 - > ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342 ie > much worse than the original 297. > > I then tried using the "compute-best-mix" utility which > starts of as expected at lambda values 0.5 and 0.5 and > iterates to 0.66 and 0.33.Plugging these vals into > > ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix- > lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/ > $PPL_CORPUS.dev.sri.tran|tail yields > s > > ppl= 331.8 ppl1= 608.504 > > > still worse. I would expect it to perhaps stay the same, and > iterate to lambda values which excluded the Kompass data, > but these seem to be at odds with the ppl score. I then > trialled the same technique using Switchboard and Gigaword > and got the normal expected behaviour ie improvement > > Unsure of whether this was because the Kompass data was > unsuitable or I was just making a foolish error somewhere I > trialled CMU LM toolkit. Agian using gt discounting to build > a lm and evaluate on ABC devset gives a ppl of 268 which was > a little surprising. More surprising was when I used their > interpolation tools. To cut the story short it produces: > > > weights: 0.547 0.453 (7843 items) - > -> PP=152.624029 > > =============> TOTAL PP = 152.624 > > No doubt the devil is in the detail, but has anyone got some > suggestions. > > Cheers > > Terry Martin > QUT Speech Lab > Australia From stolcke at speech.sri.com Wed Sep 7 09:49:40 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 07 Sep 2005 09:49:40 PDT Subject: Problems finding best path (to choose synonynm) In-Reply-To: Your message of Wed, 07 Sep 2005 10:28:28 +0100. <6.0.1.1.1.20050907095331.036a0a60@pigeon.csd.abdn.ac.uk> Message-ID: <200509071649.JAA28602@tonga> Ehud, I just realized the latest version of SRILM (1.4.5) might solve your problem. You can read word confusion networks directly into lattice-tool. lattice-tool -read-mesh -in-lattice test.wlat -split-multiwords -lm ... -write-htk -out-lattice - will convert your confusion network to HTK format, and rescore with an LM (all the usual LM options apply). The original scores are encoded as "x1" scores. You can then decode the lattices wiht Viterbi, using a weighting specified by the -htk-lmscale and -htk-x1scale options. I think you need to do these two steps in two separate invocations of lattices-tool. --Andreas In message <6.0.1.1.1.20050907095331.036a0a60 at pigeon.csd.abdn.ac.uk>you wrote: > Andreas - Thanks very much for suggesting the disambig tool, > this works very well for simple cases when synonyms are the > same length. Unfortunately (as I didn't say in my first message) > I also have cases where synonyms are of different lengths, eg > > you should not be OR[discouraged,put off] by > OR[the results of,NULL] this OR[assessment,test] > > I can't see how to get disambig to handle such cases. Thats > why wlat format and lattice-tool seemed promising, since they handle > the above with *DELETE* pseudo-words and -split-multiwords option > > Is there any way to tell lattice-tool that the scores on the > input lattice are purely "acoustic" scores (that is, my > preference weights), and that it should > compute language model scores separately and combine these with > the existing "acoustic" scores in the lattice? > > I suspect there is some simple thing I need to do, or simple > mistake which I am making > > An example wlat file that I have created (for the above example) is > > numaligns 9 > name ex9 > posterior 1 > align 0 you 1 align 1 should 1 > align 2 not 1 > align 3 be 1 > align 4 discouraged 0.33 put_off 0.67 > align 5 by 1 > align 6 the_results_of 0.33 *DELETE* 0.67 > align 7 this 1 > align 8 assessment 0.25 test 0.75 > > I convert the above to pfsg using wlat-to-pfsg, and then invoke > > lattice-tool -viterbi-decode -in-lattice ex9.pfsg -lm LM -split-multiwords > > If there isn't a way to do the above, I'll try to create HTK > formal files as you suggest > > Thanks for you help! > Ehud > > At 17:24 06/09/2005, you wrote: > > >In message <6.0.1.1.1.20050906134955.036d7238 at pigeon.csd.abdn.ac.uk>you wrot > e: > >> I'm trying to use srilm for a Natural Language Generation > >> application, to choose between synonymns of a word. The input > >> to a system is a structure such as > >> > >> you OR[answered,got] 4 questions OR[correctly,correct,right] > >> > >> The system needs to make a choice at each OR point, with the > >> goal of producing the easiest-to-read final sentence. There are > >> preference weights for the choices, for example, "answered" > >> gets a preference weight of 0.2 and "got" gets 0.8, this reflects > >> the fact that even ignoring LM issues we expect "got" > >> to be easier to read (shorter, simple phoneme->letter mapping) > >> > >> I represent the above as a "wlat" format file, which I convert > >> to pfsg and then run lattice-tool on. However, I can't get > >> lattice-tool to find the best path through the mesh taking into > >> account both the language model and the preference weights. > >> If I specify -viterbi-decode I get the best path based on the > >> LM (but ignoring the preference scores), while if I specify > >> -posterior-decode I get the best path based on preference scores > >> (but ignoring the LM). I'd also like to see the actual scores, > >> I thought I would get this with -nbest-decode but the nbest file > >> has 0 for all the scores. > >> > >> Is there any way to find the best path taking both LM and > >> preference weights into consideration, and giving actual > >> scores? > > > >I think you would have to directly encode your problem as an HTK-style > >lattice, where you can have a number of scores associated with each word. > >The HTK format is not documented as part of SRILM, but as part of > >the HTK documentation (which is available online at > >http://htk.eng.cam.ac.uk/ > > > >That said, it seems like your problem is more straightforwardly encoded > >as a HMM tagging problem. Have a look at the disambig tool, especially > >the -text-map option. The preference values would be encoded in the > >map file, and the unamiguous words are mapped to themselves. > > > >--Andreas > > > >> > >k > >> Many thanks > >> Ehud Reiter > >> > From joonki74 at etri.re.kr Wed Sep 7 10:01:18 2005 From: joonki74 at etri.re.kr (=?ks_c_5601-1987?B?w9bB2LHi?=) Date: Thu, 8 Sep 2005 02:01:18 +0900 Subject: Interpolating Lang Models for Indonesian ASR Message-ID: Hi, I wonder your texts have sentence boundary symbols such as ", ". In CMU toolkit, you should define the symbols explictly in context cue file to control these symbols. According to my experience, with same discounting scheme, same cut-off value and careful usage of back-off options(from unknown words and sentence boundary), the two LM toolkit generate almost same perplexities. (except some round-off errors) On interpolation problem, IMHO, I suggest that you compare the N-gram probability stream on same text. (using fprobs files in CMU LM toolkit and -debug 2 option in SRI LM toolkit) regards, Joon Ki Choi > -----Original Message----- > From: owner-srilm-user at speech.sri.com > [mailto:owner-srilm-user at speech.sri.com] On Behalf Of > tl.martin at qut.edu.au > Sent: Wednesday, September 07, 2005 9:55 AM > To: srilm-user at speech.sri.com > Subject: Interpolating Lang Models for Indonesian ASR > > > Hi All, > > At present I am trying to use the SRI tools to improve the > LM for an Indonesian ASR system we are building. We have > just over ten hours of Australian Broadcast Commision > training data and at present the system gets just over 80% > on a heldout test set, with a bigram LM trained on the > training data. However, we also have approximately 12 > million words of Text from Indonesian papers Kompass and > Tempo and were hoping that we could interpolate these with > the existing ABC LM to improve the ngram estimates and > subsequent perplexity. > > Evaluating the ABC data PPL on a separate dev transcript > provides a ppl=297 > > Noting the advice given in the package notes when using > limited vocabs(vocab is 11000 words) I computed discount > coefficients first on unlimited vocab.and then subsequently > used these as input to a second pass of n-gram-count to get > the LM. I used good-turing. > > > I then ran > > ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab > $DESIRED_VOCAB -limit-vocab -ppl output/sr\ > i_trans/$PPL_CORPUS.dev.sri.trans > > to get perplexity score > > Using a similar technique using the much larger set of > Kompass text produces a ppl score of 808 when evaluated on > the ABC dev set. > > > All is well and good, until I try and interpolate the 2. I > have trialled two approaches. The first uses the dynamic > interpolation capabality incorporated in ngram.Using > > ngram -bayes 0 -lm ./ABC/lm/arpa.bo.lm -mix- > lm ./Kompass/lm/arpa.bo.lm -debug 2 - > ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342 ie > much worse than the original 297. > > I then tried using the "compute-best-mix" utility which > starts of as expected at lambda values 0.5 and 0.5 and > iterates to 0.66 and 0.33.Plugging these vals into > > ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix- > lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/ > $PPL_CORPUS.dev.sri.tran|tail yields > s > > ppl= 331.8 ppl1= 608.504 > > > still worse. I would expect it to perhaps stay the same, and > iterate to lambda values which excluded the Kompass data, > but these seem to be at odds with the ppl score. I then > trialled the same technique using Switchboard and Gigaword > and got the normal expected behaviour ie improvement > > Unsure of whether this was because the Kompass data was > unsuitable or I was just making a foolish error somewhere I > trialled CMU LM toolkit. Agian using gt discounting to build > a lm and evaluate on ABC devset gives a ppl of 268 which was > a little surprising. More surprising was when I used their > interpolation tools. To cut the story short it produces: > > > weights: 0.547 0.453 (7843 items) - > -> PP=152.624029 > > =============> TOTAL PP = 152.624 > > No doubt the devil is in the detail, but has anyone got some > suggestions. > > Cheers > > Terry Martin > QUT Speech Lab > Australia > From dilekh at yahoo.com Thu Sep 8 20:29:01 2005 From: dilekh at yahoo.com (Dilek Hakkani) Date: Thu, 8 Sep 2005 20:29:01 -0700 (PDT) Subject: converting ngram format model to AT&T FSM format Message-ID: <20050909032901.81882.qmail@web40506.mail.yahoo.com> Hi, I'm trying to convert an n-gram model (e.g., a.lm) into AT&T FSM format. I have first used make-ngram-pfsg (e.g., make-ngram-pfsg a.lm > a.pfsg), then I used pfsg-to-fsm (e.g., pfsg-fsm a.pfsg > a.fsm). I have some questions regarding the interpretation of the transition probabilities and labels: 1. words are represented as themselves in the n-gram format, but in the FSM format model, the transitions seem to have an index. Which word is represented with which index? Can it be extracted from the order of the unigrams in the ngram format file? Is 0 representing an epsilon? 2. Are the transition probabilities -10000.5*logprobs? 3. What do the state potentials represent? Also, is there a better way of doing these? I appreciate any help about these. Thanks, Dilek Hakkani-T?r --------------------------------- Click here to donate to the Hurricane Katrina relief effort. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Tue Sep 13 18:50:30 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 13 Sep 2005 18:50:30 PDT Subject: converting ngram format model to AT&T FSM format In-Reply-To: Your message of Thu, 08 Sep 2005 20:29:01 -0700. <20050909032901.81882.qmail@web40506.mail.yahoo.com> Message-ID: <200509140150.j8E1oUb01649@huge> In message <20050909032901.81882.qmail at web40506.mail.yahoo.com>you wrote: > Hi, > I'm trying to convert an n-gram model (e.g., a.lm) into AT&T FSM format. > I have first used make-ngram-pfsg (e.g., make-ngram-pfsg a.lm > a.pfsg), then > I used pfsg-to-fsm (e.g., pfsg-fsm a.pfsg > a.fsm). I have some questions re > garding the interpretation of the transition probabilities and labels: > 1. words are represented as themselves in the n-gram format, but in the FSM f > ormat model, the transitions seem to have an index. Which word is represented > with which index? Can it be extracted from the order of the unigrams in the > ngram format file? Is 0 representing an epsilon? Use pfsg-to-fsm symbolfile=FILE to dump the index-to-word mapping to FILE. FILE can then be used with the FSM tool options -i and -o (this is explained in the pfsg-scripts man page). > 2. Are the transition probabilities -10000.5*logprobs? They are, because that's what make-ngram-pfsg outputs, and pfsg-to-fsm doesn't change the scaling except changing the sign. But you can undo this scaling by using the pfsg-to-fsm scale=S option and setting S=1/-10000.5. Note this will give you back log-base-10, not log-base-e. > 3. What do the state potentials represent? They are the costs of ending a path in a given state. I don't think they're used in the encoding of PFSGs. > Also, is there a better way of doing these? Probably, but not in SRILM ;-) --Andreas From yannick.esteve at lium.univ-lemans.fr Thu Sep 15 02:50:01 2005 From: yannick.esteve at lium.univ-lemans.fr (=?ISO-8859-1?Q?Yannick_Est=E8ve?=) Date: Thu, 15 Sep 2005 11:50:01 +0200 Subject: converting ngram format model to AT&T FSM format In-Reply-To: <200509140150.j8E1oUb01649@huge> References: <200509140150.j8E1oUb01649@huge> Message-ID: <8533EFD3-2DB0-40E8-BE74-E6694F90EA94@lium.univ-lemans.fr> Le 14 sept. 05 ? 03:50, Andreas Stolcke a ?crit : > > In message <20050909032901.81882.qmail at web40506.mail.yahoo.com>you > wrote: > >> Hi, >> I'm trying to convert an n-gram model (e.g., a.lm) into AT&T FSM >> format. > > >> Also, is there a better way of doing these? >> > > Probably, but not in SRILM ;-) > Hi, Not a better way, but simply another one: I have made a tool which converts TRIGRAM (only) language model from ARPA format to AT&T FSM format. You can find it here (under open-source license): http://www-lium.univ-lemans.fr/speechtools/index.html in the section 'Additional resources and miscellaneous tools'. If you want to convert bigram or quadrigram LMs, contact me: I don't still use FSM, but I can improve this program if someone is interested. - Yannick From stolcke at speech.sri.com Sat Sep 17 11:16:50 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 17 Sep 2005 11:16:50 PDT Subject: compiling SRILM In-Reply-To: Your message of Fri, 16 Sep 2005 16:40:06 -0500. <3b97280005091614402c704325@mail.gmail.com> Message-ID: <200509171816.j8HIGoH23247@huge> In message <3b97280005091614402c704325 at mail.gmail.com>you wrote: > I'm not sure if this email should be sent to srilm-user at speech.sri.com > instead, let me know if you would prefer that. > > I just managed to compile SRILM 1.4.5 after a day of troubleshooting. > There are a couple of minor points that could be helpful to other > people. I'm compiling on a pentium running Linux (Debian 1:3.3.4-12). > > They are all edits in common/Makefile.machine.i686. As will become > clear, compiling C++ isn't my strong point. > > (1) When editing the TCL variables, I didn't initially realize I had > to prepend -I and -L to the path (as below). > > TCL_INCLUDE = -I/usr/include/tcl8.3 > TCL_LIBRARY = -L/usr/include/tcl8.3 > > If the above is actually correct (well, it compiles) then perhaps a > note in the documentation file (INSTALL) would help. > > (2) Currently common/Makefile.machine.i686 has > > GCC_FLAGS = -mtune=pentium3 -Wreturn-type -Wimplicit > > My gcc (3.3.4) and g++ don't like the "tune" option. They prefer > what was in the same file in SRILM 1.4.4, so that this makes > everything compile: > > GCC_FLAGS = -mcpu=pentium3 -Wreturn-type -Wimplicit > > ... > > Ok, that's it. Thanks very much for writing SRILM - I heard that > implements Chen-Goodman-modified Kneser-Ney smoothing, which is > exactly what I was looking for. Now to learn how to use it! > > Dinoj Surendran > Dinoj, thanks for pointing out these issues. All the changes you made are as expected. You are right they could be documented better. --Andreas From james.waterhouse at elf.mcgill.ca Fri Sep 30 10:58:31 2005 From: james.waterhouse at elf.mcgill.ca (james.waterhouse at elf.mcgill.ca) Date: Fri, 30 Sep 2005 13:58:31 -0400 Subject: lattice-tool Message-ID: <1128103111.433d7cc79d9a8@www.webmail.mcgill.ca> Hello, I have a question concerning lattice-tool... Why, if I convert a htk file to pfsg and then back again, are the weights missing in the resulting htk file? James From stolcke at speech.sri.com Fri Sep 30 12:05:58 2005 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 30 Sep 2005 12:05:58 PDT Subject: lattice-tool In-Reply-To: Your message of Fri, 30 Sep 2005 13:58:31 -0400. <1128103111.433d7cc79d9a8@www.webmail.mcgill.ca> Message-ID: <200509301905.j8UJ5wM14262@huge> In message <1128103111.433d7cc79d9a8 at www.webmail.mcgill.ca>you wrote: > Hello, > > I have a question concerning lattice-tool... Why, if I convert a htk file to > pfsg and then back again, are the weights missing in the resulting htk file? It's either an oversight or a feature, depending on your point of view. The problem is that HTK lattices encode potentially many kinds of scores, but PFSGs only one. So, when outputting a PFSGs as an HTK lattice is is not clear what to map the PFSG weights to. However, PFSG weights should probably not just be discarded in this situation. A useful default behavior would be to map the PFSG weights to HTK "acoustic" scores. The following small change in lattice-tool.cc will have this effect. *** /tmp/T0uymDr_ Fri Sep 30 12:01:40 2005 --- lattice-tool.cc Fri Sep 30 11:48:26 2005 *************** *** 613,618 **** --- 613,623 ---- // by default we leave HTK lattices scores alone HTKScoreMapping htkScoreMapping = mapHTKnone; + if (!readHTK) { + // preserve PFSG weights as HTK acoustic scores + htkScoreMapping = mapHTKacoustic; + } + if (lmFile) { // remove pause and NULL nodes prior to LM application, It will be in the next release unless someone objects. It is still important to remember that converting HTK lattice to PFSGs and back will not preserve all information (such as times, multple scores, pronunciations, etc.). --Andreas From james.waterhouse at elf.mcgill.ca Fri Sep 30 13:05:46 2005 From: james.waterhouse at elf.mcgill.ca (james.waterhouse at elf.mcgill.ca) Date: Fri, 30 Sep 2005 16:05:46 -0400 Subject: lattice-tool: HTK to PFSG In-Reply-To: <200509301905.j8UJ5wM14262@huge> References: <200509301905.j8UJ5wM14262@huge> Message-ID: <1128110746.433d9a9ac3aad@www.webmail.mcgill.ca> Andreas, Thanks for the response to my question about lattice-tool and fsm and pfsg lattices. I have another question.... Could you perhaps tell me how the PFSG scores are calculated from the HTK scores? I thought it was straight forward... weight = amscale*a + lmscale*l + word_penalty but when I looked at the PFSG output for a simple example it was obviously not that simple. I am supposed to look at the code and try and figure things out but if you can tell me in a line or two what is happening then it would be greatly apreciated. Thanks, James p.s. here is the simple example and the result... VERSION=1.0 UTTERANCE=/opt/asrtools/apps/parya/ttimit/test/dr1/faks0/si1573.mfc lmname=/opt/asrtools/apps/parya/ttimit/wdnet lmscale=8.10 wdpenalty=-9.40 vocab=/opt/asrtools/apps/parya/ttimit/dict N=6 L=5 I=0 t=0.00 W=!NULL I=1 t=0.02 W=sent-start v=1 I=2 t=0.04 W=sil4 v=1 I=3 t=0.09 W=sil4 v=1 I=4 t=0.12 W=t v=1 I=5 t=0.15 W=s v=1 J=0 S=0 E=1 a=-249.59 l=0.000 J=1 S=1 E=2 a=-204.42 l=0.000 J=2 S=2 E=3 a=-354.02 l=-6.250 J=3 S=3 E=4 a=-218.17 l=-3.900 J=4 S=4 E=5 a=-215.62 l=-2.440 Resulting PFSG.... name /opt/asrtools/apps/parya/ttimit/test/dr1/faks0/si1573.mfc nodes 11 NULL NULL NULL NULL NULL NULL sent-start sil4 sil4 t s initial 0 final 5 transitions 10 0 6 -2712479 1 7 -2260756 2 8 -4263106 3 9 -2714179 4 10 -2570411 6 1 0 7 2 0 8 3 0 9 4 0 10 5 0