From lakshmi at lantana.tenet.res.in  Tue Oct  3 22:27:51 2006
From: lakshmi at lantana.tenet.res.in (Lakshmi A)
Date: Wed, 4 Oct 2006 10:57:51 +0530 (IST)
Subject: query regarding usage of SRILM toolkit
In-Reply-To: <200609291604.JAA05487@tonga>
References: <200609291604.JAA05487@tonga>
Message-ID: <Pine.LNX.4.60.0610041052040.5435@lantana.tenet.res.in>


Greetings!!!

Thanks for the prompt reply. But the ideas you mentioned seems to be for 
boundary marking when the whole sequence is correct. Our recognition 
output is only 50% correct. That is we have a sequence of syllables that 
are just 50% correct from which we need to extract the words. The n-best 
results of the recognizer could be used to improve the performance. We can 
have a lattice of syllable sequence where each syllable has a n-best list.
Now, the task is to find the best word sequence from this n-best lattice. 
Do you have any similar programs. Please do reply.

Thanks in Advance.
Regards,
Lakshmi

On Fri, 29 Sep 2006, Andreas Stolcke wrote:

>
> In message <Pine.LNX.4.60.0609291425390.5866 at lantana.tenet.res.in>you wrote:
>>
>> Greetings!!!
>>
>> We are developing a syllable based isolated style continuous speech recognize
>> r
>> for Indian languages. Currently, our recognizer output is just a sequence of
>> syllables. We want to extract the sequence of words from this syllable sequen
>> ce
>> using statistical language models and lexicon.I thought may be one of the
>> programs in this  toolkit must be doing something similar (sub-word
>> sequence to word sequence conversion). But all the programs seems to use
>> word lattices.
>>
>> Is there any program in this toolkit that extracts the word sequence from
>> the sub-word sequence using LM and lexicon.
>
> Lashmi,
>
> first you have to remember that when the documentation of a program says
> 'words' it doesn't mean you have to use words in the conventional sense.
> you can use any kind of token (phones, syllables, etc.) in your lattices
> etc.
>
> The task you describe sounds like a boundary tagging problem, i.e., given
> a sequence of tokens, you want to label each transition between tokens as
> either a "boundary" or a "non-boundary".  There are two tools in SRILM
> that can do this, using different kind of models.  One is
> "hidden-ngram", which performs boundary tagging explicitly.
> The other is "disambig" which tags the tokens themselves, not the boundaries
> between them.  But by assigining tags that denote "first token in a unit",
> "token insde a unit', etc. you can perform boundary tagging implicitly.
> (The tokens in your case are the syllables, the units would be the words.)
> Both tools use ngram language models to disambiguate the input.
> The model can be trained from syllabified training data, in your case.
>
> I suggest you look up papers on "word segmentation", "sentence segmentation",
> "Mandarin tokenization", "chunk parsing" and "shallow parsing" to
> get a good idea of the existing models for this type of task,
> then study the manual pages for the programs.
>
> --Andreas
>
>
>>
>> Thanks in Advance.
>> Regards,
>> Lakshmi
>


From stolcke at speech.sri.com  Wed Oct  4 12:41:47 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 04 Oct 2006 12:41:47 -0700
Subject: query regarding usage of SRILM toolkit
In-Reply-To: <Pine.LNX.4.60.0610041052040.5435@lantana.tenet.res.in>
References: <200609291604.JAA05487@tonga> <Pine.LNX.4.60.0610041052040.5435@lantana.tenet.res.in>
Message-ID: <45240E7B.9000509@speech.sri.com>

Lakshmi A wrote:
>
> Greetings!!!
>
> Thanks for the prompt reply. But the ideas you mentioned seems to be 
> for boundary marking when the whole sequence is correct. Our 
> recognition output is only 50% correct. That is we have a sequence of 
> syllables that are just 50% correct from which we need to extract the 
> words. The n-best results of the recognizer could be used to improve 
> the performance. We can have a lattice of syllable sequence where each 
> syllable has a n-best list.
> Now, the task is to find the best word sequence from this n-best 
> lattice. Do you have any similar programs. Please do reply.
>
> Thanks in Advance.
> Regards,
> Lakshmi
>
> On Fri, 29 Sep 2006, Andreas Stolcke wrote:
>
If your output if n-best, you can apply the disambig or hidden-ngram 
taggers to each of the hypotheses, and
then extract the 1-best segmentation by some criterion. 

If your output is in lattice format, things are more involved. You'd 
have to edit the lattices to insert nodes
representing the different tagging choices (e.g., 
boundary/no-boundary).  then rescore the lattice with
the tagging LM to extract the best hypothesis.

Andreas


From chiateek at comp.nus.edu.sg  Sat Oct 21 00:54:40 2006
From: chiateek at comp.nus.edu.sg (chiateek at comp.nus.edu.sg)
Date: Sat, 21 Oct 2006 15:54:40 +0800
Subject: Implementation details of -write-ngrams?
Message-ID: <20061021075440.GA3056@localhost.localdomain>

Hello,

Where can I find a detailed description of the algorithm for computing
n-gram counts (-write-ngrams) in SRILM? Thanks!


From stolcke at speech.sri.com  Sat Oct 21 09:38:49 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sat, 21 Oct 2006 09:38:49 PDT
Subject: Implementation details of -write-ngrams? 
In-Reply-To: Your message of Sat, 21 Oct 2006 15:54:40 +0800.
             <20061021075440.GA3056@localhost.localdomain> 
Message-ID: <200610211638.k9LGcos06332@huge>


In message <20061021075440.GA3056 at localhost.localdomain>you wrote:
> Hello,
> 
> Where can I find a detailed description of the algorithm for computing
> n-gram counts (-write-ngrams) in SRILM? Thanks!

The concept of posterior ngram counts is explained in section 3.3.2
of the paper

A. O. Hatch, B. Peskin, and A. Stolcke (2005), Improved Phonetic
Speaker Recognition Using Lattice Decoding, Proc. IEEE ICASSP,
Philadelphia, vol. 1, pp. 169-172.
http://www.speech.sri.com/cgi-bin/run-distill?papers/icassp2005-spkr-phonelats.ps.gz

(where you have to replace "phone" with "word" since the default is to
compute word ngrams).  Note this is not a new concept.

The algorithm is a forward-backward computation with on-the-fly lattice
expansion.  For further details you'll have to read the source code.

Andreas 


From ioparin at yahoo.co.uk  Sun Oct 22 10:50:56 2006
From: ioparin at yahoo.co.uk (ilya oparin)
Date: Sun, 22 Oct 2006 18:50:56 +0100 (BST)
Subject: [SRILM] FLM model training on large data
Message-ID: <20061022175056.33004.qmail@web25403.mail.ukl.yahoo.com>

Hi, everybody!

Does anyone have any experience of building a Factored Language Model on large data? There is still no problem with, say, processing a file in FLM format containing 5 mln entries, but as far as I try to feed a 50 mln FLM corpus, it needs unfeasible amount of memory (since it loads everything in memory). 

Does anyone know if there are any tricks how to train an FLM model in this case? Something like building partial LMs and then merging with standard ngram-count... What could you suggest as a solution?


best regards,
Ilya
 		
---------------------------------
 Try the all-new Yahoo! Mail . "The New Version is radically easier to use" ? The Wall Street Journal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20061022/c8ec80fe/attachment.html>

From ioparin at yahoo.co.uk  Mon Oct 23 04:58:52 2006
From: ioparin at yahoo.co.uk (ilya oparin)
Date: Mon, 23 Oct 2006 12:58:52 +0100 (BST)
Subject: [SRILM] Some more FLM questions
In-Reply-To: <453C481E.3000204@ee.washington.edu>
Message-ID: <20061023115852.65010.qmail@web25410.mail.ukl.yahoo.com>

Dear Katrin, thanks for the reply

There is a couple of other questions to those
concerned with FLM development:

1) Is there any possibility to interpolate FLMs with
normal LMs?
I tried to do this with "ngram" using "-factored" and
then "-lm-mix" options but it didn't work since it
expeted even a general (standard) word model to be
factored as well and I couldn't see how to show the
system that the first of interpolated models is
conventional, though others are factored. In "fngram"
there is no such option as well, as I get it.

2) Could you please specify how you work with large
data?
When I was training the model on 5M data, it was
taking 1.2G of memory. Actually, I work with
inflectional languages (Russian and Czech) so the
factors are really  "rich": features for each word
include its stem, inflection, detailed morphological
tag and lemma. May be that's why it takes so much
space? Otherwise I can not get how you managed to run
it for 30G words in English: in my case if I want to
enlarge data it seems like I'll have to switch to
64-bit architecture. Does SRILM and FLM support 64-bit
somehow?
If it's only me that lucky with memory loads, what
could you suggest to reduce it?

3) Which parameters does the training time depend on?

Thanks in advance,
regards,
ilya

--- Katrin Kirchhoff <katrin at ee.washington.edu> wrote:

> 
> Ilya,
> 
> We have trained FLMs with ~30M words without
> problems, but yes,
> beyond that it becomes a problem. We are currently
> working
> on updates to the code that make it possible to use
> larger
> corpora - these haven't been publicly released yet
> but
> I'll let you know when they become available.
> 
> best,
> Katrin
> 
> ilya oparin wrote:
> > Hi, everybody!
> > 
> > Does anyone have any experience of building a
> Factored Language Model on 
> > large data? There is still no problem with, say,
> processing a file in 
> > FLM format containing 5 mln entries, but as far as
> I try to feed a 50 
> > mln FLM corpus, it needs unfeasible amount of
> memory (since it loads 
> > everything in memory).
> > 
> > Does anyone know if there are any tricks how to
> train an FLM model in 
> > this case? Something like building partial LMs and
> then merging with 
> > standard ngram-count... What could you suggest as
> a solution?
> > 
> > 
> > best regards,
> > Ilya
> > 
> >
>
------------------------------------------------------------------------
> > Try the all-new Yahoo! Mail 
> >
>
<http://us.rd.yahoo.com/mail/uk/taglines/default/nowyoucan/wall_st_2/*http://us.rd.yahoo.com/evt=40565/*http://uk.docs.yahoo.com/nowyoucan.html>
> 
> > . "The New Version is radically easier to use" ?
> The Wall Street Journal
> 
> 


best regards,
Ilya

Send instant messages to your online friends http://uk.messenger.yahoo.com 


From stolcke at speech.sri.com  Mon Oct 23 15:46:53 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 23 Oct 2006 15:46:53 -0700
Subject: [SRILM] Some more FLM questions
In-Reply-To: <20061023115852.65010.qmail@web25410.mail.ukl.yahoo.com>
References: <20061023115852.65010.qmail@web25410.mail.ukl.yahoo.com>
Message-ID: <453D465D.9020105@speech.sri.com>

ilya oparin wrote:
>
> 2) Could you please specify how you work with large
> data?
> When I was training the model on 5M data, it was
> taking 1.2G of memory. Actually, I work with
> inflectional languages (Russian and Czech) so the
> factors are really  "rich": features for each word
> include its stem, inflection, detailed morphological
> tag and lemma. May be that's why it takes so much
> space? Otherwise I can not get how you managed to run
> it for 30G words in English: in my case if I want to
> enlarge data it seems like I'll have to switch to
> 64-bit architecture. Does SRILM and FLM support 64-bit
> somehow?
> If it's only me that lucky with memory loads, what
> could you suggest to reduce it?
>   
Yes, SRILM supports 64bit linux (and other) platforms.  For Linux 
running on AMD64-compatible
machines use
   
    make MACHINE_TYPE=i686-m64

So reduce memory consumptions use the strategies described in doc/FAQ.  
I'm copying here the relevant bits, many of which apply to
FLMs as well.

> Topic: Large data / too little memory issues
>
> 1) I'm getting a message saying (among other things)
>
>         Assertion `body != 0' failed.
>
> A: You are running out of memory.  See subsequent questions depending on
>    what you are trying to do.  Note: the above message means you are 
> running
>    out of "virtual" memory on your computer, which could be because of
>    limits in swap space, administrative resource limits, or limitations of
>    the machine architecture (a 32-bit machine cannot address more than
>    4GB no matter how many resources your system has).
>    Another symptom of not enough memory is that your program runs, but
>    very, very slowly, i.e., it is "paging" or "swapping" as it tries to
>    use more memory than the machine has RAM installed.
>
> 2) I am trying to count N-grams in a text file and running out of memory.
>
> A: Don't use ngram-count directly to count N-grams.  Instead, use the
>    make-batch-counts and merge-batch-counts scripts described in
>    training-scripts(1).  That way you can create N-gram counts limited 
> only
>    by the maximum file size on your system.
>
> 3) I am trying to build an N-gram LM and ngram-count runs out of memory.
>
> A: You are running out of memory either because of the size of ngram 
> counts,
>    or of the LM being built. The following are strategies for reducing the
>    memory requiredments for training LMs.
>
>      a) Assuming you are using Good-Turing or Kneser-Ney discounting, 
> don't use
>         ngram-count in "raw" form.  Instead, use the make-big-lm wrapper
>         script described in the traning-scripts(1) man page.    

> b) Switch to using the "_c" or "_s" versions of the SRI binaries.  For
>         instructions on how to build them, see the INSTALL file.
>         Once built, set your executable seach path accordingly, and try
>         make-big-lm again.
>
>      c) Lower the minimum counts for N-grams included in the LM, i.e., the
>         values of the options -gt2min, -gt3min, gt4min, etc.  The higher
>         order N-grams typically get higher minumum counts.
>
>      d) Get a machine with more memory.  If you are hitting the 
> limitations of
>         a 32-bit machine architecture, get a 64-bit machine and 
> recompile SRILM
>         to take advantage of the expanded address space. (The "i686-m64"
>         MACHINE_TYPE setting is for systems based on 64-bit AMD 
> processors.)
>         Note: that the 64-bit pointers will require a memory overhead in
>         themselves, so will need a machine with significantly, not just a
>         little, more memory than 4GB.
>
> 4) I am trying to apply a large LM to some data and am running out of 
> memory.
>
> A: Again, there are several strategies to reduce memory requirements.
>
>      a) Use the "_c" or "_s" versions of the SRI binaries.  See 3b) above.
>
>      b) Precompute the vocabulary of your test data and use the
>         ngram -limit-vocab option to load only the N-gram parameters 
> relevant
>         to your data.  This approach should allow you to use arbitrarily
>         large LMs provided the data is divided into small enough chunks.
>
>      c) If the LM can be built on a large machine, but then is to be 
> used on
>         machines with limited memory, use ngram -prune to remove the less
>         important parametere of the model.  This usually gives huge size
>         reductions with relatively modest performance degradation.  The
>         tradeoff is adjustable by varying the pruning parameter.
>

Andreas


From pclouds at gmail.com  Tue Oct 31 10:01:42 2006
From: pclouds at gmail.com (Nguyen Thai Ngoc Duy)
Date: Wed, 1 Nov 2006 01:01:42 +0700
Subject: SRILM and GCC 4.1.1
Message-ID: <fcaeb9bf0610311001h78f3f4d6k1ec42fcf608f600c@mail.gmail.com>

Hi all,
I tried to compile SRILM with GCC 4.1.1 and got lots of errors (mostly
undefined references). Has anyone tried it with GCC 4.1?
Best regards,
-- 
Duy


From stolcke at speech.sri.com  Tue Oct 31 10:13:39 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 31 Oct 2006 10:13:39 PST
Subject: SRILM and GCC 4.1.1 
In-Reply-To: Your message of Wed, 01 Nov 2006 01:01:42 +0700.
             <fcaeb9bf0610311001h78f3f4d6k1ec42fcf608f600c@mail.gmail.com> 
Message-ID: <200610311813.k9VIDdu12157@huge>


In message <fcaeb9bf0610311001h78f3f4d6k1ec42fcf608f600c at mail.gmail.com>you wro
te:
> Hi all,
> I tried to compile SRILM with GCC 4.1.1 and got lots of errors (mostly
> undefined references). Has anyone tried it with GCC 4.1?

It compiles cleanly with gcc 4.1.0 and the right compiler options.
See 

$SRILM/common/Makefile.machine.i686-gcc4

Andreas 


From pclouds at gmail.com  Tue Oct 31 11:05:17 2006
From: pclouds at gmail.com (Nguyen Thai Ngoc Duy)
Date: Wed, 1 Nov 2006 02:05:17 +0700
Subject: SRILM and GCC 4.1.1
In-Reply-To: <200610311813.k9VIDdu12157@huge>
References: <fcaeb9bf0610311001h78f3f4d6k1ec42fcf608f600c@mail.gmail.com>
	 <200610311813.k9VIDdu12157@huge>
Message-ID: <fcaeb9bf0610311105x32325a33tc910a5c4a19460cd@mail.gmail.com>

On 11/1/06, Andreas Stolcke <stolcke at speech.sri.com> wrote:
> $SRILM/common/Makefile.machine.i686-gcc4
Thank you. After setting MACHINE_TYPE=i686-gcc4, I still got errors:

make[2]: Entering directory `/home/pclouds/tmp/srilm/lm/src'
/usr/bin/g++    -I. -I/home/pclouds/tmp/srilm/include   -u matherr
-L/home/pclouds/tmp/srilm/lib/i686-gcc4  -g -O3 -o
../bin/i686-gcc4/ngram ../obj/i686-gcc4/ngram.o
../obj/i686-gcc4/liboolm.a -lm -ldl
/home/pclouds/tmp/srilm/lib/i686-gcc4/libflm.a
/home/pclouds/tmp/srilm/lib/i686-gcc4/libdstruct.a
/home/pclouds/tmp/srilm/lib/i686-gcc4/libmisc.a -ltcl -lm 2>&1 |
c++filt
../obj/i686-gcc4/liboolm.a(Vocab.o): In function `LHash<char const*,
unsigned int>::remove(char const*, bool&)':
/home/pclouds/tmp/srilm/include/LHash.cc:416: undefined reference to
`LHash<char const*, unsigned int>::removedData'
/home/pclouds/tmp/srilm/include/LHash.cc:417: undefined reference to
`LHash<char const*, unsigned int>::removedData'
/home/pclouds/tmp/srilm/include/LHash.cc:424: undefined reference to
`LHash<char const*, unsigned int>::removedData'
/home/pclouds/tmp/srilm/include/LHash.cc:473: undefined reference to
`LHash<char const*, unsigned int>::removedData'

I'm using SRILM 1.5.0

> Andreas
>
>


-- 
Duy


From stolcke at speech.sri.com  Tue Oct 31 11:36:14 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 31 Oct 2006 11:36:14 PST
Subject: SRILM and GCC 4.1.1 
In-Reply-To: Your message of Wed, 01 Nov 2006 02:05:17 +0700.
             <fcaeb9bf0610311105x32325a33tc910a5c4a19460cd@mail.gmail.com> 
Message-ID: <200610311936.k9VJaEm19959@huge>


make sure the c++ compiler is invoked with

 -DINSTANTIATE_TEMPLATES

If it is then there seems to be a strange problem with your linker or 
compiler installation that I cannot reproduce.

--Andreas

In message <fcaeb9bf0610311105x32325a33tc910a5c4a19460cd at mail.gmail.com>you wro
te:
> On 11/1/06, Andreas Stolcke <stolcke at speech.sri.com> wrote:
> > $SRILM/common/Makefile.machine.i686-gcc4
> Thank you. After setting MACHINE_TYPE=i686-gcc4, I still got errors:
> 
> make[2]: Entering directory `/home/pclouds/tmp/srilm/lm/src'
> /usr/bin/g++    -I. -I/home/pclouds/tmp/srilm/include   -u matherr
> -L/home/pclouds/tmp/srilm/lib/i686-gcc4  -g -O3 -o
> ../bin/i686-gcc4/ngram ../obj/i686-gcc4/ngram.o
> ../obj/i686-gcc4/liboolm.a -lm -ldl
> /home/pclouds/tmp/srilm/lib/i686-gcc4/libflm.a
> /home/pclouds/tmp/srilm/lib/i686-gcc4/libdstruct.a
> /home/pclouds/tmp/srilm/lib/i686-gcc4/libmisc.a -ltcl -lm 2>&1 |
> c++filt
> ../obj/i686-gcc4/liboolm.a(Vocab.o): In function `LHash<char const*,
> unsigned int>::remove(char const*, bool&)':
> /home/pclouds/tmp/srilm/include/LHash.cc:416: undefined reference to
> `LHash<char const*, unsigned int>::removedData'
> /home/pclouds/tmp/srilm/include/LHash.cc:417: undefined reference to
> `LHash<char const*, unsigned int>::removedData'
> /home/pclouds/tmp/srilm/include/LHash.cc:424: undefined reference to
`LHash<char const*, unsigned int>::removedData'
> /home/pclouds/tmp/srilm/include/LHash.cc:473: undefined reference to
> `LHash<char const*, unsigned int>::removedData'
> 
> I'm using SRILM 1.5.0
> 
> > Andreas
> >
> >
> 
> 
> -- 
> Duy


From pclouds at gmail.com  Tue Oct 31 22:22:00 2006
From: pclouds at gmail.com (Nguyen Thai Ngoc Duy)
Date: Wed, 1 Nov 2006 13:22:00 +0700
Subject: SRILM and GCC 4.1.1
In-Reply-To: <200610311936.k9VJaEm19959@huge>
References: <fcaeb9bf0610311105x32325a33tc910a5c4a19460cd@mail.gmail.com>
	 <200610311936.k9VJaEm19959@huge>
Message-ID: <fcaeb9bf0610312222q18729049p559e3fafaed7540c@mail.gmail.com>

On 11/1/06, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>
> make sure the c++ compiler is invoked with
>
>  -DINSTANTIATE_TEMPLATES

Obviously I blindly overwrite CC and CXX variables without looking
into Makefiles. It works now. Thank you and sorry for the noise.

>
> If it is then there seems to be a strange problem with your linker or
> compiler installation that I cannot reproduce.
>
> --Andreas
-- 
Duy


From stolcke at speech.sri.com  Wed Nov  1 09:27:32 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 01 Nov 2006 09:27:32 PST
Subject: -gt1min 
In-Reply-To: Your message of Wed, 01 Nov 2006 08:28:21 +0100.
             <45484C95.4030401@web.de> 
Message-ID: <200611011727.kA1HRWr04897@huge>


In message <45484C95.4030401 at web.de>you wrote:
> Andreas Stolcke wrote:
> > In message <45475E03.4040105 at web.de>you wrote:
> >> 	Hi Andreas,
> >>
> >> ngram-count effectively ignores the -gt1min option, i.e. the cutoff
> >> value for unigrams. Is that the desired behavior?
> > 
> > How ddo you reach this conclusions?
> > 
> > Andreas 
> > 
> > 
> e.g.,
> ngram-count -order 1 -gt1min 1 -text <text> -lm lm1
> ngram-count -order 1 -gt1min 5 -text <text> -lm lm5
> both produce the same list of unigrams (same length), just the logprob
> changes. I would have expected unigrams below gt1min being pruned (as
> are ngrams of higher order) and hence the list in lm5 being shorter...
> 
> Ronny
> 
> -- 
> ------------------------------------
> Ronny Melz
> IfI, NLP Dept, University of Leipzig
> Augustusplatz 10/11
> 04109 Leipzig, Germany
> ------------------------------------
> 

Ronny,

the fact that all words appear in the unigrams does not mean that -gt1min
doesn't work.  For historical reasons the unigram list also serves the 
purpose of listing the vocabulary of the LM.  Therefore SRILM always 
includes all words in the unigrams.  However, those words that are excluded
by -gt1min would get a probability that corresponds to the zero-order backoff
probability.  Zero-order backoff probabilities are obtained by distributing 
the probability mass left over from unigram discounting over all 
words.

If you want to exclude certain words from the LM altogether use the 
-vocab option.

Andreas 


From ioparin at yahoo.co.uk  Wed Nov  8 00:28:43 2006
From: ioparin at yahoo.co.uk (ilya oparin)
Date: Wed, 8 Nov 2006 08:28:43 +0000 (GMT)
Subject: bug in lattice-tool?
Message-ID: <20061108082843.47119.qmail@web25401.mail.ukl.yahoo.com>

Andreas,

We've possibly found a bug in lattice-tool. Here, in
Brno, we work with th Czech language that has
diacritized letters. So, lattice-tool does everything
well with all the calculations until it comes to
matching of the best path with the reference file to
get number of del, subs and ins - and finally WER. It
appears that if both files are in ISO encoding and
there is a diacritized letter in the reference, it can
be matched to a non-diacritized word in the output,
that is actually a different word. So, the WER goes
down significantly from what really is (and what is
correctly output by HResults in HTK).

best regards,
Ilya

Send instant messages to your online friends http://uk.messenger.yahoo.com 


From stolcke at speech.sri.com  Wed Nov  8 06:30:00 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 08 Nov 2006 06:30:00 PST
Subject: bug in lattice-tool? 
In-Reply-To: Your message of Wed, 08 Nov 2006 08:28:43 +0000.
             <20061108082843.47119.qmail@web25401.mail.ukl.yahoo.com> 
Message-ID: <200611081430.kA8EU0Z05065@huge>


SRILM uses the strcmp() C library function to compare strings.
I suspect what you're seeing is a function of locale settings 
by way of environment variable such as LANG and LC_COLLATE.
This is almost certainly an OS-dependent issue.
First, I would try setting $LANG to "C" and unset any of the LC_* variables.

I would write a little test program that invokes strcmp() and 
observe the effect of locale settings on the result.

BTW, I have used SRILM with spanish, which also has diacritics in
the vocabulary and it works fine.  

--Andreas

In message <20061108082843.47119.qmail at web25401.mail.ukl.yahoo.com>you wrote:
> Andreas,
> 
> We've possibly found a bug in lattice-tool. Here, in
> Brno, we work with th Czech language that has
> diacritized letters. So, lattice-tool does everything
> well with all the calculations until it comes to
> matching of the best path with the reference file to
> get number of del, subs and ins - and finally WER. It
> appears that if both files are in ISO encoding and
> there is a diacritized letter in the reference, it can
> be matched to a non-diacritized word in the output,
> that is actually a different word. So, the WER goes
> down significantly from what really is (and what is
> correctly output by HResults in HTK).
> 
> best regards,
> Ilya
> 
> Send instant messages to your online friends http://uk.messenger.yahoo.com 


From ioparin at yahoo.co.uk  Tue Nov 14 08:17:44 2006
From: ioparin at yahoo.co.uk (ilya oparin)
Date: Tue, 14 Nov 2006 16:17:44 +0000 (GMT)
Subject: [SRILM] lattice-tool LM interpolarion
Message-ID: <20061114161745.10337.qmail@web25415.mail.ukl.yahoo.com>

Hi,

Could anyone give me any hints on the following: when
I interpolate different LMs (with different vocabs) to
rescore lattices with lattice-tool (in HTK format), in
the output lattice several links get l=-inf
probability, that leads to the fact it is impossible
to calculate viterbi best path etc. 
For me it looks like the loglinear mix is performed,
that leads to getting -inf in case at least one of the
LMs gives this output (that is possible due to
different vocabs). If so, is there any way to
interpolate LMs with different vocabs in lattice-tool,
or all the LMs should have the same vocab beforehand?
Or I just miss something crucial in the way the whole
thing works?

lattice-tool -in-lattice-list lat.list -read-htk
-no-htk-nulls -htk-words-on-nodes -lm LM1 -mix-lm LM2
-write-htk -htk-logbase 2.71828 -out-lattice-dir
out_dir

BTW, Andeas, thanks for the comments on accented
characters, it works.


best regards,
Ilya

Send instant messages to your online friends http://uk.messenger.yahoo.com 


From stolcke at speech.sri.com  Tue Nov 14 08:58:08 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Tue, 14 Nov 2006 08:58:08 PST
Subject: [SRILM] lattice-tool LM interpolarion 
In-Reply-To: Your message of Tue, 14 Nov 2006 16:17:44 +0000.
             <20061114161745.10337.qmail@web25415.mail.ukl.yahoo.com> 
Message-ID: <200611141658.kAEGw8L03438@huge>


In message <20061114161745.10337.qmail at web25415.mail.ukl.yahoo.com>you wrote:
> Hi,
> 
> Could anyone give me any hints on the following: when
> I interpolate different LMs (with different vocabs) to
> rescore lattices with lattice-tool (in HTK format), in
> the output lattice several links get l=-inf
> probability, that leads to the fact it is impossible
> to calculate viterbi best path etc. 
> For me it looks like the loglinear mix is performed,
> that leads to getting -inf in case at least one of the
> LMs gives this output (that is possible due to
> different vocabs). If so, is there any way to
> interpolate LMs with different vocabs in lattice-tool,
> or all the LMs should have the same vocab beforehand?
> Or I just miss something crucial in the way the whole
> thing works?
> 
> lattice-tool -in-lattice-list lat.list -read-htk
> -no-htk-nulls -htk-words-on-nodes -lm LM1 -mix-lm LM2
> -write-htk -htk-logbase 2.71828 -out-lattice-dir
> out_dir

This produces a linear (not loglinear) mixture of models.
The vocabulary of such a model is the union of the component models.

The -inf scores must be due to words that are not in the union
of the vocabularies of LM1 and LM2, or probabilities that are 
explicitly 0 in the LMs.

> 
> BTW, Andeas, thanks for the comments on accented
> characters, it works.

Glad to hear it.

Andreas 


From stolcke at speech.sri.com  Wed Dec  6 14:07:11 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 06 Dec 2006 14:07:11 PST
Subject: deleted estimation using SRILM 
In-Reply-To: Your message of Wed, 06 Dec 2006 14:44:53 -0500.
             <Pine.LNX.4.62L.0612061422120.13078@athena.lcs.mit.edu> 
Message-ID: <200612062207.kB6M7BY25643@huge>


In message <Pine.LNX.4.62L.0612061422120.13078 at athena.lcs.mit.edu>you wrote:
> Hello Andreas,
> 
> I have the latest SRILM toolkit version and I am trying to implement 
> deleted interpolation using ngram/ngram-count but I cannot seem to get it 
> to work. Would it be possible to get a sample of how the command(s) would 
> look like?

The latest version of SRILM implements deleted interpolation as part 
of the "count-LM" LM class.  Look up the -count-lm option in both the
ngram-count and the ngram man pages.
Then look at $SRILM/test/tests/ngram-count-lm/run-test for an example
of how it all fits together.

Deleted interpolation is not typically as good as other schemes such 
as modified Kneser Ney smoothing, but has some practical advantages 
(efficient memory implementation) when applied to very large count sets.

Andreas 


From stolcke at speech.sri.com  Thu Dec  7 14:08:01 2006
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 07 Dec 2006 14:08:01 PST
Subject: deleted estimation using SRILM 
In-Reply-To: Your message of Wed, 06 Dec 2006 17:21:11 -0500.
             <Pine.LNX.4.62L.0612061719580.13078@athena.lcs.mit.edu> 
Message-ID: <200612072208.kB7M81305989@huge>


In message <Pine.LNX.4.62L.0612061719580.13078 at athena.lcs.mit.edu>you wrote:
> Quick question.
> Is there a way to get the deleted-interpolation LM in arpa or fstn format?
> 
> thanks,
> -Ghinwa

No, because a deleted-interpolation LM cannot exactly be represented as a 
backoff LM in general (short of listing all ngrams).
What you can do, however, is define a set of ngrams and then create 
a backoff LM whose probabilities match exactly those of the
deleted-interpolation LM for those ngrams (and use backoff for all others).
This way, most SRILM LM classes can be approximated by backoff LMs.

To do this use the ngram -rescore-ngram option (see man page).

	ngram -rescore-ngram BACKOFF-LM \
		OTHER-LM-OPTIONS \
		-write-lm NEW-BACKOFF-LM

where OTHER-LM-OPTIONS specifies the LM from which the new probabilities 
are taken.  By chosing the set of ngrams in BACKOFF-LM large or small
you control the goodness of the approximation.

Andreas 

> 
> On Wed, 6 Dec 2006, Andreas Stolcke wrote:
> 
> >
> > In message <Pine.LNX.4.62L.0612061422120.13078 at athena.lcs.mit.edu>you wrote
> :
> >> Hello Andreas,
> >>
> >> I have the latest SRILM toolkit version and I am trying to implement
> >> deleted interpolation using ngram/ngram-count but I cannot seem to get it
> >> to work. Would it be possible to get a sample of how the command(s) would
> >> look like?
> >
> > The latest version of SRILM implements deleted interpolation as part
> > of the "count-LM" LM class.  Look up the -count-lm option in both the
> > ngram-count and the ngram man pages.
> > Then look at $SRILM/test/tests/ngram-count-lm/run-test for an example
> > of how it all fits together.
> >
> > Deleted interpolation is not typically as good as other schemes such
> > as modified Kneser Ney smoothing, but has some practical advantages
> > (efficient memory implementation) when applied to very large count sets.
> >
> > Andreas
> >
> >