From stolcke at speech.sri.com  Wed Jan  2 09:49:07 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 02 Jan 2008 09:49:07 PST
Subject: SRILM 1.5.6 released
Message-ID: <200801021749.m02Hn7B14128@huge>


Happy New Year!

The latest version of SRILM is available from 
http://www.speech.sri.com/projects/srilm/download.html .

This release features much enhanced support for server-based LMs and 
improved documentation.

Enjoy,

Andreas

-----------------------------------------------------------------------------

1.5.6   2 January 2008

        Functionality:

        * New ngram -use-server option to run the client side of a network LM
        server as implemented by ngram -server-port.  Optionally, probabilities
        may be cached in the client (option -cache-served-ngrams).
        Mixtures of one or more network and file-based LMs are also possible.

        * Likewise, disambig, hidden-gram, and lattice-tool understand the
        -use-server option.

        * New LMClient class to implement the above (a stub LM subclass that
        queries a server for LM probabilities).

        * ngram -server-port now behaves like a true server daemon: it handles
        multiple simultaneous or sequential clients, and never exits (unless
        killed).  The number of simultaneous clients may be limited with the
        -server-maxclients option.

        * Support for 7-zip compressed files (suggested by Alexy Khrabrov).

        * lattice-tool -split-multiwords will now print a warning message
        about multiwords that were not split because their LM probability was
        non-zero.

        * LoglinearMix LM class supports n-way mixtures directly, giving more
        efficient implementation for n > 2 than recursive object construction
        in ngram (contributed by Tanel Alumae).

        Bug fixes:

        * MultiwordLM now implicitly adds all words to the vocabulary, so that
        previously unseen multiwords get split.  This has the side effect that
        OOVs will appear as zeroprob words.

        Documentation:

        * The doc/FAQ file has been expanded and reformated as a man page.
        It can be viewed with "man srilm-faq" or online at
        http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html .
        The major content additions are questions about the build
        process, how to build a "Google N-gram LM", smoothing issues,
        and OOV-handling (the latter by Deniz Yuret).  Corrections and
        additions to this document are most welcome!

        * A new manual page ngram-discount(7) gives a detailed overview of
        smoothing methods found in SRILM (contributed by Deniz Yuret).

        * The conversion of man pages to html has been enhanced to better
        handle code samples and nested itemized lists.


From gelbart at icsi.berkeley.edu  Mon Jan 14 17:23:42 2008
From: gelbart at icsi.berkeley.edu (David Gelbart)
Date: Mon, 14 Jan 2008 17:23:42 -0800 (PST)
Subject: SRILM BOW denominator warning
Message-ID: <Pine.LNX.4.63.0801141250460.26472@lamb.ICSI.Berkeley.EDU>

Hello,

I am trying to build a trigram LM for the OGI Numbers corpus, in which 
utterances are spoken strings of numbers such as 'eighty nine eighty 
eight'.  Since there are no singletons, I am using Witten-Bell 
discounting instead of Good-Turing.  ngram-count displays "BOW 
denominator for context... is zero" warnings.  Does this mean the LM 
is broken?  If I try adding "-gt3min 1 -gt2min 1" to the ngram-count 
options, I still see these warnings.  Here is the ngram-count output:

$ ngram-count -wbdiscount -text /u/gelbart/tmp/train.trans -order 3 \
   -lm /u/gelbart/tmp/numbers-wb.lm
BOW denominator for context "seven" is zero; scaling probabilities to sum to 1
BOW denominator for context "six" is zero; scaling probabilities to sum to 1
BOW denominator for context "four" is zero; scaling probabilities to sum to 1
BOW denominator for context "two" is zero; scaling probabilities to sum to 1

In the generated language model, the log BOWs are zero for those four 
words:

-1.156247       four    0
-1.09725        seven   0
-1.203041       six     0
-1.029482       two     0

Thanks,
David


From stolcke at speech.sri.com  Thu Jan 17 18:25:11 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Thu, 17 Jan 2008 18:25:11 -0800
Subject: SRILM BOW denominator warning
In-Reply-To: <Pine.LNX.4.63.0801141250460.26472@lamb.ICSI.Berkeley.EDU>
References: <Pine.LNX.4.63.0801141250460.26472@lamb.ICSI.Berkeley.EDU>
Message-ID: <47900E07.8000904@speech.sri.com>

David Gelbart wrote:
> Hello,
>
> I am trying to build a trigram LM for the OGI Numbers corpus, in which 
> utterances are spoken strings of numbers such as 'eighty nine eighty 
> eight'.  Since there are no singletons, I am using Witten-Bell 
> discounting instead of Good-Turing.  ngram-count displays "BOW 
> denominator for context... is zero" warnings.  Does this mean the LM 
> is broken?  If I try adding "-gt3min 1 -gt2min 1" to the ngram-count 
> options, I still see these warnings.  Here is the ngram-count output:
>
> $ ngram-count -wbdiscount -text /u/gelbart/tmp/train.trans -order 3 \
>   -lm /u/gelbart/tmp/numbers-wb.lm
> BOW denominator for context "seven" is zero; scaling probabilities to 
> sum to 1
> BOW denominator for context "six" is zero; scaling probabilities to 
> sum to 1
> BOW denominator for context "four" is zero; scaling probabilities to 
> sum to 1
> BOW denominator for context "two" is zero; scaling probabilities to 
> sum to 1
>
> In the generated language model, the log BOWs are zero for those four 
> words:
>
> -1.156247       four    0
> -1.09725        seven   0
> -1.203041       six     0
> -1.029482       two     0
>
this happens when you have a small vocabulary and all words are observed 
in a given context, so there is no backoff mass to distribute over 
unseen words.

there is no need to do anything,   the LM will work just fine.

this should probably be included in the FAQ for smoothing issues.

Andreas

> Thanks,
> David


From sai_tang_huang at hotmail.com  Sun Jan 20 18:37:51 2008
From: sai_tang_huang at hotmail.com (SAI TANG HUANG)
Date: Mon, 21 Jan 2008 03:37:51 +0100
Subject: new to srilm, first ngram-count with discount coeff out of range
 warning
Message-ID: <BAY108-W17A2F78AB39F3199C7D6ABDD3D0@phx.gbl>


Hi SRILM team,
 
My name is Sai Tang and I am a student at the University of Brighton. My project involves using SRILM to create language models as well as calculating entropies. I have just managed to get my first LM file after running ngram-count on a text file as follows:
 
sai at uk-notebook:~/Desktop$ ngram-count -order 2 -text Merged_File.txt -lm file
 
file is my lm file.
 
The file was created, however I got a warning message during the run of the ngram-count:
 
warning: discount coeff 1 is out of range: -2.26158e-17
 
My experience in NLP is not very complete and it's also the first time I use SRILM. I would appreciate it a lot if someone could help me with this. Also is this the address I have to right to in order to post to the mailing list?
 
Kind regards,
 
Sai
_________________________________________________________________
La vida de los famosos al desnudo en MSN Entretenimiento
http://entretenimiento.es.msn.com/


From sai_tang_huang at hotmail.com  Sun Jan 20 18:41:45 2008
From: sai_tang_huang at hotmail.com (SAI TANG HUANG)
Date: Mon, 21 Jan 2008 03:41:45 +0100
Subject: how to use ngram -counts-entropy
Message-ID: <BAY108-W400969B0C36509F0DD872FDD3D0@phx.gbl>


Hiya,

My name is Sai and I'm trying to get the entropy of my LM. I have managed to run ngram -counts successfully, but when I type ngram -lm myLMfile -counts-entropy nothing happens.

I have read the manual of ngram but I can't seem to understand why this command isn't working.

Thanks,

Sai
_________________________________________________________________
La vida de los famosos al desnudo en MSN Entretenimiento
http://entretenimiento.es.msn.com/


From sai_tang_huang at hotmail.com  Mon Jan 21 04:59:49 2008
From: sai_tang_huang at hotmail.com (SAI TANG HUANG)
Date: Mon, 21 Jan 2008 13:59:49 +0100
Subject: can't get right counts-entropy
Message-ID: <BAY108-W30DBDC567488F2C176BC4BDD3D0@phx.gbl>


Hi,

I have created a counts file and a back-off LM file from a text file with sentences with the following command:

sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file 

Then I ran the ngram program with -counts here is the output:

sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file 
file count_file: 23640 sentences, 460074 words, 0 OOVs
7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575
sai at uk-notebook:~/Desktop$ 

I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get:

sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt 
file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs
0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985
sai at uk-notebook:~/Desktop$ 

Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ?

If there is a more detailed manual or document describing these values then I'm willing to read it.

Thanks a lot,

Sai
_________________________________________________________________
Tecnolog?a, moda, motor, viajes,?suscr?bete a nuestros boletines para estar siempre a la ?ltima
http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com


From stolcke at speech.sri.com  Tue Jan 22 15:36:32 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 23 Jan 2008 01:36:32 +0200
Subject: can't get right counts-entropy
In-Reply-To: <BAY108-W30DBDC567488F2C176BC4BDD3D0@phx.gbl>
References: <BAY108-W30DBDC567488F2C176BC4BDD3D0@phx.gbl>
Message-ID: <47967E00.9080500@speech.sri.com>

SAI TANG HUANG wrote:
> Hi,
>
> I have created a counts file and a back-off LM file from a text file with sentences with the following command:
>
> sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file 
>
> Then I ran the ngram program with -counts here is the output:
>
> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file 
> file count_file: 23640 sentences, 460074 words, 0 OOVs
> 7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575
> sai at uk-notebook:~/Desktop$ 
>
> I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get:
>   
The 7880 zeroprobs are probably due to the <s> tokens output by the 
ngram-count program.
you cannot use the ngram-count output directly as input to ngram 
-counts. See below.
> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt 
> file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs
> 0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985
> sai at uk-notebook:~/Desktop$ 
>
> Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ?
>
> If there is a more detailed manual or document describing these values then I'm willing to read it.
>   
This is not yet well documented.    To use ngram -counts correctly to 
must only feed those N-grams that correspond to "events" in the LM, not 
those that only appear as "context".   That means you need to filter the 
ngram-count output and retain only ngrams that

- are of the highest order (e.g., trigrams for a trigram LM), OR
- start with <s> (but not the <s> unigam, see above).

For example, the sentence "a b c" in conjunction with a trigram LM 
should generate only the ngrams

<s> a
<s> a b
a b c
b c </s>

You can do this filtering with a small perl or gawk script.

Sounds like another topic for the FAQ file.

Andreas


> Thanks a lot,
>
> Sai
> _________________________________________________________________
> Tecnolog?a, moda, motor, viajes,?suscr?bete a nuestros boletines para estar siempre a la ?ltima
> http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com
>   


From amantrac at ulb.ac.be  Wed Jan 23 09:20:40 2008
From: amantrac at ulb.ac.be (Amin Mantrach)
Date: Wed, 23 Jan 2008 18:20:40 +0100
Subject: From logproba on sentences to logproba on words
Message-ID: <FDD15D0F-61E1-4140-B2EC-846742B6A348@ulb.ac.be>

Hi,

I wonder if it is possible to initialize directly an LM model with  
initial log probabilities on sentence and not on ngrams counts or a  
textfile.   And if yes wich command to use.(I don't see in the help  
how to realize that with the ngram or ngram-count).


Or How to obtain the log-proba on ngrams having the log-proba on all  
the sentences of a set of documents.

Thanks a lot.


From stolcke at speech.sri.com  Mon Jan 28 14:44:41 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 28 Jan 2008 14:44:41 -0800
Subject: can't get right counts-entropy
In-Reply-To: <BAY108-W5494994256D984AC1ABD1DD390@phx.gbl>
References: <BAY108-W30DBDC567488F2C176BC4BDD3D0@phx.gbl> <47967E00.9080500@speech.sri.com> <BAY108-W5494994256D984AC1ABD1DD390@phx.gbl>
Message-ID: <479E5AD9.5060802@speech.sri.com>

SAI TANG HUANG wrote:
> Hi,
>
> Thanks for your answer. However I'm still a bit confused. I ran the abc example you used in your last email and I got a count file of:
>
> sai at uk-notebook:~/Desktop$ ngram-count -text abc.txt
>      1
>  a   1
>  a b 1
> a       1
> a b     1
> a b c   1
> b       1
> b c     1
> b c         1
> c       1
> c   1
>     1
>
> According to you, I should only keep  
>
>  a
>  a b
> a b c
> b c 
>
> because the rest of them are "context". I understand that all the unigrams are contexts, but why do I keep " a" and chuck away all the other bigrams?
>
> Also, in what context do you mean by "context" and "event"? In P(a|b) = P(a,b)/P(b).
>
> Is P(b) the context and P(a,b) the event then?
>   
It's hard to make sense of this because somehow the <s> and </s> tags 
got deleted in the mail.   I will therefore use SB for "sentence begin" 
and SE for "sentence end".
I will try to explain better which ngrams are relevant for computing 
perplexity.  By "events" I mean the tokens that the LM predicts.  Those 
are all the tokens except
the SB.
    a
    b
    c
    SE

Now, you add the context for each for those tokens as is used by the LM 
for condition the predictions.  These are the precending two tokens, 
except in those cases where you are near the beginning of the sentence , 
since you cannot go beyond the SB token.  So the contexts together with 
the predicted tokens are

    SB a
    SB a b
    a b c
    b c SE

Those are the ngrams that you need to feed to ngram -counts to get a 
valid perplexity.
I hope this makes more sense now.

Andreas

> Thanks a lot,
>
> Sai
>
> ----------------------------------------
>   
>> Date: Wed, 23 Jan 2008 01:36:32 +0200
>> From: stolcke at speech.sri.com
>> To: sai_tang_huang at hotmail.com
>> CC: srilm-user at speech.sri.com
>> Subject: Re: can't get right counts-entropy
>>
>> SAI TANG HUANG wrote:
>>     
>>> Hi,
>>>
>>> I have created a counts file and a back-off LM file from a text file with sentences with the following command:
>>>
>>> sai at uk-notebook:~/Desktop$ ngram-count -text Merged_File.txt -lm lm_file -write count_file 
>>>
>>> Then I ran the ngram program with -counts here is the output:
>>>
>>> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -counts count_file 
>>> file count_file: 23640 sentences, 460074 words, 0 OOVs
>>> 7880 zeroprobs, logprob= -1.03103e+06 ppl= 146.821 ppl1= 190.575
>>> sai at uk-notebook:~/Desktop$ 
>>>
>>> I fail to understand the output. I read the the -counts command does something with a counts file (that would be my count_file). I don't understand why there's 7880 zeroprobs. When I run the ngram with -ppl I get:
>>>   
>>>       
>> The 7880 zeroprobs are probably due to the  tokens output by the 
>> ngram-count program.
>> you cannot use the ngram-count output directly as input to ngram 
>> -counts. See below.
>>     
>>> sai at uk-notebook:~/Desktop$ ngram -lm lm_file -debug 0 -ppl Merged_File.txt 
>>> file Merged_File.txt: 7880 sentences, 153358 words, 0 OOVs
>>> 0 zeroprobs, logprob= -270778 ppl= 47.7932 ppl1= 58.2985
>>> sai at uk-notebook:~/Desktop$ 
>>>
>>> Why does the -ppl yield 0 zeroprobs and the -counts give me 7880 zeroprobs? Also why are the ppl and ppl1 values different from the -ppl ?
>>>
>>> If there is a more detailed manual or document describing these values then I'm willing to read it.
>>>   
>>>       
>> This is not yet well documented.    To use ngram -counts correctly to 
>> must only feed those N-grams that correspond to "events" in the LM, not 
>> those that only appear as "context".   That means you need to filter the 
>> ngram-count output and retain only ngrams that
>>
>> - are of the highest order (e.g., trigrams for a trigram LM), OR
>> - start with  (but not the  unigam, see above).
>>
>> For example, the sentence "a b c" in conjunction with a trigram LM 
>> should generate only the ngrams
>>
>>  a
>>  a b
>> a b c
>> b c 
>>
>> You can do this filtering with a small perl or gawk script.
>>
>> Sounds like another topic for the FAQ file.
>>
>> Andreas
>>
>>
>>     
>>> Thanks a lot,
>>>
>>> Sai
>>> _________________________________________________________________
>>> Tecnolog?a, moda, motor, viajes,?suscr?bete a nuestros boletines para estar siempre a la ?ltima
>>> http://newsletters.msn.com/hm/maintenanceeses.asp?L=ES&C=ES&P=WCMaintenance&Brand=WL&RU=http%3a%2f%2fmail.live.com
>>>   
>>>       
>>     
>
> _________________________________________________________________
> La vida de los famosos al desnudo en MSN Entretenimiento
> http://entretenimiento.es.msn.com/
>   


From amantrac at ulb.ac.be  Tue Jan 29 10:35:25 2008
From: amantrac at ulb.ac.be (Amin Mantrach)
Date: Tue, 29 Jan 2008 19:35:25 +0100
Subject: Fw: From logproba on sentences to logproba on words
In-Reply-To: <FDD15D0F-61E1-4140-B2EC-846742B6A348@ulb.ac.be>
References: <FDD15D0F-61E1-4140-B2EC-846742B6A348@ulb.ac.be>
Message-ID: <29269D3B-A655-4ACA-B160-6A805F5AF58E@ulb.ac.be>

Apparently my question doesn't meet any answer, so I'll reformulate it  
in order to be more clear.

Actually, I want to create an LM model with the command > # ngram- 
count -text textfile -lm lmfile


In the case I'm concerned with I dispose of the log-probabilities for  
every sentences  of appearing. The same that you can obtain from  
(#ngram -lm lm_file -debug 1 -ppl testfile)

What I want to do ? Create a new LM file build from probabilities on  
sentences I have.

Current ideas :

1 / Produce a text file with the sentences. Each sentence can appear  
in file multiple times. It will appear in fact exactly n times.  Where  
n = exp(log-proba of the sentence)*1000) (Rounded to integer).

And then simply :  ngram-count -text newtextsentences -lm new_lm

2 /  Produce a count file (with only the counts needed (of the highest  
order, etc.) and for each n-gram multiply the nb of occurrence by the  
sum of proba of the sentences it belongs to.
This methods is clearly not fair.


Can you answer me if one of those ideas are correct. If not how should  
I proceed.


I hope the question in now clear enough.

Thanks a lot for your help.
Amin.


From amantrac at ulb.ac.be  Wed Jan 30 04:49:32 2008
From: amantrac at ulb.ac.be (Amin Mantrach)
Date: Wed, 30 Jan 2008 13:49:32 +0100
Subject: From logproba on sentences to logproba on words
In-Reply-To: <C921C2D3-A286-4CAE-931C-BB6A988A74AE@fbk.eu>
References: <FDD15D0F-61E1-4140-B2EC-846742B6A348@ulb.ac.be> <29269D3B-A655-4ACA-B160-6A805F5AF58E@ulb.ac.be> <C921C2D3-A286-4CAE-931C-BB6A988A74AE@fbk.eu>
Message-ID: <BEC0100E-11C7-4325-ADB1-CD41E4211BB9@ulb.ac.be>


Le 30-janv.-08 ? 09:32, Marcello Federico a ?crit :

> I will try to answer. You basically want to generate a corpus from
> some prior statistics over sentences, then estimate an n-gram LM
> over  such corpus. Correct?
>
Yes, I want to reesimate or estimate an n-gram LM over that corpus  
with prior probabilities on sentences.
> I do not see anything wrong with that, but you have to keep in mind
> that:
>
> - your corpus could not match the typical properties of real life  
> texts
>  (namely the distribution of ngrams could be very different);
>
It means that having in the corpus sentences a number of times  
proportional to prio proba of the sentences don't preserve a correct  
distribution on the ngrams.

> - you could not be able to apply all smoothing methods, such as
>  kneser-ney, just because your corpus will not generate proper
>  statistics of 'rare' ngrams, for the reasons stated above.
>
> Again, I do not see anything wrong with that, just pay attention to
> the smoothing method you use. My suggestion would be to use a
> simple technique like witten-bell smoothing.
>
>
> Greetings,
> Marcello
Thanks for you answer.
>
>
>
>
> On Jan 29, 2008, at 7:35 PM, Amin Mantrach wrote:
>
>> Apparently my question doesn't meet any answer, so I'll reformulate  
>> it
>> in order to be more clear.
>>
>> Actually, I want to create an LM model with the command > # ngram-
>> count -text textfile -lm lmfile
>>
>>
>> In the case I'm concerned with I dispose of the log-probabilities for
>> every sentences  of appearing. The same that you can obtain from
>> (#ngram -lm lm_file -debug 1 -ppl testfile)
>>
>> What I want to do ? Create a new LM file build from probabilities on
>> sentences I have.
>>
>> Current ideas :
>>
>> 1 / Produce a text file with the sentences. Each sentence can appear
>> in file multiple times. It will appear in fact exactly n times.   
>> Where
>> n = exp(log-proba of the sentence)*1000) (Rounded to integer).
>>
>> And then simply :  ngram-count -text newtextsentences -lm new_lm
>>
>> 2 /  Produce a count file (with only the counts needed (of the  
>> highest
>> order, etc.) and for each n-gram multiply the nb of occurrence by the
>> sum of proba of the sentences it belongs to.
>> This methods is clearly not fair.
>>
>>
>> Can you answer me if one of those ideas are correct. If not how  
>> should
>> I proceed.
>>
>>
>> I hope the question in now clear enough.
>>
>> Thanks a lot for your help.
>> Amin.
>>
>>
>


From amantrac at ulb.ac.be  Wed Jan 30 06:45:09 2008
From: amantrac at ulb.ac.be (Amin Mantrach)
Date: Wed, 30 Jan 2008 15:45:09 +0100
Subject: From logproba on sentences to logproba on words
In-Reply-To: <E4D07AB09F5F044299333C8D0FEB45E903AD08AC@nrccenexb1.nrc.ca>
References: <E4D07AB09F5F044299333C8D0FEB45E903AD08AC@nrccenexb1.nrc.ca>
Message-ID: <D6DA8DFD-0A34-44A8-8386-65258E27ED6A@ulb.ac.be>

Thanks Eric for your response.

The problem with doing that is that it supposes an equally  
redistributed probability for all n-grams of a sentence. Adding a 1  
for unigram and for a bi-gram means that the 2 grams contribute  
equiprobably to the sentence probability while that's not true.
May be I should first compute the probability of each word.

1/ ngram-count corpus.txt -lm wordmodel.lm

2/ ngram -lm wordmodel.lm -ppl corpus.txt -debug 2

Such that I obtain now for each word of the corpus the log- 
probability. (without taking into account OOV words)
And then for taking into account the priors  proba (of sentences)  
simply multiply each word by the sum of probabilities of sentences it  
appears into.


Do you agree with that idea ?


Le 29-janv.-08 ? 20:37, Joanis, Eric a ?crit :

> Dear Amin,
>
> I would use a variant of 2):  produce a count file, and *replace* the
> counts by the sum of probabilities of the sentences where a given n- 
> gram
> occurs.
>
> The default way to count adds 1 for each occurrence, which makes sense
> when the distribution is assumed to be uniform over the observed data.
> With your data, you can replace these 1's by the actual probability
> figures you have.  You may have to worry about underflow issues when
> tallying small numbers, but otherwise the process should be simple
> enough.  You may also need to renormalize all the counts so that the
> smallest count be equal to 1, depending on which discounting scheme  
> you
> use.  Not all discounting methods take float counts, so rounding may
> also be necessary.
>
> By the way, with your modified definition of the problem, I would
> probably write my own program to build the count file, and then invoke
> the SRILM utilities afterwards for building the LM from the counts.
>
> Cheers,
>
> Eric
>
> ____________________________________________________
> Eric Joanis
> CNRC - ITI - GTLI | NRC - IIT - ILT
>
>
>> -----Original Message-----
>> From: owner-srilm-user at speech.sri.com
>> [mailto:owner-srilm-user at speech.sri.com] On Behalf Of Amin Mantrach
>> Sent: January 29, 2008 1:35 PM
>> To: srilm-user at speech.sri.com
>> Subject: Fw: From logproba on sentences to logproba on words
>>
>>
>> Apparently my question doesn't meet any answer, so I'll
>> reformulate it
>> in order to be more clear.
>>
>> Actually, I want to create an LM model with the command > # ngram-
>> count -text textfile -lm lmfile
>>
>>
>> In the case I'm concerned with I dispose of the
>> log-probabilities for
>> every sentences  of appearing. The same that you can obtain from
>> (#ngram -lm lm_file -debug 1 -ppl testfile)
>>
>> What I want to do ? Create a new LM file build from probabilities on
>> sentences I have.
>>
>> Current ideas :
>>
>> 1 / Produce a text file with the sentences. Each sentence can appear
>> in file multiple times. It will appear in fact exactly n
>> times.  Where
>> n = exp(log-proba of the sentence)*1000) (Rounded to integer).
>>
>> And then simply :  ngram-count -text newtextsentences -lm new_lm
>>
>> 2 /  Produce a count file (with only the counts needed (of
>> the highest
>> order, etc.) and for each n-gram multiply the nb of
>> occurrence by the
>> sum of proba of the sentences it belongs to.
>> This methods is clearly not fair.
>>
>>
>> Can you answer me if one of those ideas are correct. If not
>> how should
>> I proceed.
>>
>>
>> I hope the question in now clear enough.
>>
>> Thanks a lot for your help.
>> Amin.
>>
>>
>>


From stolcke at speech.sri.com  Wed Jan 30 23:11:44 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 30 Jan 2008 23:11:44 -0800
Subject: Another detail about the ngram-count
In-Reply-To: <BAY144-DS22DBF2D73423FE0BAD9EFDD360@phx.gbl>
References: <BAY144-DS22DBF2D73423FE0BAD9EFDD360@phx.gbl>
Message-ID: <47A174B0.8060504@speech.sri.com>

Sai Tang Huang wrote:
> Hi Andreas,
>  
> Another detail probably worth mentioning is that when I run 
> ngram-count to get the counts and create the LM I get a coeff out of 
> range warning:
>  
> warning: discount coeff 1 is out of range: -3.33329e-17
>  
> I read that this was a bug somewhere in the mailing list archive.
It's not a bug (there was bug related to this message back in 2003, but 
it's long fixed).
What it means is that your corpus statistics are such that Good Turing 
discounting is not applicable, specifically, leading to a discounting 
factor that is effectively 0.
The effect is that discounting is disabled for this order of n-gram.
For reasons and countermeasures please check the FAQ man page or web page.
>  
> Could this be affecting the ngram -counts?
Only indirectly, in that the LM will be suboptimal.

Andreas


From bplank at science.uva.nl  Sat Mar  1 13:08:19 2008
From: bplank at science.uva.nl (B. Plank)
Date: Sat, 1 Mar 2008 22:08:19 +0100 (CET)
Subject: problems compiling 
Message-ID: <2124.82.73.146.219.1204405699.squirrel@webmail.science.uva.nl>

Hi to all,

sorry I have a small question. When trying to install/compile SRILM 1.5.6
I all the time have problems compiling the lm files (I cannot compile
ngram ngram-count etc). In more detail, it says me when compiling "make
World":

ERROR:  File to be installed (../bin/i686/fngram-count) does not exist.
ERROR:  File to be installed (../bin/i686/fngram-count) is not a plain file.
Usage:  decipher-install <mode> <file1> ... <fileN> <directory>
        mode:                 file permission mode, in octal
        file1 ... fileN:      files to be installed
        directory:            where the files should be installed

I found some older posting guessing it was the TCL library. But now I
tried both 1) letting them empty (TCL_INCLUDE and TCL_LIBRARY) and setting
NO_TCL=X, and 2) setting the include and library to

TCL_INCLUDE = -I/usr/include/tcl8.4
TCL_LIBRARY = -L/usr/lib/tcl8.4 -ltcl8.4

I'm running debian linux (i686). I also tried compiling the files in the
subdirectory itself, but then I get undefined reference problem.

Thanks in advance,
Barbara


From stolcke at speech.sri.com  Sun Mar  2 23:14:38 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Sun, 02 Mar 2008 23:14:38 -0800
Subject: problems compiling
In-Reply-To: <2124.82.73.146.219.1204405699.squirrel@webmail.science.uva.nl>
References: <2124.82.73.146.219.1204405699.squirrel@webmail.science.uva.nl>
Message-ID: <47CBA55E.5000103@speech.sri.com>

B. Plank wrote:
> Hi to all,
>
> sorry I have a small question. When trying to install/compile SRILM 1.5.6
> I all the time have problems compiling the lm files (I cannot compile
> ngram ngram-count etc). In more detail, it says me when compiling "make
> World":
>
> ERROR:  File to be installed (../bin/i686/fngram-count) does not exist.
> ERROR:  File to be installed (../bin/i686/fngram-count) is not a plain file.
> Usage:  decipher-install <mode> <file1> ... <fileN> <directory>
>         mode:                 file permission mode, in octal
>         file1 ... fileN:      files to be installed
>         directory:            where the files should be installed
>
> I found some older posting guessing it was the TCL library. But now I
> tried both 1) letting them empty (TCL_INCLUDE and TCL_LIBRARY) and setting
> NO_TCL=X, and 2) setting the include and library to
>
> TCL_INCLUDE = -I/usr/include/tcl8.4
> TCL_LIBRARY = -L/usr/lib/tcl8.4 -ltcl8.4
>
> I'm running debian linux (i686). I also tried compiling the files in the
> subdirectory itself, but then I get undefined reference problem.
>   
You need to send me the output of the make command .  there is not way 
of telling what went wrong otherwise.

Andreas


From sopheap.seng at gmail.com  Mon Mar  3 04:54:15 2008
From: sopheap.seng at gmail.com (Sopheap SENG)
Date: Mon, 3 Mar 2008 13:54:15 +0100
Subject: Rescore HTK lattice
Message-ID: <3b7711ea0803030454j46fce354r9816ac5b538cac13@mail.gmail.com>

Hello,

I need HTK lattice in my experiments but the sphinx3 decoder I used, could
not generate HTK lattice. So I have to convert sphinx lattice to HTK
lattice.

My problem is : the lattice generated by sphinx3 decoder provides only the
acoustic score of word transitions, I did not find the option to obtain the
lmscore in sphninx lattice.

In order to obtain HTK lattice with lmscore, first I converted sphinx
lattice to HTK SLF lattice format (I added l=0 as lmscore, the acoustic
score is  kept as it is)

Then I used lattice tool (Srilm V 1.5.2) to rescore the lattice by giving a
LM :

> lattice-tool  -in-lattice in.slf  -read-htk -lm LM.BO <http://lm.bo/>-htk-lmscale
9.5 -htk-wdpenalty 0.7 -htk-logbase 1.0003 -out-lattice out.slf  -write-htk

(the lmscale, wdpenalty and logbase are the values that I used during
lattice generation with sphninx3, the LM is the same as in sphinx3)

I obtained in the output a lattice with acoustic score and new lmscore. What
I observed is that the acoustic score in the output lattice is recalculated
using the logbase.

In order to verify that the output lattice in HTK format is equivalent to
the orginal sphinx lattice once, I generated 200-Best lists from these two
lattices.

- for sphinx lattice I used sphinx3_astar to generate N-best
- for the rescore HTK lattice, i used lattice-tool :

    >lattice-tool -in-lattice out.slf  -read-htk -lm lm.BO
-htk-lmscale 9.5-htk-wdpenalty
0.7 -htk-logbase 1.0003  -nbest-decode 200 -out-nbest-dir OUT/

The problem is that the order of the hypothesis in the two N-best list is
not the same. The 1-best given by sphinx3_astar ccould be found in the
200-Best given by lattice-tool but with a much more lower rank or some time
not found. But I always find the 1-best of sphinx_astar in a bigger N-Best
list of lattice-tool (N=2000).

I am convinced that this is a problems of normalizing the score between
sphinx and lattice-tool. If the score is correctly normalized, I should have
the same N-best at both sides.

Could  you please give me any clues on this issue?

Thank in advance.

Sopheap

-- 
---------------------------------------------
Sopheap SENG

Laboratoire d'Informatique de Grenoble (LIG)
Equipe GETALP Bureau C118
220, avenue de la Chimie
Campus Scientifique, BP53
38041 GRENOBLE Cedex 9, FRANCE
T?l : (33)-4-76-63-55-81
T?l?copie : (33)-4-76-63-55-52
Courriel : sopheap.seng at imag.f
URL : http://www-geod.imag.fr
---------------------------------------------
Enseignant
Institut de Technologie du Cambodge
BP 86, Bd de Pochentong
Phnom Penh - Cambodge
T?l : (855)-23-88-03-70/98-24-45
T?l?copie : (855)-23-88-03-69
Courriel : sopheap.seng at itc.edu.kh
URL : http://www.itc.edu.kh
---------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20080303/0d6267b8/attachment.html>

From stolcke at speech.sri.com  Mon Mar  3 09:38:17 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Mon, 03 Mar 2008 09:38:17 -0800
Subject: Rescore HTK lattice
In-Reply-To: <3b7711ea0803030454j46fce354r9816ac5b538cac13@mail.gmail.com>
References: <3b7711ea0803030454j46fce354r9816ac5b538cac13@mail.gmail.com>
Message-ID: <47CC3789.4060006@speech.sri.com>

Sopheap SENG wrote:
> Hello,
>
> I need HTK lattice in my experiments but the sphinx3 decoder I used, 
> could not generate HTK lattice. So I have to convert sphinx lattice to 
> HTK lattice.
>
> My problem is : the lattice generated by sphinx3 decoder provides only 
> the acoustic score of word transitions, I did not find the option to 
> obtain the lmscore in sphninx lattice.
>
> In order to obtain HTK lattice with lmscore, first I converted sphinx 
> lattice to HTK SLF lattice format (I added l=0 as lmscore, the 
> acoustic score is  kept as it is)
>
> Then I used lattice tool (Srilm V 1.5.2) to rescore the lattice by 
> giving a LM :
>
> > lattice-tool  -in-lattice in.slf  -read-htk -lm LM.BO 
> <http://lm.bo/> -htk-lmscale 9.5 -htk-wdpenalty 0.7 -htk-logbase 
> 1.0003 -out-lattice out.slf  -write-htk
>
> (the lmscale, wdpenalty and logbase are the values that I used during 
> lattice generation with sphninx3, the LM is the same as in sphinx3)
>
> I obtained in the output a lattice with acoustic score and new 
> lmscore. What I observed is that the acoustic score in the output 
> lattice is recalculated using the logbase.
My first thought is that you have to make sure the logbase specified in 
the header of your converted sphinx lattices needs to reflect the based 
used by the actual scores.
This should be obvious, but maybe not.
>
> In order to verify that the output lattice in HTK format is equivalent 
> to the orginal sphinx lattice once, I generated 200-Best lists from 
> these two lattices.
>
> - for sphinx lattice I used sphinx3_astar to generate N-best
> - for the rescore HTK lattice, i used lattice-tool :
>
>     >lattice-tool -in-lattice out.slf  -read-htk -lm lm.BO 
> -htk-lmscale 9.5 -htk-wdpenalty 0.7 -htk-logbase 1.0003  -nbest-decode 
> 200 -out-nbest-dir OUT/
Do you have a way of generating the total (combined acoustic and lm) 
scores of the sphinx system?  then try comparing them to the 
lattice-tool output and make sure they are the same (or nearly, up to 
numerical issues).  if not then repeat the comparison for all component 
scores, but setting the weights of all by one score (including 
wedpenalty) to zero. that way you should be able to pinpoint the source 
of any discrepancy.

Note that wdpenalty is also sensitive to logbase.

Andreas


From sai_tang_huang at hotmail.com  Mon Mar 10 12:36:21 2008
From: sai_tang_huang at hotmail.com (SAI TANG HUANG)
Date: Mon, 10 Mar 2008 20:36:21 +0100
Subject: Entropy going smaller as corpus goes smaller.
Message-ID: <BAY108-W3CAAC2374C4FC2F22DAB5DD0E0@phx.gbl>


Hi everyone,

I have computed the entropy for my model with the following command:

ngram -lm small_1.lm -counts small_1.cnt -counts-entropy

where small_1.lm is a trigram model with wbdiscount created from ngram-count and where small_1.cnt is a count file only including the events we want to predict.

The output is this:

file small_1.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -4.69264 ppl= 49276.1 ppl1= 49276.1

This model is really trained using a subset of my TRAIN.txt corpus. This model also gives the following ppl against an unseen test set:

file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs
0 zeroprobs, logprob= -38595.4 ppl= 339.378 ppl1= 476.644

On the other hand I have another model also a subset of my TRAIN.txt but a different subset from small_1.lm with entropy as follows:

file small_2.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -4.03253 ppl= 10777.8 ppl1= 10777.8

and a perplexity against the same unseen test set of:

file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs
0 zeroprobs, logprob= -38792.5 ppl= 349.627 ppl1= 491.891

So my question is, why is entropy bigger in the model whose ppl is actually the smallest? I thought that both measures could be used to measure the performance or quality of a language model. How can both numbers be so inconsistent? By the way my TRAIN.lm (model created from the whole of the training corpus) has an entropy of:

file TRAIN_EVENTS.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -11.5557 ppl= 3.59464e+11 ppl1= 3.59464e+11

which is humongous!

I am a complete beginner in this field and this is really not making any sense.

Any help will be greatly appreciated.

Regards to all,

Sai


_________________________________________________________________
MSN Video. 
http://video.msn.com/?mkt=es-es


From syaman at ece.gatech.edu  Tue Mar 25 12:29:08 2008
From: syaman at ece.gatech.edu (Sibel Yaman)
Date: Tue, 25 Mar 2008 15:29:08 -0400
Subject: Optimizing Weights in Log-Linear Interpolation 
Message-ID: <006301c88eae$820cf4c0$4b95d78f@ece.gatech.edu>

Hello,
I was wondering how I can train the weights in log-linear interpolation of
several language models (as in Klakow's paper). 
I have successfully used "compute-best-mix" script to use linear
interpolation weights but do not see how to modify the process to optimize
log-linear interpolation weights so that the perplexity is minimized on a
cross-validation set.
Thank you,
Sibel Yaman
From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Mon, 18 Jul 2005 07:28:40 PDT
In message <32809.213.58.88.69.1081673875.squirrel at ADDRESS HIDDEN>you
wrote:
> 
> Hi!
> 
> Does anyone know a program or toolkit allowing to do log-linear
> interpolation of different language models? since SRILM only permit to do
> linear interpolation.
> Thanks for your help,
> 
> Ciro Martins

Ciro,

sorry for the late response ;-)

There is now, in the current development version of SRILM, an
implementation of log-linear interpolation. The class name is
LoglinearMix, and the ngram -loglinear-mix option triggers its use.
Note that log-linear interpolation is much slower to evaluate than
linear interpolation, due to the need to normalize the combined LM.
This is done somewhat efficiently in SRILM by caching the normalizers
for previously seen contexts.

You might also want to try using log-linear combination of LM scores
without normalization. This can be done in the nbest or lattice
rescoring framework implemented by the toolkit, simply by computing
scores from multiple LMs.

The latest version of the toolkit can by downloaded in the usual way
by choosing the "1.4.5 (Beta)" version in the web form.

--Andreas 

Click here </projects/srilm/>  to go to the SRILM home page.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20080325/3a0bc6c6/attachment.html>

From stolcke at speech.sri.com  Wed Mar 26 10:39:14 2008
From: stolcke at speech.sri.com (Andreas Stolcke)
Date: Wed, 26 Mar 2008 10:39:14 -0700
Subject: Optimizing Weights in Log-Linear Interpolation
In-Reply-To: <006301c88eae$820cf4c0$4b95d78f@ece.gatech.edu>
References: <006301c88eae$820cf4c0$4b95d78f@ece.gatech.edu>
Message-ID: <47EA8A42.30509@speech.sri.com>

Sibel Yaman wrote:
>
> Hello,
> I was wondering how I can train the weights in log-linear 
> interpolation of several language models (as in Klakow's paper).
>
There isn't such a tool in SRILM, sorry. You would have to implement a 
gradient descent type optimization, but it's going to be slow due to the 
normalization term.

Optimizing the linear interpolation weight and then using that in the 
log-linear model might give decent results in many cases.
I'd be interested to hear from others on this list what they do.

Andreas

> I have successfully used "compute-best-mix" script to use linear 
> interpolation weights but do not see how to modify the process to 
> optimize log-linear interpolation weights so that the perplexity is 
> minimized on a cross-validation set.
>
> Thank you,
> Sibel Yaman
> *From:* Andreas Stolcke <stolcke at ADDRESS HIDDEN>
> *Date:* Mon, 18 Jul 2005 07:28:40 PDT
> In message <32809.213.58.88.69.1081673875.squirrel at ADDRESS 
> HIDDEN>you wrote:
> >
> > Hi!
> >
> > Does anyone know a program or toolkit allowing to do log-linear
> > interpolation of different language models? since SRILM only permit 
> to do
> > linear interpolation.
> > Thanks for your help,
> >
> > Ciro Martins
>
> Ciro,
>
> sorry for the late response ;-)
>
> There is now, in the current development version of SRILM, an
> implementation of log-linear interpolation. The class name is
> LoglinearMix, and the ngram -loglinear-mix option triggers its use.
> Note that log-linear interpolation is much slower to evaluate than
> linear interpolation, due to the need to normalize the combined LM.
> This is done somewhat efficiently in SRILM by caching the normalizers
> for previously seen contexts.
>
> You might also want to try using log-linear combination of LM scores
> without normalization. This can be done in the nbest or lattice
> rescoring framework implemented by the toolkit, simply by computing
> scores from multiple LMs.
>
> The latest version of the toolkit can by downloaded in the usual way
> by choosing the "1.4.5 (Beta)" version in the web form.
>
> --Andreas
>
> Click _here_ <file:///projects/srilm/> to go to the SRILM home page.
>


From mlease at cs.brown.edu  Fri Mar 28 12:03:25 2008
From: mlease at cs.brown.edu (Matt Lease)
Date: Fri, 28 Mar 2008 15:03:25 -0400
Subject: ngram-class with -incremental + -save-maxclasses
In-Reply-To: <46B7608D.6080002@speech.sri.com>
References: <d4929ad00705211043v78272000odaef19023c4f1e41@mail.gmail.com>	 <200705300356.l4U3u3R26372@huge> <d4929ad00708051130nd386560od20f6c022802f27d@mail.gmail.com> <46B7608D.6080002@speech.sri.com>
Message-ID: <47ED40FD.7020408@cs.brown.edu>

What is the behavior of -save-maxclasses for ngram-class when 
-incremental is used?  My understanding of -incremental is that C as 
specified by -numclasses determines the number of classes for the entire 
run-time (i.e. C+1 for the new word being merged into the existing C 
classes), in which case -save-maxclasses would seem not to add anything 
(ie perhaps it's only intended for V^3 clustering).

If one wanted to get different clusterings with the greedy algorithm 
without re-running each from scratch, it looks like you can use the 
-class-counts option and then feed this counts file into a subsequent 
invocation of ngram-class.  For example, run it initially with C=1000, 
then feed the output class counts into a second invocation with C=500, 
say.  Is this the correct procedure?

Thanks!