From logproba on sentences to logproba on words

Amin Mantrach amantrac at ulb.ac.be
Wed Jan 30 04:49:32 PST 2008


Le 30-janv.-08 à 09:32, Marcello Federico a écrit :

> I will try to answer. You basically want to generate a corpus from
> some prior statistics over sentences, then estimate an n-gram LM
> over  such corpus. Correct?
>
Yes, I want to reesimate or estimate an n-gram LM over that corpus  
with prior probabilities on sentences.
> I do not see anything wrong with that, but you have to keep in mind
> that:
>
> - your corpus could not match the typical properties of real life  
> texts
>  (namely the distribution of ngrams could be very different);
>
It means that having in the corpus sentences a number of times  
proportional to prio proba of the sentences don't preserve a correct  
distribution on the ngrams.

> - you could not be able to apply all smoothing methods, such as
>  kneser-ney, just because your corpus will not generate proper
>  statistics of 'rare' ngrams, for the reasons stated above.
>
> Again, I do not see anything wrong with that, just pay attention to
> the smoothing method you use. My suggestion would be to use a
> simple technique like witten-bell smoothing.
>
>
> Greetings,
> Marcello
Thanks for you answer.
>
>
>
>
> On Jan 29, 2008, at 7:35 PM, Amin Mantrach wrote:
>
>> Apparently my question doesn't meet any answer, so I'll reformulate  
>> it
>> in order to be more clear.
>>
>> Actually, I want to create an LM model with the command > # ngram-
>> count -text textfile -lm lmfile
>>
>>
>> In the case I'm concerned with I dispose of the log-probabilities for
>> every sentences  of appearing. The same that you can obtain from
>> (#ngram -lm lm_file -debug 1 -ppl testfile)
>>
>> What I want to do ? Create a new LM file build from probabilities on
>> sentences I have.
>>
>> Current ideas :
>>
>> 1 / Produce a text file with the sentences. Each sentence can appear
>> in file multiple times. It will appear in fact exactly n times.   
>> Where
>> n = exp(log-proba of the sentence)*1000) (Rounded to integer).
>>
>> And then simply :  ngram-count -text newtextsentences -lm new_lm
>>
>> 2 /  Produce a count file (with only the counts needed (of the  
>> highest
>> order, etc.) and for each n-gram multiply the nb of occurrence by the
>> sum of proba of the sentences it belongs to.
>> This methods is clearly not fair.
>>
>>
>> Can you answer me if one of those ideas are correct. If not how  
>> should
>> I proceed.
>>
>>
>> I hope the question in now clear enough.
>>
>> Thanks a lot for your help.
>> Amin.
>>
>>
>





More information about the SRILM-User mailing list