[SRILM User List] Fwd: Fwd: Fwd: ngram-count

Manuel Alves beleira at gmail.com
Mon Jan 11 04:00:00 PST 2010


---------- Forwarded message ----------
From: Manuel Alves <beleira at gmail.com>
Date: Mon, Jan 11, 2010 at 11:49 AM
Subject: Re: [SRILM User List] Fwd: Fwd: ngram-count
To: Andreas Stolcke <stolcke at speech.sri.com>


Hi  Andreas.
The output of the ngram-count was:
                                               [root at localhost Corporas]#
../srilm/bin/i686/ngram-count -order 3 -text CETEMPublico1.7 -lm LM
                                               warning: discount coeff 1 is
out of range: 1.44451e-17

I dont know if there is any problem with GT discount method.


On Fri, Jan 8, 2010 at 9:52 PM, Andreas Stolcke <stolcke at speech.sri.com>wrote:

>  On 1/8/2010 3:57 AM, Manuel Alves wrote:
>
>
>
> ---------- Forwarded message ----------
> From: Manuel Alves <beleira at gmail.com>
> Date: Fri, Jan 8, 2010 at 10:40 AM
> Subject: Re: Fwd: ngram-count
> To: Andreas Stolcke <stolcke at speech.sri.com>
>
>
> 1. ngram-count -text CETEMPublico1.7 -lm LM
> 2.I test it in this way:
>                              I use the client-server architecture of SRILM
>                              SERVER : ngram -lm ../$a -server-port 100
> -order 3
>                              CLIENT   : ngram -use-server 100\@localhost
> -cache-served-ngrams -ppl $ficheiro -debug 2 2>&1
>                              where $ficheiro is this:
>
>
>
>
>
>     p( observássemos | que ...)     =  0 [ -inf ]
>
>
>  file final.txt: 6 sentences, 126 words, 0 OOVs
> 6 zeroprobs, logprob= -912.981 ppl= 1.7615e+07 ppl1= 4.05673e+07
>
>
> It looks to me like everything is working as intended.   You are getting
> zeroprobs, but not a large number of them.
> They are low-frequency words (like the one above), so it makes sense, since
> they are probably not contained in the training corpus.
>
> The perplexity is quite high, but that could be because of a small, or
> mismatched training corpus.   You didn't include the output of the
> ngram-count program, it's possible that the GT (default) discounting method
> reported some problems that are not evident from your mail.
>
> One thing to note is that with network-server LMs you don't get OOVs,
> because all words are implicitly added to the vocabulary. Consequently, OOVs
> are counted as zeroprobs instead, but both types of tokens are equivalent
> for perplexity computation.
> Still, you could run
>          ngram -lm ../$a -order 3  -ppl $ficheiro -debug 2
> just to make sure you're getting the same result.
>
> Andreas
>
>
>  *Manuel Alves.  *
>
> On Thu, Jan 7, 2010 at 8:35 PM, Andreas Stolcke <stolcke at speech.sri.com>wrote:
>
>>  On 1/6/2010 10:34 AM, Manuel Alves wrote:
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Manuel Alves <beleira at gmail.com>
>> Date: Wed, Jan 6, 2010 at 6:33 PM
>> Subject: ngram-count
>> To: srilm-user at speech.sri.com
>>
>>
>> Hi people.
>> I need help whith ngram-count because i am training a model but when after
>> i try to use it some test example he gives me Zeroprobs in the output.
>> This means that the model is bad trained?
>> Please answer me.
>> Best regards,
>> Manuel Alves.
>>
>>
>>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100111/a54a6d1b/attachment.html>


More information about the SRILM-User mailing list