[SRILM User List] Fwd: Fwd: ngram-count

Andreas Stolcke stolcke at speech.sri.com
Mon Jan 18 10:21:18 PST 2010


On 1/15/2010 11:07 AM, Manuel Alves wrote:
> Hi people.
> 1. The LM is strange because of the filtering options since in the 
> training corpus the setences begin with <s> and end with </s>,
> perhaps it is because of this.
I'm not sure what filtering options you are referring to, but having <s> 
and </s> around every sentence is not a problem.
If you don't put them in yourself, ngram-count will add them, so it 
doesn't make a difference.

> 2. The training corpus has 224884192 words.
> 3.
> reading 2534558 1-grams
> reading 5070525 2-grams
> reading 514318 3-grams
You have a good-sized corpus, but also a huge vocabulary, so no wonder 
you get some OOVs (i.e., the number of unique words seems to grow fast 
as a function of text length).
You might be able to reduce your vocabulary by mapping all words to 
lower-case, or do other text conditioning steps, like eliminating 
sources that might contain non-textual data (eg.,tables,  numbers) or 
misspellings.

> 4.You suspect of what in the training data.
I'm not sure what you mean here.

> 5.I am working in a translation system and i want to know if it makes 
> sense to have a word that has zeroprob(prob=0) just because the word 
> does not exists in the training corpus but exist in the test corpus 
> and if the -unk tag in the ngram-count command solves the problem?
In that case you really want to use -unk in both training and test.  
This will assign some non-zero probability to previously unseen words.  
However, you need to take steps to ensure that the training corpus 
contains words NOT in your vocabulary, so actual instances of <unk> 
occur for estimation purposes.  Please read the items relating to 
open-vocabulary LM in the FAQ.
> 6. If the -unk tag and the discount methods do not solve this problem 
> tell me how do i do to solve it?

A good sanity check is to compute the perplexity of (a sample of) your 
training data.  This should be much lower than your test set 
perplexity.  If not then you have a problem in your LM training and/or 
test procedure.  If the training ppl is low but the test ppl is high 
then your test data is just poorly  matched to your training.

Andreas

>
>
> Best Regards,
> Manuel.
>
>
>
> On Thu, Jan 14, 2010 at 6:01 PM, Andreas Stolcke 
> <stolcke at speech.sri.com <mailto:stolcke at speech.sri.com>> wrote:
>
>     On 1/14/2010 8:49 AM, Manuel Alves wrote:
>
>            p( </s> | . ...)     =  0.999997 [ -1.32346e-06 ]
>
>
>     You have a very strange LM since almost all the probability mass
>     in your LM is on the end-of-sentence tag.
>     How many words are in your training corpus?
>     How many unigrams, bigrams, and trigrams are in your LM?
>     I suspect some basic with the preparation of your training data.
>
>     Andreas
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100118/4675cc5d/attachment.html>


More information about the SRILM-User mailing list