<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html; charset=ISO-8859-1"

 http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

On 1/15/2010 11:07 AM, Manuel Alves wrote:

<blockquote

 cite="mid:495c9ccd1001151107r43a3d551hcc7de8707d359f96@mail.gmail.com"

 type="cite">Hi people.<br>

1. The LM is strange because of the filtering options since in the

training corpus the setences begin with &lt;s&gt; and end with

&lt;/s&gt;,<br>

perhaps it is because of this.<br>

</blockquote>

I'm not sure what filtering options you are referring to, but having

&lt;s&gt; and &lt;/s&gt; around every sentence is not a problem.<br>

If you don't put them in yourself, ngram-count will add them, so it

doesn't make a difference.<br>

<br>

<blockquote

 cite="mid:495c9ccd1001151107r43a3d551hcc7de8707d359f96@mail.gmail.com"

 type="cite">2. The training corpus has 224884192 words.<br>

3.<br>

reading 2534558 1-grams<br>

reading 5070525 2-grams<br>

reading 514318 3-grams<br>

</blockquote>

You have a good-sized corpus, but also a huge vocabulary, so no wonder

you get some OOVs (i.e., the number of unique words seems to grow fast

as a function of text length).<br>

You might be able to reduce your vocabulary by mapping all words to

lower-case, or do other text conditioning steps, like eliminating

sources that might contain non-textual data (eg.,tables,&nbsp; numbers) or

misspellings.<br>

<br>

<blockquote

 cite="mid:495c9ccd1001151107r43a3d551hcc7de8707d359f96@mail.gmail.com"

 type="cite">4.You suspect of what in the training data.<br>

</blockquote>

I'm not sure what you mean here.<br>

<br>

<blockquote

 cite="mid:495c9ccd1001151107r43a3d551hcc7de8707d359f96@mail.gmail.com"

 type="cite">5.I am working in a translation system and i want to know

if it makes sense to have a word that has zeroprob(prob=0) just because

the word does not exists in the training corpus but exist in the test

corpus and if the -unk tag in the ngram-count command solves the

problem?<br>

</blockquote>

In that case you really want to use -unk in both training and test.&nbsp;

This will assign some non-zero probability to previously unseen words.&nbsp;

However, you need to take steps to ensure that the training corpus

contains words NOT in your vocabulary, so actual instances of

&lt;unk&gt; occur for estimation purposes.&nbsp; Please read the items

relating to open-vocabulary LM in the FAQ.<br>

<blockquote

 cite="mid:495c9ccd1001151107r43a3d551hcc7de8707d359f96@mail.gmail.com"

 type="cite">6. If the -unk tag and the discount methods do not solve

this problem tell me how do i do to solve it?<br>

</blockquote>

<br>

A good sanity check is to compute the perplexity of (a sample of) your

training data.&nbsp; This should be much lower than your test set

perplexity.&nbsp; If not then you have a problem in your LM training and/or

test procedure.&nbsp; If the training ppl is low but the test ppl is high

then your test data is just poorly&nbsp; matched to your training.<br>

<br>

Andreas<br>

<br>

<blockquote

 cite="mid:495c9ccd1001151107r43a3d551hcc7de8707d359f96@mail.gmail.com"

 type="cite"><br>

  <br>

Best Regards,<br>

Manuel.<br>

  <br>

  <br>

  <br>

  <div class="gmail_quote">On Thu, Jan 14, 2010 at 6:01 PM, Andreas

Stolcke <span dir="ltr">&lt;<a moz-do-not-send="true"

 href="mailto:stolcke@speech.sri.com">stolcke@speech.sri.com</a>&gt;</span>

wrote:<br>

  <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

    <div class="im">On 1/14/2010 8:49 AM, Manuel Alves wrote:<br>

    <blockquote class="gmail_quote"

 style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

&nbsp; &nbsp;p( &lt;/s&gt; | . ...) &nbsp; &nbsp; = &nbsp;0.999997 [ -1.32346e-06 ]<br>

    </blockquote>

    <br>

    </div>

You have a very strange LM since almost all the probability mass in

your LM is on the end-of-sentence tag.<br>

How many words are in your training corpus?<br>

How many unigrams, bigrams, and trigrams are in your LM?<br>

I suspect some basic with the preparation of your training data.<br>

    <font color="#888888"><br>

Andreas<br>

    <br>

    </font></blockquote>

  </div>

  <br>

</blockquote>

<br>

</body>

</html>