Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: Open-vocabulary LM

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 25 Feb 2003 09:02:59 PST

Amelie,

it is possible if there are no unknown words in your data, or if
you didn't specify a vocabulary file (because then all words are
added implicitly).   It is also possible that you set ngram cutoffs
such that all ngrams involving <unk> fall below the cutoffs and are
therefore excluded from the LM.

To understand what's going on run ngram-count with

-write COUNTFILE

(in addition to the other options you use) and check what ngrams are
generated containing <unk>.

--Andreas

In message <3E5B960C.6010704 at ADDRESS HIDDEN>you wrote:
> Hi,
> Is it normal that in an open-vocabulary LM (built with the "-unk"
> option) the <unk> token is present as unigram, but not in bigrams and
> trigrams?
> (Sorry if this is a silly question, but I am not so familiar with
> language models, and I was told that it would not be the case with other
> toolkits).
> Thanks again,
>
> Amélie
>
> --
> --------------------------------------------------------------------
> Amélie DELTOUR
> ENSIMAG / Universität Karlsruhe
> E-mail : amelie.deltour at ADDRESS HIDDEN
> --------------------------------------------------------------------
>
>

Click here to go to the SRILM home page.