Open-vocabulary LM

Andreas Stolcke stolcke at speech.sri.com
Tue Feb 25 09:02:59 PST 2003


Amelie,

it is possible if there are no unknown words in your data, or if 
you didn't specify a vocabulary file (because then all words are 
added implicitly).   It is also possible that you set ngram cutoffs
such that all ngrams involving <unk> fall below the cutoffs and are
therefore excluded from the LM. 

To understand what's going on run ngram-count with

-write COUNTFILE

(in addition to the other options you use) and check what ngrams are
generated containing <unk>.

--Andreas

In message <3E5B960C.6010704 at ira.uka.de>you wrote:
> Hi,
> Is it normal that in an open-vocabulary LM (built with the "-unk" 
> option) the <unk> token is present as unigram, but not in bigrams and 
> trigrams?
> (Sorry if this is a silly question, but I am not so familiar with 
> language models, and I was told that it would not be the case with other 
> toolkits).
> Thanks again,
> 
> Amélie
> 
> -- 
> --------------------------------------------------------------------
> Amélie DELTOUR
> ENSIMAG / Universität Karlsruhe
> E-mail : amelie.deltour at ira.uka.de
> --------------------------------------------------------------------
> 
> 




More information about the SRILM-User mailing list