Search SRILM-USER Archives

Re: SRILM beginning and end tokens?

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 20 Mar 2007 20:27:00 -0800

In message <20070320233327.E8AD478B51 at ADDRESS HIDDEN>you wrote:
> Dear Andreas,
>
> I am very grateful to benefit from your work by using this toolkit.  It's
> great!
>
> I noticed it adds <s> and </s> tokens if they aren't there.  However, I'm
> modelling with trigrams, and it seems to add only one begin/end pair per
> sentence.  Is there an option I missed, or do I need to insert them myself?

For </s>, there is never a reason to add more than one such token,
the last ngram probability that goes into the sentence probability is

p( </s> | ... )

For <s>, you also need no more than one token, since the backoff will
establish that

p( w1 | ... <s> ) = p(w1 | <s>)

I know that some other implementations add additional higher-order ngrams
by filling in multiple copies of <s>, but I believe that is not well motivated.
It could also lead to unnatural count-of-count statistics for KN and GT
smoothing.

Andreas

>
> Thank you!
> -Amber
>
>
> \   L. Amber Wilcox-O'Hearn * http://www.cs.toronto.edu/~amber/   /
> -\  Graduate student * Computational Linguistics Research Group  /-
> --\   Department of Computer Science * University of Toronto    /--

Click here to go to the SRILM home page.