Search SRILM-USER Archives

Re: tolower option

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Wed, 14 Mar 2007 10:32:08 -0700

B. Plank wrote:
> Dear SRILM mailing list,
>
> I am wondering.. when I try to train a language model with ngram-count and
> the –tolower option,
> I’m getting the following error:
>
> assertion "i < maxWordLength" failed: file "Vocab.cc", line 97
>
> The input corpus (-text) is an utf8 file. Might this cause the problem?
>
> I am grateful for any suggestion.
>
>
-tolower is simply implemented by the C library tolower() function,
which is controlled by the OS's locale settings.
I am not sure if tolower() works correctly for UTF8, and if it does you
probably have to set LC_CTYPE to something
appropriate. In other words, this is all beyond the scope of what the
SRILM code itself handles.

I would write a little test program that calls tolower() on some test
data to make sure it does what you want.

Andreas

Click here to go to the SRILM home page.