Search SRILM-USER Archives

Re: SIRLM for unicode

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Fri, 23 Jan 2004 09:47:31 PST

I'm not familiar with unicode, unfortunately. However, SRILM does
not "interpret" characters other than for parsing lines of text into
words. It assumes that words are separated by spaces. So if unicode
uses the same encoding of space characters as ASCII then you should be fine.

The case mappping functions (-tolower option) in various tools will
probably not work correctly for multi-byte character sets.

--Andreas

In message <40113180.4030109 at ADDRESS HIDDEN>you wrote:
> Dear All,
> Is it possible for me to use SIRLM for text corpus which was encoded in
> unicode format ?
> Best regards.
>

Click here to go to the SRILM home page.