Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: Problem with language-specific characters in segment

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Sun, 13 Oct 2002 08:20:53 -0700

Hi,

sorry to hear about the problems.  I think it has to do with the fact
that the locale is
never set in segment.cc.   try putting

    setlocale(LC_CTYPE, "");
    setlocale(LC_COLLATE, "");

right at the beginning of main() in segment.cc.  (This applies to
several other programs as
well, and will be fixed in the next release.)

BTW, the -unk option only makes sense if your LM was trained with
instances of <unk>
(or the ngram-count -unk option).  Otherwise unknown words will get zero
probability either
way.

--Andreas

Jáchym Kolář wrote:

> Hi to all!
> I have a following problem with segment tool. In the output of segment
> appears <unk> token instead of words including
> language-specific characters - although in language model file they
> are saved correctly and input text file has the same coding (ISO-Latin
> 2) as the training text.
>  Does anybody know what's the problem?
>  
> Language model was buil using:
> ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm
> lmfile2
>  
> Segment tool was used with option:
> segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous
>  
> Disabling -unk option  I got right words in the output but posteriors
> are probably not correct.
>  
> Jachym Kolar
> Department of Cybernetics
> University of West-Bohemia
> Pilsen, Czech Republic
>  

Click here to go to the SRILM home page.