<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content="text/html; charset=iso-8859-2" http-equiv=Content-Type>
<META content="MSHTML 5.00.3103.1000" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face="Arial CE" size=2>Hi to all!</FONT></DIV>
<DIV><FONT face="Arial CE" size=2>I have a following problem with
<EM>segment</EM> tool. In the output of segment appears <unk> token
instead of words including language-specific characters - although in
language model file they are saved correctly and input text file has the same
coding (ISO-Latin 2) as the training text. </FONT></DIV>
<DIV><FONT face="Arial CE" size=2> Does anybody know what's the
problem?</FONT></DIV>
<DIV><FONT face="Arial CE" size=2></FONT> </DIV>
<DIV><FONT face="Arial CE" size=2>Language model was buil using:</FONT></DIV>
<DIV><FONT face="Arial CE" size=2>ngram-count -write-vocab vocabulary -text
train2.txt -write probs -lm lmfile2</FONT></DIV>
<DIV><FONT face="Arial CE" size=2></FONT> </DIV>
<DIV><FONT face="Arial CE" size=2>Segment tool was used with
option:</FONT></DIV>
<DIV><FONT face="Arial CE" size=2>segment -lm lmfile2 -text test3.txt -unk
-posteriors -continuous</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face="Arial CE" size=2>Disabling -unk option I got right words
in the output but posteriors are probably not correct.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face="Arial CE" size=2>Jachym Kolar</FONT></DIV>
<DIV><FONT face="Arial CE" size=2>Department of Cybernetics</FONT></DIV>
<DIV><FONT face="Arial CE" size=2>University of West-Bohemia</FONT></DIV>
<DIV><FONT face="Arial CE" size=2>Pilsen, Czech Republic</FONT></DIV>
<DIV> </DIV></BODY></HTML>