Search SRILM-USER Archives

Problem with language-specific characters in segment

From: =?iso-8859-2?B?SuFjaHltIEtvbOH4?= <jachym at ADDRESS HIDDEN>
Date: Fri, 11 Oct 2002 13:47:17 +0200

Toto je zprava ve formatu MIME obsahujmcm vmce hastm.
------=_NextPart_000_0009_01C2712C.B65EC8C0
Content-Type: text/plain;
charset="iso-8859-2"
Content-Transfer-Encoding: quoted-printable

Hi to all!
I have a following problem with segment tool. In the output of segment =
appears <unk> token instead of words including language-specific =
characters - although in language model file they are saved correctly =
and input text file has the same coding (ISO-Latin 2) as the training =
text.=20
Does anybody know what's the problem?

Language model was buil using:
ngram-count -write-vocab vocabulary -text train2.txt -write probs -lm =
lmfile2

Segment tool was used with option:
segment -lm lmfile2 -text test3.txt -unk -posteriors -continuous

Disabling -unk option I got right words in the output but posteriors =
are probably not correct.

Jachym Kolar
Department of Cybernetics
University of West-Bohemia
Pilsen, Czech Republic

------=_NextPart_000_0009_01C2712C.B65EC8C0
Content-Type: text/html;
charset="iso-8859-2"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Diso-8859-2" =
http-equiv=3DContent-Type>
<META content=3D"MSHTML 5.00.3103.1000" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3D"Arial CE" size=3D2>Hi to all!</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>I have a following problem with=20
<EM>segment</EM> tool. In the output of segment appears <unk> =
token=20
instead of words including language-specific characters - although =
in=20
language model file they are saved correctly and input text file has the =
same=20
coding (ISO-Latin 2) as the training text. </FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2> Does anybody know =
what's the=20
problem?</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2></FONT> </DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>Language model was buil =
using:</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>ngram-count -write-vocab =
vocabulary -text=20
train2.txt -write probs -lm lmfile2</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2></FONT> </DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>Segment tool was used with=20
option:</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>segment -lm lmfile2 -text =
test3.txt -unk=20
-posteriors -continuous</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>Disabling -unk option I got =
right words=20
in the output but posteriors are probably not correct.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>Jachym Kolar</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>Department of =
Cybernetics</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>University of =
West-Bohemia</FONT></DIV>
<DIV><FONT face=3D"Arial CE" size=3D2>Pilsen, Czech =
Republic</FONT></DIV>
<DIV> </DIV></BODY></HTML>

------=_NextPart_000_0009_01C2712C.B65EC8C0--

Click here to go to the SRILM home page.