Search SRILM-USER Archives

Re: SRILM transcription format

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 19 Mar 2002 15:15:25 PST

Ben,

SRILM does not rely on any fancy transcription conventions.
It tokenizes the input using the strtok() function from the C library.
It doesn't know about XML or any other tagging schemes.

What this boils down to is:

Everything that is separated by whitespace (space, newline, tabs) is
considered a word.  Case distinctions are preserved unless you use the
"-tolower" option in various tools.  Punctuation is treated as just another
non-whitespace character.  So you would have to strip punctuation if you
wanted to ignore it in your modeling, or surround punctuation marks with
whitespace if you wanted to model them as word tokens of their own.

--Andreas

In message <000701c1cf99$5416a280$dd00a8c0 at ADDRESS HIDDEN>you wrote:
> Andreas,
>
> Hello, could you point me to a document describing in detail the
> transcription conventions for SRILM tools?
>
> For example, can words be capitalized? What punctuation is permitted
> (apostrophe? period? comma?)
>
> Thank you,
>
> ________________________________
> Ben Reaves      benreaves at ieee.org
>
>

Click here to go to the SRILM home page.