Ben,
SRILM does not rely on any fancy transcription conventions.
It tokenizes the input using the strtok() function from the C library.
It doesn't know about XML or any other tagging schemes.
What this boils down to is:
Everything that is separated by whitespace (space, newline, tabs) is
considered a word. Case distinctions are preserved unless you use the
"-tolower" option in various tools. Punctuation is treated as just another
non-whitespace character. So you would have to strip punctuation if you
wanted to ignore it in your modeling, or surround punctuation marks with
whitespace if you wanted to model them as word tokens of their own.
--Andreas
In message <000701c1cf99$5416a280$dd00a8c0 at ADDRESS HIDDEN>you wrote:
> Andreas,
>
> Hello, could you point me to a document describing in detail the
> transcription conventions for SRILM tools?
>
> For example, can words be capitalized? What punctuation is permitted
> (apostrophe? period? comma?)
>
> Thank you,
>
> ________________________________
> Ben Reaves benreaves at ieee.org
>
>
Click here to go to the SRILM home page.