segment - segment text using N-gram language model
segment [ -help ] option ...
infers a most likely segmentation (location of segment boundaries)
from a text, based on a segment language model.
The language model is a standard backoff N-gram model in ARPA
modeling segmentation using the boundary tags <s> and </s>.
The program reads in a word sequence, finds the most likely locations
of segment boundaries according to the language model, and
outputs the word sequence with segment boundaries marked by <s> tags.
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
Print option summary.
Print version information.
- -order n
Set the maximal N-gram order to be used, by default 3.
NOTE: The order of the model is not set automatically when a model
file is read, so the same file can be used at various orders.
- -debug level
Set the debugging output level (0 means no debugging output).
Debugging messages are sent to stderr.
- -lm file
Read the N-gram model from
- -text file
Find the text to be segmented in
Default input is stdin.
Process all words in the input as one sequence of words, irrespective of
Normally each line is processed separately as a word sequence.
Use a forward-backward algorithm to compute the posterior probabilities
of a segment boundary at each word transition, and hypothesize a boundary
whenever the probability exceeds 0.5.
By default a Viterbi algorithm is used that computes
the globally most likely segmentation.
is specified as well,
then this option will produce one line of output per word, containing,
respectively, the <s> tag (if appropriate), the word itself, and the
posterior probability for a boundary preceding the word.
Output the unknown word token <unk> for each input word not in the
language model vocabulary.
The default is to output the input word unchanged.
- -stag string
to mark segment boundaries in the output.
Default is the start-of-sentence symbol defined in the language model (<s>).
- -bias b
Make a segment boundary a priori more likely by a factor of
This allows balancing of false detection/rejection errors.
The default is 1.
A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of
Spontaneous Speech,'' Proc. ICSLP, 1005-1008, 1996.
Only N-grams models up to trigram order are used accurately.
For higher-order models use the more general
Andreas Stolcke <email@example.com>.
Copyright 1997-2004 SRI International