segment - segment text using N-gram language model


segment [ -help ] option ...


segment infers a most likely segmentation (location of segment boundaries) from a text, based on a segment language model. The language model is a standard backoff N-gram model in ARPA ngram-format(5), modeling segmentation using the boundary tags <s> and </s>. The program reads in a word sequence, finds the most likely locations of segment boundaries according to the language model, and outputs the word sequence with segment boundaries marked by <s> tags.


Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

Print option summary.
Print version information.
-order n
Set the maximal N-gram order to be used, by default 3. NOTE: The order of the model is not set automatically when a model file is read, so the same file can be used at various orders.
-debug level
Set the debugging output level (0 means no debugging output). Debugging messages are sent to stderr.
-lm file
Read the N-gram model from file.
-text file
Find the text to be segmented in file. Default input is stdin.
Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a word sequence.
Use a forward-backward algorithm to compute the posterior probabilities of a segment boundary at each word transition, and hypothesize a boundary whenever the probability exceeds 0.5. By default a Viterbi algorithm is used that computes the globally most likely segmentation.
If -continuous is specified as well, then this option will produce one line of output per word, containing, respectively, the <s> tag (if appropriate), the word itself, and the posterior probability for a boundary preceding the word.
Output the unknown word token <unk> for each input word not in the language model vocabulary. The default is to output the input word unchanged.
-stag string
Use string to mark segment boundaries in the output. Default is the start-of-sentence symbol defined in the language model (<s>).
-bias b
Make a segment boundary a priori more likely by a factor of b. This allows balancing of false detection/rejection errors. The default is 1.


ngram-count(1), ngram-format(5).
A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmentation of Spontaneous Speech,'' Proc. ICSLP, 1005-1008, 1996.


Only N-grams models up to trigram order are used accurately. For higher-order models use the more general hidden-ngram(1).


Andreas Stolcke <>.
Copyright 1997-2004 SRI International