segment

NAME

segment - segment text using N-gram language model

SYNOPSIS

segment [ -help ] option ...

DESCRIPTION

segment infers a most likely segmentation (location of segment boundaries) from a text, based on a segment language model. The language model is a standard backoff N-gram model in ARPA ngram-format(5), modeling segmentation using the boundary tags <s> and </s>. The program reads in a word sequence, finds the most likely locations of segment boundaries according to the language model, and outputs the word sequence with segment boundaries marked by <s> tags.

OPTIONS

Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.

-help: Print option summary.
-version: Print version information.
-order n: Set the maximal N-gram order to be used, by default 3. NOTE: The order of the model is not set automatically when a model file is read, so the same file can be used at various orders.
-debug level: Set the debugging output level (0 means no debugging output). Debugging messages are sent to stderr.
-lm file: Read the N-gram model from file.
-text file: Find the text to be segmented in file. Default input is stdin.
-continuous: Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a word sequence.
-posteriors: Use a forward-backward algorithm to compute the posterior probabilities of a segment boundary at each word transition, and hypothesize a boundary whenever the probability exceeds 0.5. By default a Viterbi algorithm is used that computes the globally most likely segmentation.
If -continuous is specified as well, then this option will produce one line of output per word, containing, respectively, the <s> tag (if appropriate), the word itself, and the posterior probability for a boundary preceding the word.
-unk: Output the unknown word token <unk> for each input word not in the language model vocabulary. The default is to output the input word unchanged.
-stag string: Use string to mark segment boundaries in the output. Default is the start-of-sentence symbol defined in the language model (<s>).
-bias b: Make a segment boundary a priori more likely by a factor of b. This allows balancing of false detection/rejection errors. The default is 1.

BUGS

Only N-grams models up to trigram order are used accurately. For higher-order models use the more general hidden-ngram(1).