wlat-format

wlat-format

NAME

wlat-format - File format for SRILM word posterior lattices

SYNOPSIS

Word lattices:
version 2
name s
initial i
final f
node n w a p n1 p1 n2 p2 ...
...

Word meshes (confusion networks):

name s
numaligns N
posterior P
align a w1 p1 w2 p2 ...
reference a w
hyps a w h1 h2 ...
info a w start dur ascore gscore phones phonedurs
time a t
...

DESCRIPTION

Word posterior lattices and meshes are lattices generated by aligning N-best hypotheses with nbest-lattice(1), or by aligning PFSG or HTK lattices with lattice-tool(1). They compactly encode possible word hypotheses sequences and their posterior probabilities. (Word meshes have become generally known as ``confusion networks'' or ``sausages.'')

A word lattice is a partially ordered directed graph with nodes representing word hypotheses. Nodes are identified by non-negative integers. The file format specifies the initial node i, the final node f, and any number of additional nodes n. For each node n the following associated information is given on the same line: the word identity w (the string ``NULL'' is used with initial and final nodes), the alignment position a (identical values in this field identify hypotheses that occur at the same position), and the word posterior probability p. Following these values, zero or more transitions to successor nodes are specified, each given by the node index ni and the transition posterior probability pi. In a properly normalized word lattice the transition posteriors pi sum up to the node posterior p.

Word meshes represent a more constrained lattice format in which word hypotheses are in a total order. A mesh contains a number of alignment positions, and a set of mutually exclusive word hypotheses in each position (the ``confusion sets''). The word mesh represents all sentence hypotheses that can be generated by freely combining word hypotheses at each position. The file format specifies the number of alignment positions A and the total posterior probability mass P contained in the lattice, followed by one or more confusion set specifications. For each alignment position a, the hypothesized words wi and their posterior probabilities pi are listed in alternation. The pseudo-word string *DELETE* represents an empty hypothesis.

Optionally, the word mesh format encodes additional information about the hypothesis alignment from which it resulted. The keyword reference specifies the correct word w that was aligned at position a. The keyword hyps is used to list the sentence hypotheses of which a certain word hypothesis was a part. The word hypothesis is identified by an alignment postion a and the word string w, and is followed by the integer IDs hi (typically, the N-best ranks) of the associated sentence hypotheses.

As another optional element, the word mesh can contain word-level acoustic and temporal information, following the keyword info, the alignment position a, and the word identity w. This information is derived by nbest-lattice(1) from word- and phone-level backtraces of N-best hypotheses (as represented in Decipher NBestList2.0 format). The details of this information are defined in the SRILM class NBestWordInfo and subject to change, but currently include the following. start: word start time (in seconds from the beginning of the waveform); dur: word duration (in seconds); ascore: acoustic model likelihood (log base 10); gscore: grammar (LM and pronunciation) score (log base 10); phones: sequence of phones in word (separated by colons); phonedurs: sequence of phone durations (in numbers of frames, separated by colons). When word meshes are derived from HTK format lattices, pronunciation field will consist of the HTK phone alignment information, which encodes both phone sequence and durations; the phone duration field in turn is used to encode the duration model scores, if present. Note: The encoded information pertains to the word hypothesis with the highest posterior probability among all hypotheses of the same word aligned to a given word mesh position.

The time keyword is used for debugging purposes and encodes the estimated timestamp t of an alignment position a when the input contains backtrace information. It is ignored when reading in word meshes.

Both formats optionally encode the associated utterance IDs in the name field. Word lattices and meshes can be converted to PFSG format using the script wlat-to-pfsg.

SEE ALSO

nbest-lattice(1), lattice-tool(1), pfsg-scripts(1), pfsg-format(5), nbest-format(5).
L. Mangu, E. Brill, & A. Stolcke, ``Finding consensus in speech recognition: word error minimization and other applications of confusion networks,'' Computer Speech and Language 14(4), 373-400, 2000.

BUGS

Detailed alignment and acoustic information is so far only implemented for word meshes, although conceptually it would apply equally to word lattices.

AUTHOR

Andreas Stolcke <andreas.stolcke@microsoft.com>
Copyright 2001-2011 SRI International
Copyright 2011-2019 Microsoft Corp.