Greetings!!!
Thanks for the prompt reply. But the ideas you mentioned seems to be for
boundary marking when the whole sequence is correct. Our recognition
output is only 50% correct. That is we have a sequence of syllables that
are just 50% correct from which we need to extract the words. The n-best
results of the recognizer could be used to improve the performance. We can
have a lattice of syllable sequence where each syllable has a n-best list.
Now, the task is to find the best word sequence from this n-best lattice.
Do you have any similar programs. Please do reply.
Thanks in Advance.
Regards,
Lakshmi
On Fri, 29 Sep 2006, Andreas Stolcke wrote:
>
> In message <Pine.LNX.4.60.0609291425390.5866 at ADDRESS HIDDEN>you wrote:
>>
>> Greetings!!!
>>
>> We are developing a syllable based isolated style continuous speech recognize
>> r
>> for Indian languages. Currently, our recognizer output is just a sequence of
>> syllables. We want to extract the sequence of words from this syllable sequen
>> ce
>> using statistical language models and lexicon.I thought may be one of the
>> programs in this toolkit must be doing something similar (sub-word
>> sequence to word sequence conversion). But all the programs seems to use
>> word lattices.
>>
>> Is there any program in this toolkit that extracts the word sequence from
>> the sub-word sequence using LM and lexicon.
>
> Lashmi,
>
> first you have to remember that when the documentation of a program says
> 'words' it doesn't mean you have to use words in the conventional sense.
> you can use any kind of token (phones, syllables, etc.) in your lattices
> etc.
>
> The task you describe sounds like a boundary tagging problem, i.e., given
> a sequence of tokens, you want to label each transition between tokens as
> either a "boundary" or a "non-boundary". There are two tools in SRILM
> that can do this, using different kind of models. One is
> "hidden-ngram", which performs boundary tagging explicitly.
> The other is "disambig" which tags the tokens themselves, not the boundaries
> between them. But by assigining tags that denote "first token in a unit",
> "token insde a unit', etc. you can perform boundary tagging implicitly.
> (The tokens in your case are the syllables, the units would be the words.)
> Both tools use ngram language models to disambiguate the input.
> The model can be trained from syllabified training data, in your case.
>
> I suggest you look up papers on "word segmentation", "sentence segmentation",
> "Mandarin tokenization", "chunk parsing" and "shallow parsing" to
> get a good idea of the existing models for this type of task,
> then study the manual pages for the programs.
>
> --Andreas
>
>
>>
>> Thanks in Advance.
>> Regards,
>> Lakshmi
>
Click here to go to the SRILM home page.