<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    On 3/4/2012 4:56 PM, Meng Chen wrote:

    <blockquote

cite="mid:CA+bc0mqyDCf-__ALg0GR2t5Em+cvHgeu84yGiqO9RrDGg9xj=w@mail.gmail.com"

      type="cite">

      <div style="">Hello, I tried to make the language model from some

        non-native spontaneous speech transcription. However, there are

        lots of "strange words" in the corpus because the transcriber

        tried to transcribe as close as the real pronunciation.</div>

      <div style=""><br>

      </div>

      <div style="">For example, some transcriptions are as follows:</div>

      <div style=""><br>

      </div>

      <span style=""><s> she taught english there and she gave

        english lesson to a secondary school students in </span><b

        style="">boli bolivi  bolivia</b><span style=""></s></span>

      <div style="">

        <s> <b>er</b> what's wrong <b>er </b>he asked she asked

        </s></div>

      <div style=""><s> her her mother would <b>em er</b> her she

        took her mother in her own house and the baby <b>em</b> <b>moven

          bester</b></s></div>

    </blockquote>

    <br>

    First, such words are not strange at all, and occur even for native

    speakers when speaking spontaneously.<br>

    "er" and "em" are called "filled pauses", and "boli" etc. "word

    fragments".   Both are associated with a more general class of 

    spontaneous speech phenomena called "disfluencies".   For an

    overview see

    <a class="moz-txt-link-freetext" href="http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp96-dfs-swb.ps.gz">http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp96-dfs-swb.ps.gz</a>

    .<br>

    <blockquote

cite="mid:CA+bc0mqyDCf-__ALg0GR2t5Em+cvHgeu84yGiqO9RrDGg9xj=w@mail.gmail.com"

      type="cite">

      <div style=""><br>

      </div>

      <div style="">So I want to ask how should I process these "strange

        words" that don't exist such as boli, bolivi, er, em, moven,

        bester etc.</div>

      <div style="">If I replace them with the correct words, the

        language model will be unsuitable for the non-native spontaneous

        speech task.  </div>

      <div style="">If I keep them, their counts and probability are too

        small. And the dictionary is also hard to generate.</div>

      <div style=""><br>

      </div>

      <div style="">Are there any suggestions?</div>

    </blockquote>

    Filled pauses are usually modeled as any other words, though you

    might normalize their spellings.  There are usually just two forms,

    with and without nasal (usually spelled "um" and "uh" respectively).

    You should normalize alternative spellings like "ah", "eh",  "er",

    etc. and map them to the standard form to avoid fragmenting your

    data.   Often people use a dedicated vowel phone for pronunciations

    of these words because they are more variable in quality and

    duration than the standard schwa phone.<br>

    <br>

    Fragments, especially short ones, are hard to recognize because they

    are very confusable.   First, you should use a spelling convention

    that distinguishes them from full words, usually with a final

    hyphen, e.g., "boli-". <br>

    For LM training purposes you might want to delete them entirely, and

    represent them with a garbage model in acoustic training to avoid

    contaminating the models for regular words.<br>

    At SRI we tried modeling the most frequent word fragments in AM and

    LM, but even those (especially because they tend to have just one or

    two phones) are not recognized well, and removing them from the LM

    was best for overall word recognition accuracy.<br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>