<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Sander,<br>

      <br>

      Thank you for your elaborate reply, but it doesn't really answer

      my question. I am not confused about the different sets of words.

      I know why they are there and what they are used for, but I'm

      wondering whether there is a standard term to denote each set

      individually. Let me rephrase my question with a very simple

      example:<br>

      <br>

      Given a single training sentence, "wrong is wrong" and a language

      model with cut-off 1, what are the terms to denote the following

      sets:<br>

      <ol>

        <li>{wrong, is}?<br>

        </li>

        <li>{wrong}?</li>

        <li>{is}?</li>

        <li>all other English words?<br>

        </li>

      </ol>

      I am especially interested in terms that differentiate between

      sets 3 and 4, if such terms exist.<br>

      <br>

      Regards,<br>

      <br>

      Joris<br>

      <br>

      <br>

      On 07/03/13 22:05, Sander Maijers wrote:<br>

    </div>

    <blockquote cite="mid:51D48418.8090100@student.ru.nl" type="cite">On

      03-07-13 20:22, Joris Pelemans wrote:

      <br>

      <blockquote type="cite">Hello all,

        <br>

        <br>

        My question is perhaps a little bit of topic, but I'm hoping for

        your

        <br>

        cooperation, since it's LM related.

        <br>

        <br>

        Say we have a training corpus with lexicon V_train. Since some

        of the

        <br>

        words have near-zero counts, we choose to exclude them from our

        LM. This

        <br>

        gives us a new lexicon, let's call it V_final. However this also

        gives

        <br>

        us two types of OOV words: those not in V_train and those not in

        <br>

        V_final. I was wondering whether there are standard terms in the

        <br>

        literature for these two types of OOVs. I have read my share of

        papers,

        <br>

        but none of them seem to make this distinction.

        <br>

        <br>

        Kind regards,

        <br>

        <br>

        Joris

        <br>

        _______________________________________________

        <br>

        SRILM-User site list

        <br>

        <a class="moz-txt-link-abbreviated" href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a>

        <br>

        <a class="moz-txt-link-freetext" href="http://www.speech.sri.com/mailman/listinfo/srilm-user">http://www.speech.sri.com/mailman/listinfo/srilm-user</a>

        <br>

      </blockquote>

      <br>

      Hi Joris,

      <br>

      <br>

      In my view the vocabulary is a superset of the actual set of the

      wordforms for which all wordform sequences (the N-permutations of

      vocabulary words, with repetion) are modeled in the N-gram LM.

      <br>

      <br>

      What limits the hypothesized transcript produced by an ASR system,

      is the intersection between the sets of:

      <br>

      a. the wordforms in the pronunciation lexicon (the mapping between

      acoustic feature sequences and orthographic representations)

      <br>

      b. the target words of the wordform sequences in the LM (as

      opposed to history words)

      <br>

      <br>

      The vocabulary does not matter then: is just an optional means to

      constrain the potential richness (given the written training data)

      of an N-gram LM that you are creating. You can use a vocabulary as

      a constraint ('-limit-vocab' in' ngram-count'), and/or use it to

      facilitate a preprocessed form of training data by means of

      special tokens that aren't really words (such as "<unk>" or

      a 'proper name class' token).

      <br>

      <br>

      So, the vocabulary may contain superfluous words. Only after you

      realize that this is not an issue, you could think about it

      further and say that after you have created and pruned an LM, you

      can find out which words were actually redundant in your

      vocabulary given the same written training data you used to create

      that LM, and you could just as well drop those and those words

      from the vocabulary you had already before creating your LM. Maybe

      that reduces the size of your vocabulary as much as you hope. Will

      this be worthwhile? Not for the ASR task, you see.

      <br>

      <br>

      The term OOV comes in handy as shorthand to denote words that are

      in the written training data but not in the vocabulary. It is not

      precise, you could just as well use an element-out-of-set notation

      (short and clear) in reports. Maybe you have read the article:

      "Detection of OOV Words Using Generalized Word Models and a

      Semantic Class Language Model" by Schaaf, which was a top Google

      result for me. This author confuses the pronunciation lexicon with

      the vocabulary. While you can, confusingly, call a word that was

      not transcribed correctly because, for one, it was not modeled by

      the pronunciation lexicon 'OOV', I think it is not okay to confuse

      the concepts vocabulary and pronunciation lexicon as he does.

      <br>

      <br>

      I hope this clears up any confusion?

      <br>

      <br>

      <br>

    </blockquote>

    <br>

  </body>

</html>