<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Sander,<br>
<br>
Thank you for your elaborate reply, but it doesn't really answer
my question. I am not confused about the different sets of words.
I know why they are there and what they are used for, but I'm
wondering whether there is a standard term to denote each set
individually. Let me rephrase my question with a very simple
example:<br>
<br>
Given a single training sentence, "wrong is wrong" and a language
model with cut-off 1, what are the terms to denote the following
sets:<br>
<ol>
<li>{wrong, is}?<br>
</li>
<li>{wrong}?</li>
<li>{is}?</li>
<li>all other English words?<br>
</li>
</ol>
I am especially interested in terms that differentiate between
sets 3 and 4, if such terms exist.<br>
<br>
Regards,<br>
<br>
Joris<br>
<br>
<br>
On 07/03/13 22:05, Sander Maijers wrote:<br>
</div>
<blockquote cite="mid:51D48418.8090100@student.ru.nl" type="cite">On
03-07-13 20:22, Joris Pelemans wrote:
<br>
<blockquote type="cite">Hello all,
<br>
<br>
My question is perhaps a little bit of topic, but I'm hoping for
your
<br>
cooperation, since it's LM related.
<br>
<br>
Say we have a training corpus with lexicon V_train. Since some
of the
<br>
words have near-zero counts, we choose to exclude them from our
LM. This
<br>
gives us a new lexicon, let's call it V_final. However this also
gives
<br>
us two types of OOV words: those not in V_train and those not in
<br>
V_final. I was wondering whether there are standard terms in the
<br>
literature for these two types of OOVs. I have read my share of
papers,
<br>
but none of them seem to make this distinction.
<br>
<br>
Kind regards,
<br>
<br>
Joris
<br>
_______________________________________________
<br>
SRILM-User site list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a>
<br>
<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/mailman/listinfo/srilm-user">http://www.speech.sri.com/mailman/listinfo/srilm-user</a>
<br>
</blockquote>
<br>
Hi Joris,
<br>
<br>
In my view the vocabulary is a superset of the actual set of the
wordforms for which all wordform sequences (the N-permutations of
vocabulary words, with repetion) are modeled in the N-gram LM.
<br>
<br>
What limits the hypothesized transcript produced by an ASR system,
is the intersection between the sets of:
<br>
a. the wordforms in the pronunciation lexicon (the mapping between
acoustic feature sequences and orthographic representations)
<br>
b. the target words of the wordform sequences in the LM (as
opposed to history words)
<br>
<br>
The vocabulary does not matter then: is just an optional means to
constrain the potential richness (given the written training data)
of an N-gram LM that you are creating. You can use a vocabulary as
a constraint ('-limit-vocab' in' ngram-count'), and/or use it to
facilitate a preprocessed form of training data by means of
special tokens that aren't really words (such as "<unk>" or
a 'proper name class' token).
<br>
<br>
So, the vocabulary may contain superfluous words. Only after you
realize that this is not an issue, you could think about it
further and say that after you have created and pruned an LM, you
can find out which words were actually redundant in your
vocabulary given the same written training data you used to create
that LM, and you could just as well drop those and those words
from the vocabulary you had already before creating your LM. Maybe
that reduces the size of your vocabulary as much as you hope. Will
this be worthwhile? Not for the ASR task, you see.
<br>
<br>
The term OOV comes in handy as shorthand to denote words that are
in the written training data but not in the vocabulary. It is not
precise, you could just as well use an element-out-of-set notation
(short and clear) in reports. Maybe you have read the article:
"Detection of OOV Words Using Generalized Word Models and a
Semantic Class Language Model" by Schaaf, which was a top Google
result for me. This author confuses the pronunciation lexicon with
the vocabulary. While you can, confusingly, call a word that was
not transcribed correctly because, for one, it was not modeled by
the pronunciation lexicon 'OOV', I think it is not okay to confuse
the concepts vocabulary and pronunciation lexicon as he does.
<br>
<br>
I hope this clears up any confusion?
<br>
<br>
<br>
</blockquote>
<br>
</body>
</html>