Search SRILM-USER Archives

Re: srilm toolkit

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Fri, 21 Sep 2007 12:53:39 PDT

>
> El 20/09/2007, a las 20:16, Andreas Stolcke escribió:
>
> > Raquel Justo wrote:
> >>
> >> I am working with class-based LMs and I propose the use of class n-
> >> gram LMs (where classes are made up of "multiword" strings or
> >> "subsequences of words") in two different ways:
> >> - In a first approach a multiword string is considered as a new
> >> lexical unit generated by joining the words and it is treated as a
> >> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P
> >> ("san_franciso"| C_CITY_NAME))
> >> - Instead, in a second approach, the words (taking part in the
> >> multiword string) are separately studied and the conditioned
> >> probabilities are calculated. Thus, a class n-gram LM is generated
> >> on the one hand, and on the other hand a word n-gram LM is
> >> generated within each class. (e.g. "san francisco", P(C_CITY_NAME)
> >> *P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)).
> > It looks to me like your second approach is equivalent to the
> > first, modulo smoothing effects achieved by the different backing
> > off distributions you might use in estimating the component
> > probabilities.
>
> I don't know if I have understood very well what you want to say but
> I think that using backing off smoothing the first approach is
> different from the second one because different combination of all
> the words belonging to a class are allowed and in the second approach
> instead, only the considered subsequences of words are allowed
> because they are treated as unigrams inside each class. I think that
> even when no smoothing is considered the first approach can
> generalize better due to the fact that n-gram models themselves
> generalize on the training data.

You are right.  That's actually what I meant by "different backing off".

> >>
> >> I send in an attached file a paper published in the "IEEE workshop
> >> on machine learning and signal processing" explaining better the
> >> two approaches.
> >>
> >> Does the -expand-classes or the -expand-exact option do something
> >> similar to the aforementioned approaches do? or does it adapt the
> >> class n-gram LM to a word n-gram LM considering that the words
> >> takes into account the information related to the classes (e.g. P
> >> (san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))?
> > Here is a high-level description of what -expand-classes does:
> >
> > 1) generate a list of all word ngrams obtained by replacing the
> > class tokens in the given LM.
> > 2) for each word ngram thus obtained:
> >          a) compute the joint probability p of the entire word
> > ngram, according to the original class LM
>
> Would you mind telling me how you compute this probability when
> multiwords are considered?
> do you consider the multiword as a unique token or do you estimate
> the conditional probabilities between the words that make up the
> multiword?

Are you talking about multiwords that are joined by underscores
(as handled by the -multiwords) option?  In that case there is no
special processing for them in ngram -expand-classes.  The class mechanism
treats multiwords as regular word tokens.

If you are asking about class expansions that contain multiple words
separated by spaces (e.g. CITY -> San Franscisco)  then the answer is that
the expansion algorithm deals with them just fine.  The algorithm I outlined
above handles this case quite naturally.

I forgot to mention one feature of the expansion algorithm:
If the same word ngram can be generated by expanding different class ngrams
then to corresponding joint probabilities are added, as they should be.

Andreas

Click here to go to the SRILM home page.