<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 11/5/2012 10:46 PM, Meng Chen wrote:<br>

    </div>

    <blockquote

cite="mid:CA+bc0mro16TYpt3pUexNgWFP3ZwC9mSS-qx+mDBY+wrWdmpCsw@mail.gmail.com"

      type="cite"><span class="" style="font-size:14px"><font

          style="font-family:arial,sans-serif" face="arial, helvetica,

          sans-serif">Hi, I'm training LMs for Mandarin Chinese ASR task

          with two different vocabularies, vocab1(<span

            style="line-height:16px">100635 vocabularies</span>) and

          vocab2(102541 vocabularies). In order to compare the

          performance of two vocabularies, the training corpus is<span

            style="line-height:16px"> the same, the test corpus is the

            same, and t</span>he word segmentation method is also the

          same, which is<span style="line-height:16px"> Forward Maximum

            Match.</span> The only difference is the segmentation

          vocabulary and LM training vocabulary. I trained LM1 and LM2

          with vocab1 and vocab2, and evaluate them on test set. <span

            style="line-height:16px">The result is as follows:</span></font>

        <div style="font-family:arial,sans-serif">

          <span style="line-height:16px"><font face="arial, helvetica,

              sans-serif"><br>

            </font></span></div>

        <div style="font-family:arial,sans-serif"><font face="arial,

            helvetica, sans-serif"><span style="line-height:16px">LM1:

              logprobs = </span>-84069.7, PPL = 416.452.</font></div>

        <div style="font-family:arial,sans-serif"><font face="arial,

            helvetica, sans-serif"><span style="line-height:16px">LM2:

              logprobs =<font color="#000000"> </font></span><font

              color="#000000"><span lang="EN-US">-82921.7, PPL = </span><span

                lang="EN-US">189.564.</span></font><span

              style="line-height:16px"><font color="#000000">  </font> </span></font></div>

        <div style="font-family:arial,sans-serif"><span

            style="line-height:16px"><font face="arial, helvetica,

              sans-serif"><br>

            </font></span></div>

        <div style="font-family:arial,sans-serif"><span

            style="line-height:16px"><font face="arial, helvetica,

              sans-serif">It seems LM2 is much better than LM1, either

              by logprobs or by PPL. However, when I am doing decoding

              with the corresponding Acoustic Model. The CER(Character

              Error Rate) of LM2 is higher than LM1. So I'm really

              confused. What's the relationship between the PPL and CER?

               How to compare LMs with different vocabularies? Can you

              give me some suggestions or references? I'm really

              confused.</font></span></div>

        <div style="font-family:arial,sans-serif"><span

            style="line-height:16px"><font face="arial, helvetica,

              sans-serif"><br>

            </font></span></div>

        <div style="font-family:arial,sans-serif"><span

            style="line-height:16px"><font face="arial, helvetica,

              sans-serif">ps: There is a mistake in last mail, so I sent

              it gain. <br>

            </font></span></div>

      </span></blockquote>

    <br>

    <font face="arial, helvetica, sans-serif">It is hard or impossible

      to compare two LMs with different vocabularies even when word

      segmentation is not an issue.<br>

      But you are comparing two LMs using different segmentations

      (because the vocabularies differ), so the problem is even harder.<br>

      The fact that your log probs differ by only a small amount

      (relatively) but the perplexities by a lot means that somehow your

      segmentation (the number of tokens in particular) in the two

      systems but be quite different.  Is that the case?  Can you devise

      an experiment where the segmentations are kept as similar as

      possible?   For example, you could apply the same segmenter to

      both test cases, and then split OOV words into their

      single-character components where needed to apply the LM.<br>

      <br>

      Anecdotally, PPL and WER are not always well correlated, though

      when comparing a large range of models the correlation is strong

      (if not perfect).   See

      <a class="moz-txt-link-freetext" href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013">http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013</a> .<br>

      <br>

      I do not recall any systematic studies of the effect of Mandarin

      word segmentation on CER but given the amount of work in this area

      in the last decade there must be some.   Maybe someone else has

      some pointers ?<br>

      <br>

      Andreas<br>

      <br>

    </font><br>

  </body>

</html>