Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: make-ngram-pfsg: bad results with new gawk version

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Fri, 05 Mar 2004 07:52:32 PST

Thanks for tracking this down.  I'll add a note somewhere that one better
set LC_NUMERIC=C or LC_ALL=C for gawk scripts to do proper artihmetic.

--Andreas

In message <40487FE8.3020708 at ADDRESS HIDDEN>you wrote:
> Hi Andreas,
>
> Andreas Stolcke wrote:
> > This is quite odd.
>
> I think so, too :)
>
> > make-ngram-pfsg doesn't perform much arithmetic on the log probabilties
> > in the LM.  It only scales and rounds them.
>  >
> > Can you apply the scale_log() function in make-ngram-pfsg to your LM
> > probabilties and backoff weights, and extract the cases where the output
> > differs?
>
> old awk:
> add_trans BO  -> </s> -0.314718
> scale_log(prob) = -7247
> add_trans <s> -> BO  -2.596963
> scale_log(prob) = -59800
>
> new awk:
> logscale = 23027
> add_trans BO  -> </s> -0.314718
> scale_log(prob) = 0
> add_trans <s> -> BO  -2.596963
> scale_log(prob) = -46054
>
> Note that I printed the logscale which seems to be correct.
> ...
> I think I found the problem:
>
> The float log-probs (x) seem to be converted to integers when
> multiplying them with the logscale:
>
> function scale_log(x) {
> return rint(x * logscale);
> }
>
> This seems to be related to the locale settings
> http://mail.gnu.org/archive/html/bug-gnu-utils/2002-07/msg00196.html
>
> If I set LC_ALL="C" in my shell, it also works as expected. So the bad
> behaviour seems to occur with gawk 3.1.3 AND LC_ALL=""...
>
>
> Regards.
> Matthias
>
>
> > --Andreas
> >
> > In message <40475599.9070700 at ADDRESS HIDDEN>you wrote:
> >
> >>Hello again,
> >>
> >>forgot to say that I tested this with srilm 1.3.3 and 1.3.1.
> >>
> >>Matthias
> >>
> >>Matthias Thomae wrote:
> >>
> >>>Hello Andreas,
> >>>
> >>>make-ngram-pfsg gives me different results with different versions of
> >>>gawk. The header and the links are the same, but the weights differ
> >>>substantially.
> >>>
> >>>I see the old behaviour with gawk 3.1.0 (on debian) and 3.1.1 (on suse),
> >>>and the differing one with 3.1.3-1 and 3.1.3-2 (on debian). The newly
> >>>created PFSGs cause some ASR error degradation...
> >>>
> >>>Any clues?
> >>>
> >>>Regards.
> >>>Matthias
> >>
> >
> >
>

Click here to go to the SRILM home page.