Search SRILM-USER Archives

Match: Format: Sort by:
Search:

SRILM and LC_ALL

From: David Gelbart <gelbart at ADDRESS HIDDEN>
Date: Mon, 8 Oct 2007 18:32:33 -0700 (PDT)

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

---1765779782-1209304681-1191893553=:12736
Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8BIT

On July 19 2007, Andreas Stolcke wrote:
> David Brodbeck wrote:
> > I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5.
> > The machine type is i686_m64.  Everything builds all right, but
> > the tests fail for make-ngram-pfsg, ngram-class, and
> > ngram-count-lm-limit-vocab.
> >
> > make-ngram-pfsg is the most obvious one, so I'll tackle that one
> > first.  I get the following in the stderr file:
> > gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid
> > collation character: /[[:lower:]-ÿ]/
>
> > Has anyone else run into this?  I'm using GNU Awk 3.1.5, and the
> > locale is set to en_US.UTF-8.
>
> This is odd since we're also using gawk 3.1.5 and I cannot replicate
> the problem even when setting LANG to en_US.UTF-8. It seems that the
> interpretation of gawk regular expressions should not depend on the
> OS release version, but of course there may always be bugs.

Hi Andreas,

Are you sure you used gawk 3.1.5 when you tried to replicate this?
The reason I ask is that at ICSI, the SRILM tools seem to invoke gawk
3.1.3, not gawk 3.1.5:

$ head -1 `which add-pauses-to-pfsg`
#!/usr/bin/gawk -f
$ /usr/bin/gawk --version | head -1
GNU Awk 3.1.3
$ which gawk
/usr/local/bin/gawk
$ /usr/local/bin/gawk --version | head -1
GNU Awk 3.1.5

My default locale is en_US.  With this locale, I do not see the error
David Brodbeck did, even if I use gawk 3.1.5.  If I set
LANG=en_US.UTF-8 and use gawk 3.1.5, then I see the error:

$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22:
fatal: Invalid collation character: /[[:lower:]-?]/

Setting LC_ALL=C as suggested in the SRILM INSTALL file does not solve
the problem:

tmp$ export LC_ALL=C
tmp$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22:
fatal: Invalid collation character: /[[:lower:]-ÿ]/

The compute-oov-rate script gives a similar error.

David Brodbeck, if you're reading this, did setting LC_ALL=C solve
your problem with add-pauses-to-pfsg?  This was not clear to me from
reading your July 23 email to Andreas.

Thanks,
David
---1765779782-1209304681-1191893553=:12736--

Click here to go to the SRILM home page.