Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: ARPA format (sorting)

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Tue, 11 Mar 2003 14:33:44 PST

I'm not aware of any specific sorting requirements.  SRILM outputs the
N-grams in and order that optimizes memory caching behavior (essentially
by proximity in the underlying tree data structure), but of course it
can read N-grams in any order.

However, I have heard that some CMU software (like Sphinx) expects the
N-grams to be sorted lexicographically left-to-right.  The latest release
contains a script "sort-lm" that reorders the N-grams in a manner that
should be agreeable to the CMU software.  It is documented in the lm-scripts(1)
man page.

--Andreas

In message <20030311232159.A15739 at ADDRESS HIDDEN>you wrote:
> Hello Andreas,
>
> Is there any explicit sorting that LM's in ARPA format should have? Specifica
> lly, is there a standard sort order for the words of uni-, bi- and trigrams?
> (e.g. <unk> first, then diacritics, then alphabetically, then...).
> We've had some problems with arpa's written by SRILM that the CMU toolkit can
> 't handle, and we suspect a problem in the sorting of n-grams.
>
> Regards,
> Paul
> --
> melis at ADDRESS HIDDEN

Click here to go to the SRILM home page.