[SRILM User List] SRI LM toolkit: ngram-count

Andreas Stolcke stolcke at speech.sri.com
Mon Feb 15 14:05:50 PST 2010


You are running out of memory, and the reasons could have to do with the
way your operating system is set up. SRILM itself has no inherent limit
in the size of memory it can use (other than what is given by the width
of your pointer type (32/64 bits).

Write a small test program to see how much memory you can malloc() on
your system, and if it doesn't work as expected look for an expert in
windows/cygwin.

Regardless of all this, you will not be able to convert the Google
N-gram collection into a backoff model, as I've explained before (see
http://www.speech.sri.com/pipermail/srilm-user/2009q2/000751.html ).

Andreas

On 2/9/2010 1:15 AM, 이일빈 wrote:
> Thank you for your prompt response.
> In fact, I was trying to interpolate two count files from Google
> N-gram and a training corpus.
> However, I found out there is a FAQ section about Google N-gram so I'm
> trying it now.
> I finished all the processes given in the FAQ and I want to convert it
> into ARPA format.
> (As you know, the result of the process is just a count-LM parameter
> file.)
> So I tried the following command.
> ----------------------------
> ngram -debug 2 -order 3 -count-lm -lm google.countlm -vocab vocab.txt
> -vocab-aliases google.aliases -limit-vocab -write-lm google.lm
> ----------------------------
> But an error message came out.
> ----------------------------
> assertion "body != 0" failed: file "../../include/LHash.cc", line 138
> 3 [sig] ngram 21852 winpids::enumNT: error 0xC0000005 reading system proce
> ss information
> Aborted (core dumped)
> ----------------------------
> However, I monitored the committed memory size and it reached only 900MB.
> So I'm wondering whether there is a memory usage limit in the toolkit.
> If you could help me with this problem, then please do so.
> But you can also suggest a good way to convert the count-LM parameter
> file and count files into ARPA format.
> Thank you.
> Best regards,
> ILBIN
>
> -----원본 메시지-----
> *From:* "Andreas Stolcke" <stolcke at speech.sri.com>
> *From Date:* 2010-02-09 오전 2:59:45
> *To:* 이일빈 <illee at etri.re.kr>
> *Cc:* "srilm-user" <srilm-user at speech.sri.com>
> *Subject:* Re: SRI LM toolkit: ngram-count
>
> On 2/7/2010 9:35 PM, 이일빈 wrote:
>> Dear Andreas Stolcke
>> Hello. I'm ILBIN LEE who develops a speech recognizer in ETRI, Korea.
>> While using ngram-count command of SRI LM toolkit, I encountered the
>> following error message.
>> $ ngram-count.exe -order 3 -sort -float-counts -gt2min 1 -gt3min 1
>> -vocab vocab.txt -read count.txt -lm lm.txt
>> error in discount estimator for order 1
>> The count file is an interpolation of two different count files. So
>> it has lots of fractional counts.
>> If you could suggest me some possible causes, it would help me a lot.
> You cannot use Good Turing discounting with fractional counts. Try
> -wbdiscount or -cdiscount or -addsmooth.
>
> The fact that you didn't get an error message also indicates that you
> weren't using -float-counts, which you must when processing fractional
> counts.
>
> Please also read the FAQ section on Smoothing issues before
> proceedings further.
>
> Andreas
>
>> Thank you.
>> Best regards,
>> ILBIN
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100215/bce16452/attachment.html>


More information about the SRILM-User mailing list