Hello,<br><br>First off, to save you from having to read the below, suppose I used make-google-ngrams to store a small corpus of text's N-gram counts on disk in googles format. How do I then convert this to ARPA format with SRILM?<br>
<br>I have read the Google Web N-gram section in the F.A.Q, I read all the emails with the search term google in it and I read all the relevant man pages as well as looked at relevant run-tests without success. <br><br>
My goal is to make an arpa format language model from the N-gram counts inside the Google Web N-gram corpus. I realize its too large to load into memory as discussed in the documentation, so as per one of the emails in the list suggested, I pruned out most of the junk or non dictionary words and merged different cases and fixed the config files. So now I reduced the data quite significantly and am unable to figure out how to convert it to arpa format. Below is what I tried:<br>
<br>1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM<br><br>This did not work. It produced the same duplicate file of google.countlm<br><br>2. I noticed in the man pages that using the command -expand-classes forced the output to be a single ngram model in ARPA format. So I tried:<br>
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -write-lm arpaLM<br>Nothing happened but the output:<br>HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram models are mutually exclusive<br>
<br>3.I thought maybe using mix-lm would result in an arpa model as it also says in the man pages this would occur with mix-lm. I realize this was unlikely to work as I am combining the same lm's but tried regardless.<br>
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm google.countlm -write-lm arpaLM<br>Output was the same as google.countlm<br><br>I tried other things like using ngram-count and running the lm-scripts but no dice. One of the relevant posts in the forum I posted below:<br>
<br><a href="http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html" target="_blank">http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html</a><br>The URL above mentions:<br>
<b><br>
<i>>> Could you give me an *example* about bulilding google 3-gram LM file<br>
>> ,please?<br>
>> <br>>Again, this will require using the option with some tricks<br>
>that are not documents<br>
>as yet. Please be patient (or read all the manual pages carefully to<br>
>figure it our yourself.)</i></b><br><b><br></b>Has any documentations been made regarding this? Did the trick infer using mix-lm or expand-classes to force arpa format? <br><br>I figure worst case I do it manually but am sure there is something in SRILM that I am missing.<br>
<br>Thanks<br>Elias<br>