[SRILM User List] Generating sentences from a Google N-Gram LM
stolcke at speech.sri.com
Mon Mar 1 22:15:51 PST 2010
On 2/18/2010 8:00 PM, Benjamin Lambert wrote:
> Hi there,
> I'm pretty new to the SRI LM toolkit. I couldn't figure out how to search the mailing list archives, so forgive me if this has come up before.
> I'd like to *generate* random sentences using a LM based on Google n-grams (GNG). I tried following the directions in the FAQ on using GNG. I don't have any particular corpus in mind so I used the vocab from the WSJ HUB4 dataset (it was all I have handy--it's also all CAPS, so I made in lowercase in case that would help). That vocab file is about 15k words.
> This is the command I'm using at the end to generate:
> ngram -memuse -debug 3 -order 3 -count-lm -lm google.countlm -gen 1 -vocab wsj-lc.vocab -limit-vocab -vocab-aliases google.alias
> My questions are:
> 1) Am I on the right track here?
> 2) After launching the 'ngram' binary, it prints numerous times:
> "gunzip: stdout: Broken pipe
> gunzip: stdout: Broken pipe
> gunzip: stdout: Broken pipe"
> Is that normal? It seems to finish anyway.
These messages probably result from the .gz counts files being close
before the end of file.
This is okay, since SRILM doesn't have to read the entire counts file if
it can predict that the rest of it is outside the vocabulary given.
> 3) It takes a very long time to generate a single sentence. Is that expected? (I imagine, yes, because of the file-format). Would it be faster if it weren't unzipping the data?
No, it has nothing to do with how the data is stored. Generation is
slow for large vocabularies, and/or if the probability computation is slow.
> 4) When I finally do get a sentence generated, it's *very* long and has *many*>unk>'s. Like, from the command above, I get a sentence with 50,004 words. Actually, all sentences generated seem to be 50004 words... This doesn't look quite right to me. Any idea what's happening here? Maybe it's not handling the begin and end of sentence markers properly... I'll paste the beginning of the 50k word sentence, generated by the command above (that is, with order 3), below.
Sorry, I have no idea what's going on here. It seems that the </s>
symbol is not predicted with large enough probability. The length
you're seeing is due to the constant
const unsigned int maxWordsPerLine = 50000;
which puts a limit on the sentences generated.
> Thank you,
More information about the SRILM-User