[SRILM User List] counts in ngram-count output
    Andreas Stolcke 
    stolcke at icsi.berkeley.edu
       
    Fri Jul 20 08:55:42 PDT 2012
    
    
  
On 7/19/2012 6:47 PM, shinichiro.hamada wrote:
> Hi, I have a question if my outputs of ngram-count are correct or not.
>
> I made a fractional word-count file by my own program and executed
> ngram-count command with wb discount. The header of outputs were
> bellow:
>
> --------------------------
> [4gram wb float-count]
> ngram-count -read countfile_float -float-counts -order 4 -lm outfile \
>   -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram 1=780387
> ngram 2=20321
> ngram 3=2692
> ngram 4=2622
> ..
> --------------------------
>
> I thought higher order models have always more counts than lower
> order ones, but the above result wasn't so. Does this result
> designate that my word-count file has bug?
This is probably because the defaults for minimum count frequency are 
higher for trigrams and 4grams than for bigrams.
For bigrams it is 1, whereas for 3grams and higher it is 2.  You should 
see the expected behavior if you add
-gt3min 1 -gt4min 1
to the options.  (As explained in the man page, -gtXmin options apply to 
all discounting methods, not just GT.)
Andreas
    
    
More information about the SRILM-User
mailing list