<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Best srilm users,<div><br><div>I wanted to have your opinion about the using of the SRILM package for &nbsp;text&nbsp;categorisation&nbsp;purpose. My goal is to compare on some known data sets (newsgroup, Reuters,...) and other data sets the performance in classification of the SRILM package to some well known other&nbsp;techniques&nbsp;(SVMs, Decision Trees,...) that are given good results.</div><div>The unique problem I'm facing is that the SRILM package is well huge and I will be&nbsp;embarrassed&nbsp;if the "wrongly" way I'm configuring the package infers into the results. So I&nbsp;summit&nbsp;you the methodology I'll use in order to have your advices, suggestions and corrections.</div><div><br></div><div>Each data set (pre-processed&nbsp;with stop-words and stemming) has a number of categories. Each document belong to a unique category (multi-class , mono-label). &nbsp;For each category I build a trainingFile containing all the documents of that category. Then for the category I get model file using the following command :</div><div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><span style="color: #2a00ff"><span class="Apple-tab-span" style="white-space:pre">        </span><font class="Apple-style-span" face="Helvetica"><font class="Apple-style-span" color="#0000FF">ngram-count&nbsp; -text trainingFile</font></font></span><span style=""><font class="Apple-style-span" face="Helvetica"><font class="Apple-style-span" color="#0000FF">&nbsp;-lm modelFile</font></font></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 12px;">I'm using 10 fold cross-validation for avoiding&nbsp;over-fitting&nbsp;purposes. So each trainingFile consists of 90% of the documents.&nbsp;</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 12px;">The model obtained is tested on the 10% with the following command<span class="Apple-tab-span" style="white-space:pre">        </span></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 12px;"><span class="Apple-tab-span" style="white-space:pre">                <font class="Apple-style-span" size="3"><span class="Apple-style-span" style="font-size: 13px;"><font class="Apple-style-span" color="#000000">        </font></span></font></span><span style=""><font class="Apple-style-span" color="#0000FF"><font class="Apple-style-span" size="4"><span class="Apple-style-span" style="font-size: 14px;">ngram -lm modelFile</span></font></font></span><span style=""><font class="Apple-style-span" color="#0000FF"><font class="Apple-style-span" size="4"><span class="Apple-style-span" style="font-size: 14px;">&nbsp;-ppl testFile</span></font></font></span><span style=""><font class="Apple-style-span" color="#0000FF"><font class="Apple-style-span" size="4"><span class="Apple-style-span" style="font-size: 14px;">&nbsp;-debug 0</span></font></font></span></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;"><br class="webkit-block-placeholder"></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">The output gives me the perplexity as well as the logprob. I consider the logprob as the likelihood of the data it is = log P(documents | category)&nbsp;</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">(Is it ok to use directly the logprob? Or should I use the perplexity. Since each category has his own vocabulary, may be oovs could influence in the categorisation? )</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">For the categorisation I'm using the bayes rule : P(category | document ) =&nbsp;&nbsp;P(documents | category) &nbsp;* P(category) /P(document).&nbsp;</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;"><br class="webkit-block-placeholder"></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">Since&nbsp;P(document) is constant for different categories. I obtained the posterior proba simply by&nbsp;P(documents | category) &nbsp;* P(category). I'm estimating the prior&nbsp;as the portions of total documents classified in that category.</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;"><br class="webkit-block-placeholder"></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">Finally I'm classifying a document into the category given the max posterior proba (P(category | document ) ).</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;"><br class="webkit-block-placeholder"></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">Is for you this&nbsp;simple test sufficiently good for assessing the performance in classification of the SRILM package or is it mandatory to use other commands for taking into account other features (such as oovs,...)?&nbsp;</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;"><br class="webkit-block-placeholder"></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">Thank you for your contribution. I hope that this question will help other users after also.</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;"><br class="webkit-block-placeholder"></span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 13px;">@min.</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 12px;">&nbsp;</span></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><br></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" color="#2A00FF"><br class="webkit-block-placeholder"></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 14px/normal Monaco; "><font class="Apple-style-span" color="#2A00FF"><br class="webkit-block-placeholder"></font></div></div></div></body></html>