<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Georgi,<br>
<br>
You can get the conditional probabilities for arbitrary sets of
ngrams using<br>
<br>
ngram -counts FILE<br>
<br>
Andreas<br>
<br>
<br>
On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
<blockquote
cite="mid:03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
<div style="direction: ltr; font-family: Tahoma; color: rgb(0, 0,
0); font-size: 10pt;">Dear Mr. Stolcke,
<br>
<br>
I am trying to do sentence boundary segmentation. I have an
n-gram language model and for modelling it I use the SRILM
toolkit. Thanks for the nice tool!<br>
<br>
I have the following problem. <br>
<br>
I implement the forward-backward algortithm on my own. So I need
to combine the n-grams of your "hidden event model" with the
prosodic model.
<br>
Therefore, I need to get the probabilities of the individual
n-grams (in my case 3-grams).
<br>
<br>
For example for the word sequence <br>
wordt_2 wordt_1 wordt wordt+1 wordt+2<br>
<br>
i need <br>
P( <s> , wordt | wordt_2 wordt_1) <br>
P(wordt | wordt_2 wordt_1)<br>
P(wordt+1 | wordt_1 wordt) <br>
... and so on<br>
all possible combinations with and without <s> before each
word. <br>
<br>
<br>
What I do to get one of these is to use the following SRILM
command: <br>
<br>
# create text for case *wordt_2 wordt_1 <s> wordt*<br>
> echo "$wordt_2 $wordt_1 <br>
> $wordt" > testtext2;<br>
<br>
> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2
-unk >/tmp/output;<br>
and then read the corresponding line from the output that I need
(e.g. line 3 )<br>
</div>
</blockquote>
<blockquote
cite="mid:03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de"
type="cite">
<div style="direction: ltr;font-family: Tahoma;color:
#000000;font-size: 10pt;">
<br>
<br>
OUTPUT:<br>
wordt_2 wordt_1<br>
p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]<br>
p( <unk> | <unk> ...) = [2gram] 0.00343115 [
-2.46456 ] <br>
p( </s> | <unk> ...) = [2gram] 0.0937662 [ -1.02795
]<br>
1 sentences, 2 words, 0 OOVs<br>
0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4<br>
<br>
wordt<br>
p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]<br>
p( </s> | <unk> ...) = [2gram] 0.10582 [ -0.975432 ]<br>
1 sentences, 1 words, 0 OOVs<br>
0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59<br>
<br>
file testtext2: 2 sentences, 3 words, 0 OOVs<br>
0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21<br>
--------------------------------<br>
<br>
<br>
<br>
The problem is that for each trigram I call again ngram function
and it loads the LM ( > 1GB) and this makes it very slow.
<br>
Is there a faster solution? I do not need perplexity as well. <br>
<br>
I know about the segmentation tool <br>
<a moz-do-not-send="true"
href="http://www.speech.sri.com/projects/srilm/manpages/segment.1.html"
target="_blank">http://www.speech.sri.com/projects/srilm/manpages/segment.1.html</a><br>
but it gives results for the whole sequence, which is not my
goal.<br>
<br>
<br>
<br>
<br>
mit freundlichen Grüßen,<br>
Georgi Dzhambazov,<br>
<br>
Studentischer Mitarbeiter,<br>
NetMedia<br>
________________________________________<br>
Von: Andreas Stolcke [<a class="moz-txt-link-abbreviated" href="mailto:stolcke@icsi.berkeley.edu">stolcke@icsi.berkeley.edu</a>]<br>
Gesendet: Donnerstag, 13. Oktober 2011 05:50<br>
Bis: Dzhambazov, Georgi<br>
Cc: <a class="moz-txt-link-abbreviated" href="mailto:eee@speech.sri.com">eee@speech.sri.com</a><br>
Betreff: Re: Question about sentence boundary detection paper<br>
<br>
Dzhambazov, Georgi wrote:<br>
> Dear A. Stolcke,<br>
> Dear E. Shriberg,<br>
><br>
><br>
> I am interested in your approach of sentence boundary
detection.<br>
> I would be very happy if you find some time to clarify me
some of the<br>
> steps of your approach.<br>
> I plan to implement them.<br>
><br>
> Question 1)<br>
> In the paper (1) at paragraph 2.2.1 you say that states are
"the end<br>
> of sentence status of each word plus any preceeding words.<br>
> So for example at position 4 of the example sentence, the
state is (<br>
> <ns> + quick brown fox). At position 6 the state is
(<s> + brown fox<br>
> flies ) .<br>
> This means a huge state space. Is this right?<br>
><br>
> 1 2 3 4 5 6 7 8 9 10<br>
><br>
> The quick brown fox flies <s> The rabbit is white.<br>
The state space is potentially huge, but just like in standard
N-gram<br>
LMs you only consider the histories (= states) actually
occurring in the<br>
training data, and handle any new histories through backoff.<br>
Furthermore, the state space is constrained to those that match
the<br>
ngrams in the word sequence. So for every word position you have
to<br>
consider only two states (<s> and no-<s>).<br>
><br>
> Question 2)<br>
> Transitions probabilities are N-gram Probabilities. You
give an<br>
> example with bigram probabilities in the next line.<br>
> However you say as well you are using a 4-gram LM. So the
correct<br>
> example should be:<br>
> This means that a probability at position 6 is
Pr(<s>|brown fox flies)<br>
> and at position 4 is Pr( <ns> | quick brown fox)<br>
> Is this right?<br>
correct.<br>
><br>
> Question 3)<br>
> Then for recognition you say that the forward-backward
algorithm is<br>
> used to determine the maximal P (T_i | W )<br>
> where T_i corresponds to <s> or <ns> at
position i. However the<br>
> transition probabilities include information about states
like ( <ns><br>
> + quick brown fox).<br>
> How do you apply the transition probabilities in this
model. Does it<br>
> relate to the formula of section 4 ot (2).<br>
> I think this formula can work for the forward-backward
algorithm,<br>
> although it is stated in this section 4 that it is used for
Viterbi.<br>
For finding the most probable T_i you use in fact the Viterbi
algorithm.<br>
<br>
The formulas in section 4 just give one step in the forward
computation<br>
that would be used in the Viterbi algorithm.<br>
<br>
Please note that this is all implemented in the "segment" tool
that<br>
comes with SRILM.<br>
See
<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/segment.1.html">http://www.speech.sri.com/projects/srilm/manpages/segment.1.html</a>
and<br>
<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/">http://www.speech.sri.com/projects/srilm/</a> for more information
on SRILM.<br>
<br>
Andreas<br>
<br>
><br>
> References:<br>
><br>
> 1) Shriberg et al. 2000 - Prosody based automatic
segmentation of<br>
> Speech into sentences and topics<br>
> 2) Stolcke and Shriberg - 1996 - Automatic linguistic
segmentation of<br>
> conversational speech<br>
><br>
> Thank you!<br>
><br>
> Kind Regards,<br>
> Georgi Dzhambazov,<br>
><br>
> Studentischer Mitarbeiter,<br>
> NetMedia<br>
<br>
</div>
</blockquote>
<br>
</body>
</html>