<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Georgi,<br>

    <br>

    You can get the conditional probabilities for arbitrary sets of

    ngrams using<br>

    <br>

        ngram -counts FILE<br>

    <br>

    Andreas<br>

    <br>

    <br>

    On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:

    <blockquote

cite="mid:03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=ISO-8859-1">

      <style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>

      <div style="direction: ltr; font-family: Tahoma; color: rgb(0, 0,

        0); font-size: 10pt;">Dear Mr. Stolcke,

        <br>

        <br>

        I am trying to do sentence boundary segmentation. I have an

        n-gram language model and for modelling it I use the SRILM

        toolkit. Thanks for the nice tool!<br>

        <br>

        I have the following problem. <br>

        <br>

        I implement the forward-backward algortithm on my own. So I need

        to combine the n-grams of your "hidden event model" with the

        prosodic model.

        <br>

        Therefore, I need to get the probabilities of the individual

        n-grams (in my case 3-grams).

        <br>

        <br>

        For example for the word sequence <br>

        wordt_2 wordt_1 wordt wordt+1 wordt+2<br>

        <br>

        i need <br>

        P( <s> , wordt | wordt_2 wordt_1) <br>

        P(wordt | wordt_2 wordt_1)<br>

        P(wordt+1 | wordt_1 wordt) <br>

        ... and so on<br>

        all possible combinations with and without <s> before each

        word. <br>

        <br>

        <br>

        What I do to get one of these is to use the following SRILM

        command: <br>

        <br>

        # create text for case *wordt_2 wordt_1 <s> wordt*<br>

        > echo "$wordt_2 $wordt_1 <br>

        > $wordt" > testtext2;<br>

        <br>

        > ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2

        -unk >/tmp/output;<br>

        and then read the corresponding line from the output that I need

        (e.g. line 3 )<br>

      </div>

    </blockquote>

    <blockquote

cite="mid:03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de"

      type="cite">

      <div style="direction: ltr;font-family: Tahoma;color:

        #000000;font-size: 10pt;">

        <br>

        <br>

        OUTPUT:<br>

        wordt_2 wordt_1<br>

        p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]<br>

        p( <unk> | <unk> ...) = [2gram] 0.00343115 [

        -2.46456 ] <br>

        p( </s> | <unk> ...) = [2gram] 0.0937662 [ -1.02795

        ]<br>

        1 sentences, 2 words, 0 OOVs<br>

        0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4<br>

        <br>

        wordt<br>

        p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]<br>

        p( </s> | <unk> ...) = [2gram] 0.10582 [ -0.975432 ]<br>

        1 sentences, 1 words, 0 OOVs<br>

        0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59<br>

        <br>

        file testtext2: 2 sentences, 3 words, 0 OOVs<br>

        0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21<br>

        --------------------------------<br>

        <br>

        <br>

        <br>

        The problem is that for each trigram I call again ngram function

        and it loads the LM ( > 1GB) and this makes it very slow.

        <br>

        Is there a faster solution? I do not need perplexity as well. <br>

        <br>

        I know about the segmentation tool <br>

        <a moz-do-not-send="true"

          href="http://www.speech.sri.com/projects/srilm/manpages/segment.1.html"

          target="_blank">http://www.speech.sri.com/projects/srilm/manpages/segment.1.html</a><br>

          but it gives results for the whole sequence, which is not my

        goal.<br>

        <br>

        <br>

        <br>

        <br>

        mit freundlichen Grüßen,<br>

        Georgi Dzhambazov,<br>

        <br>

        Studentischer Mitarbeiter,<br>

        NetMedia<br>

        ________________________________________<br>

        Von: Andreas Stolcke [<a class="moz-txt-link-abbreviated" href="mailto:stolcke@icsi.berkeley.edu">stolcke@icsi.berkeley.edu</a>]<br>

        Gesendet: Donnerstag, 13. Oktober 2011 05:50<br>

        Bis: Dzhambazov, Georgi<br>

        Cc: <a class="moz-txt-link-abbreviated" href="mailto:eee@speech.sri.com">eee@speech.sri.com</a><br>

        Betreff: Re: Question about sentence boundary detection paper<br>

        <br>

        Dzhambazov, Georgi wrote:<br>

        > Dear A. Stolcke,<br>

        > Dear E. Shriberg,<br>

        ><br>

        ><br>

        > I am interested in your approach of sentence boundary

        detection.<br>

        > I would be very happy if you find some time to clarify me

        some of the<br>

        > steps of your approach.<br>

        > I plan to implement them.<br>

        ><br>

        > Question 1)<br>

        > In the paper (1) at paragraph 2.2.1 you say that states are

        "the end<br>

        > of sentence status of each word plus any preceeding words.<br>

        > So for example at position 4 of the example sentence, the

        state is (<br>

        > <ns> + quick brown fox). At position 6 the state is

        (<s> + brown fox<br>

        > flies ) .<br>

        > This means a huge state space. Is this right?<br>

        ><br>

        > 1 2 3 4 5 6 7 8 9 10<br>

        ><br>

        > The quick brown fox flies <s> The rabbit is white.<br>

        The state space is potentially huge, but just like in standard

        N-gram<br>

        LMs you only consider the histories (= states) actually

        occurring in the<br>

        training data, and handle any new histories through backoff.<br>

        Furthermore, the state space is constrained to those that match

        the<br>

        ngrams in the word sequence. So for every word position you have

        to<br>

        consider only two states (<s> and no-<s>).<br>

        ><br>

        > Question 2)<br>

        > Transitions probabilities are N-gram Probabilities. You

        give an<br>

        > example with bigram probabilities in the next line.<br>

        > However you say as well you are using a 4-gram LM. So the

        correct<br>

        > example should be:<br>

        > This means that a probability at position 6 is

        Pr(<s>|brown fox flies)<br>

        > and at position 4 is Pr( <ns> | quick brown fox)<br>

        > Is this right?<br>

        correct.<br>

        ><br>

        > Question 3)<br>

        > Then for recognition you say that the forward-backward

        algorithm is<br>

        > used to determine the maximal P (T_i | W )<br>

        > where T_i corresponds to <s> or <ns> at

        position i. However the<br>

        > transition probabilities include information about states

        like ( <ns><br>

        > + quick brown fox).<br>

        > How do you apply the transition probabilities in this

        model. Does it<br>

        > relate to the formula of section 4 ot (2).<br>

        > I think this formula can work for the forward-backward

        algorithm,<br>

        > although it is stated in this section 4 that it is used for

        Viterbi.<br>

        For finding the most probable T_i you use in fact the Viterbi

        algorithm.<br>

        <br>

        The formulas in section 4 just give one step in the forward

        computation<br>

        that would be used in the Viterbi algorithm.<br>

        <br>

        Please note that this is all implemented in the "segment" tool

        that<br>

        comes with SRILM.<br>

        See

        <a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/segment.1.html">http://www.speech.sri.com/projects/srilm/manpages/segment.1.html</a>

        and<br>

        <a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/">http://www.speech.sri.com/projects/srilm/</a> for more information

        on SRILM.<br>

        <br>

        Andreas<br>

        <br>

        ><br>

        > References:<br>

        ><br>

        > 1) Shriberg et al. 2000 - Prosody based automatic

        segmentation of<br>

        > Speech into sentences and topics<br>

        > 2) Stolcke and Shriberg - 1996 - Automatic linguistic

        segmentation of<br>

        > conversational speech<br>

        ><br>

        > Thank you!<br>

        ><br>

        > Kind Regards,<br>

        > Georgi Dzhambazov,<br>

        ><br>

        > Studentischer Mitarbeiter,<br>

        > NetMedia<br>

        <br>

      </div>

    </blockquote>

    <br>

  </body>

</html>