From stolcke at speech.sri.com Sat Jul 8 20:50:35 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 08 Jul 2006 20:50:35 -0700 Subject: [SRILM]: -debug 2 info In-Reply-To: <20060531104152.63656.qmail@web86903.mail.ukl.yahoo.com> References: <20060531104152.63656.qmail@web86903.mail.ukl.yahoo.com> Message-ID: <44B07D0B.80204@speech.sri.com> ilya oparin wrote: > Hi! > > When I calculate perplexity of my POS-based class model (word can > belong to many classes, class-definition file I create myself on the > base of a POS-tagged data), with "-debug 2" I get the output I can not > fully understand. For testing puropses I measure ppl on the same data > I trained the class model (i.e. there should not be ay OOVs). However, > in the debug output, for every N-gram there is a string of the format > P(w| w...) = [OOV][n-gram][n-gram]...[OOV][n-gram][n-gram]... > As far as I get it, [n-gram]s refer to different combinations of > assigning words to classes. But why fo those [OOV] may appear (and > they appear in equal intervals between strings of [n-gram]s for each > word)? The stuff in brackets refers to ngram lookups for various class memberships. The first bracket refers to the ngram lookup where no class membership is involved, i.e., the word itself is used in the last ngram position (remember that in SRILM class-based LMs may contain both word and class ngram in the same model,). So OOV here just means that no ngram containing the word directly is found. --Andreas > > I have only one guess: since [OOVs] are only missing for the last > (| ...) n-gram, those [OOV] may correspond to a check if a word is > present in the implicit stop-word vocabulary or something... > > It would be great if anybody could comment on that. > > > best regards, > Ilya > > ------------------------------------------------------------------------ > All New Yahoo! Mail > > ? Tired of Vi at gr@! come-ons? Let our SpamGuard protect you. From stolcke at speech.sri.com Sat Jul 8 22:48:17 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 08 Jul 2006 22:48:17 -0700 Subject: Lattice-Tool: problems with pruning In-Reply-To: <447C63A9.4010602@itc.it> References: <447C63A9.4010602@itc.it> Message-ID: <44B098A1.6080106@speech.sri.com> Nicola Bertoldi wrote: > While pruning a lattice wrt posterior probs > with this command: > > lattice-tool -in-lattice lattice -read-htk -out-lattice - -write-htk > -posterior-prune 1.0e-1 > > I got this error > > Lattice::computeForwardBackward: warning: called with unreachable nodes > > > If I decrease pruning threshold this error disappears. > > Who can help me? If you prune too much you remove enough nodes such that some paths to/from the initial/final nodes are cut off, and some nodes become unreachable. --Andreas > > best regards > Nicola > > From marthayifiru at yahoo.com Tue Jul 25 07:36:49 2006 From: marthayifiru at yahoo.com (Martha Yifiru) Date: Tue, 25 Jul 2006 07:36:49 -0700 (PDT) Subject: Problem with SRILM toolkit installation Message-ID: <20060725143649.63432.qmail@web36802.mail.mud.yahoo.com> I am a new SRILM user. I tried to install the toolkit on unix platform with the help of a system administrator. But when I test the compilation, the files in output and reference directory are not similar. Even the number of files are not equal. Should I reinstall the toolkit doing all the steps? Waiting for you, I remain. Martha. _______________________________________ Address: Martha Yifiru Tachbelie Sedanstrasse 24 20146 Hamburg Germany Tel. +49 40 52721540 __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From anand at speech.sri.com Tue Jul 25 08:36:08 2006 From: anand at speech.sri.com (Anand Venkataraman) Date: Tue, 25 Jul 2006 08:36:08 -0700 Subject: Problem with SRILM toolkit installation In-Reply-To: <20060725143649.63432.qmail@web36802.mail.mud.yahoo.com> References: <20060725143649.63432.qmail@web36802.mail.mud.yahoo.com> Message-ID: <44C63A68.10400@speech.sri.com> Hi Martha Which platform and which architecture did you compile your toolkit for? If the compile procedure worked fine without errors you should now have now have all the executables you need. It would help if you can list the specific files that you think are missing or superfluous. & Martha Yifiru wrote: > I am a new SRILM user. I tried to install the toolkit > on unix platform with the help of a system > administrator. > > But when I test the compilation, the files in output > and reference directory are not similar. Even the > number of files are not equal. > > Should I reinstall the toolkit doing all the steps? > > Waiting for you, I remain. > Martha. > > > > _______________________________________ > Address: > > Martha Yifiru Tachbelie > Sedanstrasse 24 > 20146 Hamburg > Germany > Tel. +49 40 52721540 > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com From bertoldi at itc.it Fri Jul 28 16:00:40 2006 From: bertoldi at itc.it (Nicola Bertoldi) Date: Sat, 29 Jul 2006 01:00:40 +0200 Subject: Lattice WER and oracle Message-ID: <9749FF5BC5DC1C48A2D7EEB08AB1CF611CCF0C@ntmail.pc.itc.it> dear Andreas, I would ask if there is a parameter of lattice-tool (1.50beta) to get the path which gives the minimum WER (and not only this score). If not, do tou think could be difficult adding this feature? Nicola From stolcke at speech.sri.com Fri Jul 28 16:21:12 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 28 Jul 2006 16:21:12 PDT Subject: Lattice WER and oracle In-Reply-To: Your message of Sat, 29 Jul 2006 01:00:40 +0200. <9749FF5BC5DC1C48A2D7EEB08AB1CF611CCF0C@ntmail.pc.itc.it> Message-ID: <200607282321.QAA26098@tonga> In message <9749FF5BC5DC1C48A2D7EEB08AB1CF611CCF0C at ntmail.pc.itc.it>you wrote: > dear Andreas, > I would ask if there is a parameter of lattice-tool (1.50beta) > to get the path which gives the minimum WER (and not only this score). > > If not, do tou think could be difficult adding this feature? If you just need the word sequence on the path with the smaller word error, that's straightforward to output. The node/link sequence is not straightforward because the internal representation of the lattice is not 1-to-1 with what you have in the file. --Andreas From stolcke at speech.sri.com Wed Aug 2 10:34:31 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 02 Aug 2006 10:34:31 -0700 Subject: Hint for compiling SRILM with cygwin Message-ID: <44D0E227.1060608@speech.sri.com> I have not had this problem myself, but I'm passing it on just in case. --Andreas -------------- next part -------------- An embedded message was scrubbed... From: Ronny Melz Subject: srilm 1.5.0 did not compile / Unknown Date: Wed, 02 Aug 2006 17:55:47 +0200 Size: 2986 URL: From cuong at idiap.ch Thu Aug 3 05:06:37 2006 From: cuong at idiap.ch (Cuong Huy To) Date: Thu, 03 Aug 2006 14:06:37 +0200 Subject: Ask about the practical usage of SRILM for Machine Translation Message-ID: <44D1E6CD.8070606@idiap.ch> Hi every one This question is for SRILM - 1.4.1 I am working on Statistical Machine Translation, basically the problem is to find the best sentence e (english) given the input sentence f (foreign) e = argmax p(e|f) = argmax p(f|e).p(e). In which, the p(f|e) is about the translation model (including the lexicon and alignment models) What I am concerning about is p(e), the language model. My corpus is EuroParl (European Parliament Sessions), now I'm working with 512,000 sentences, 10,228,002 words, which is made by 54182 monograms, 1044600 bigrams, 765141 trigrams ..... My questions are: 1. Which combination of several options currently available with ngram-count I should use. 2. How many words per parameter should I use . (Joshua Goodman on his tutorial research.microsoft.com/~joshuago/lm-tutorial-v7-handouts.ps recommend the ratio between Number of words/Number of parameters to be greater than 100 or 1000) . 3. Normally, an option -X is to represent all the options for each order of n-gram (e.g. -interpolate is like -interpolate1 -interpolate2 ..... -interpolateN), but why it doens't work for -kndiscount ? So far, given this training text of 512,000 sentences, my test set is of 2000 sentences, 57951 words, and among the LM with order=7 here is the best combination I have -order 7 -kndiscount 1 -kndiscount 2 -kndiscount 3 -kndiscount 4 -kndiscount 5 -kndiscount 6 -kndiscount 7 -interpolate (also the question with -kndiscount, if I use -kndiscount only, then I will get the message: "warning: discount coeff 1 is out of range: 5.96382e-17") Thanks for reading this long email, and thanks to all who might want to answer this. Bests Cuong, From cuong at idiap.ch Thu Aug 3 05:09:49 2006 From: cuong at idiap.ch (Cuong Huy To) Date: Thu, 03 Aug 2006 14:09:49 +0200 Subject: [Fwd: Ask about the practical usage of SRILM for Machine Translation] Message-ID: <44D1E78D.8070503@idiap.ch> Hi all I forgot to mention the result for my best options ever: So far, given this training text of 512,000 sentences, my test set is of 2000 sentences, 57951 words, and among the LM with order=7 here is the best combination I have -order 7 -kndiscount 1 -kndiscount 2 -kndiscount 3 -kndiscount 4 -kndiscount 5 -kndiscount 6 -kndiscount 7 -interpolate logprob=-107526, ppl = 63.4007, ppl1=73.214 Thanks Cuong -------------- next part -------------- An embedded message was scrubbed... From: Cuong Huy To Subject: Ask about the practical usage of SRILM for Machine Translation Date: Thu, 03 Aug 2006 14:06:37 +0200 Size: 2104 URL: From stolcke at speech.sri.com Thu Aug 3 10:34:23 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 03 Aug 2006 10:34:23 PDT Subject: Ask about the practical usage of SRILM for Machine Translation In-Reply-To: Your message of Thu, 03 Aug 2006 18:18:48 +0200. <44D221E8.4050406@idiap.ch> Message-ID: <200608031734.k73HYNN18542@huge> In message <44D221E8.4050406 at idiap.ch>you wrote: > Andreas Stolcke wrote: > > In message <44D1E6CD.8070606 at idiap.ch>you wrote: > > > >> Hi every one > >> > >> This question is for SRILM - 1.4.1 > >> > > > > Before anything else, please get the lastest version (1.5.0) and see if > > it solves your problems. > > > > --Andreas > > > > > Thank you Andreas, > My 3rd question will be checked once I run on 1.5.0, but these 2 > questions are version-independent: > > 1. Which is the state-of-the-art combination of several options > currently available with ngram-count I should use. -kndiscount -interpolate > 2. How many words per parameter should I use . (Joshua Goodman on his > tutorial research.microsoft.com/~joshuago/lm-tutorial-v7-handouts.ps > recommend the ratio between Number of words/Number of parameters to be > greater than 100 or 1000) . I'm not sure I agree with Josh's rule if it means reducing the size of the model simple based on the total number of ngrams in it. By reducing the number of parameters (pruning ngrams from the model, or having a higher minumum count) you are not improving the estimates of the parameters that remain. So this is different from other types of models where there is set of parameters that is shared among all the data. If you can afford it you should use all the ngrams in your data in your model. When in doubt, try different settings on held-out data and "cross-validate" your choices. If you are using class-based models then you do share parameters between different ngrams and then a rule of the sort Josh suggested makes sense. --Andreas From rmadsen at byu.net Thu Aug 3 14:02:46 2006 From: rmadsen at byu.net (Rebecca Madsen) Date: Thu, 3 Aug 2006 15:02:46 -0600 Subject: error in discount estimator for order 3 Message-ID: Is there a reason why duplicating my data would give me the following error: using ModKneserNey for 3-grams Kneser-Ney smoothing 3-grams n1 = 0 n2 = 94762 n3 = 0 n4 = 37773 one of required modified KneserNey count-of-counts is zero error in discount estimator for order 3 I can build a language model using the following command line with the normal data, but concatenating two copies of the data together gives me the discount estimator error. $ /home/tools/srilm/bin/i686/ngram-count -text my_data_doubled.txt -interpolate -kndiscount1 -kndiscount2 -kndiscount3 -lm my_data_doubled.lm Thanks for your help, Rebecca From stolcke at speech.sri.com Thu Aug 3 23:37:19 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 03 Aug 2006 23:37:19 -0700 Subject: error in discount estimator for order 3 In-Reply-To: References: Message-ID: <44D2EB1F.40600@speech.sri.com> Rebecca Madsen wrote: > Is there a reason why duplicating my data would give me the following > error: > > using ModKneserNey for 3-grams > Kneser-Ney smoothing 3-grams > n1 = 0 > n2 = 94762 > n3 = 0 > n4 = 37773 > one of required modified KneserNey count-of-counts is zero > error in discount estimator for order 3 If you look at the formulae for KN discounting you see that they lead to undefined values when n1 = 0. The same is true of GT discounting. These dicsounting methods assume that the ngram distribution is "natural", not manipulated like in your case. > > I can build a language model using the following command line with the > normal data, but concatenating two copies of the data together gives > me the discount estimator error. That's completely expected (see above). What are you trying to accomplish by duplicating your data? Obviously you are not adding any information by doing so. --Andreas From stolcke at speech.sri.com Fri Aug 4 10:13:35 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 04 Aug 2006 10:13:35 PDT Subject: sorting of n-grams In-Reply-To: Your message of Fri, 04 Aug 2006 17:51:03 +0200. <9749FF5BC5DC1C48A2D7EEB08AB1CF61AEA8AE@ntmail.pc.itc.it> Message-ID: <200608041713.KAA01281@tonga> In message <9749FF5BC5DC1C48A2D7EEB08AB1CF61AEA8AE at ntmail.pc.itc.it>you wrote: > Hi Andreas, > > I'm just wondering if there is some special reason > why ngrams of a LM are not printed according to > the ordering given by the 1-grams. In particular, > the order is not respected up from the 3-grams. > The first 3-gram that are printed do not begin > with the top words in the 1-gram list. The N-grams are output in the order that corresponds to the internal data structure. Of course no particular order is required for the external representation, but this order also happens to be the most efficient (in terms of hardware caching) when the model is read back in. If you want to sort the ngrams in an LM file like some other software (like Sphinx) seems to require, use the sort-lm script (see man lm-scripts). --Andreas From marthayifiru at yahoo.com Wed Sep 13 03:04:51 2006 From: marthayifiru at yahoo.com (Martha Yifiru) Date: Wed, 13 Sep 2006 03:04:51 -0700 (PDT) Subject: class-based language model In-Reply-To: <200608041713.KAA01281@tonga> Message-ID: <20060913100451.99858.qmail@web36813.mail.mud.yahoo.com> Hi all, Is there a tutorial or introduction on how to develop a class-based (where classes are induced automatically by ngram-class) language model? I thought that ngram-class is used to induce classes automatically and then the language model training and evaluation follow using ngram-count and ngram, respectively. Thus used the following command: ngram-class -debug 2 -text textfile.txt -class-counts classcount_file -classes input_to_ngram But the result is not similar with my expectation. It was giving me even perplexity values. Would you please give me some ideas on how to develop class-based language model? Waiting from you, I remain. Martha. _______________________________________ Address: Martha Yifiru Tachbelie Sedanstrasse 24 20146 Hamburg Germany Tel. +49 40 52721540 --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ioparin at yahoo.co.uk Wed Sep 13 04:49:02 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Wed, 13 Sep 2006 12:49:02 +0100 (BST) Subject: class-based language model In-Reply-To: <20060913100451.99858.qmail@web36813.mail.mud.yahoo.com> Message-ID: <20060913114902.87347.qmail@web25412.mail.ukl.yahoo.com> Hi, I don't get exactly what you dislike in the debugging info that you get with the "-debug 2" option but I would add "-numclasses" option. If it is zero, then class merging is supressed altogether, as it is stated in the manual on ngram-class. May be then you will get the output you expect. Martha Yifiru wrote: Hi all, Is there a tutorial or introduction on how to develop a class-based (where classes are induced automatically by ngram-class) language model? I thought that ngram-class is used to induce classes automatically and then the language model training and evaluation follow using ngram-count and ngram, respectively. Thus used the following command: ngram-class -debug 2 -text textfile.txt -class-counts classcount_file -classes input_to_ngram But the result is not similar with my expectation. It was giving me even perplexity values. Would you please give me some ideas on how to develop class-based language model? Waiting from you, I remain. Martha. _______________________________________ Address: Martha Yifiru Tachbelie Sedanstrasse 24 20146 Hamburg Germany Tel. +49 40 52721540 --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail. best regards, Ilya --------------------------------- The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. -------------- next part -------------- An HTML attachment was scrubbed... URL: From marthayifiru at yahoo.com Tue Sep 19 09:00:16 2006 From: marthayifiru at yahoo.com (Martha Yifiru) Date: Tue, 19 Sep 2006 09:00:16 -0700 (PDT) Subject: Warning message Message-ID: <20060919160016.52311.qmail@web36806.mail.mud.yahoo.com> I am developing language models of different order (2 to 5) with Good-Turing discounting and Katz backoff for Smoothing. I all cases, I have got the following warning message: discount coeff 1 is out of range : 6.2135e-17 I could not get the reason for the warning message. I develop language models 5 days ago using the same data and smoothing techniques, but this warning message was no there. Could you please tell me the reason behind? Does it affect the quality of my language models? Waiting for you, I remain. Martha --------------------------------- Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2?/min or less. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Wed Sep 20 13:06:25 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 20 Sep 2006 13:06:25 PDT Subject: lattice-tool question/reference In-Reply-To: Your message of Wed, 20 Sep 2006 11:53:29 -0400. <176e9d420609200853r49e22cavf0fc1e49333c08ca@mail.gmail.com> Message-ID: <200609202006.k8KK6Pg10171@huge> > > Andreas, > I'm trying to understand what exactly it means to "compute posterior > expected n-gram counts" using lattice-tool with the -write-ngrams option. > > Would you kindly point me to a reference where I can read/learn about what > this flag is doing? posterior_expected_n-gram_count (X) = sum over all paths P through lattice { posterior_probability(P) * number_of_occurrences_of(X in P) } where posterior_probability(Q) = exp(sum_of_all_scores_on(Q)) / sum over all paths P { exp(sum_of_all_scores_on(P)) } exp(.) is the exponential (anti-log) function , assuming your scores are logarithmic. It's a generalized form of counting ngram frequencies in lattices, where the ngrams are weighted by the probabilities of the paths they occur on. --Andreas From stolcke at speech.sri.com Wed Sep 20 21:49:43 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Wed, 20 Sep 2006 21:49:43 PDT Subject: Warning message In-Reply-To: Your message of Tue, 19 Sep 2006 09:00:16 -0700. <20060919160016.52311.qmail@web36806.mail.mud.yahoo.com> Message-ID: <200609210449.VAA23987@tonga> In message <20060919160016.52311.qmail at web36806.mail.mud.yahoo.com>you wrote: > --0-789738089-1158681616=:50607 > Content-Type: text/plain; charset=iso-8859-1 > Content-Transfer-Encoding: 8bit > > I am developing language models of different order (2 to 5) with Good-Turing > discounting and Katz backoff for Smoothing. I all cases, I have got the foll > owing warning message: > discount coeff 1 is out of range : 6.2135e-17 > > I could not get the reason for the warning message. I develop language models > 5 days ago using the same data and smoothing techniques, but this warning m > essage was no there. Something must have changed. What was it? Has the software been updated? > > Could you please tell me the reason behind? Does it affect the quality of my > language models? The warning is issued because discount coefficients (the factors by which the maximum likelihood estimates are reduced) should be between 0 and 1. The value you are getting is effectively zero. It indicates an anomaly (non-smoothness) in the count-of-count of your data. --Andreas From ioparin at yahoo.co.uk Wed Sep 13 04:49:02 2006 From: ioparin at yahoo.co.uk (ilya oparin) Date: Wed, 13 Sep 2006 12:49:02 +0100 (BST) Subject: class-based language model In-Reply-To: <20060913100451.99858.qmail@web36813.mail.mud.yahoo.com> Message-ID: <20060913114902.87347.qmail@web25412.mail.ukl.yahoo.com> Hi, I don't get exactly what you dislike in the debugging info that you get with the "-debug 2" option but I would add "-numclasses" option. If it is zero, then class merging is supressed altogether, as it is stated in the manual on ngram-class. May be then you will get the output you expect. Martha Yifiru wrote: Hi all, Is there a tutorial or introduction on how to develop a class-based (where classes are induced automatically by ngram-class) language model? I thought that ngram-class is used to induce classes automatically and then the language model training and evaluation follow using ngram-count and ngram, respectively. Thus used the following command: ngram-class -debug 2 -text textfile.txt -class-counts classcount_file -classes input_to_ngram But the result is not similar with my expectation. It was giving me even perplexity values. Would you please give me some ideas on how to develop class-based language model? Waiting from you, I remain. Martha. _______________________________________ Address: Martha Yifiru Tachbelie Sedanstrasse 24 20146 Hamburg Germany Tel. +49 40 52721540 --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail. best regards, Ilya --------------------------------- The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lakshmi at lantana.tenet.res.in Fri Sep 29 02:14:33 2006 From: lakshmi at lantana.tenet.res.in (Lakshmi A) Date: Fri, 29 Sep 2006 14:44:33 +0530 (IST) Subject: query regarding usage of SRILM toolkit Message-ID: Greetings!!! We are developing a syllable based isolated style continuous speech recognizer for Indian languages. Currently, our recognizer output is just a sequence of syllables. We want to extract the sequence of words from this syllable sequence using statistical language models and lexicon.I thought may be one of the programs in this toolkit must be doing something similar (sub-word sequence to word sequence conversion). But all the programs seems to use word lattices. Is there any program in this toolkit that extracts the word sequence from the sub-word sequence using LM and lexicon. Thanks in Advance. Regards, Lakshmi From stolcke at speech.sri.com Fri Sep 29 09:04:06 2006 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Fri, 29 Sep 2006 09:04:06 PDT Subject: query regarding usage of SRILM toolkit In-Reply-To: Your message of Fri, 29 Sep 2006 14:44:33 +0530. Message-ID: <200609291604.JAA05487@tonga> In message you wrote: > > Greetings!!! > > We are developing a syllable based isolated style continuous speech recognize > r > for Indian languages. Currently, our recognizer output is just a sequence of > syllables. We want to extract the sequence of words from this syllable sequen > ce > using statistical language models and lexicon.I thought may be one of the > programs in this toolkit must be doing something similar (sub-word > sequence to word sequence conversion). But all the programs seems to use > word lattices. > > Is there any program in this toolkit that extracts the word sequence from > the sub-word sequence using LM and lexicon. Lashmi, first you have to remember that when the documentation of a program says 'words' it doesn't mean you have to use words in the conventional sense. you can use any kind of token (phones, syllables, etc.) in your lattices etc. The task you describe sounds like a boundary tagging problem, i.e., given a sequence of tokens, you want to label each transition between tokens as either a "boundary" or a "non-boundary". There are two tools in SRILM that can do this, using different kind of models. One is "hidden-ngram", which performs boundary tagging explicitly. The other is "disambig" which tags the tokens themselves, not the boundaries between them. But by assigining tags that denote "first token in a unit", "token insde a unit', etc. you can perform boundary tagging implicitly. (The tokens in your case are the syllables, the units would be the words.) Both tools use ngram language models to disambiguate the input. The model can be trained from syllabified training data, in your case. I suggest you look up papers on "word segmentation", "sentence segmentation", "Mandarin tokenization", "chunk parsing" and "shallow parsing" to get a good idea of the existing models for this type of task, then study the manual pages for the programs. --Andreas > > Thanks in Advance. > Regards, > Lakshmi