From stolcke at speech.sri.com Tue Apr 1 11:17:30 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Tue, 01 Apr 2003 11:17:30 PST Subject: Question about Lattice Tool In-Reply-To: Your message of Mon, 31 Mar 2003 18:50:31 +0200. <3E8871D7.89295212@itc.it> Message-ID: <200304011917.LAA13991@huge> In message <3E8871D7.89295212 at itc.it>you wrote: > Dear Dr. Andreas Stolcke and users of SIRLM > > I have implemented our Lattice Tool based mainly on SIRLM tool kit (Lattice > Tool) (our lattice definition has time information). > My problem now is how to measure word error rate from the generated lattice > file with the corresponding real utterance. That means I need to compute the > lower word error rate. > Could you please tell some works about this problem before ? > > In SIRLM Lattice Tool: there is a function namely: > unsigned latticeWER(const VocabIndex *words, > unsigned &sub, unsigned &ins, unsigned &del) > { SubVocab ignoreWords(vocab); > return latticeWER(words, sub, ins, del, ignoreWords); > }; > seems to solve my problem. Unfortunately, I could not understand how this > function work. > I am looking forward to hearing from you. > Thank you in advanced. > > Vu Hai Quan. Vu, The lattice word error rate is computed by dynamic programming, very similar to the standard word error computation on strings. You keep a table that has the minimum error from the beginning of the string and lattice to each pair of string position and lattice node. For lack of time I cannot describe the algorithm in more detail, but maybe someone on the list can help with that. It's really quite straightforward if you are familiar with string alignment. If you're not, then do a google search on "string alignment algorithm" and you will find a dozen or so references that should help you get going. Incidentally, even if your lattices have time information it should be irrelevant to the lattice error computation. so you could convert them to PSFG format and still use the SRILM tool. --Andreas PS. Please only send email related to the mailing list itself to majordomo at speech.sri.com. Once you've signed up, send all email to srilm-user at speech.sri.com. From stolcke at speech.sri.com Sun Apr 6 14:25:19 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sun, 06 Apr 2003 14:25:19 PDT Subject: can SRILM be used for machine translation? Message-ID: <200304062125.OAA21692@huge> Dear Chintan, I have no experience with MT applications using SRILM or any other LM toolkit. I am forwarding your email to the srilm-user list in hopes that someone there can help. However, I wouldn't expect anyone to have the time to give you detailed guidance -- maybe some useful pointers. By the way, you cannot send email to srilm-user at speech.sri.com without subscribing to the list first. Send a message to majordomo at speech.sri.com with "help" in the message body for more information. --Andreas ------- Forwarded Message From: "chintan shah" To: srilm-user at speech.sri.com Subject: can SRILM be used for machine translation? Date: Sun, 06 Apr 2003 08:45:15 +0000 Mime-Version: 1.0 Content-Type: text/plain; format=flowed Message-ID: X-OriginalArrivalTime: 06 Apr 2003 08:45:15.0445 (UTC) FILETIME=[D814C650:01C2FC18] X-Spam-Status: No, score=0.8 threshold=8.0 X-Spam-Level: x Respected Sir, Myself is Chintan and I am a final year undergraduate student and want to know about SRI language modeling toolkit whether it can be used for machine translation and how to manage corpus for that. We are not having any budget, so if you can resource us then it would be great pleasure. The textbook of Daniel Martin and Jurafsky , i do have. So sir, if you could guide us about which chapters to take immediately, it shall be great favour. Yours Thankfully, Chintan. ------- End of Forwarded Message From melis at cs.utwente.nl Thu Apr 10 01:56:31 2003 From: melis at cs.utwente.nl (Paul Melis) Date: Thu, 10 Apr 2003 10:56:31 +0200 Subject: Vocabularies when interpolation Message-ID: <20030410105631.B11073@luistervink.cs.utwente.nl> Hello Andreas, When performing interpolation with ngram -lm .. -mix-lm .. -lambda ... the vocabularies of the LM's being mixed get merged if I understand it correctly (from doing some test runs). Is there a way to force the resulting output LM to have a predefined vocabulary (e.g. the vocab of one of the LM's being mixed)? Regards, Paul -- melis at cs.utwente.nl From stolcke at speech.sri.com Thu Apr 10 09:15:07 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Thu, 10 Apr 2003 09:15:07 PDT Subject: Vocabularies when interpolation In-Reply-To: Your message of Thu, 10 Apr 2003 10:56:31 +0200. <20030410105631.B11073@luistervink.cs.utwente.nl> Message-ID: <200304101615.JAA27187@huge> In message <20030410105631.B11073 at luistervink.cs.utwente.nl>you wrote: > Hello Andreas, > > When performing interpolation with > > ngram -lm .. -mix-lm .. -lambda ... > > the vocabularies of the LM's being mixed get merged if I understand it correc > tly (from doing some test runs). Is there a way to force the resulting output > LM to have a predefined vocabulary (e.g. the vocab of one of the LM's being > mixed)? No, but you can limit the LM vocabulary either before or after the merging. The proper way to do this is to specify the same vocabulary when building the various LM components. If that is not possible (e.g., you got the LMs from someone else) you can modify the LM vocabulary post-training using the "change-lm-vocab" script. Check the "lm-scripts" man page. --Andreas From ejoy at peoplemail.com.cn Thu Apr 17 06:23:32 2003 From: ejoy at peoplemail.com.cn (Zhang Le) Date: Thu, 17 Apr 2003 21:23:32 +0800 Subject: srilm works on FreeBSD Message-ID: <20030417132332.GA406@> Hi all, I just managed to get srilm 1.3.3 work on an FreeBSD. I change the following lines in bin/machine-type to detect FreeBSD. set MACHINE_TYPE = cygwin else if (`uname -s` =~ FreeBSD*) then +set MACHINE_TYPE = freebsd else if (`uname -s` == Darwin) then set MACHINE_TYPE = macosx And add a common/Makefile.machine.freebsd modified from cygwin configure file(see attachment). "gmake World" now works fine under FreeBSD. here is uname -a: FreeBSD 4.8-RELEASE FreeBSD 4.8-RELEASE #0: Sat Apr 12 22:18:07 CST 2003 zl@:/usr/src/sys/compile/MYKERNEL i386 I also test it on an FreeBSD 5.0-RELEASE. -- Sincerely yours, Zhang Le -------------- next part -------------- # # File: Makefile.i686 # Author: The SRI DECIPHER (TM) System # Date: Fri Feb 19 22:45:31 PST 1999 # # Description: # Machine dependent compilation options and variable definitions # for CYGWIN/i686 platform # # Copyright (c) 1999-2002 SRI International. All Rights Reserved. # # $Header: /home/srilm/devel/common/RCS/Makefile.machine.cygwin,v 1.4 2003/02/27 18:25:11 stolcke Exp $ # # Use the GNU C compiler. GCC_FLAGS = -Wreturn-type -Wimplicit CC = gcc $(GCC_FLAGS) CXX = g++ -Wno-deprecated $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES # Optional compilation flags. OPTIMIZE_FLAGS = -g -O2 DEBUG_FLAGS = -g -DDEBUG PROFILE_FLAGS = -g -pg -O2 # Optional linking flags. EXPORT_LDFLAGS = -s # Shared compilation flags. CFLAGS = $(ADDITIONAL_CFLAGS) $(INCLUDES) CXXFLAGS = $(ADDITIONAL_CXXFLAGS) $(INCLUDES) # Shared linking flags. LDFLAGS = $(ADDITIONAL_LDFLAGS) -L$(SRILM_LIBDIR) # Other useful compilation flags. ADDITIONAL_CFLAGS = ADDITIONAL_CXXFLAGS = # Other useful include directories. ADDITIONAL_INCLUDES = # Other useful linking flags. ADDITIONAL_LDFLAGS = # Other useful libraries. ADDITIONAL_LIBRARIES = -lm # run-time linker path flag RLD_FLAG = -R # Tcl support (part of cygwin) TCL_INCLUDE = -I/usr/local/include/tcl8.3 TCL_LIBRARY = -L/usr/local/lib -ltcl83 # No ranlib RANLIB = : # Generate dependencies from source files. GEN_DEP = $(CC) $(CFLAGS) -MM GEN_DEP.cc = $(CXX) $(CXXFLAGS) -MM # Run lint. LINT = lint LINT_FLAGS = -DDEBUG $(CFLAGS) # Location of gawk binary GAWK = /usr/bin/gawk From julyjune03 at yahoo.com Fri Apr 18 21:54:23 2003 From: julyjune03 at yahoo.com (June July) Date: Fri, 18 Apr 2003 21:54:23 -0700 (PDT) Subject: help with ngram-count Message-ID: <20030419045423.33794.qmail@web41604.mail.yahoo.com> I encountered the following problem reported from ngram-count: BOW denominator for context "D SMALL" is 0 <= 0,numerator is 0.0909091 The switches I invoked is: zcat EN.count.1.gz EN.count.2.gz EN.count.3.gz | perl -pe 's///g' | ./bin/ngram-count -memuse -read - -vocab ML.vocab -order 3 -cdiscount3 0 -cdiscount2 0 -cdiscount1 0 -unk -lm - | ./bin/add-dummy-bows - | perl -pe 's///g' | gzip >! EN.arpabo.3.gz Could someone help me to get rid of that warning msg? Thanks, June --------------------------------- Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at speech.sri.com Sat Apr 19 10:13:13 2003 From: stolcke at speech.sri.com (Andreas Stolcke) Date: Sat, 19 Apr 2003 10:13:13 PDT Subject: help with ngram-count In-Reply-To: Your message of Fri, 18 Apr 2003 21:54:23 -0700. <20030419045423.33794.qmail@web41604.mail.yahoo.com> Message-ID: <200304191713.KAA24463@huge> For ngram backup you distribute the probabilty mass left over by ngrams of order k in proportion to probabilities given by ngrams of order k-1. What the error message is saying is that the k-1-grams don't assign any probability to the words that don't already have k-grams. This can happen especially when you disable smoothing as you did. The problem should go away if you include all trigrams from your training data. the default minimum count for trigrams 2, so you need to use -gt3min 1 in addition to the options you have. --Andreas In message <20030419045423.33794.qmail at web41604.mail.yahoo.com>you wrote: > --0-1120635126-1050728063=:32317 > Content-Type: text/plain; charset=us-ascii > > I encountered the following problem reported from ngram-count: BOW denominato > r for context "D SMALL" is 0 <= 0,numerator is 0.0909091 The switches I invok > ed is: zcat EN.count.1.gz EN.count.2.gz EN.count.3.gz | perl -pe 's// k>/g' | ./bin/ngram-count -memuse -read - -vocab ML.vocab -order 3 -cdiscount > 3 0 -cdiscount2 0 -cdiscount1 0 -unk -lm - | ./bin/add-dummy-bows - | perl > -pe 's///g' | gzip >! EN.arpabo.3.gz Could someone help me to get > rid of that warning msg? Thanks, June > > From dpico at dsic.upv.es Mon Jun 16 09:14:42 2003 From: dpico at dsic.upv.es (=?ISO-8859-1?Q?David_Pic=F3_Vila?=) Date: Mon, 16 Jun 2003 18:14:42 +0200 Subject: Problems compiling srilm Message-ID: <3EEDECF2.7060906@dsic.upv.es> Hello! I hope this is the right forum to ask this and I am not disturbing anyone. I am trying to install SRILM in a SuSe 8.2 Linux platform and I cannot get ngram and ngram-count compiled! Apparently, all the versions of compilers, etc. are correct, and also system variables, etc., but gcc seems to complain about some uncompatible compiler options. Has anyone in the list already had this problem and know how to solve it? Thank you very much in advance for your help! David -- David Pic? Vila Departament de Sistemes Inform?tics i Computaci? Universitat Polit?cnica de Val?ncia Val?ncia, Spain Email: dpico at dsic.upv.es Tel: +34 963877007 ext. 73528 From julyjune03 at yahoo.com Mon Jun 23 10:45:44 2003 From: julyjune03 at yahoo.com (June July) Date: Mon, 23 Jun 2003 10:45:44 -0700 (PDT) Subject: class based SRI LM Message-ID: <20030623174544.79857.qmail@web41601.mail.yahoo.com> Hi, I tried to build class based LMs in the following way: step-1: ngram-class -text test.in -numclasses 100 -class-counts text.cnt -classes text.cls -save 100 step-2: ngram-count -read text.cnt -memuse -kndiscount -kndiscount1 -kndiscount2 -lm text.srilm.gz I found that the class count output "text.cnt" from step-1 is only bigram-counts. Thus the final class-LM text.srilm.gz is also a bigram one. Could anyone tell me if I am using the toolkit correctly? How to build a trigram class-based LM? Also are there any published paper/document that I can look up for detail information? Many thanks, -June --------------------------------- Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! -------------- next part -------------- An HTML attachment was scrubbed... URL: From yangl at ecn.purdue.edu Mon Jun 23 13:59:05 2003 From: yangl at ecn.purdue.edu (Yang Liu) Date: Mon, 23 Jun 2003 15:59:05 -0500 (EST) Subject: class based SRI LM Message-ID: <200306232059.h5NKx5u2018424@sohmm.ecn.purdue.edu> Hi June, After you get the automatically induced classes (the class definition in file text.cls), you can map all the words in your training set to classes using: replace-words-with-classes classes=text.cls training_set > training_set_classes Then you can any order class-based LM from that. Hope this helps. -- Yang >Hi, > > I tried to build class based LMs in the following way: > > step-1: ngram-class -text test.in -numclasses 100 -class-counts text.cnt -classes text.cls -save 100 > > step-2: ngram-count -read text.cnt -memuse -kndiscount -kndiscount1 -kndiscount2 -lm text.srilm.gz > > I found that the class count output "text.cnt" from step-1 is only bigram-counts. Thus the final class-LM text.srilm.gz is also a bigram one. > > Could anyone tell me if I am using the toolkit correctly? How to build a trigram class-based LM? Also are there any published paper/document that I can look up for detail information? > > Many thanks, > >-June > > >--------------------------------- >Do you Yahoo!? >SBC Yahoo! DSL - Now only $29.95 per month! From julyjune03 at yahoo.com Mon Jun 23 14:02:22 2003 From: julyjune03 at yahoo.com (June July) Date: Mon, 23 Jun 2003 14:02:22 -0700 (PDT) Subject: class based SRI LM In-Reply-To: <200306232059.h5NKx5u2018424@sohmm.ecn.purdue.edu> Message-ID: <20030623210222.31446.qmail@web41609.mail.yahoo.com> Thanks alot! Yang Liu wrote: Hi June, After you get the automatically induced classes (the class definition in file text.cls), you can map all the words in your training set to classes using: replace-words-with-classes classes=text.cls training_set > training_set_classes Then you can any order class-based LM from that. Hope this helps. -- Yang >Hi, > > I tried to build class based LMs in the following way: > > step-1: ngram-class -text test.in -numclasses 100 -class-counts text.cnt -classes text.cls -save 100 > > step-2: ngram-count -read text.cnt -memuse -kndiscount -kndiscount1 -kndiscount2 -lm text.srilm.gz > > I found that the class count output "text.cnt" from step-1 is only bigram-counts. Thus the final class-LM text.srilm.gz is also a bigram one. > > Could anyone tell me if I am using the toolkit correctly? How to build a trigram class-based LM? Also are there any published paper/document that I can look up for detail information? > > Many thanks, > >-June > > >--------------------------------- >Do you Yahoo!? >SBC Yahoo! DSL - Now only $29.95 per month! --------------------------------- Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! -------------- next part -------------- An HTML attachment was scrubbed... URL: