From rmr4848 at gmail.com Tue Jan 3 15:12:11 2012 From: rmr4848 at gmail.com (Ryan Roth) Date: Tue, 3 Jan 2012 18:12:11 -0500 Subject: [SRILM User List] Odd behavior in disambig and OOV words Message-ID: Hello: For some time now I've been using *disambig* to perform diacritic disambiguation of Arabic. I create a open-vocabulary LM of diacritized forms from a training corpus, and for the input I use a morphological analysis tool to create, for each input word, a list of possible diacritized forms to use as the V2 mapping for the input form (V1). * Disambig* is then used to select one of the diacritized forms using the LM. This works well, but recently I noticed a strange behavior. I have a small input file (A) of about 200 lines of text. I run it through the above process, and I get a mapped output file as expected. Then I take the input file A and replace two words in the last line with different words (creating input file B). I run B through the same process as A (this results in a very slightly different map file -- but only for the two words that were replaced). The odd behavior is that, when I compare the output mapping of A and B, not only is the last line different, but over 70 other words in the file (in different sentences) also have different V2 mappings. Doing some checking, I discover (not too surprisingly) that all the affected words are ones that were not present in the LM, so the effect is related to how *disambig* is handling OOV words. Similar differences occur if I compare the mapped output of two files concatenated together to the concatenation of two file's mapped output (that is, [A+B].out =/= [A.out] + [B.out] ). I need to find a way to make sure *disambig* handles these words consistently, so that changes in one part of a file do not affect the results in a different part. I'm hoping that there is some option setting in *disambig* or *ngram*-*count* that I've overlooked that will correct the problem, but I currently don't see one. For reference, I create my LM using the options: *ngram*-*count* -*text* training-input-file -*lm* model-name.lm -*order*5 - *unk* and I run disambig using the options: *disambig* -*keep*-*unk* -*text* test-file.in -*map* test-file.map -* order* 5 -*lm* model-name.lm > test-file.out My test-file.map is created without conditional probabilities, and the list of V2 forms is always alphabetized to ensure a consistent ordering. The morphological analyzer which generates the V2 forms is always consistent, and its output does not depend on word context. Any advice or direction would be appreciated. Thanks, Ryan Roth CCLS Columbia University -------------- next part -------------- An HTML attachment was scrubbed... URL: From reza.haffari at gmail.com Wed Jan 4 02:16:03 2012 From: reza.haffari at gmail.com (gholamreza haffari) Date: Wed, 4 Jan 2012 21:16:03 +1100 Subject: [SRILM User List] Chaing's python wrapper In-Reply-To: References: Message-ID: Hi there, I get an error when I try to compile the following python wrapper (by David Chaing): http://www.isi.edu/~chiang/software/psrilm.tgz The error is as follows: /usr/bin/ld: /cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a(Vocab.o): relocation R_X86_64_32 against `Vocab::compare(unsigned int, unsigned int)' can not be used when making a shared object; recompile with -fPIC /cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a: could not read symbols: Bad value collect2: ld returned 1 exit status error: command 'g++' failed with exit status 1 make: *** [all] Error 1 The srilm version that I use is "1.6.0" and my machine type is "i686-m64". I appreciate your help. cheers, -Reza -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Jan 4 09:36:27 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 04 Jan 2012 09:36:27 -0800 Subject: [SRILM User List] Chaing's python wrapper In-Reply-To: References: Message-ID: <4F048E1B.8010703@icsi.berkeley.edu> gholamreza haffari wrote: > Hi there, > > I get an error when I try to compile the following python wrapper (by > David Chaing): > http://www.isi.edu/~chiang/software/psrilm.tgz > > > > The error is as follows: > > /usr/bin/ld: > /cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a(Vocab.o): > relocation R_X86_64_32 against `Vocab::compare(unsigned int, unsigned > int)' can not be used when making a shared object; recompile with -fPIC > /cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a: could not > read symbols: Bad value > collect2: ld returned 1 exit status > error: command 'g++' failed with exit status 1 > make: *** [all] Error 1 > > > The srilm version that I use is "1.6.0" and my machine type is "i686-m64". To build SRILM for use in shared libraries, invoke the build with make MAKE_PIC=X (other arguments) or put ADDITIONAL_CFLAGS += -fPIC ADDITIONAL_CXXFLAGS += -fPIC in the machine-specific makefile common/Makefile.site.$(MACHINE_TYPE) . (-fPIC is for gcc-based compilers, other compilers have different options to accomplish the same thing.) Andreas > > I appreciate your help. > cheers, > -Reza > > ------------------------------------------------------------------------ > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Fri Jan 6 15:12:47 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 06 Jan 2012 15:12:47 -0800 Subject: [SRILM User List] Odd behavior in disambig and OOV words In-Reply-To: References: <4F0491D8.7000402@icsi.berkeley.edu> <4F0497E8.7050101@icsi.berkeley.edu> Message-ID: <4F077FEF.8080906@icsi.berkeley.edu> The attached patch seems to fix the problem. The problem stems from pseudo-random ordering of hypotheses that have identical scores in viterbi/nbest decoding. The patch introduces an additional sorting criterion to make the ordering deterministic. If you add the option -nbest 10 you see the alternatives that get the same score, and there are often many. Andreas -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: srilm.patch.txt URL: From nishthajaiswal at cdacnoida.in Sun Jan 15 22:51:09 2012 From: nishthajaiswal at cdacnoida.in (nishthajaiswal at cdacnoida.in) Date: Mon, 16 Jan 2012 12:21:09 +0530 (IST) Subject: [SRILM User List] SRILM install problem Message-ID: <9359.10.0.0.4.1326696669.squirrel@mail.cdacnoida.in> ------------------------------- Original Message ------------------------------- Subject: SRILM install problem From: srilm-user-owner at speech.sri.com Date: Mon, January 16, 2012 12:19 pm To: nishthajaiswal at cdacnoida.in -------------------------------------------------------------------------------- You are not allowed to post to this mailing list, and your message has been automatically rejected. If you think that your messages are being rejected in error, contact the mailing list owner at srilm-user-owner at speech.sri.com. -------------- next part -------------- An embedded message was scrubbed... From: nishthajaiswal at cdacnoida.in Subject: SRILM install problem Date: Mon, 16 Jan 2012 12:19:32 +0530 (IST) Size: 5144 URL: From nishthajaiswal at cdacnoida.in Sun Jan 15 22:53:33 2012 From: nishthajaiswal at cdacnoida.in (nishthajaiswal at cdacnoida.in) Date: Mon, 16 Jan 2012 12:23:33 +0530 (IST) Subject: [SRILM User List] [Fwd: SRILM install problem] Message-ID: <12892.10.0.0.4.1326696813.squirrel@mail.cdacnoida.in> ------------------------------- Original Message ------------------------------- Subject: SRILM install problem From: nishthajaiswal at cdacnoida.in Date: Mon, January 16, 2012 12:19 pm To: srilm-user at speech.sri.com -------------------------------------------------------------------------------- Hi Unable to install srilm on my fedora 8. Following error is coming when giving on giving make World mkdir include lib bin mkdir: cannot create directory `include': File exists mkdir: cannot create directory `lib': File exists mkdir: cannot create directory `bin': File exists make: [dirs] Error 1 (ignored) make init make[1]: Entering directory `/root/srilm' for subdir in misc dstruct lm flm lattice utils; do \ (cd $subdir/src; make SRILM=/root/srilm MACHINE_TYPE=i686 OPTION= MAKE_PIC= init) || exit 1; \ done make[2]: Entering directory `/root/srilm/misc/src' cd ..; /root/srilm/sbin/make-standard-directories /bin/sh: /root/srilm/sbin/make-standard-directories: /bin/csh: bad interpreter: No such file or directory make[2]: [init] Error 126 (ignored) make ../obj/i686/STAMP ../bin/i686/STAMP make[3]: Entering directory `/root/srilm/misc/src' make[3]: `../obj/i686/STAMP' is up to date. mkdir ../bin/i686/ mkdir: cannot create directory `../bin/i686/': No such file or directory make[3]: [../bin/i686/STAMP] Error 1 (ignored) touch ../bin/i686/STAMP touch: cannot touch `../bin/i686/STAMP': No such file or directory make[3]: *** [../bin/i686/STAMP] Error 1 make[3]: Leaving directory `/root/srilm/misc/src' make[2]: *** [init] Error 2 make[2]: Leaving directory `/root/srilm/misc/src' make[1]: *** [init] Error 1 make[1]: Leaving directory `/root/srilm' make: *** [World] Error 2 Output of uname -a is: Linux nishthajaiswal 2.6.21-2950.fc8xen #1 SMP Tue Oct 23 12:24:34 EDT 2007 i686 i686 i386 GNU/Linux output of gcc -v is : Using built-in specs. Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-cpu=generic --host=i386-redhat-linux Thread model: posix gcc version 4.1.2 20070925 (Red Hat 4.1.2-33) Please reply ... Regards Nishtha From stolcke at icsi.berkeley.edu Mon Jan 16 10:11:01 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 16 Jan 2012 10:11:01 -0800 Subject: [SRILM User List] [Fwd: SRILM install problem] In-Reply-To: <12892.10.0.0.4.1326696813.squirrel@mail.cdacnoida.in> References: <12892.10.0.0.4.1326696813.squirrel@mail.cdacnoida.in> Message-ID: <4F146835.4050306@icsi.berkeley.edu> On 1/15/2012 10:53 PM, nishthajaiswal at cdacnoida.in wrote: > ------------------------------- Original Message ------------------------------- > Subject: SRILM install problem > From: nishthajaiswal at cdacnoida.in > Date: Mon, January 16, 2012 12:19 pm > To: srilm-user at speech.sri.com > -------------------------------------------------------------------------------- > > Hi > > Unable to install srilm on my fedora 8. > Following error is coming when giving on giving make World > > mkdir include lib bin > mkdir: cannot create directory `include': File exists > mkdir: cannot create directory `lib': File exists > mkdir: cannot create directory `bin': File exists > make: [dirs] Error 1 (ignored) > make init > make[1]: Entering directory `/root/srilm' > for subdir in misc dstruct lm flm lattice utils; do \ > (cd $subdir/src; make SRILM=/root/srilm MACHINE_TYPE=i686 > OPTION= MAKE_PIC= init) || exit 1; \ > done > make[2]: Entering directory `/root/srilm/misc/src' > cd ..; /root/srilm/sbin/make-standard-directories > /bin/sh: /root/srilm/sbin/make-standard-directories: /bin/csh: bad interpreter: From this last message it seems you don't have the C-shell installed. I believe csh (or sometimes called tcsh) is optional in some Linux distributions. Note the most recent beta version of SRILM no longer required csh, so another solution is to get that. Andreas From ryan at hlt.utdallas.edu Mon Jan 16 11:59:31 2012 From: ryan at hlt.utdallas.edu (Ryan Zeigler) Date: Mon, 16 Jan 2012 13:59:31 -0600 Subject: [SRILM User List] Difficulty Building SRILM with mingw-w64 Message-ID: <4F1481A3.7080205@hlt.utdallas.edu> Hello SRILM mailing list, I am having difficulty building SRILM 1.6 using the mingw32-w64 toolchain on a Windows 7 x64 machine. In order to try and do this, I modified the win32 machine type makefile to replace the bare g++/gcc invocations with the target prefixed names that are installed by mingw32 and removed the -mno-cygwin line. These being the relevant lines. CC_FLAGS = -DNEED_RAND48 -Wall -Wno-unused-variable -Wno-uninitialized CC = x86_64-w64-mingw32-gcc $(GCC_FLAGS) -Wimplicit-int CXX = x86_64-w64-mingw32-g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES When I subsequently attempt to compile, I receive the following errors from matherr.c x86_64-w64-mingw32-gcc -DNEED_RAND48 -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int -I. -I../../include -c -g -O2 -DUSE_SARRAY -DUSE_SARRAY_TRIE -DUSE_SARRAY_MAP2 -o ../obj/x86_64-w64-mingw32_c/matherr.o matherr.c matherr.c:19:16: warning: ?struct exception? declared inside parameter list matherr.c:19:16: warning: its scope is only this definition or declaration, which is probably not what you want matherr.c:19:1: error: conflicting types for ?_matherr? /usr/x86_64-w64-mingw32/sys-root/mingw/include/math.h:179:23: note: previous declaration of ?_matherr? was here matherr.c: In function ?_matherr?: matherr.c:22:10: error: dereferencing pointer to incomplete type matherr.c:22:36: error: dereferencing pointer to incomplete type matherr.c:30:1: warning: control reaches end of non-void function /cygdrive/c/srilm/common/Makefile.common.targets:85: recipe for target `../obj/x86_64-w64-mingw32_c/matherr.o' failed For refrence, the definition of _matherr given is _CRTIMP int __cdecl _matherr (struct _exception *); I would appreciate any help in resolving this issue. Regards, Ryan Zeigler From stolcke at icsi.berkeley.edu Mon Jan 16 19:17:58 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 16 Jan 2012 19:17:58 -0800 Subject: [SRILM User List] Difficulty Building SRILM with mingw-w64 In-Reply-To: <4F1481A3.7080205@hlt.utdallas.edu> References: <4F1481A3.7080205@hlt.utdallas.edu> Message-ID: <4F14E866.5090304@icsi.berkeley.edu> On 1/16/2012 11:59 AM, Ryan Zeigler wrote: > Hello SRILM mailing list, > I am having difficulty building SRILM 1.6 using the mingw32-w64 > toolchain on a Windows 7 x64 machine. > In order to try and do this, I modified the win32 machine type > makefile to replace the bare g++/gcc invocations with the target > prefixed names that are installed by mingw32 and removed the > -mno-cygwin line. These being the relevant lines. > > CC_FLAGS = -DNEED_RAND48 -Wall -Wno-unused-variable -Wno-uninitialized > CC = x86_64-w64-mingw32-gcc $(GCC_FLAGS) -Wimplicit-int > CXX = x86_64-w64-mingw32-g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES > > When I subsequently attempt to compile, I receive the following errors > from matherr.c This particular error is due to a glitch in the ifdefs. Replace #if defined(__MINGW32_VERSION) || defined(_MSC_VER) with #if defined(WIN32) || defined(_MSC_VER) I took at stab at a mingw32-w64 build recently, and there are link-time errors even after everything compiles fine. Let me know how far you get with this! Andreas > > x86_64-w64-mingw32-gcc -DNEED_RAND48 -Wall -Wno-unused-variable > -Wno-uninitialized -Wimplicit-int -I. -I../../include -c -g -O2 > -DUSE_SARRAY -DUSE_SARRAY_TRIE -DUSE_SARRAY_MAP2 -o > ../obj/x86_64-w64-mingw32_c/matherr.o matherr.c > matherr.c:19:16: warning: ?struct exception? declared inside parameter > list > matherr.c:19:16: warning: its scope is only this definition or > declaration, which is probably not what you want > matherr.c:19:1: error: conflicting types for ?_matherr? > /usr/x86_64-w64-mingw32/sys-root/mingw/include/math.h:179:23: note: > previous declaration of ?_matherr? was here > matherr.c: In function ?_matherr?: > matherr.c:22:10: error: dereferencing pointer to incomplete type > matherr.c:22:36: error: dereferencing pointer to incomplete type > matherr.c:30:1: warning: control reaches end of non-void function > /cygdrive/c/srilm/common/Makefile.common.targets:85: recipe for target > `../obj/x86_64-w64-mingw32_c/matherr.o' failed > > > For refrence, the definition of _matherr given is > _CRTIMP int __cdecl _matherr (struct _exception *); > > I would appreciate any help in resolving this issue. > Regards, > Ryan Zeigler > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user > From dmytro.prylipko at ovgu.de Sun Jan 22 11:19:36 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Sun, 22 Jan 2012 20:19:36 +0100 Subject: [SRILM User List] Using hidden events Message-ID: Hi, I would like to use models with hidden vocabulary for filled pauses but I am not sure what is the right way to train and test such models. I have a train and test data containing filled pauses between words as well as 'clean' datasets where FPs are removed. The filled pauses are going to be modeled as '-observed -omit' or '-observed'. The questions are: - Should I train the model on the data containing the FPs or on the clean data? - Which vocabulary to use during training and test: with FP or without, since FP word is included into hidden vocabulary? I am also trying to estimate local perplexity of the words following filled pauses. I extracted these words together with the contexts into separate sentences, e.g: eine woche was aus vom sonnabend and applied trained LM on them. Total perplexity is calculated as 10^( - totalLogProb / N ), where totalLogProb is the sum of log probabilities of the words predicted after . The same value is then calculated on these chunks where have been removed from the context: eine woche was aus vom sonnabend. Is this right? Which setup should I use in order to calculate the local perplexity, when I want to model FPs as hidden events with '-observed -omit' options? Thanks in advance. Yours, Dmytro. From stolcke at icsi.berkeley.edu Sun Jan 22 19:35:37 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 22 Jan 2012 19:35:37 -0800 Subject: [SRILM User List] Using hidden events In-Reply-To: Your message of Sun, 22 Jan 2012 20:19:36 +0100. Message-ID: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU> In message you wrote: > Hi, > > I would like to use models with hidden vocabulary for filled pauses > but I am not sure what is the right way to train and test such models. > I have a train and test data containing filled pauses between words as > well as 'clean' datasets where FPs are removed. > The filled pauses are going to be modeled as '-observed -omit' or '-observed' As stated in the ngram(1) man page, filled pauses should normally modeled as -hidden-vocab tokens with -observed -omit. > . > The questions are: > - Should I train the model on the data containing the FPs or on the > clean data? You need to have the FPs in the training data, since (1) they are observed and (2) even hidden events need to be made "unhidden" for training purposes. There is no ready-made training procedure for hidden-event LMs. You yourself have to extact the n-grams that correspond to the events and histories implied by the LM. For example, if "UH" is a filled pause and the training data has a b UH c d and you want to train a 3gram LM, you need to generate ngrams UH 1 b UH 1 a b UH 1 c 1 b c 1 a b c 1 d 1 c d 1 b c d 1 and feed that to ngram-count -read plus any of the standard training options. > - Which vocabulary to use during training and test: with FP or > without, since FP word is included into hidden vocabulary? With FP in training (since there is no "hidden" vocabulary in training, see above). In testing it doesn't matter since all the tokens specified by -hidden-vocab are implicitly added to the overall LM vocabulary. > > I am also trying to estimate local perplexity of the words following > filled pauses. I extracted these words together with the contexts into > separate sentences, e.g: > eine woche was > aus vom sonnabend You want to use ngram -debug 2 -ppl and extract the probabilities from the output. Andreas > > and applied trained LM on them. Total perplexity is calculated as 10^( > - totalLogProb / N ), where totalLogProb is the sum of log > probabilities of the words predicted after . > > The same value is then calculated on these chunks where have been > removed from the context: > eine woche was > aus vom sonnabend. > > Is this right? > > Which setup should I use in order to calculate the local perplexity, > when I want to model FPs as hidden events with '-observed -omit' > options? > > Thanks in advance. > > Yours, > Dmytro. > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user --Andreas From dmytro.prylipko at ovgu.de Mon Jan 23 03:25:12 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Mon, 23 Jan 2012 12:25:12 +0100 Subject: [SRILM User List] Using hidden events In-Reply-To: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU> References: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU> Message-ID: On Mon, Jan 23, 2012 at 4:35 AM, Andreas Stolcke wrote: > In message > you wrote: >> Hi, >> >> I would like to use models with hidden vocabulary for filled pauses >> but I am not sure what is the right way to train and test such models. >> I have a train and test data containing filled pauses between words as >> well as 'clean' datasets where FPs are removed. >> The filled pauses are going to be modeled as '-observed -omit' or '-observed' > > As stated in the ngram(1) man page, filled pauses should normally > modeled as -hidden-vocab tokens with -observed -omit. > >> . >> The questions are: >> ? - ?Should I train the model on the data containing the FPs or on the >> clean data? > > You need to have the FPs in the training data, since (1) they are observed > and (2) even hidden events need to be made "unhidden" ?for training purposes. > > There is no ready-made training procedure for hidden-event LMs. > You yourself have to extact the n-grams that correspond to the events > and histories implied by the LM. ?For example, if "UH" is a filled pause and > the training data has > > ? ? ? ?a b UH c d > > and you want to train a 3gram LM, you need to generate ngrams > > ? ? ? ?UH ? ? ?1 > ? ? ? ?b UH ? ?1 > ? ? ? ?a b UH ?1 > ? ? ? ?c ? ? ? 1 > ? ? ? ?b c ? ? 1 > ? ? ? ?a b c ? 1 > ? ? ? ?d ? ? ? 1 > ? ? ? ?c d ? ? 1 > ? ? ? ?b c d ? 1 > > and feed that to ngram-count -read plus any of the standard training > options. Wow, sounds tricky. I guess this procedure is required for those disfluencies which are omitted from the context, i.e. marked with the -omit option in the hidden vocabulary, but need to be predicted themselves. For other kinds, such as insertions, deletions and repairs, LM can be trained just with ngram-count, right? > > >> ? - Which vocabulary to use during training and test: with FP or >> without, since FP word is included into hidden vocabulary? > > With FP in training (since there is no "hidden" vocabulary in training, > see above). > > In testing it doesn't matter since all the tokens specified by -hidden-vocab > are implicitly added to the overall LM vocabulary. > >> >> I am also trying to estimate local perplexity of the words following >> filled pauses. I extracted these words together with the contexts into >> separate sentences, e.g: >> eine woche was >> aus vom sonnabend > > You want to use ngram -debug 2 -ppl > and extract the probabilities from the output. > > Andreas > >> >> and applied trained LM on them. Total perplexity is calculated as 10^( >> - totalLogProb / N ), where totalLogProb is the sum of log >> probabilities of the words predicted after . >> >> The same value is then calculated on these chunks where have been >> removed from the context: >> eine woche was >> aus vom sonnabend. >> >> Is this right? >> >> Which setup should I use in order to calculate the local perplexity, >> when I want to model FPs as hidden events with '-observed -omit' >> options? >> >> Thanks in advance. >> >> Yours, >> Dmytro. >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > > > --Andreas From dmytro.prylipko at ovgu.de Mon Jan 23 06:23:28 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Mon, 23 Jan 2012 15:23:28 +0100 Subject: [SRILM User List] Using hidden events In-Reply-To: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU> References: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU> Message-ID: Dear Andreas, I am conducting experiments on filled pauses and some results are puzzle for me. I estimated a perplexity of words following filled pauses in two ways: (1) taking FPs into account (FP is modeled as a regular word, not hidden event) and (2) after removal of them from both train and test data. I account just log probabilities of the words placed after FPs (obtained with ngram -debug 2 -ppl), not for the FPs. The first approach provides lower perplexity, which is expected. But when using -hidden-vocab I have some strange results which are not clear for me. For example, I can assume that using language model (trained on 'clean' data, i.e. w/o FPs) together with 'FP -observed -omit' on the test data containing pauses (i.e. 'not clean') should lead to the same result as for word-only model (approach (2)), since we predict only words and context is freed from disfluencies. However, this assumption is not supported with experiments. Using 'clean' model with hidden vocabulary on test data containing pauses gives much higher perplexity (364 -> 400). I found that word probability after FP in such a case is always modeled with unigrams. I can conclude that FPs are not omitted from the context despite of the hidden event instruction. This is supported by the fact that the result is equal when I use either '-observed -omit' or just '-observed' or just '-omit'. Also I thought that using a model, which consider filled pauses as regular words, incorporating hidden vocabulary with 'FP -observed' should not change the result as well, since pauses are not omitted from the context in this case. This is not true as well, I get the value of 295 without hidden vocabulary and 291 with it. Also, I found that perplexity values do not change when I use 'FP -observed' or just 'FP -omit' in the hidden vocabulary, which looks very strange. I would be greatly appreciated if you could clarify these questions. Yours, Dmytro. From stolcke at icsi.berkeley.edu Mon Jan 23 09:44:54 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 23 Jan 2012 09:44:54 -0800 Subject: [SRILM User List] Using hidden events In-Reply-To: References: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU> Message-ID: <4F1D9C96.6030408@icsi.berkeley.edu> On 1/23/2012 3:25 AM, Dmytro Prylipko wrote: > >>> . >>> The questions are: >>> - Should I train the model on the data containing the FPs or on the >>> clean data? >> You need to have the FPs in the training data, since (1) they are observed >> and (2) even hidden events need to be made "unhidden" for training purposes. >> >> There is no ready-made training procedure for hidden-event LMs. >> You yourself have to extact the n-grams that correspond to the events >> and histories implied by the LM. For example, if "UH" is a filled pause and >> the training data has >> >> a b UH c d >> >> and you want to train a 3gram LM, you need to generate ngrams >> >> UH 1 >> b UH 1 >> a b UH 1 >> c 1 >> b c 1 >> a b c 1 >> d 1 >> c d 1 >> b c d 1 >> >> and feed that to ngram-count -read plus any of the standard training >> options. > Wow, sounds tricky. I guess this procedure is required for those > disfluencies which are omitted from the context, i.e. marked with the > -omit option in the hidden vocabulary, but need to be predicted > themselves. For other kinds, such as insertions, deletions and > repairs, LM can be trained just with ngram-count, right? Well, you need to train a single model for all types of tokens. So it is easiest to write a perl script (for example) that extract the counts for all ngrams. Note that you can write the script so that it processes one sentence at a time, and output just a bunch of ngrams with count 1. ngram-count -read will take care of merging and summing the counts. Andreas From dmytro.prylipko at ovgu.de Sun Jan 29 07:45:54 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Sun, 29 Jan 2012 16:45:54 +0100 Subject: [SRILM User List] Does '-omit' work? Message-ID: Dear Andreas, I found that using -omit and -observed options does not influence on the calculation of perplexity. I trained an skip-LM for filled pauses as you advised me (generated n-grams, where FPs were skipped from context). But when I apply it to the test data it does not matter which combination of options do I use for the hidden-vocabulary: -omit -observed -omit -observed or just For each case I have the same perplexity value. However, it differs when the hidden vocabulary is empty or contains another token, so I can conclude it works. Could you tell me if I am doing everything right? Why the options do not work? Sincerely yours, Dmytro Prylipko. From martin.ostrovsky at gmail.com Mon Jan 30 13:26:15 2012 From: martin.ostrovsky at gmail.com (Martin Ostrovsky) Date: Mon, 30 Jan 2012 16:26:15 -0500 Subject: [SRILM User List] Full list of spanish POS tags Message-ID: <9D31A18D-4864-49A5-B6DA-B821958DE2C6@gmail.com> Hello, I've run the SVMTagger against some spanish text using spanish model provided on the SVMTool site and am looking for a canonical list of definitions for each POS tag. Any suggestions? From stolcke at icsi.berkeley.edu Tue Jan 31 01:12:18 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 31 Jan 2012 01:12:18 -0800 Subject: [SRILM User List] Does '-omit' work? In-Reply-To: References: Message-ID: <4F27B072.5030901@icsi.berkeley.edu> On 1/29/2012 7:45 AM, Dmytro Prylipko wrote: > Dear Andreas, > > I found that using -omit and -observed options does not influence on > the calculation of perplexity. > I trained an skip-LM for filled pauses as you advised me (generated > n-grams, where FPs were skipped from context). > But when I apply it to the test data it does not matter which > combination of options do I use for the hidden-vocabulary: > -omit -observed > -omit > -observed > or just > This is a bug, more in the documentation than in the code. The hidden event "options" (-omit, -observed, etc) are only processed when they appear in the -lm file following the ngram parameters. When processing the -hidden-vocab file, on the other hand, only the names of the hidden events are recorded (like -vocab). This should be fixed. But for now, simply append your hidden-event file to the contents of the -lm file . Sorry for the confusion in the man page. It kind of says this but in a very confusing way, and I agree that the -hidden-vocab file should also interpret the full hidden event specifications. Andreas From shinichiro.hamada at gmail.com Wed Feb 1 07:01:52 2012 From: shinichiro.hamada at gmail.com (shinichiro.hamada) Date: Thu, 2 Feb 2012 00:01:52 +0900 Subject: [SRILM User List] LM whose counts are multiplied Message-ID: Hello, all. I want to make a language model with data which have fraction counts. But not all smoothing method can handle them, so I'll try to multiply each count by 10 and make it integer by rounding. -- I did a preliminary experiment. Files: * count-file with integers : a.count * the file whose counts are multiplied by 10 : b.count Command: ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3 -interpolate ngram-count -read b.count -order 3 -lm b.lm -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3 -interpolate I expected same language models are generated, but they differed. Why? Followings are their heading parts. ------------------------ [a.lm] \Data\ ngram 1=1055 ngram 2=2240 ngram 3=87 \1-grams: .. ------------------------ [b.lm] \data\ ngram 1=1055 ngram 2=2240 ngram 3=2548 \1-grams: .. From stolcke at icsi.berkeley.edu Wed Feb 1 12:44:37 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 01 Feb 2012 12:44:37 -0800 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> Message-ID: <4F29A435.5050905@icsi.berkeley.edu> Georgi, You can get the conditional probabilities for arbitrary sets of ngrams using ngram -counts FILE Andreas On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote: > Dear Mr. Stolcke, > > I am trying to do sentence boundary segmentation. I have an n-gram > language model and for modelling it I use the SRILM toolkit. Thanks > for the nice tool! > > I have the following problem. > > I implement the forward-backward algortithm on my own. So I need to > combine the n-grams of your "hidden event model" with the prosodic model. > Therefore, I need to get the probabilities of the individual n-grams > (in my case 3-grams). > > For example for the word sequence > wordt_2 wordt_1 wordt wordt+1 wordt+2 > > i need > P( , wordt | wordt_2 wordt_1) > P(wordt | wordt_2 wordt_1) > P(wordt+1 | wordt_1 wordt) > ... and so on > all possible combinations with and without before each word. > > > What I do to get one of these is to use the following SRILM command: > > # create text for case *wordt_2 wordt_1 wordt* > > echo "$wordt_2 $wordt_1 > > $wordt" > testtext2; > > > ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk > >/tmp/output; > and then read the corresponding line from the output that I need (e.g. > line 3 ) > > > OUTPUT: > wordt_2 wordt_1 > p( | ) = [2gram] 0.00235274 [ -2.62843 ] > p( | ...) = [2gram] 0.00343115 [ -2.46456 ] > p( | ...) = [2gram] 0.0937662 [ -1.02795 ] > 1 sentences, 2 words, 0 OOVs > 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4 > > wordt > p( | ) = [2gram] 0.00235274 [ -2.62843 ] > p( | ...) = [2gram] 0.10582 [ -0.975432 ] > 1 sentences, 1 words, 0 OOVs > 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59 > > file testtext2: 2 sentences, 3 words, 0 OOVs > 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21 > -------------------------------- > > > > The problem is that for each trigram I call again ngram function and > it loads the LM ( > 1GB) and this makes it very slow. > Is there a faster solution? I do not need perplexity as well. > > I know about the segmentation tool > http://www.speech.sri.com/projects/srilm/manpages/segment.1.html > but it gives results for the whole sequence, which is not my goal. > > > > > mit freundlichen Gr??en, > Georgi Dzhambazov, > > Studentischer Mitarbeiter, > NetMedia > ________________________________________ > Von: Andreas Stolcke [stolcke at icsi.berkeley.edu] > Gesendet: Donnerstag, 13. Oktober 2011 05:50 > Bis: Dzhambazov, Georgi > Cc: eee at speech.sri.com > Betreff: Re: Question about sentence boundary detection paper > > Dzhambazov, Georgi wrote: > > Dear A. Stolcke, > > Dear E. Shriberg, > > > > > > I am interested in your approach of sentence boundary detection. > > I would be very happy if you find some time to clarify me some of the > > steps of your approach. > > I plan to implement them. > > > > Question 1) > > In the paper (1) at paragraph 2.2.1 you say that states are "the end > > of sentence status of each word plus any preceeding words. > > So for example at position 4 of the example sentence, the state is ( > > + quick brown fox). At position 6 the state is ( + brown fox > > flies ) . > > This means a huge state space. Is this right? > > > > 1 2 3 4 5 6 7 8 9 10 > > > > The quick brown fox flies The rabbit is white. > The state space is potentially huge, but just like in standard N-gram > LMs you only consider the histories (= states) actually occurring in the > training data, and handle any new histories through backoff. > Furthermore, the state space is constrained to those that match the > ngrams in the word sequence. So for every word position you have to > consider only two states ( and no-). > > > > Question 2) > > Transitions probabilities are N-gram Probabilities. You give an > > example with bigram probabilities in the next line. > > However you say as well you are using a 4-gram LM. So the correct > > example should be: > > This means that a probability at position 6 is Pr(|brown fox flies) > > and at position 4 is Pr( | quick brown fox) > > Is this right? > correct. > > > > Question 3) > > Then for recognition you say that the forward-backward algorithm is > > used to determine the maximal P (T_i | W ) > > where T_i corresponds to or at position i. However the > > transition probabilities include information about states like ( > > + quick brown fox). > > How do you apply the transition probabilities in this model. Does it > > relate to the formula of section 4 ot (2). > > I think this formula can work for the forward-backward algorithm, > > although it is stated in this section 4 that it is used for Viterbi. > For finding the most probable T_i you use in fact the Viterbi algorithm. > > The formulas in section 4 just give one step in the forward computation > that would be used in the Viterbi algorithm. > > Please note that this is all implemented in the "segment" tool that > comes with SRILM. > See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and > http://www.speech.sri.com/projects/srilm/ for more information on SRILM. > > Andreas > > > > > References: > > > > 1) Shriberg et al. 2000 - Prosody based automatic segmentation of > > Speech into sentences and topics > > 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of > > conversational speech > > > > Thank you! > > > > Kind Regards, > > Georgi Dzhambazov, > > > > Studentischer Mitarbeiter, > > NetMedia > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Feb 1 12:51:40 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 01 Feb 2012 12:51:40 -0800 Subject: [SRILM User List] LM whose counts are multiplied In-Reply-To: References: Message-ID: <4F29A5DC.8010002@icsi.berkeley.edu> On 2/1/2012 7:01 AM, shinichiro.hamada wrote: > Hello, all. > > I want to make a language model with data which have fraction counts. But > not all smoothing method can handle them, so I'll try to multiply each count > by 10 and make it integer by rounding. > > -- > I did a preliminary experiment. > > Files: > * count-file with integers : a.count > * the file whose counts are multiplied by 10 : b.count > > Command: > ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1 > -wbdiscount2 -wbdiscount3 -interpolate > ngram-count -read b.count -order 3 -lm b.lm -wbdiscount -wbdiscount1 > -wbdiscount2 -wbdiscount3 -interpolate > > I expected same language models are generated, but they differed. Why? > Followings are their heading parts. First off, the WB discounting method does support fractional counts, so you can just feed your counts to ngram -float-counts ... with no need to scale and truncate the counts to integers. The reason you are seeing different LM outputs for different count multipliers is that smoothing is sensitive to the absolute occurrence counts of ngrams, not just their relative frequencies. This has to be so, if you're trying to estimate the probabilities of unseen ngrams. If you've seen only 10 cases "a b" and never saw "a b x" you should be less surprised to see your first "a b x", than if you had seen 1000 instances of "a b" (and still none of "a b x"). Andreas From shinichiro.hamada at gmail.com Thu Feb 2 06:10:02 2012 From: shinichiro.hamada at gmail.com (shinichiro.hamada) Date: Thu, 2 Feb 2012 23:10:02 +0900 Subject: [SRILM User List] LM whose counts are multiplied In-Reply-To: <4F29A5DC.8010002@icsi.berkeley.edu> References: <4F29A5DC.8010002@icsi.berkeley.edu> Message-ID: <9FE92B5EFFF24E018A7A5E2D4F3E31AE@f91> Dear Mr. Stolcke, Thank you for your clear explanation. I understood it completely!! I'll try to use WB discounting method with float-counts. Shincihiro Hamada > -----Original Message----- > From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] > Sent: Thursday, February 02, 2012 5:52 AM > To: shinichiro.hamada > Cc: srilm-user at speech.sri.com > Subject: Re: [SRILM User List] LM whose counts are multiplied > > On 2/1/2012 7:01 AM, shinichiro.hamada wrote: > > Hello, all. > > > > I want to make a language model with data which have > fraction counts. > > But not all smoothing method can handle them, so I'll try > to multiply > > each count by 10 and make it integer by rounding. > > > > -- > > I did a preliminary experiment. > > > > Files: > > * count-file with integers : a.count > > * the file whose counts are multiplied by 10 : b.count > > > > Command: > > ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1 > > -wbdiscount2 -wbdiscount3 -interpolate ngram-count -read b.count > > -order 3 -lm b.lm -wbdiscount -wbdiscount1 > > -wbdiscount2 -wbdiscount3 -interpolate > > > > I expected same language models are generated, but they > differed. Why? > > Followings are their heading parts. > > First off, the WB discounting method does support fractional > counts, so you can just feed your counts to ngram -float-counts ... > with no need to scale and truncate the counts to integers. > > The reason you are seeing different LM outputs for different > count multipliers is that smoothing is sensitive to the > absolute occurrence counts of ngrams, not just their relative > frequencies. This has to be so, if you're trying to estimate > the probabilities of unseen ngrams. If you've seen only 10 > cases "a b" and never saw "a b x" you should be less > surprised to see your first "a b x", than if you had seen > 1000 instances of "a b" (and still none of "a b x"). > > Andreas From af4ex.radio at yahoo.com Thu Feb 2 07:24:16 2012 From: af4ex.radio at yahoo.com (John Day) Date: Thu, 2 Feb 2012 07:24:16 -0800 (PST) Subject: [SRILM User List] Using srilm as Memory Jogger Message-ID: <1328196256.12588.YahooMailNeo@web114411.mail.gq1.yahoo.com> Hi Andreas, ? Can you (or the group) tell me if srilm could be used to query language models in such a way as to 'narrow down' the search for a "partially known word" where the context of its usage is known. By "partially known" I mean hints such as word prefixes or endings are known. The "context of usage" is equivalent, I think, to stating that the likelihood of the hidden word is increased if it is preceded or followed by a given set of words associated with some topic. So I would like to 'leverage' srilm and language model queries by using topic models to suggest some words associated with a certain topic. For example, find the most likely words, that begin with "st", given a "context set" (suggested by some 'sociology' topic model) containing the words "neighborhood, behavior, customs, environment". Does that make sense? Do you think srilm could be used to execute a query like that? Thanks, John Day Palm Bay, Florida -------------- next part -------------- An HTML attachment was scrubbed... URL: From amber.wilcox.ohearn at gmail.com Thu Feb 2 08:29:07 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Thu, 2 Feb 2012 09:29:07 -0700 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: <4F29A435.5050905@icsi.berkeley.edu> References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> <4F29A435.5050905@icsi.berkeley.edu> Message-ID: (Sorry Andreas, I meant to reply to the list): Georgi, I'm not sure if SRILM has something that does that -- i.e. holds the whole LM in RAM and waits for queries. You might need something like that as opposed to using a whole file, if you want just the probabilities of the last word with respect to the previous, and you want to compare different last words depending on results of previous calculations, for example. I have a little C/Python tool I wrote for exactly this purpose. It's at https://github.com/lamber/BackOffTrigramModel It's very specific to my work at the time. So for example, it works for only exactly trigrams, and it assumes you are using . It performs all the back-off calculations for unseen trigrams. But it looks like you have the same use case, so it might be useful for you. It's not much documented, but the unit tests show how it works. Amber -- http://scholar.google.com/citations?user=15gGywMAAAAJ On Wed, Feb 1, 2012 at 1:44 PM, Andreas Stolcke wrote: > Georgi, > > You can get the conditional probabilities for arbitrary sets of ngrams using > > ??? ngram -counts FILE > > Andreas > > > On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote: > > Dear Mr. Stolcke, > > I am trying to do sentence boundary segmentation. I have an n-gram language > model and for modelling it I use the SRILM toolkit. Thanks for the nice > tool! > > I have the following problem. > > I implement the forward-backward algortithm on my own. So I need to combine > the n-grams of your "hidden event model" with the prosodic model. > Therefore, I need to get the probabilities of the individual n-grams (in my > case 3-grams). > > For example for the word sequence > wordt_2 wordt_1 wordt wordt+1 wordt+2 > > i need > P( , wordt | wordt_2 wordt_1) > P(wordt | wordt_2 wordt_1) > P(wordt+1 | wordt_1 wordt) > ... and so on > all possible combinations with and without before each word. > > > What I do to get one of these is to use the following SRILM command: > > # create text for case *wordt_2 wordt_1 wordt* >> echo "$wordt_2 $wordt_1 >> $wordt" > testtext2; > >> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk >/tmp/output; > and then read the corresponding line from the output that I need (e.g. line > 3 ) > > > > OUTPUT: > wordt_2 wordt_1 > p( | ) = [2gram] 0.00235274 [ -2.62843 ] > p( | ...) = [2gram] 0.00343115 [ -2.46456 ] > p( | ...) = [2gram] 0.0937662 [ -1.02795 ] > 1 sentences, 2 words, 0 OOVs > 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4 > > wordt > p( | ) = [2gram] 0.00235274 [ -2.62843 ] > p( | ...) = [2gram] 0.10582 [ -0.975432 ] > 1 sentences, 1 words, 0 OOVs > 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59 > > file testtext2: 2 sentences, 3 words, 0 OOVs > 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21 > -------------------------------- > > > > The problem is that for each trigram I call again ngram function and it > loads the LM ( > 1GB) and this makes it very slow. > Is there a faster solution? I do not need perplexity as well. > > I know about the segmentation tool > http://www.speech.sri.com/projects/srilm/manpages/segment.1.html > ? but it gives results for the whole sequence, which is not my goal. > > > > > mit freundlichen Gr??en, > Georgi Dzhambazov, > > Studentischer Mitarbeiter, > NetMedia > ________________________________________ > Von: Andreas Stolcke [stolcke at icsi.berkeley.edu] > Gesendet: Donnerstag, 13. Oktober 2011 05:50 > Bis: Dzhambazov, Georgi > Cc: eee at speech.sri.com > Betreff: Re: Question about sentence boundary detection paper > > Dzhambazov, Georgi wrote: >> Dear A. Stolcke, >> Dear E. Shriberg, >> >> >> I am interested in your approach of sentence boundary detection. >> I would be very happy if you find some time to clarify me some of the >> steps of your approach. >> I plan to implement them. >> >> Question 1) >> In the paper (1) at paragraph 2.2.1 you say that states are "the end >> of sentence status of each word plus any preceeding words. >> So for example at position 4 of the example sentence, the state is ( >> + quick brown fox). At position 6 the state is ( + brown fox >> flies ) . >> This means a huge state space. Is this right? >> >> 1 2 3 4 5 6 7 8 9 10 >> >> The quick brown fox flies The rabbit is white. > The state space is potentially huge, but just like in standard N-gram > LMs you only consider the histories (= states) actually occurring in the > training data, and handle any new histories through backoff. > Furthermore, the state space is constrained to those that match the > ngrams in the word sequence. So for every word position you have to > consider only two states ( and no-). >> >> Question 2) >> Transitions probabilities are N-gram Probabilities. You give an >> example with bigram probabilities in the next line. >> However you say as well you are using a 4-gram LM. So the correct >> example should be: >> This means that a probability at position 6 is Pr(|brown fox flies) >> and at position 4 is Pr( | quick brown fox) >> Is this right? > correct. >> >> Question 3) >> Then for recognition you say that the forward-backward algorithm is >> used to determine the maximal P (T_i | W ) >> where T_i corresponds to or at position i. However the >> transition probabilities include information about states like ( >> + quick brown fox). >> How do you apply the transition probabilities in this model. Does it >> relate to the formula of section 4 ot (2). >> I think this formula can work for the forward-backward algorithm, >> although it is stated in this section 4 that it is used for Viterbi. > For finding the most probable T_i you use in fact the Viterbi algorithm. > > The formulas in section 4 just give one step in the forward computation > that would be used in the Viterbi algorithm. > > Please note that this is all implemented in the "segment" tool that > comes with SRILM. > See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and > http://www.speech.sri.com/projects/srilm/ for more information on SRILM. > > Andreas > >> >> References: >> >> 1) Shriberg et al. 2000 - Prosody based automatic segmentation of >> Speech into sentences and topics >> 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of >> conversational speech >> >> Thank you! >> >> Kind Regards, >> Georgi Dzhambazov, >> >> Studentischer Mitarbeiter, >> NetMedia > > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -- http://scholar.google.com/citations?user=15gGywMAAAAJ From stolcke at icsi.berkeley.edu Thu Feb 2 16:53:07 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Thu, 02 Feb 2012 16:53:07 -0800 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> <4F29A435.5050905@icsi.berkeley.edu> Message-ID: <4F2B2FF3.2070602@icsi.berkeley.edu> On 2/2/2012 8:29 AM, L. Amber Wilcox-O'Hearn wrote: > (Sorry Andreas, I meant to reply to the list): > > Georgi, > > I'm not sure if SRILM has something that does that -- i.e. holds the > whole LM in RAM and waits for queries. You might need something like > that as opposed to using a whole file, if you want just the > probabilities of the last word with respect to the previous, and you > want to compare different last words depending on results of previous > calculations, for example. Two SRILM solutions: 1- Start ngram -lm LM -escape "===" -counts - (read from stdin) and put an escape line (in this case, starting with "===") after every ngram in the input (make sure the ngram words are followed my a count "1"). This will cause ngram to dump out the conditional prob for the ngram right away (instead of waiting for end-of-file). 2. Directly access the network LM server protocol implemented by ngram -server-port. Start the server with % ngram -lm LM -server-port 8888 then write ngrams to that TCP port and read back the log probs: % telnet localhost 8888 my first word << input -4.6499 >> output Of course you would do the equivalent of telnet in perl, python, C, or some other language to make use of the probabilities. Andreas > > I have a little C/Python tool I wrote for exactly this purpose. It's > at https://github.com/lamber/BackOffTrigramModel > > It's very specific to my work at the time. So for example, it works > for only exactly trigrams, and it assumes you are using. It > performs all the back-off calculations for unseen trigrams. But it > looks like you have the same use case, so it might be useful for you. > > It's not much documented, but the unit tests show how it works. > > Amber > -- > http://scholar.google.com/citations?user=15gGywMAAAAJ > > On Wed, Feb 1, 2012 at 1:44 PM, Andreas Stolcke > wrote: >> Georgi, >> >> You can get the conditional probabilities for arbitrary sets of ngrams using >> >> ngram -counts FILE >> >> Andreas >> >> >> On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote: >> >> Dear Mr. Stolcke, >> >> I am trying to do sentence boundary segmentation. I have an n-gram language >> model and for modelling it I use the SRILM toolkit. Thanks for the nice >> tool! >> >> I have the following problem. >> >> I implement the forward-backward algortithm on my own. So I need to combine >> the n-grams of your "hidden event model" with the prosodic model. >> Therefore, I need to get the probabilities of the individual n-grams (in my >> case 3-grams). >> >> For example for the word sequence >> wordt_2 wordt_1 wordt wordt+1 wordt+2 >> >> i need >> P( , wordt | wordt_2 wordt_1) >> P(wordt | wordt_2 wordt_1) >> P(wordt+1 | wordt_1 wordt) >> ... and so on >> all possible combinations with and without before each word. >> >> >> What I do to get one of these is to use the following SRILM command: >> >> # create text for case *wordt_2 wordt_1 wordt* >>> echo "$wordt_2 $wordt_1 >>> $wordt"> testtext2; >>> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk>/tmp/output; >> and then read the corresponding line from the output that I need (e.g. line >> 3 ) >> >> >> >> OUTPUT: >> wordt_2 wordt_1 >> p( | ) = [2gram] 0.00235274 [ -2.62843 ] >> p( | ...) = [2gram] 0.00343115 [ -2.46456 ] >> p( | ...) = [2gram] 0.0937662 [ -1.02795 ] >> 1 sentences, 2 words, 0 OOVs >> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4 >> >> wordt >> p( | ) = [2gram] 0.00235274 [ -2.62843 ] >> p( | ...) = [2gram] 0.10582 [ -0.975432 ] >> 1 sentences, 1 words, 0 OOVs >> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59 >> >> file testtext2: 2 sentences, 3 words, 0 OOVs >> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21 >> -------------------------------- >> >> >> >> The problem is that for each trigram I call again ngram function and it >> loads the LM (> 1GB) and this makes it very slow. >> Is there a faster solution? I do not need perplexity as well. >> >> I know about the segmentation tool >> http://www.speech.sri.com/projects/srilm/manpages/segment.1.html >> but it gives results for the whole sequence, which is not my goal. >> >> >> >> >> mit freundlichen Gr??en, >> Georgi Dzhambazov, >> >> Studentischer Mitarbeiter, >> NetMedia >> ________________________________________ >> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu] >> Gesendet: Donnerstag, 13. Oktober 2011 05:50 >> Bis: Dzhambazov, Georgi >> Cc: eee at speech.sri.com >> Betreff: Re: Question about sentence boundary detection paper >> >> Dzhambazov, Georgi wrote: >>> Dear A. Stolcke, >>> Dear E. Shriberg, >>> >>> >>> I am interested in your approach of sentence boundary detection. >>> I would be very happy if you find some time to clarify me some of the >>> steps of your approach. >>> I plan to implement them. >>> >>> Question 1) >>> In the paper (1) at paragraph 2.2.1 you say that states are "the end >>> of sentence status of each word plus any preceeding words. >>> So for example at position 4 of the example sentence, the state is ( >>> + quick brown fox). At position 6 the state is ( + brown fox >>> flies ) . >>> This means a huge state space. Is this right? >>> >>> 1 2 3 4 5 6 7 8 9 10 >>> >>> The quick brown fox flies The rabbit is white. >> The state space is potentially huge, but just like in standard N-gram >> LMs you only consider the histories (= states) actually occurring in the >> training data, and handle any new histories through backoff. >> Furthermore, the state space is constrained to those that match the >> ngrams in the word sequence. So for every word position you have to >> consider only two states ( and no-). >>> Question 2) >>> Transitions probabilities are N-gram Probabilities. You give an >>> example with bigram probabilities in the next line. >>> However you say as well you are using a 4-gram LM. So the correct >>> example should be: >>> This means that a probability at position 6 is Pr(|brown fox flies) >>> and at position 4 is Pr( | quick brown fox) >>> Is this right? >> correct. >>> Question 3) >>> Then for recognition you say that the forward-backward algorithm is >>> used to determine the maximal P (T_i | W ) >>> where T_i corresponds to or at position i. However the >>> transition probabilities include information about states like ( >>> + quick brown fox). >>> How do you apply the transition probabilities in this model. Does it >>> relate to the formula of section 4 ot (2). >>> I think this formula can work for the forward-backward algorithm, >>> although it is stated in this section 4 that it is used for Viterbi. >> For finding the most probable T_i you use in fact the Viterbi algorithm. >> >> The formulas in section 4 just give one step in the forward computation >> that would be used in the Viterbi algorithm. >> >> Please note that this is all implemented in the "segment" tool that >> comes with SRILM. >> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and >> http://www.speech.sri.com/projects/srilm/ for more information on SRILM. >> >> Andreas >> >>> References: >>> >>> 1) Shriberg et al. 2000 - Prosody based automatic segmentation of >>> Speech into sentences and topics >>> 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of >>> conversational speech >>> >>> Thank you! >>> >>> Kind Regards, >>> Georgi Dzhambazov, >>> >>> Studentischer Mitarbeiter, >>> NetMedia >> >> >> _______________________________________________ >> SRILM-User site list >> SRILM-User at speech.sri.com >> http://www.speech.sri.com/mailman/listinfo/srilm-user > > From stolcke at icsi.berkeley.edu Fri Feb 3 11:03:45 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 03 Feb 2012 11:03:45 -0800 Subject: [SRILM User List] Using srilm as Memory Jogger In-Reply-To: <1328196256.12588.YahooMailNeo@web114411.mail.gq1.yahoo.com> References: <1328196256.12588.YahooMailNeo@web114411.mail.gq1.yahoo.com> Message-ID: <4F2C2F91.1020409@icsi.berkeley.edu> Sorry, such functionality is not built into SRILM, and would have to be built on top of it by querying various probability models that incorporate the co-occurance of words. Personally I don't experience with this type of application but someone else on the list might. Andreas On 2/2/2012 7:24 AM, John Day wrote: > Hi Andreas, > Can you (or the group) tell me if srilm could be used to query > language models in such a way as to 'narrow down' the search for a > "partially known word" where the context of its usage is known. By > "partially known" I mean hints such as word prefixes or endings are > known. The "context of usage" is equivalent, I think, to stating that > the likelihood of the hidden word is increased if it is preceded or > followed by a given set of words associated with some topic. > > So I would like to 'leverage' srilm and language model queries by > using topic models to suggest some words associated with a certain topic. > > For example, find the most likely words, that begin with "st", given a > "context set" (suggested by some 'sociology' topic model) containing > the words "neighborhood, behavior, customs, environment". > > Does that make sense? Do you think srilm could be used to execute a > query like that? > > Thanks, > John Day > Palm Bay, Florida > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user -------------- next part -------------- An HTML attachment was scrubbed... URL: From zeinab.vakil at gmail.com Sun Feb 5 20:14:01 2012 From: zeinab.vakil at gmail.com (zeinab vakil) Date: Mon, 6 Feb 2012 07:44:01 +0330 Subject: [SRILM User List] Predicting specified words Message-ID: Dear All hi, I newly getting to know SRILM and have a question. Is it possible to use SRILM to predict a word that starts with certain character. for example sentence is "i go to h...", and we want what word has highest probability P("i go to"|w) or even P("to"|w) and starts by 'h'. please guide me. Best Regards, zeinab vakil. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Wed Feb 8 21:44:29 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 08 Feb 2012 21:44:29 -0800 Subject: [SRILM User List] Predicting specified words In-Reply-To: References: Message-ID: <4F335D3D.4070702@icsi.berkeley.edu> On 2/5/2012 8:14 PM, zeinab vakil wrote: > Dear All > hi, > I newly getting to know SRILM and have a question. > Is it possible to use SRILM to predict a word that starts with certain > character. for example sentence is "i go to h...", and we want what > word has highest probability P("i go to"|w) or even P("to"|w) and > starts by 'h'. > please guide me. > Best Regards, > zeinab vakil. Boy, there seems to be a lot of interest lately in this sort of prediction problem (see previous posts on this list). No, there is no ready-made solution for this in SRILM. I would probably try to build a mixed word/letter ngram LM, estimating probabilities p(next-letter | word-2, word-3, letter-1, letter-2, letter-3, ...) . Andreas From tonyr at cantabresearch.com Thu Feb 9 01:09:59 2012 From: tonyr at cantabresearch.com (Tony Robinson) Date: Thu, 09 Feb 2012 09:09:59 +0000 Subject: [SRILM User List] Predicting specified words In-Reply-To: <4F335D3D.4070702@icsi.berkeley.edu> References: <4F335D3D.4070702@icsi.berkeley.edu> Message-ID: <4F338D67.3090402@cantabResearch.com> On 02/09/2012 05:44 AM, Andreas Stolcke wrote: > On 2/5/2012 8:14 PM, zeinab vakil wrote: >> Dear All >> hi, >> I newly getting to know SRILM and have a question. >> Is it possible to use SRILM to predict a word that starts with >> certain character. for example sentence is "i go to h...", and we >> want what word has highest probability P("i go to"|w) or even >> P("to"|w) and starts by 'h'. >> please guide me. >> Best Regards, >> zeinab vakil. > Boy, there seems to be a lot of interest lately in this sort of > prediction problem (see previous posts on this list). If you haven't already seen Dasher then you might like to look it it up at http://www.inference.phy.cam.ac.uk/dasher/Publications.html. Tony -- Dr A J Robinson, Founder and Director of Cantab Research Limited St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK Company reg no 05697423 (England and Wales), VAT reg no 925606030 From kutlak.roman at gmail.com Thu Feb 9 01:34:08 2012 From: kutlak.roman at gmail.com (Kutlak Roman) Date: Thu, 9 Feb 2012 09:34:08 +0000 Subject: [SRILM User List] Predicting specified words In-Reply-To: <4F335D3D.4070702@icsi.berkeley.edu> References: <4F335D3D.4070702@icsi.berkeley.edu> Message-ID: Hi guys, I am not an expert on language modelling but here is a thought: The library contains classes LM and Vocab where Vocab is the vocabulary used with the current language model. Maybe you could iterate through the words in the vocabulary, pick the ones that start with the letter you have and ask the LM to tell you which word gives you the highest probability given the context. Roman On 9 Feb 2012, at 05:44, Andreas Stolcke wrote: > On 2/5/2012 8:14 PM, zeinab vakil wrote: >> Dear All >> hi, >> I newly getting to know SRILM and have a question. >> Is it possible to use SRILM to predict a word that starts with certain character. for example sentence is "i go to h...", and we want what word has highest probability P("i go to"|w) or even P("to"|w) and starts by 'h'. >> please guide me. >> Best Regards, >> zeinab vakil. > Boy, there seems to be a lot of interest lately in this sort of prediction problem (see previous posts on this list). > > No, there is no ready-made solution for this in SRILM. I would probably try to build a mixed word/letter ngram LM, estimating probabilities > p(next-letter | word-2, word-3, letter-1, letter-2, letter-3, ...) . > > Andreas > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From aktumuluru at cse.ust.hk Sat Feb 11 02:31:37 2012 From: aktumuluru at cse.ust.hk (Anand Karthik) Date: Sat, 11 Feb 2012 18:31:37 +0800 Subject: [SRILM User List] Please help : Problem with installation of SRILM1.4.6 on ubuntu 10.04 amd-64 bit machine In-Reply-To: References: Message-ID: Hello, I'm trying to install srilm 1.4.6 on ubuntu 10.04 64-bit and amd-64 bit machine. I have turned TCL off. I have read the user archive and couldn't find a solution to the problem. Please help me with the same. Im using the following command : make MACHINE_TYPE=i686-m64 SRILM=$PWD CC=/usr/bin/gcc CXX=/usr/bin/g++ NO_TCL=X TCL_INCLUDE= TCL_LIBRARY= 2>&1 > make.log.txt uname -a Linux ubuntu 2.6.32-38-generic #83-Ubuntu SMP Wed Jan 4 11:12:07 UTC 2012 x86_64 GNU/Linux gcc version : Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.3-4ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i486 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) and I get the following error on the console: I have attached the makefile log. The ngram, ngram-count,ngram-merge, ngram-calss, disambig, anti-ngram, nbest-lattice, nbest-mix, nbest-optimize,nnbest-pron-score,segment,segment-nbest,hidden-ngram,multi-ngram,fngram-count,fngram-count,fngram and lattice-tool etc. binaries are not being created and they have a problem like this g++ command : ****************************************************************************************************************** /usr/bin/g++ -I. -I/home/ak/Downloads/srilm/include -u matherr -L/home/ak/Downloads/srilm/lib/i686-m64 -g -O3 -o ../bin/i686-m64/ngram ../obj/i686-m64/ngram.o ../obj/i686-m64/liboolm.a -lm -ldl /home/ak/Downloads/srilm/lib/i686-m64/libflm.a /home/ak/Downloads/srilm/lib/i686-m64/libdstruct.a /home/ak/Downloads/srilm/lib/i686-m64/libmisc.a -lm 2>&1 | c++filt ../obj/i686-m64/liboolm.a(SimpleClassNgram.o): In function `global constructors keyed to ctsBuffer': /home/ak/Downloads/srilm/include/Debug.h:54: multiple definition of `ctsBuffer' ../obj/i686-m64/liboolm.a(ClassNgram.o):/home/ak/Downloads/srilm/include/Debug.h:54: first defined here ../obj/i686-m64/liboolm.a(Vocab.o): In function `LHash::remove(unsigned int, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(Vocab.o): In function `LHash::remove(char const*, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(Vocab.o): In function `Map_noKey': /usr/include/bits/string3.h:52: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(Vocab.o): In function `LHash::remove(char const*, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(SubVocab.o): In function `SubVocab::addWord(unsigned int)': /home/ak/Downloads/srilm/lm/src/SubVocab.cc:80: undefined reference to `LHash::getInternalKey(char const*, bool&) const' ../obj/i686-m64/liboolm.a(MultiwordVocab.o): In function `LHash::remove(unsigned int, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(LM.o): In function `~VocabIter': /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' ../obj/i686-m64/liboolm.a(LM.o): In function `LM::pplCountsFile(File&, unsigned int, TextStats&, char const*, bool)': /home/ak/Downloads/srilm/lm/src/LM.cc:569: undefined reference to `NgramCounts::parseNgram(char*, char const**, unsigned int, unsigned int&)' ../obj/i686-m64/liboolm.a(LM.o): In function `NgramStats': /home/ak/Downloads/srilm/lm/src/NgramStats.h:150: undefined reference to `NgramCounts::NgramCounts(Vocab&, unsigned int)' ../obj/i686-m64/liboolm.a(LM.o): In function `NgramCountsIter': /home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(LM.o): In function `NgramCounts::write(File&)': /home/ak/Downloads/srilm/lm/src/NgramStats.h:70: undefined reference to `NgramCounts::write(File&, unsigned int, bool)' ../obj/i686-m64/liboolm.a(LM.o): In function `NgramCounts::read(File&)': /home/ak/Downloads/srilm/lm/src/NgramStats.h:67: undefined reference to `NgramCounts::read(File&, unsigned int)' ../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV10NgramStats[vtable for NgramStats]+0x68): undefined reference to `NgramCounts::memStats(MemStats&)' ../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV10NgramStats[vtable for NgramStats]+0x70): undefined reference to `NgramCounts::countSentence(char const* const*, unsigned int)' ../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV10NgramStats[vtable for NgramStats]+0x78): undefined reference to `NgramCounts::countSentence(unsigned int const*, unsigned int)' ../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV11NgramCountsIjE[vtable for NgramCounts]+0x68): undefined reference to `NgramCounts::memStats(MemStats&)' ../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV11NgramCountsIjE[vtable for NgramCounts]+0x70): undefined reference to `NgramCounts::countSentence(char const* const*, unsigned int)' ../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV11NgramCountsIjE[vtable for NgramCounts]+0x78): undefined reference to `NgramCounts::countSentence(unsigned int const*, unsigned int)' ../obj/i686-m64/liboolm.a(NgramLM.o): In function `LHash::remove(unsigned int, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(NgramLM.o): In function `LHash >::remove(unsigned int, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash >::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash >::removedData' /home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to `LHash >::removedData' ../obj/i686-m64/liboolm.a(Discount.o): In function `~VocabIter': /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' ../obj/i686-m64/liboolm.a(Discount.o): In function `NgramCountsIter': /home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(Discount.o): In function `Vocab::isMetaTag(unsigned int)': /home/ak/Downloads/srilm/lm/src/Vocab.h:177: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(Discount.o): In function `Vocab::typeOfMetaTag(unsigned int)': /home/ak/Downloads/srilm/lm/src/Vocab.h:179: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(Discount.o): In function `NgramsIter': /home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(Discount.o): In function `Vocab::isMetaTag(unsigned int)': /home/ak/Downloads/srilm/lm/src/Vocab.h:177: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(Discount.o): In function `Vocab::typeOfMetaTag(unsigned int)': /home/ak/Downloads/srilm/lm/src/Vocab.h:179: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(Discount.o): In function `NgramCountsIter': /home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' /home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(Discount.o): In function `Trie::find(unsigned int const*, bool&) const': /home/ak/Downloads/srilm/include/Trie.h:124: undefined reference to `Trie::findTrie(unsigned int const*, bool&) const' ../obj/i686-m64/liboolm.a(Discount.o): In function `NgramsIter': /home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(Discount.o): In function `Vocab::isMetaTag(unsigned int)': /home/ak/Downloads/srilm/lm/src/Vocab.h:177: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(Discount.o): In function `Vocab::typeOfMetaTag(unsigned int)': /home/ak/Downloads/srilm/lm/src/Vocab.h:179: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(ClassNgram.o): In function `Map2::clear()': ClassNgram.cc:(.text._ZN4Map2IjPKjdE5clearEv[Map2::clear()]+0xbc): undefined reference to `LHash >::removedData' ClassNgram.cc:(.text._ZN4Map2IjPKjdE5clearEv[Map2::clear()]+0xda): undefined reference to `LHash >::removedData' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram::estimateMstep(NgramStats&, NgramCounts&, LHash&, Discount**)': /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:344: undefined reference to `LHashIter::LHashIter(LHash const&, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `NgramCounts::findCount(unsigned int const*, unsigned int)': /home/ak/Downloads/srilm/lm/src/NgramStats.h:47: undefined reference to `Trie::findTrie(unsigned int const*, bool&) const' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `Trie::findTrie(unsigned int, bool&) const': /home/ak/Downloads/srilm/include/Trie.h:145: undefined reference to `LHash >::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram::wordProb(unsigned int, unsigned int const*)': /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:68: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram::write(File&)': /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:141: undefined reference to `LHashIter::LHashIter(LHash const&, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram': /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:26: undefined reference to `LHash::LHash(unsigned int)' /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:26: undefined reference to `LHash::LHash(unsigned int)' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram::estimateEstepNgram(unsigned int*, unsigned int, NgramStats&, NgramCounts&, LHash&)': /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:221: undefined reference to `LHash::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `NgramCounts::findCount(unsigned int const*, unsigned int)': /home/ak/Downloads/srilm/lm/src/NgramStats.h:47: undefined reference to `Trie::findTrie(unsigned int const*, bool&) const' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `Trie::findTrie(unsigned int, bool&) const': /home/ak/Downloads/srilm/include/Trie.h:145: undefined reference to `LHash >::find(unsigned int, bool&) const' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `NgramCountsIter': /home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' /home/ak/Downloads/srilm/lm/src/NgramStats.h:122: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~NgramCounts': /home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference to `Trie::~Trie()' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram::estimate(NgramStats&, Discount**)': /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:178: undefined reference to `NgramCounts::NgramCounts(Vocab&, unsigned int)' /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:179: undefined reference to `LHash::LHash(unsigned int)' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~NgramCounts': /home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference to `Trie::~Trie()' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~VocabIter': /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' /home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to `LHashIter::~LHashIter()' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram::memStats(MemStats&)': /home/ak/Downloads/srilm/lm/src/SkipNgram.cc:34: undefined reference to `LHash::memStats(MemStats&) const' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~NgramCounts': /home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference to `Trie::~Trie()' /home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference to `Trie::~Trie()' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `NgramCounts::write(File&)': /home/ak/Downloads/srilm/lm/src/NgramStats.h:70: undefined reference to `NgramCounts::write(File&, unsigned int, bool)' ../obj/i686-m64/liboolm.a(SkipNgram.o): In function `NgramCounts::read(File&)': /home/ak/Downloads/srilm/lm/src/NgramStats.h:67: undefined reference to `NgramCounts::read(File&, unsigned int)' ../obj/i686-m64/liboolm.a(SkipNgram.o):(.rodata._ZTV11NgramCountsIdE[vtable for NgramCounts]+0x68): undefined reference to `NgramCounts::memStats(MemStats&)' ../obj/i686-m64/liboolm.a(SkipNgram.o):(.rodata._ZTV11NgramCountsIdE[vtable for NgramCounts]+0x70): undefined reference to `NgramCounts::countSentence(char const* const*, double)' ../obj/i686-m64/liboolm.a(SkipNgram.o):(.rodata._ZTV11NgramCountsIdE[vtable for NgramCounts]+0x78): undefined reference to `NgramCounts::countSentence(unsigned int const*, double)' ../obj/i686-m64/liboolm.a(TaggedNgram.o): In function `NgramBOsIter': /home/ak/Downloads/srilm/lm/src/Ngram.h:139: undefined reference to `TrieIter2::TrieIter2(Trie const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(TaggedNgram.o): In function `NgramProbsIter': /home/ak/Downloads/srilm/lm/src/Ngram.h:157: undefined reference to `LHashIter::LHashIter(LHash const&, int (*)(unsigned int, unsigned int))' ../obj/i686-m64/liboolm.a(TaggedNgram.o): In function `~NgramProbsIter': /home/ak/Downloads/srilm/lm/src/Ngram.h:153: undefined reference to `LHashIter::~LHashIter()' /home/ak/Downloads/srilm/lm/src/Ngram.h:153: undefined reference to `LHashIter::~LHashIter()' ../obj/i686-m64/liboolm.a(WordMesh.o): In function `LHash::remove(unsigned int, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(VocabMultiMap.o): In function `LHash::remove(unsigned int const*, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' ../obj/i686-m64/liboolm.a(VocabMultiMap.o): In function `Map_noKey': /usr/include/bits/string3.h:52: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In function `ProductNgram::read(File&, bool)': /home/ak/Downloads/srilm/flm/src/ProductNgram.cc:54: undefined reference to `FNgramSpecs::FNgramSpecs(File&, FactoredVocab&, unsigned int)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In function `FNgramStats': /home/ak/Downloads/srilm/flm/src/FNgramStats.h:148: undefined reference to `FNgramCounts::FNgramCounts(FactoredVocab&, FNgramSpecs&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In function `ProductNgram::read(File&, bool)': /home/ak/Downloads/srilm/flm/src/ProductNgram.cc:69: undefined reference to `FNgramCounts::read()' /home/ak/Downloads/srilm/flm/src/ProductNgram.cc:74: undefined reference to `FNgramCounts::estimateDiscounts()' /home/ak/Downloads/srilm/flm/src/ProductNgram.cc:75: undefined reference to `FNgramCounts::computeCardinalityFunctions()' /home/ak/Downloads/srilm/flm/src/ProductNgram.cc:76: undefined reference to `FNgramCounts::sumCounts()' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In function `FNgramCounts::read(File&)': /home/ak/Downloads/srilm/flm/src/FNgramStats.h:83: undefined reference to `FNgramCounts::read()' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In function `FNgramCounts::write(File&)': /home/ak/Downloads/srilm/flm/src/FNgramStats.h:99: undefined reference to `FNgramCounts::write(bool)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable for FNgramStats]+0x50): undefined reference to `FNgramCounts::countFile(File&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable for FNgramStats]+0x68): undefined reference to `FNgramCounts::memStats(MemStats&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable for FNgramStats]+0x78): undefined reference to `FNgramCounts::countSentence(unsigned int, unsigned int, WidMatrix&, unsigned int)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable for FNgramStats]+0x80): undefined reference to `FNgramCounts::countSentence(char const* const*, unsigned int)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable for FNgramCounts]+0x50): undefined reference to `FNgramCounts::countFile(File&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable for FNgramCounts]+0x68): undefined reference to `FNgramCounts::memStats(MemStats&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable for FNgramCounts]+0x78): undefined reference to `FNgramCounts::countSentence(unsigned int, unsigned int, WidMatrix&, unsigned int)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable for FNgramCounts]+0x80): undefined reference to `FNgramCounts::countSentence(char const* const*, unsigned int)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FactoredVocab.o): In function `FactoredVocab::getIndex(char const*, unsigned int)': /home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:279: undefined reference to `FNgramSpecs::getTag(char const*)' /home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:282: undefined reference to `FNgramSpecs::wordTag()' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FactoredVocab.o): In function `FactoredVocab::addWord(char const*)': /home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:193: undefined reference to `FNgramSpecs::getTag(char const*)' /home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:196: undefined reference to `FNgramSpecs::wordTag()' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FactoredVocab.o): In function `FactoredVocab::addWord2(char const*, bool&)': /home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:228: undefined reference to `FNgramSpecs::getTag(char const*)' /home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:231: undefined reference to `FNgramSpecs::wordTag()' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::recomputeBOWs()': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2162: undefined reference to `FNgramSpecs::FNgramSpec::LevelIter::next(unsigned int&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::bgChildProbBO(unsigned int, unsigned int const*, unsigned int, unsigned int, unsigned int)': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:685: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:686: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:706: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:707: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:726: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:727: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:744: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:746: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::boNode(unsigned int, unsigned int const*, unsigned int, unsigned int, unsigned int)': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:544: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:554: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:567: undefined reference to `FNgramSpecs::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int, unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:554: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:567: undefined reference to `FNgramSpecs::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int, unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:601: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:608: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:617: undefined reference to `FNgramSpecs::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int, unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:610: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:617: undefined reference to `FNgramSpecs::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int, unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:610: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:627: undefined reference to `FNgramSpecs::FNgramSpec::BGGrandChildIter::BGGrandChildIter(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:636: undefined reference to `FNgramSpecs::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int, unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:629: undefined reference to `FNgramSpecs::FNgramSpec::BGGrandChildIter::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:636: undefined reference to `FNgramSpecs::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int, unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:629: undefined reference to `FNgramSpecs::FNgramSpec::BGGrandChildIter::next(unsigned int&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `LHash >::remove(unsigned int, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash >::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash >::removedData' /home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to `LHash >::removedData' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `LHash::remove(unsigned int, bool&)': /home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to `LHash::removedData' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::wordProbSum()': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2930: undefined reference to `FNgramSpecs::FNgramSpec::LevelIter::next(unsigned int&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::rescoreFile(File&, double, double, LM&, double, double, char const*)': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2805: undefined reference to `FNgramSpecs::loadWordFactors(char const* const*, WordMatrix&, unsigned int)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::pplFile(File&, TextStats&, char const*)': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2684: undefined reference to `FNgramSpecs::loadWordFactors(char const* const*, WordMatrix&, unsigned int)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::computeBOWs(unsigned int, unsigned int)': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2028: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int, unsigned int, unsigned int)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2030: undefined reference to `FNgramSpecs::FNgramSpec::BGChildIterCnstr::next(unsigned int&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::write(unsigned int, File&)': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:1256: undefined reference to `FNgramSpecs::FNgramSpec::LevelIter::next(unsigned int&)' /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:1265: undefined reference to `FNgramSpecs::FNgramSpec::LevelIter::next(unsigned int&)' /home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In function `FNgram::estimate(unsigned int)': /home/ak/Downloads/srilm/flm/src/FNgramLM.cc:1433: undefined reference to `FNgramSpecs::FNgramSpec::LevelIter::next(unsigned int&)' collect2: ld returned 1 exit status /home/ak/Downloads/srilm/sbin/decipher-install 0555 ../bin/i686-m64/ngram /home/ak/Downloads/srilm/bin/i686-m64 ERROR: File to be installed (../bin/i686-m64/ngram) does not exist. ERROR: File to be installed (../bin/i686-m64/ngram) is not a plain file. WARNING: creating directory /home/ak/Downloads/srilm/bin/i686-m64 Usage: decipher-install ... mode: file permission mode, in octal file1 ... fileN: files to be installed directory: where the files should be installed files = ../bin/i686-m64/ngram directory = /home/ak/Downloads/srilm/bin/i686-m64 mode = 0555 ***************************************************************************************************** Thanks a lot in advance. Sincere Regards, Anand Karthik -------------- next part -------------- An HTML attachment was scrubbed... URL: From amber.wilcox.ohearn at gmail.com Sat Feb 11 10:10:28 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Sat, 11 Feb 2012 11:10:28 -0700 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: <4F2B2FF3.2070602@icsi.berkeley.edu> References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> <4F29A435.5050905@icsi.berkeley.edu> <4F2B2FF3.2070602@icsi.berkeley.edu> Message-ID: On Thu, Feb 2, 2012 at 5:53 PM, Andreas Stolcke wrote: > On 2/2/2012 8:29 AM, L. Amber Wilcox-O'Hearn wrote: >> >> I'm not sure if SRILM has something that does that -- i.e. holds the >> whole LM in RAM and waits for queries. ?You might need something like >> that as opposed to using a whole file, if you want just the >> probabilities of the last word with respect to the previous, and you >> want to compare different last words depending on results of previous >> calculations, for example. > > Two SRILM solutions: > > 1- Start ngram -lm LM -escape "===" -counts - (read from stdin) and put an > escape line (in this case, starting with "===") after every ngram in the > input (make sure the ngram words are followed my a count "1"). > This will cause ngram to dump out the conditional prob for the ngram right > away (instead of waiting for end-of-file). > > 2. Directly access the network LM server protocol implemented by ngram > -server-port. > Start the server with > ? ? ? ?% ngram -lm LM -server-port 8888 > then write ngrams to that TCP port and read back the log probs: > > ? ?% telnet localhost 8888 > my first word << input > -4.6499 >> output > > Of course you would do the equivalent of telnet in perl, python, C, ?or some > other language to make use of the probabilities. Thank you, Andreas. I wasn't aware of these capabilities. The server-port worked exactly as expected. That is, if I give it w1 w2 w3, it returns p(w3|w1w2). Combined with the caching, it looks very promising for my applications. The other solution using -counts (or actually -ppl for my case) also worked, but of course if I give it w1 w2 w3, it returns the probability of that whole string, i.e. p(w1) * p(w2|w1) * p(w3|w1w2), which would be redundant for my purposes. I ran > cat input_text | ngram -lm my_lm -escape "===" -ppl - -unk -no-sos -no-eos where input_text looked like: w1 w2 w3 === w1 w2 w3' Still, I'm glad it was brought up, because SRILM has so much functionality, that I had overlooked something directly useful to me. Amber -- http://scholar.google.com/citations?user=15gGywMAAAAJ From alexx.tudor at gmail.com Sat Feb 11 15:20:37 2012 From: alexx.tudor at gmail.com (alex tudor) Date: Sun, 12 Feb 2012 01:20:37 +0200 Subject: [SRILM User List] SRILM install: LM.cc error Message-ID: Hello everyone, I compiled SRILM with Cygwin under Windows XP. First I had: *-bash: LANG=${locale -uU}: bad substitution * Afterwards all worked fine until I compiled *make World *and I had this error: *LM.cc: In member function 'virtual unsigned int LM::probServer(unsingned int, unsigned int)': LM.cc:893:38: error: call of overloaded 'waitpid(int, NULL, int)' is ambiguous /usr/include/sys/wait.h:38:7: note: candidates are: pid_t waitpid(pid_t, int*, int) /usr/include/sys/wait.h:84:14: note: pid_t waitpid(pid_t, wait*, int) /cygdrive/c/srilm13/common/Makefile.common.targets:93: recipe for target '../obj/cygwin/LM.o' failed* What can I do ? Thanks in advance ! Cheers, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Feb 11 19:53:52 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 11 Feb 2012 19:53:52 -0800 Subject: [SRILM User List] SRILM install: LM.cc error In-Reply-To: References: Message-ID: <4F3737D0.2060200@icsi.berkeley.edu> On 2/11/2012 3:20 PM, alex tudor wrote: > Hello everyone, > > I compiled SRILM with Cygwin under Windows XP. First I had: > > /-bash: LANG=${locale -uU}: bad substitution > / > Afterwards all worked fine until I compiled /make World /and I had > this error: > > /LM.cc: In member function 'virtual unsigned int > LM::probServer(unsingned int, unsigned int)': > LM.cc:893:38: error: call of overloaded 'waitpid(int, NULL, int)' is > ambiguous > /usr/include/sys/wait.h:38:7: note: candidates are: pid_t > waitpid(pid_t, int*, int) > /usr/include/sys/wait.h:84:14: note: pid_t waitpid(pid_t, wait*, int) > /cygdrive/c/srilm13/common/Makefile.common.targets:93: recipe for > target '../obj/cygwin/LM.o' failed/ > > What can I do ? Try replacing the line while (waitpid(-1, NULL, WNOHANG) > 0) { with while (waitpid(-1, (int *)NULL, WNOHANG) > 0) { Let me know if that works. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sat Feb 11 20:03:36 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sat, 11 Feb 2012 20:03:36 -0800 Subject: [SRILM User List] Please help : Problem with installation of SRILM1.4.6 on ubuntu 10.04 amd-64 bit machine In-Reply-To: References: Message-ID: <4F373A18.9020405@icsi.berkeley.edu> On 2/11/2012 2:31 AM, Anand Karthik wrote: > Hello, > I'm trying to install srilm 1.4.6 on ubuntu 10.04 64-bit and amd-64 > bit machine. I have turned TCL off. > I have read the user archive and couldn't find a solution to the > problem. Please help me with the same. > > Im using the following command : > make MACHINE_TYPE=i686-m64 SRILM=$PWD CC=/usr/bin/gcc CXX=/usr/bin/g++ > NO_TCL=X TCL_INCLUDE= TCL_LIBRARY= 2>&1 > make.log.txt > > uname -a > Linux ubuntu 2.6.32-38-generic #83-Ubuntu SMP Wed Jan 4 11:12:07 UTC > 2012 x86_64 GNU/Linux > > gcc version : > > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > 4.4.3-4ubuntu5' I cannot reproduce this error, even with the same gcc version on Ubuntu. First thing to try with problems is always to get the latest version of SRILM. The current release is 1.6.0. You are using a version that is quite old. Andreas From prochva1 at fel.cvut.cz Sun Feb 12 01:37:10 2012 From: prochva1 at fel.cvut.cz (prochva1 at fel.cvut.cz) Date: Sun, 12 Feb 2012 10:37:10 +0100 Subject: [SRILM User List] SRILM install: LM.cc error In-Reply-To: <4F3737D0.2060200@icsi.berkeley.edu> References: <4F3737D0.2060200@icsi.berkeley.edu> Message-ID: <20120212103710.Horde.9IvYUuIAEqdPN4hG8jCRf1A@wimap.feld.cvut.cz> Cituji Andreas Stolcke : > On 2/11/2012 3:20 PM, alex tudor wrote: >> Hello everyone, >> >> I compiled SRILM with Cygwin under Windows XP. First I had: >> >> /-bash: LANG=${locale -uU}: bad substitution >> / >> Afterwards all worked fine until I compiled /make World /and I had >> this error: >> >> /LM.cc: In member function 'virtual unsigned int >> LM::probServer(unsingned int, unsigned int)': >> LM.cc:893:38: error: call of overloaded 'waitpid(int, NULL, int)' >> is ambiguous >> /usr/include/sys/wait.h:38:7: note: candidates are: pid_t >> waitpid(pid_t, int*, int) >> /usr/include/sys/wait.h:84:14: note: pid_t waitpid(pid_t, wait*, int) >> /cygdrive/c/srilm13/common/Makefile.common.targets:93: recipe for >> target '../obj/cygwin/LM.o' failed/ >> >> What can I do ? > Try replacing the line > > while (waitpid(-1, NULL, WNOHANG) > 0) { > > with > while (waitpid(-1, (int *)NULL, WNOHANG) > 0) { > > Let me know if that works. > > Andreas Hello, AFAICS both are base-files package/cygwin core problems (regressions from previous versions), both are already reported, the second one seems to be also fixed in cygwin CVS/snapshots ( http://cygwin.com/snapshots/ ). >> /-bash: LANG=${locale -uU}: bad substitution http://cygwin.com/ml/cygwin/2012-02/msg00335.html waitpid overload problem http://cygwin.com/ml/cygwin/2012-02/msg00184.html http://cygwin.com/ml/cygwin-patches/2012-q1/msg00016.html Vaclav From zeinab.vakil at gmail.com Sun Feb 12 04:35:15 2012 From: zeinab.vakil at gmail.com (zeinab vakil) Date: Sun, 12 Feb 2012 16:05:15 +0330 Subject: [SRILM User List] Predicting specified words In-Reply-To: <4F335D3D.4070702@icsi.berkeley.edu> References: <4F335D3D.4070702@icsi.berkeley.edu> Message-ID: On 2/9/12, Andreas Stolcke wrote: > On 2/5/2012 8:14 PM, zeinab vakil wrote: >> Dear All >> hi, >> I newly getting to know SRILM and have a question. >> Is it possible to use SRILM to predict a word that starts with certain >> character. for example sentence is "i go to h...", and we want what >> word has highest probability P("i go to"|w) or even P("to"|w) and >> starts by 'h'. >> please guide me. >> Best Regards, >> zeinab vakil. > Boy, there seems to be a lot of interest lately in this sort of > prediction problem (see previous posts on this list). > > No, there is no ready-made solution for this in SRILM. I would probably > try to build a mixed word/letter ngram LM, estimating probabilities > p(next-letter | word-2, word-3, letter-1, letter-2, letter-3, ...) . > > Andreas > Thanks for all guidances, how can i give query to srilm to obtain probability P(word-1|word-2)? I want to use SRILM as server and send my query to it and receive probability of requested bi-gram or n-gram. does it possible? please guide me. best regards, zeinab From alexx.tudor at gmail.com Sun Feb 12 05:27:33 2012 From: alexx.tudor at gmail.com (alex tudor) Date: Sun, 12 Feb 2012 15:27:33 +0200 Subject: [SRILM User List] Fwd: SRILM install: LM.cc error In-Reply-To: References: <4F3737D0.2060200@icsi.berkeley.edu> <20120212103710.Horde.9IvYUuIAEqdPN4hG8jCRf1A@wimap.feld.cvut.cz> Message-ID: ---------- Forwarded message ---------- From: alex tudor Date: Sun, Feb 12, 2012 at 3:23 PM Subject: Re: [SRILM User List] SRILM install: LM.cc error To: prochva1 at fel.cvut.cz Andreas, it works ! Thank you ! Vaclav, I read it but that package fixed aren't in the cygwin install yet. I'll try to download it separately. Now I have another problem: make[2]: Entering directory `/cygdrive/c/srilm13/dstruct/src' gcc -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int -I. -I../../include -c -g -O2 -o ../obj/cygwin/maxalloc.o maxalloc.c g++ -Wall -Wno-unused-variable -Wno-uninitialized -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin -g -O2 -o ../bin/cygwin/maxalloc.exe ../obj/cygwin/maxalloc.o ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl84 -lm /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: cannot find -ltcl84 collect2: ld returned 1 exit status /cygdrive/c/srilm13/common/Makefile.common.targets:108: recipe for target `../bin/cygwin/maxalloc.exe' failed I suppose I need tcl-tk 8.4, but in cygwin is only 8.5.11. Any ideas ? Alex On Sun, Feb 12, 2012 at 5:53 AM, Andreas Stolcke wrote: > > Try replacing the line > > while (waitpid(-1, NULL, WNOHANG) > 0) { > > with > while (waitpid(-1, (int *)NULL, WNOHANG) > 0) { > > Let me know if that works. > > Andreas > > > On Sun, Feb 12, 2012 at 11:37 AM, wrote: > > Hello, > > AFAICS both are base-files package/cygwin core problems (regressions from > previous versions), both are already reported, the second one seems to be > also fixed in cygwin CVS/snapshots ( http://cygwin.com/snapshots/ ). > > /-bash: LANG=${locale -uU}: bad substitution >>> >> > http://cygwin.com/ml/cygwin/**2012-02/msg00335.html > > waitpid overload problem > > http://cygwin.com/ml/cygwin/**2012-02/msg00184.html > http://cygwin.com/ml/cygwin-**patches/2012-q1/msg00016.html > > Vaclav > > > > ______________________________**_________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/**mailman/listinfo/srilm-user > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sun Feb 12 17:24:43 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 12 Feb 2012 17:24:43 -0800 Subject: [SRILM User List] Fwd: SRILM install: LM.cc error In-Reply-To: References: <4F3737D0.2060200@icsi.berkeley.edu> <20120212103710.Horde.9IvYUuIAEqdPN4hG8jCRf1A@wimap.feld.cvut.cz> Message-ID: <4F38665B.9070808@icsi.berkeley.edu> On 2/12/2012 5:27 AM, alex tudor wrote: > > > ---------- Forwarded message ---------- > From: *alex tudor* > > Date: Sun, Feb 12, 2012 at 3:23 PM > Subject: Re: [SRILM User List] SRILM install: LM.cc error > To: prochva1 at fel.cvut.cz > > > Andreas, it works ! Thank you ! > Vaclav, I read it but that package fixed aren't in the cygwin install > yet. I'll try to download it separately. > Now I have another problem: > > make[2]: Entering directory `/cygdrive/c/srilm13/dstruct/src' > gcc -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int > -I. -I../../include -c -g -O2 -o ../obj/cygwin/maxalloc.o maxalloc.c > g++ -Wall -Wno-unused-variable -Wno-uninitialized > -DINSTANTIATE_TEMPLATES -I. -I../../include -L../../lib/cygwin > -g -O2 -o ../bin/cygwin/maxalloc.exe ../obj/cygwin/maxalloc.o > ../obj/cygwin/libdstruct.a -lm ../../lib/cygwin/libmisc.a -ltcl84 -lm > /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: > cannot find -ltcl84 > collect2: ld returned 1 exit status > /cygdrive/c/srilm13/common/Makefile.common.targets:108: recipe for > target `../bin/cygwin/maxalloc.exe' failed > > I suppose I need tcl-tk 8.4, but in cygwin is only 8.5.11. Any ideas ? You should be able to build with any recent Tcl version, possibly adjusting the name of the library. In the worst case just disable Tcl support as described in the FAQ. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Sun Feb 12 17:37:52 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 12 Feb 2012 17:37:52 -0800 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> <4F29A435.5050905@icsi.berkeley.edu> <4F2B2FF3.2070602@icsi.berkeley.edu> Message-ID: <4F386970.1040200@icsi.berkeley.edu> From: *L. Amber Wilcox-O'Hearn* > > > Thank you, Andreas. I wasn't aware of these capabilities. > > The server-port worked exactly as expected. That is, if I give it w1 > w2 w3, it returns p(w3|w1w2). Combined with the caching, it looks > very promising for my applications. > > The other solution using -counts (or actually -ppl for my case) also > worked, but of course if I give it w1 w2 w3, it returns the > probability of that whole string, i.e. p(w1) * p(w2|w1) * p(w3|w1w2), > which would be redundant for my purposes. That's not correct. ngram -counts will output CONDITIONAL ngram probabilities. *-counts*/countsfile/ Perform a computation similar to *-ppl*, but based only on the N-gram counts found in /countsfile/. Probabilities are computed for the last word of each N-gram, using the other words as contexts, and scaling by the associated N-gram count. Summary statistics are output at the end, as well as before each escaped input line. So it should do exactly what you need. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Feb 13 07:38:26 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 13 Feb 2012 07:38:26 -0800 Subject: [SRILM User List] Predicting specified words In-Reply-To: References: <4F335D3D.4070702@icsi.berkeley.edu> Message-ID: <4F392E72.7030703@icsi.berkeley.edu> On 2/12/2012 4:35 AM, zeinab vakil wrote: > > Thanks for all guidances, > how can i give query to srilm to obtain probability P(word-1|word-2)? > I want to use SRILM as server and send my query to it and receive > probability of requested bi-gram or n-gram. does it possible? > please guide me. > best regards, > zeinab If you want to invoke SRILM via the C++ API, use the wordProb() function. The other options are writing/reading to/from ngram via a pipe, or using the ngram client/server protocol. See this recent thread for details: http://www.speech.sri.com/pipermail/srilm-user/2012q1/001148.html . Andreas From amber.wilcox.ohearn at gmail.com Tue Feb 14 04:54:31 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Tue, 14 Feb 2012 05:54:31 -0700 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: <4F386970.1040200@icsi.berkeley.edu> References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> <4F29A435.5050905@icsi.berkeley.edu> <4F2B2FF3.2070602@icsi.berkeley.edu> <4F386970.1040200@icsi.berkeley.edu> Message-ID: On Sun, Feb 12, 2012 at 6:37 PM, Andreas Stolcke wrote: > From: L. Amber Wilcox-O'Hearn > > > Thank you, Andreas. ?I wasn't aware of these capabilities. > > The server-port worked exactly as expected. ?That is, if I give it w1 > w2 w3, it returns p(w3|w1w2). ?Combined with the caching, it looks > very promising for my applications. > > The other solution using -counts (or actually -ppl for my case) also > worked, but of course if I give it w1 w2 w3, it returns the > probability of that whole string, i.e. ?p(w1) * p(w2|w1) * p(w3|w1w2), > which would be redundant for my purposes. > > That's not correct.??? ngram -counts will output CONDITIONAL ngram > probabilities. > -counts countsfile Perform a computation similar to -ppl, but based only on > the N-gram counts found in countsfile. Probabilities are computed for the > last word of each N-gram, using the other words as contexts, and scaling by > the associated N-gram count. Summary statistics are output at the end, as > well as before each escaped input line. So it should do exactly what you > need. I see. I misunderstood the difference between -ppl and -counts. I did try this and the summary statistics at the end gave the correct sum, but there weren't any statistics output before the escaped lines: > cat testcounts | ngram -lm LM -escape "===" -counts - -unk === === === file -: 0 sentences, 4 words, 0 OOVs 0 zeroprobs, logprob= -9.87606 ppl= 294.452 ppl1= 294.452 Did I miss something? Amber -- http://scholar.google.com/citations?user=15gGywMAAAAJ From stolcke at icsi.berkeley.edu Tue Feb 14 08:41:01 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 14 Feb 2012 08:41:01 -0800 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> <4F29A435.5050905@icsi.berkeley.edu> <4F2B2FF3.2070602@icsi.berkeley.edu> <4F386970.1040200@icsi.berkeley.edu> Message-ID: <4F3A8E9D.4080509@icsi.berkeley.edu> On 2/14/2012 4:54 AM, L. Amber Wilcox-O'Hearn wrote: > I see. I misunderstood the difference between -ppl and -counts. > > I did try this and the summary statistics at the end gave the correct > sum, but there weren't any statistics output before the escaped lines: >> cat testcounts | ngram -lm LM -escape "===" -counts - -unk > === > === > === > file -: 0 sentences, 4 words, 0 OOVs > 0 zeroprobs, logprob= -9.87606 ppl= 294.452 ppl1= 294.452 > > Did I miss something? This is poorly documented. The escape lines trigger output of "sentence level" statistics. At the end, you get the "file level" statistics. However, to be compatible with -ppl, sentence level stats are only output with -debug 1 or higher. So your example will work as long as you also add -debug 1. Andreas From amber.wilcox.ohearn at gmail.com Tue Feb 14 11:20:13 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Tue, 14 Feb 2012 12:20:13 -0700 Subject: [SRILM User List] Question about SRILM and sentence boundary detection In-Reply-To: <4F3A8E9D.4080509@icsi.berkeley.edu> References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de> <4F29A435.5050905@icsi.berkeley.edu> <4F2B2FF3.2070602@icsi.berkeley.edu> <4F386970.1040200@icsi.berkeley.edu> <4F3A8E9D.4080509@icsi.berkeley.edu> Message-ID: On Tue, Feb 14, 2012 at 9:41 AM, Andreas Stolcke wrote: > On 2/14/2012 4:54 AM, L. Amber Wilcox-O'Hearn wrote: >> >> I see. ? I misunderstood the difference between -ppl and -counts. >> >> I did try this and the summary statistics at the end gave the correct >> sum, but there weren't any statistics output before the escaped lines: >>> >>> cat testcounts | ngram -lm LM -escape "===" -counts - -unk >> >> === >> === >> === >> file -: 0 sentences, 4 words, 0 OOVs >> 0 zeroprobs, logprob= -9.87606 ppl= 294.452 ppl1= 294.452 >> >> Did I miss something? > > This is poorly documented. ? The escape lines trigger output of "sentence > level" ?statistics. ?At the end, you get the "file level" statistics. > However, to be compatible with -ppl, sentence level stats are only output > with -debug 1 or higher. ?So your example will work as long as you also add > -debug 1. Ah, perfect. Thank you very much! -Amber From j.kirby at ed.ac.uk Wed Feb 15 04:00:33 2012 From: j.kirby at ed.ac.uk (James Kirby) Date: Wed, 15 Feb 2012 12:00:33 +0000 Subject: [SRILM User List] C() is always zero? Message-ID: Hello, is there a reason why the unigram count of the auto-prepended sentence start tag is always zero? As can be seen from the output below, the log probabilities are calculated counting the sentence send tags but not the start tags. Or have I just missed something horribly obvious? Thanks, James ---- [jkirby at Markov]$ more sentence.txt Sentence number 1. Sentence number 2. Sentence number 3. [jkirby at Markov]$ ngram-count -order 1 -text sentence.txt -tolower -lm sentence.lm warning: count of count 2 is zero -- lowering maxcount GT discounting disabled [jkirby at Markov]$ more sentence.lm \data\ ngram 1=7 \1-grams: -1.079181 1. -1.079181 2. -1.079181 3. -0.60206 -99 -0.60206 number -0.60206 sentence \end\ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: not available URL: From stolcke at icsi.berkeley.edu Wed Feb 15 07:31:44 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 15 Feb 2012 07:31:44 -0800 Subject: [SRILM User List] C() is always zero? In-Reply-To: References: Message-ID: <4F3BCFE0.60202@icsi.berkeley.edu> On 2/15/2012 4:00 AM, James Kirby wrote: > Hello, > > is there a reason why the unigram count of the auto-prepended sentence > start tag is always zero? As can be seen from the output below, > the log probabilities are calculated counting the sentence send tags > but not the start tags. Or have I just missed something horribly > obvious? You are confusing a token's frequency in the text with the probability in the model. Because only occurs as part of an ngram's history, but never as the token being predicted, its probability is 0. If P() were > 0, then (via backoff) you would also have P( | ...) > 0 and the sum of probabilities over all allowed words would be < 1. If you want the unigram probability of a sentence boundary, use the tag. Andreas From nobyte at sina.com Wed Feb 22 00:21:46 2012 From: nobyte at sina.com (huajian xue) Date: Wed, 22 Feb 2012 16:21:46 +0800 Subject: [SRILM User List] (no subject) Message-ID: <78cc40$1ic2as5@irxd5-187.sinamail.sina.com.cn> Hello, Can the current released srilm toolkit be utilized to build discriminative language model? Thanks, Xue -------------- next part -------------- An HTML attachment was scrubbed... URL: From amber.wilcox.ohearn at gmail.com Fri Feb 24 12:43:52 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Fri, 24 Feb 2012 13:43:52 -0700 Subject: [SRILM User List] Limiting the vocabulary size of an n-gram model Message-ID: Greetings. I am constructing a large trigram model using a pre-specified vocabulary size. What I have done in the past is to first get the unigram counts, and then sort the top N most frequent words into my vocabulary file, which I then pass to ngram for computing the trigram counts, which I then pass again to ngram to construct the LM. However, I seem to remember having read that the count of counts estimates will be better if I compute the trigram counts first, and only limit the vocabulary on the final step. Is that correct? Are there any other shortcuts for this? Thank you, Amber -- http://scholar.google.com/citations?user=15gGywMAAAAJ From stolcke at icsi.berkeley.edu Fri Feb 24 13:16:55 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 24 Feb 2012 13:16:55 -0800 Subject: [SRILM User List] Limiting the vocabulary size of an n-gram model In-Reply-To: References: Message-ID: <4F47FE47.6080807@icsi.berkeley.edu> On 2/24/2012 12:43 PM, L. Amber Wilcox-O'Hearn wrote: > Greetings. > > I am constructing a large trigram model using a pre-specified > vocabulary size. What I have done in the past is to first get the > unigram counts, and then sort the top N most frequent words into my > vocabulary file, which I then pass to ngram for computing the trigram > counts, which I then pass again to ngram to construct the LM. > > However, I seem to remember having read that the count of counts > estimates will be better if I compute the trigram counts first, and > only limit the vocabulary on the final step. Is that correct? Are > there any other shortcuts for this? This is correct. The make-big-lm script (a wrapper around ngram-count) will extract the discounting statistics from the full vocabulary and them apply them to the LM estimation with a limited vocabulary. Check the training-scripts(1) man page. Andreas From ariya at jhu.edu Fri Feb 24 14:35:10 2012 From: ariya at jhu.edu (Ariya Rastrow) Date: Fri, 24 Feb 2012 17:35:10 -0500 Subject: [SRILM User List] NgramCountLM Bug? Message-ID: Hi, I had a question about NgramCountLM (Jelinek-Mercer interpolation method). It seems to me there is a bug with the way the \lambda parameters are being estimated in the code. The problem is that the expectations for \lambda's (using EM) are being collected by iterating through N-grams of the held-out text. However, the count of the N-gram is not being taken into account for each N-gram (even though for calculating the log-probability of the held-out the wordProb is being multiplied by the count of the N-gram) during the call to LM::countsProb(...) by NgramCountLM::estimate(). In other words, the statistics for \lambda's are being collected as if each event is a singleton in the held-out data. The fix to this would be to pass *count from LM::countsProb(...) to NgramCountLM::wordProbTrain(...) such that the posteriors of \lambda get multiplied by that count. Thanks, Ariya -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Fri Feb 24 19:42:40 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 24 Feb 2012 19:42:40 -0800 Subject: [SRILM User List] NgramCountLM Bug? In-Reply-To: References: Message-ID: <4F4858B0.1030207@icsi.berkeley.edu> On 2/24/2012 2:35 PM, Ariya Rastrow wrote: > > Hi, > I had a question about NgramCountLM (Jelinek-Mercer interpolation > method). It seems to me there is a bug with the way the \lambda > parameters are being estimated in the code. The problem is that the > expectations for \lambda's (using EM) are being collected by iterating > through N-grams of the held-out text. However, the count of the N-gram > is not being taken into account for each N-gram (even though for > calculating the log-probability of the held-out the wordProb is being > multiplied by the count of the N-gram) during the call > to LM::countsProb(...) by NgramCountLM::estimate(). In other words, > the statistics for \lambda's are being collected as if each event is a > singleton in the held-out data. The fix to this would be to pass > *count from LM::countsProb(...) to NgramCountLM::wordProbTrain(...) > such that the posteriors of \lambda get multiplied by that count. > Good catch! That is indeed a bug. Attached is s patch that should do the right thing. Andreas -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ngramcountlm.patch URL: From amber.wilcox.ohearn at gmail.com Sat Feb 25 08:57:35 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Sat, 25 Feb 2012 09:57:35 -0700 Subject: [SRILM User List] make-big-lm merge-batch-counts mv error? Message-ID: Just a quick follow-up: I'm now trying to put this all together, but I'm getting the following error: [amber]$ make-big-lm -debug 1 -kndiscount3 -unk -name test_lm -read counts_3/merge-iter7-1.ngrams.gz -vocab test.vocab + make-kn-counts no_max_order=1 max_per_file=10000000 order=3 kndiscount1=0 kndiscount2=0 kndiscount3=1 kndiscount4=0 kndiscount5=0 kndiscount6=0 kndiscount7=0 kndiscount8=0 kndiscount9=0 output=test_lm.kndir/kncounts + merge-batch-counts test_lm.kndir final counts in mv: missing destination file operand after `test_lm.kncounts.gz' Try `mv --help' for more information. Any ideas about what I'm missing? Thanks again, Amber From amber.wilcox.ohearn at gmail.com Sun Feb 26 16:24:12 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Sun, 26 Feb 2012 17:24:12 -0700 Subject: [SRILM User List] make-big-lm merge-batch-counts mv error? In-Reply-To: References: Message-ID: I finally figured out my error here. I had passed make-big-lm an order 3 counts file, not an order *upto and including* 3. In response, make-kn-counts silently generated no output, and then there was no file to mv. Amber From stolcke at icsi.berkeley.edu Sun Feb 26 16:38:44 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Sun, 26 Feb 2012 16:38:44 -0800 Subject: [SRILM User List] make-big-lm merge-batch-counts mv error? In-Reply-To: References: Message-ID: <4F4AD094.2050808@icsi.berkeley.edu> On 2/26/2012 4:24 PM, L. Amber Wilcox-O'Hearn wrote: > I finally figured out my error here. I had passed make-big-lm an > order 3 counts file, not an order *upto and including* 3. In > response, make-kn-counts silently generated no output, and then there > was no file to mv. Good. Also, you want to use -kndiscount, not -kndiscount3. With the latter, you would only apply KN discounting to trigrams, but that doesn't really make sense since KN discounting relies on modifying the lower-order ngram distributions. Andreas From dmytro.prylipko at ovgu.de Mon Feb 27 05:45:12 2012 From: dmytro.prylipko at ovgu.de (Dmytro Prylipko) Date: Mon, 27 Feb 2012 14:45:12 +0100 Subject: [SRILM User List] Observed omit event Message-ID: Hi, I would like to clarify how to evaluate properly a language model with an observed hidden event (), omitted from context. I have manually created the counts file, where this event had been skipped from context, and have built a LM from that. Also, I have added this line to the end of the LM file: -observed -omit My question is whether it is necessary to specify a hidden vocabulary with -hidden-vocab option. Which command line is correct: ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist -hidden-vocab df.defs or just ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist Thanks. Yours, Dmytro Prylipko. -------------- next part -------------- An HTML attachment was scrubbed... URL: From amber.wilcox.ohearn at gmail.com Mon Feb 27 17:10:42 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Mon, 27 Feb 2012 18:10:42 -0700 Subject: [SRILM User List] make-big-lm merge-batch-counts mv error? In-Reply-To: <4F4AD094.2050808@icsi.berkeley.edu> References: <4F4AD094.2050808@icsi.berkeley.edu> Message-ID: On Sun, Feb 26, 2012 at 5:38 PM, Andreas Stolcke wrote: > Good. ?Also, you want to use -kndiscount, ?not -kndiscount3. ? With the > latter, you would only apply KN discounting to trigrams, but that doesn't > really make sense since KN discounting relies on modifying the lower-order > ngram distributions. Oh, wow. Thanks for pointing that out! Amber From chenmengdx at gmail.com Sun Mar 4 16:56:01 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Mon, 5 Mar 2012 08:56:01 +0800 Subject: [SRILM User List] How to process the disfluency words when building LM Message-ID: Hello, I tried to make the language model from some non-native spontaneous speech transcription. However, there are lots of "strange words" in the corpus because the transcriber tried to transcribe as close as the real pronunciation. For example, some transcriptions are as follows: she taught english there and she gave english lesson to a secondary school students in *boli bolivi bolivia* *er* what's wrong *er *he asked she asked her her mother would *em er* her she took her mother in her own house and the baby *em* *moven bester* So I want to ask how should I process these "strange words" that don't exist such as boli, bolivi, er, em, moven, bester etc. If I replace them with the correct words, the language model will be unsuitable for the non-native spontaneous speech task. If I keep them, their counts and probability are too small. And the dictionary is also hard to generate. Are there any suggestions? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From reham.imamu at gmail.com Mon Mar 5 07:17:15 2012 From: reham.imamu at gmail.com (Reham Al-Majed) Date: Mon, 5 Mar 2012 18:17:15 +0300 Subject: [SRILM User List] disambig with Class-based n gram In-Reply-To: References: Message-ID: Hello ,, > > I've built class based n-gram by : > > 1- define my classes > 2- use replace-words-with-classes > 3- use ngram-count to estimate the LM > > I want to use this class based n gram model with disambig tool ,, The > options (-factored and -count-lm) interpret the LMs as factored and count > based LMs ... What about class-based ? How to tell disambig to interpret > the LM as a class-based ? > > I'm trying to use my class-based as an original n-gram model, however the > output for sample test seems strange ... words in the test sample are > always disambiguated using the last word in the mapping file ! > > Actually I want the words be disambiguated using the LM probabilities > only without considering the probabilities in the mapping file.. I use the > options -lmw 1 and -mapw 0 but the output still the same ... > > > In short my questions are : > > 1- Is it possible to use class-based n gram with disabmig tool ? Or should > I build my own disambiguator using the output of ngram tool ? > > 2- How to make disambig tool use the probabilities of LM ONLY ? > > Your help is really greatly appreciated ... > > Thanks in Advance , > Reham > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Mon Mar 5 10:09:32 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 05 Mar 2012 10:09:32 -0800 Subject: [SRILM User List] disambig with Class-based n gram In-Reply-To: References: Message-ID: <4F55015C.60301@icsi.berkeley.edu> On 3/5/2012 7:17 AM, Reham Al-Majed wrote: > > > Hello ,, > > I've built class based n-gram by : > > 1- define my classes > 2- use replace-words-with-classes > 3- use ngram-count to estimate the LM > > I want to use this class based n gram model with disambig tool ,, > The options (-factored and -count-lm) interpret the LMs as > factored and count based LMs ... What about class-based ? How to > tell disambig to interpret the LM as a class-based ? > > I'm trying to use my class-based as an original n-gram model, > however the output for sample test seems strange ... words in the > test sample are always disambiguated using the last word in the > mapping file ! > > Actually I want the words be disambiguated using the LM > probabilities only without considering the probabilities in the > mapping file.. I use the options -lmw 1 and -mapw 0 but the output > still the same ... > > > In short my questions are : > > 1- Is it possible to use class-based n gram with disabmig tool ? > Or should I build my own disambiguator using the output of ngram > tool ? > Unfortunately disambig currently does not support the use of class-based ngram LMs (what is implemented by ngram -classes). Two workarounds are 1) if feasible, expand the class-ngram LM into a word-ngram LM (using ngram -expand-classes). 2) rewrite the class-ngram as a factored LM. This will require some investment into understanding the much more general FLM mechanism. > > 2- How to make disambig tool use the probabilities of LM ONLY ? > disambig -mapw 0 will do that. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From stolcke at icsi.berkeley.edu Tue Mar 6 12:34:20 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 06 Mar 2012 12:34:20 -0800 Subject: [SRILM User List] Observed omit event In-Reply-To: <4F4BB9E4.7070403@icsi.berkeley.edu> References: <4F4BB9E4.7070403@icsi.berkeley.edu> Message-ID: <4F5674CC.1060300@icsi.berkeley.edu> The attached source patch will fix the behavior of ngram -hidden-vocab so that the vocab file can contain event property specifications as described in the man page. Previously only the names of the hidden event words were read from that file, but all treated as default hidden events. The patch also fixes a couple of unrelated bugs in HiddenNgram.cc . Andreas On 2/27/2012 9:14 AM, Andreas Stolcke wrote: > On 2/27/2012 5:45 AM, Dmytro Prylipko wrote: >> Hi, >> >> I would like to clarify how to evaluate properly a language model >> with an observed hidden event (), omitted from context. >> >> I have manually created the counts file, where this event had been >> skipped from context, and have built a LM from that. >> Also, I have added this line to the end of the LM file: >> -observed -omit >> >> My question is whether it is necessary to specify a hidden vocabulary >> with -hidden-vocab option. >> Which command line is correct: >> >> ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist >> -hidden-vocab df.defs >> >> or just >> >> ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist > > If you append the hidden vocab definitions to the LM file you only > need to tell ngram that it IS a hidden even LM that you're reading. > You can achieve that by adding -hidden-vocab /dev/null . > > Andreas > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: hidden-ngram.patch URL: From reham.imamu at gmail.com Tue Mar 6 13:11:06 2012 From: reham.imamu at gmail.com (Reham Al-Majed) Date: Wed, 7 Mar 2012 00:11:06 +0300 Subject: [SRILM User List] disambig with FLM Message-ID: Thanks a lot for your reply ,, I'm trying to build FLM with the following FLM specifications file: ## normal trigram LM 1 W : 2 W(-1) W(-2) FLMCount.count FLMLM.lm 3 W1,W2 W2 wbdiscount interpolate W1 W1 wbdiscount interpolate 0 0 wbdiscount I generate my FLM model using the following command : fngram-count -factor-file FLMDes -debug 2 -text TrainFLM -lm FLMLM .lm -write-counts FLMcount.count -no-virtual-begin-sentence -nonull It runs without errors .. I then measure the ppl of the generated FLM with the following command: fngram -factor-file FLMDes -debug 2 -ppl FLMTest -nonull Unfortunately, when I tried to test the main step I got an error :( ... I search the mailing list archive but I didn't find similar problem The command I used to test disambig with my FLM model was : disambig -text FLMTest -map 3.map -factored -lm FLMLM.lm The output of this command was: No known factors found in Aa No known factors found in AA No known factors found in aa No known factors found in Bb No known factors found in bb No known factors found in BB No known factors found in CC No known factors found in cc No known factors found in Cc FLMLM.lm: line 2: Error: couldn't form int for number of factored LMs in when reading FLM spec file I don't know what dose it mean by "No known factors found in ......" And I wonder about the error message "couldn't form int for number of factored LMs in when reading FLM spec file" .... As you can see above in my FLM specifications file, I determined the number of FLM specifications ! Some notes may help you to solve my problem : -- I've built my model to test disambig with FLM before using it in my project so it was build with training data of only 28 sentences, 138 words -- The mapping file (named 3.map) used to test disambig was : W-aa Aa 0.5 AA 0.4 aa 0.1 W-bb Bb 0.6 bb 0.1 BB 0.3 W-cc CC 0.7 cc 0.1 Cc 0.2 -- The FLMTest contains only one sentence: W-aa W-bb W-cc Am I doing something wrong ? Your help and support is really greatly appreciated .. I've a graduation project that needs a disambiguator for highly inflected language I'm worried that I could not use your disambig program with FLM model :( Best Regards,, Reham On 5 March 2012 21:09, Andreas Stolcke wrote: > On 3/5/2012 7:17 AM, Reham Al-Majed wrote: > > > > Hello ,, >> >> I've built class based n-gram by : >> >> 1- define my classes >> 2- use replace-words-with-classes >> 3- use ngram-count to estimate the LM >> >> I want to use this class based n gram model with disambig tool ,, The >> options (-factored and -count-lm) interpret the LMs as factored and count >> based LMs ... What about class-based ? How to tell disambig to interpret >> the LM as a class-based ? >> >> I'm trying to use my class-based as an original n-gram model, however the >> output for sample test seems strange ... words in the test sample are >> always disambiguated using the last word in the mapping file ! >> >> Actually I want the words be disambiguated using the LM probabilities >> only without considering the probabilities in the mapping file.. I use the >> options -lmw 1 and -mapw 0 but the output still the same ... >> >> >> In short my questions are : >> >> 1- Is it possible to use class-based n gram with disabmig tool ? Or >> should I build my own disambiguator using the output of ngram tool ? >> > > Unfortunately disambig currently does not support the use of class-based > ngram LMs (what is implemented by ngram -classes). > Two workarounds are > 1) if feasible, expand the class-ngram LM into a word-ngram LM (using > ngram -expand-classes). > 2) rewrite the class-ngram as a factored LM. This will require some > investment into understanding the much more general FLM mechanism. > > > > >> 2- How to make disambig tool use the probabilities of LM ONLY ? >> > > disambig -mapw 0 will do that. > > Andreas > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vinay.amnsit at gmail.com Thu Mar 8 22:39:22 2012 From: vinay.amnsit at gmail.com (Vinay Shashidhar) Date: Fri, 9 Mar 2012 12:09:22 +0530 Subject: [SRILM User List] Posterior Probability : HTK Message-ID: Hi Guys, I have a read a lot of papers regarding posterior probability being a more robust and speaker independent features, but how does one calculate it? I am using HTK and am doing forced alignment. All i get is the likelihood scores. Thanks. Looking forward for your help.! regards Vinay From stolcke at icsi.berkeley.edu Fri Mar 9 16:02:07 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 09 Mar 2012 16:02:07 -0800 Subject: [SRILM User List] Posterior Probability : HTK In-Reply-To: References: Message-ID: <4F5A99FF.4040105@icsi.berkeley.edu> The first step is to compute posterior probabilities for arcs and nodes in your lattice, using the forward-backward algorithm. The posterior probability is the sum of the scores of all paths going through an arc/node, normalized by the sum over all paths through the lattice. This is implemented by the lattice-tool -write-posteriors option (the output format is different from HTK format though). It is important to scale the combined acoustic/language model scores, check the -posterior-scale option. Often one wants posterior probabilities at the word level, and combine all word hypotheses that occur at the same "position" in the lattice. For this you can build a word confusion network or "word mesh". This is done by the lattice-tool -write-mesh option. For an introduction to these concepts you might want to check the article http://www.speech.sri.com/cgi-bin/run-distill?ftp:papers/CSL2000-consensus.ps.gz, but note that the confusion network algorithm in SRILM is not the same as described in there. Andreas On 3/8/2012 10:39 PM, Vinay Shashidhar wrote: > Hi Guys, > > I have a read a lot of papers regarding posterior probability being a > more robust and speaker independent features, but how does one > calculate it? > > I am using HTK and am doing forced alignment. All i get is the > likelihood scores. > > Thanks. Looking forward for your help.! > > regards > Vinay > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From stolcke at icsi.berkeley.edu Fri Mar 9 16:28:17 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Fri, 09 Mar 2012 16:28:17 -0800 Subject: [SRILM User List] How to process the disfluency words when building LM In-Reply-To: References: Message-ID: <4F5AA021.4080408@icsi.berkeley.edu> On 3/4/2012 4:56 PM, Meng Chen wrote: > Hello, I tried to make the language model from some > non-native spontaneous speech transcription. However, there are lots > of "strange words" in the corpus because the transcriber tried to > transcribe as close as the real pronunciation. > > For example, some transcriptions are as follows: > > she taught english there and she gave english lesson to a > secondary school students in *boli bolivi bolivia* > *er* what's wrong *er *he asked she asked > her her mother would *em er* her she took her mother in her own > house and the baby *em* *moven bester* First, such words are not strange at all, and occur even for native speakers when speaking spontaneously. "er" and "em" are called "filled pauses", and "boli" etc. "word fragments". Both are associated with a more general class of spontaneous speech phenomena called "disfluencies". For an overview see http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp96-dfs-swb.ps.gz . > > So I want to ask how should I process these "strange words" that don't > exist such as boli, bolivi, er, em, moven, bester etc. > If I replace them with the correct words, the language model will be > unsuitable for the non-native spontaneous speech task. > If I keep them, their counts and probability are too small. And the > dictionary is also hard to generate. > > Are there any suggestions? Filled pauses are usually modeled as any other words, though you might normalize their spellings. There are usually just two forms, with and without nasal (usually spelled "um" and "uh" respectively). You should normalize alternative spellings like "ah", "eh", "er", etc. and map them to the standard form to avoid fragmenting your data. Often people use a dedicated vowel phone for pronunciations of these words because they are more variable in quality and duration than the standard schwa phone. Fragments, especially short ones, are hard to recognize because they are very confusable. First, you should use a spelling convention that distinguishes them from full words, usually with a final hyphen, e.g., "boli-". For LM training purposes you might want to delete them entirely, and represent them with a garbage model in acoustic training to avoid contaminating the models for regular words. At SRI we tried modeling the most frequent word fragments in AM and LM, but even those (especially because they tend to have just one or two phones) are not recognized well, and removing them from the LM was best for overall word recognition accuracy. Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: From rico.sennrich at gmx.ch Mon Mar 12 06:10:00 2012 From: rico.sennrich at gmx.ch (Rico Sennrich) Date: Mon, 12 Mar 2012 14:10:00 +0100 Subject: [SRILM User List] nan in language model Message-ID: <1331557800.12711.25.camel@rico-work> Hi list, Occasionally, I get 'nan' as probability or backoff weight in LMs trained with SRILM. This is not expected in an ARPA file and eventually leads to crashes / undefined behaviour in other programs that use the model. Here's some statistics: \data\ ngram 1=2054819 ngram 2=40441708 ngram 3=187680929 ngram 4=382878635 ngram 5=519867931 probability nan: 1 0 2 0 3 0 4 0 5 1233183 backoff nan: 1 0 2 0 3 0 4 415865 5 0 Here's the training parameters: make-batch-counts file-list.txt 10 cat /wrk/smt/tmp -order 5 make-big-lm -kndiscount -interpolate -order 5 -read \ tmp/file-list.txt-1.ngrams.gz -unk -lm hugelm.gz This happened with SRILM 1.5.9 and 1.6.0-beta, and stderr didn't show any errors/warnings. best wishes, Rico From stolcke at icsi.berkeley.edu Mon Mar 12 09:33:15 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Mon, 12 Mar 2012 09:33:15 -0700 Subject: [SRILM User List] nan in language model In-Reply-To: <1331557800.12711.25.camel@rico-work> References: <1331557800.12711.25.camel@rico-work> Message-ID: <4F5E254B.9040103@icsi.berkeley.edu> On 3/12/2012 6:10 AM, Rico Sennrich wrote: > Hi list, > > Occasionally, I get 'nan' as probability or backoff weight in LMs > trained with SRILM. This is not expected in an ARPA file and eventually > leads to crashes / undefined behaviour in other programs that use the > model. It's certainly not supposed to happen. In your case it looks like 5-grams end up with nan probabilities, which would then lead to BOWs also being computed as NaNs. I have never seens this, actually. It would help to try a few things: - see if it only happens with -kndiscount. - try to elicit the problem with a smaller amount of input data (e.g., including only the ngrams that have the NaN's in the probabilities) - see if those ngram counts have any special properties. Andreas > > Here's some statistics: > > \data\ > ngram 1=2054819 > ngram 2=40441708 > ngram 3=187680929 > ngram 4=382878635 > ngram 5=519867931 > > probability nan: > 1 0 > 2 0 > 3 0 > 4 0 > 5 1233183 > > backoff nan: > 1 0 > 2 0 > 3 0 > 4 415865 > 5 0 > > > Here's the training parameters: > > make-batch-counts file-list.txt 10 cat /wrk/smt/tmp -order 5 > > make-big-lm -kndiscount -interpolate -order 5 -read \ > tmp/file-list.txt-1.ngrams.gz -unk -lm hugelm.gz > > This happened with SRILM 1.5.9 and 1.6.0-beta, and stderr didn't show > any errors/warnings. > > best wishes, > Rico > > > _______________________________________________ > SRILM-User site list > SRILM-User at speech.sri.com > http://www.speech.sri.com/mailman/listinfo/srilm-user From john at dowding.net Tue Mar 13 19:34:29 2012 From: john at dowding.net (John Dowding) Date: Tue, 13 Mar 2012 19:34:29 -0700 Subject: [SRILM User List] distance between two language models Message-ID: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net> Hi, I have an application where I need to create LMs for a large number of categories of text (thousands). I'ld like to be able to combine the LMs in cases where two (or more) categories are sufficiently similar. Does SRILM provide a way to compute the distance between two LMs? Is there another approach I should consider? Thanks John -------------- next part -------------- An HTML attachment was scrubbed... URL: From amber.wilcox.ohearn at gmail.com Wed Mar 14 06:45:17 2012 From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn) Date: Wed, 14 Mar 2012 07:45:17 -0600 Subject: [SRILM User List] distance between two language models In-Reply-To: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net> References: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net> Message-ID: On Tue, Mar 13, 2012 at 8:34 PM, John Dowding wrote: > > I have an application where I need to create LMs for a large number of > categories of text (thousands). > > I?ld like to be able to combine the LMs in cases where two (or more) > categories are sufficiently similar. > > Does SRILM provide a way to compute the distance between two LMs??? Is there > another approach I should consider? I would use KL divergence, or a related measure. -- http://scholar.google.com/citations?user=15gGywMAAAAJ From stolcke at icsi.berkeley.edu Wed Mar 14 09:34:23 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Wed, 14 Mar 2012 09:34:23 -0700 Subject: [SRILM User List] distance between two language models In-Reply-To: References: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net> Message-ID: <4F60C88F.6020003@icsi.berkeley.edu> On 3/14/2012 6:45 AM, L. Amber Wilcox-O'Hearn wrote: > On Tue, Mar 13, 2012 at 8:34 PM, John Dowding wrote: >> I have an application where I need to create LMs for a large number of >> categories of text (thousands). >> >> I?ld like to be able to combine the LMs in cases where two (or more) >> categories are sufficiently similar. >> >> Does SRILM provide a way to compute the distance between two LMs? Is there >> another approach I should consider? > I would use KL divergence, or a related measure. Exactly, but computing the KL divergence between two ngram models exactly would require some work. You'd have to iterate over all ngrams occurring in either model (including the those handled by backoff) and sum up p1(w,h) log p2(w|h). Of course an empirical estimate of KL divergence is easy: to estimate cross-entropy you just run ngram -ppl on a sample of the source for model 2, computing probabilities using model 1. Andreas > From songbaoqiang at gmail.com Tue Mar 27 03:10:56 2012 From: songbaoqiang at gmail.com (vincent sung) Date: Tue, 27 Mar 2012 18:10:56 +0800 Subject: [SRILM User List] From China Message-ID: I need help from SRI International. I want to use SRI technoladge to build a project for chinese to lear english. Does anyone can tell me who will contact with? Thanks your patients. I'm at beijing I have a Business Plan and if anyone has interesting. I will share my BP with you. My Email:songbaoqiang at gmail.com Skype:songbaoqiang From stolcke at icsi.berkeley.edu Tue Mar 27 07:29:19 2012 From: stolcke at icsi.berkeley.edu (Andreas Stolcke) Date: Tue, 27 Mar 2012 07:29:19 -0700 Subject: [SRILM User List] From China In-Reply-To: References: Message-ID: <4F71CEBF.9060308@icsi.berkeley.edu> On 3/27/2012 3:10 AM, vincent sung wrote: > I need help from SRI International. I want to use SRI technoladge to > build a project for chinese to lear english. Does anyone can tell me > who will contact with? Thanks your patients. I'm at beijing > > I have a Business Plan and if anyone has interesting. I will share my > BP with you. > > My Email:songbaoqiang at gmail.com Skype:songbaoqiang > This list is not appropriate for this type of inquiry. You probably want to try http://www.eduspeak.com/utils/contact.php . Andreas From chenmengdx at gmail.com Sat Mar 31 20:00:35 2012 From: chenmengdx at gmail.com (Meng Chen) Date: Sun, 1 Apr 2012 11:00:35 +0800 Subject: [SRILM User List] Question of replace-words-with-classes Message-ID: Hi, I met a question when training class-based language model by replace-words-with-classes command. My commands are as follows: - ngram-class -vocab wlist -text training_set -numclasses 200 -incremental -classes output.classes - replace-words-with-classes classes=output.classes training_set > training_set_classes After these two steps, I found that there are both words and classes in training_set_classes. These words are OOVs in wlist, however, I don't need them at all. Shouldn't these words belong to in CLASS-00001? So I wonder to know how to process this situation? Does SRILM support some scripts to map these OOVs to CLASS-00001? Or Do I need to write a script by myself? Thanks! Meng Chen -------------- next part -------------- An HTML attachment was scrubbed... URL: