From rmr4848 at gmail.com  Tue Jan  3 15:12:11 2012
From: rmr4848 at gmail.com (Ryan Roth)
Date: Tue, 3 Jan 2012 18:12:11 -0500
Subject: [SRILM User List] Odd behavior in disambig and OOV words
Message-ID: <CA+Y0uXwUpAsqRhsA2GAU_HVsAwtjCpydohU11EmNCOc4hn3E=g@mail.gmail.com>

Hello:

For some time now I've been using *disambig* to perform diacritic
disambiguation of Arabic.  I create a open-vocabulary LM of diacritized
forms from a training corpus, and for the input I use a morphological
analysis tool to create, for each input word, a list of possible
diacritized forms to use as the V2 mapping for the input form (V1).  *
Disambig* is then used to select one of the diacritized forms using the LM.

This works well, but recently I noticed a strange behavior.  I have a small
input file (A) of about 200 lines of text.  I run it through the above
process, and I get a mapped output file as expected. Then I take the input
file A and replace two words in the last line with different words
(creating input file B).  I run B through the same process as A (this
results in a very slightly different map file -- but only for the two words
that were replaced).

The odd behavior is that, when I compare the output mapping of A and B, not
only is the last line different, but over 70 other words in the file (in
different sentences) also have different V2 mappings. Doing some checking,
I discover (not too surprisingly) that all the affected words are ones that
were not present in the LM, so the effect is related to how *disambig* is
handling OOV words.  Similar differences occur if I compare the mapped
output of two files concatenated together to the concatenation of two
file's mapped output (that is, [A+B].out  =/=  [A.out] + [B.out] ).

I need to find a way to make sure *disambig* handles these words
consistently, so that changes in one part of a file do not affect the
results in a different part.  I'm hoping that there is some option setting
in *disambig* or *ngram*-*count* that I've overlooked that will correct the
problem, but I currently don't see one.

For reference, I create my LM using the options:

   *ngram*-*count* -*text* training-input-file -*lm* model-name.lm -*order*5 -
*unk*

and I run disambig using the options:

   *disambig* -*keep*-*unk* -*text* test-file.in -*map* test-file.map -*
order* 5 -*lm* model-name.lm >  test-file.out


My test-file.map is created without conditional probabilities, and the list
of V2 forms is always alphabetized to ensure a consistent ordering. The
morphological analyzer which generates the V2 forms is always consistent,
and its output does not depend on word context.


Any advice or direction would be appreciated.

Thanks,

Ryan Roth
CCLS
Columbia University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120103/b5034a8d/attachment.html>

From reza.haffari at gmail.com  Wed Jan  4 02:16:03 2012
From: reza.haffari at gmail.com (gholamreza haffari)
Date: Wed, 4 Jan 2012 21:16:03 +1100
Subject: [SRILM User List] Chaing's python wrapper
In-Reply-To: <CALQVnNMQ-kLHwyfPPtXnfNoAVC0YnzBMZ2KhpBLoeqcUCpN1hA@mail.gmail.com>
References: <CALQVnNMQ-kLHwyfPPtXnfNoAVC0YnzBMZ2KhpBLoeqcUCpN1hA@mail.gmail.com>
Message-ID: <CALQVnNOOEmZENp62RY9StTOvKX29Pin+xrPU1YiNiHrvZAK_CQ@mail.gmail.com>

Hi there,

I get an error when I try to compile the following python wrapper (by David
Chaing):
http://www.isi.edu/~chiang/software/psrilm.tgz


The error is as follows:

/usr/bin/ld:
/cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a(Vocab.o):
relocation R_X86_64_32 against `Vocab::compare(unsigned int, unsigned int)'
can not be used when making a shared object; recompile with -fPIC
/cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a: could not read
symbols: Bad value
collect2: ld returned 1 exit status
error: command 'g++' failed with exit status 1
make: *** [all] Error 1


The srilm version that I use is "1.6.0" and my machine type is "i686-m64".

I appreciate your help.
cheers,
-Reza
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120104/73314b64/attachment.html>

From stolcke at icsi.berkeley.edu  Wed Jan  4 09:36:27 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 04 Jan 2012 09:36:27 -0800
Subject: [SRILM User List] Chaing's python wrapper
In-Reply-To: <CALQVnNOOEmZENp62RY9StTOvKX29Pin+xrPU1YiNiHrvZAK_CQ@mail.gmail.com>
References: <CALQVnNMQ-kLHwyfPPtXnfNoAVC0YnzBMZ2KhpBLoeqcUCpN1hA@mail.gmail.com>
	<CALQVnNOOEmZENp62RY9StTOvKX29Pin+xrPU1YiNiHrvZAK_CQ@mail.gmail.com>
Message-ID: <4F048E1B.8010703@icsi.berkeley.edu>

gholamreza haffari wrote:
> Hi there,
>
> I get an error when I try to compile the following python wrapper (by 
> David Chaing):
> http://www.isi.edu/~chiang/software/psrilm.tgz 
> <http://www.isi.edu/%7Echiang/software/psrilm.tgz>
>
>
> The error is as follows:
>
> /usr/bin/ld: 
> /cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a(Vocab.o): 
> relocation R_X86_64_32 against `Vocab::compare(unsigned int, unsigned 
> int)' can not be used when making a shared object; recompile with -fPIC
> /cs/grad1/ghaffar1/software/srilm/lib/i686-m64/liboolm.a: could not 
> read symbols: Bad value
> collect2: ld returned 1 exit status
> error: command 'g++' failed with exit status 1
> make: *** [all] Error 1
>
>
> The srilm version that I use is "1.6.0" and my machine type is "i686-m64".

To build SRILM for use in shared libraries, invoke the build with

       make MAKE_PIC=X   (other arguments)

or put

    ADDITIONAL_CFLAGS += -fPIC
    ADDITIONAL_CXXFLAGS += -fPIC

in the machine-specific makefile  common/Makefile.site.$(MACHINE_TYPE) .
(-fPIC is for gcc-based compilers, other compilers have different 
options to accomplish the same thing.)

Andreas


>
> I appreciate your help.
> cheers,
> -Reza
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From stolcke at icsi.berkeley.edu  Fri Jan  6 15:12:47 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 06 Jan 2012 15:12:47 -0800
Subject: [SRILM User List] Odd behavior in disambig and OOV words
In-Reply-To: <CA+Y0uXzUTJuF-ocbm3WFTGB4B515Grtbw_Aos38r72zsrgKKdg@mail.gmail.com>
References: <CA+Y0uXwUpAsqRhsA2GAU_HVsAwtjCpydohU11EmNCOc4hn3E=g@mail.gmail.com>	<4F0491D8.7000402@icsi.berkeley.edu>	<CA+Y0uXz05p=V0A1Yp6+2qct1miPegL5ZDaGtYQogtkEjDGskVg@mail.gmail.com>	<4F0497E8.7050101@icsi.berkeley.edu>
	<CA+Y0uXzUTJuF-ocbm3WFTGB4B515Grtbw_Aos38r72zsrgKKdg@mail.gmail.com>
Message-ID: <4F077FEF.8080906@icsi.berkeley.edu>


The attached patch seems to fix the problem.

The problem stems from pseudo-random ordering of hypotheses that have
identical scores in viterbi/nbest decoding.
The patch introduces an additional sorting criterion to make the
ordering deterministic.
If you add the option -nbest 10  you see the alternatives that get the
same score, and there are often many.

Andreas


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: srilm.patch.txt
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120106/225fc135/attachment.txt>

From nishthajaiswal at cdacnoida.in  Sun Jan 15 22:51:09 2012
From: nishthajaiswal at cdacnoida.in (nishthajaiswal at cdacnoida.in)
Date: Mon, 16 Jan 2012 12:21:09 +0530 (IST)
Subject: [SRILM User List] SRILM install problem
Message-ID: <9359.10.0.0.4.1326696669.squirrel@mail.cdacnoida.in>

------------------------------- Original Message -------------------------------
Subject: SRILM install problem
From:    srilm-user-owner at speech.sri.com
Date:    Mon, January 16, 2012 12:19 pm
To:      nishthajaiswal at cdacnoida.in
--------------------------------------------------------------------------------

You are not allowed to post to this mailing list, and your message has
been automatically rejected.  If you think that your messages are
being rejected in error, contact the mailing list owner at
srilm-user-owner at speech.sri.com.

-------------- next part --------------
An embedded message was scrubbed...
From: nishthajaiswal at cdacnoida.in
Subject: SRILM install problem
Date: Mon, 16 Jan 2012 12:19:32 +0530 (IST)
Size: 5144
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120116/d323a1b0/attachment.mht>

From nishthajaiswal at cdacnoida.in  Sun Jan 15 22:53:33 2012
From: nishthajaiswal at cdacnoida.in (nishthajaiswal at cdacnoida.in)
Date: Mon, 16 Jan 2012 12:23:33 +0530 (IST)
Subject: [SRILM User List] [Fwd: SRILM install problem]
Message-ID: <12892.10.0.0.4.1326696813.squirrel@mail.cdacnoida.in>

------------------------------- Original Message -------------------------------
Subject: SRILM install problem
From:    nishthajaiswal at cdacnoida.in
Date:    Mon, January 16, 2012 12:19 pm
To:      srilm-user at speech.sri.com
--------------------------------------------------------------------------------

Hi

Unable to install srilm on my fedora 8.
Following error is coming when giving on giving make World

mkdir include lib bin
mkdir: cannot create directory `include': File exists
mkdir: cannot create directory `lib': File exists
mkdir: cannot create directory `bin': File exists
make: [dirs] Error 1 (ignored)
make init
make[1]: Entering directory `/root/srilm'
for subdir in misc dstruct lm flm lattice utils; do \
                (cd $subdir/src; make SRILM=/root/srilm MACHINE_TYPE=i686
OPTION= MAKE_PIC= init) || exit 1; \
        done
make[2]: Entering directory `/root/srilm/misc/src'
cd ..; /root/srilm/sbin/make-standard-directories
/bin/sh: /root/srilm/sbin/make-standard-directories: /bin/csh: bad interpreter:
No such file or directory
make[2]: [init] Error 126 (ignored)
make ../obj/i686/STAMP ../bin/i686/STAMP
make[3]: Entering directory `/root/srilm/misc/src'
make[3]: `../obj/i686/STAMP' is up to date.
mkdir ../bin/i686/
mkdir: cannot create directory `../bin/i686/': No such file or directory
make[3]: [../bin/i686/STAMP] Error 1 (ignored)
touch ../bin/i686/STAMP
touch: cannot touch `../bin/i686/STAMP': No such file or directory
make[3]: *** [../bin/i686/STAMP] Error 1
make[3]: Leaving directory `/root/srilm/misc/src'
make[2]: *** [init] Error 2
make[2]: Leaving directory `/root/srilm/misc/src'
make[1]: *** [init] Error 1
make[1]: Leaving directory `/root/srilm'
make: *** [World] Error 2

Output of uname -a is:
Linux nishthajaiswal 2.6.21-2950.fc8xen #1 SMP Tue Oct 23 12:24:34 EDT 2007 i686
i686 i386 GNU/Linux

output of gcc -v is :
Using built-in specs.
Target: i386-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk
--disable-dssi --enable-plugin
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
--enable-libgcj-multifile --enable-java-maintainer-mode
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-cpu=generic
--host=i386-redhat-linux
Thread model: posix
gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)

Please reply ...

Regards
Nishtha


From stolcke at icsi.berkeley.edu  Mon Jan 16 10:11:01 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 16 Jan 2012 10:11:01 -0800
Subject: [SRILM User List] [Fwd: SRILM install problem]
In-Reply-To: <12892.10.0.0.4.1326696813.squirrel@mail.cdacnoida.in>
References: <12892.10.0.0.4.1326696813.squirrel@mail.cdacnoida.in>
Message-ID: <4F146835.4050306@icsi.berkeley.edu>

On 1/15/2012 10:53 PM, nishthajaiswal at cdacnoida.in wrote:
> ------------------------------- Original Message -------------------------------
> Subject: SRILM install problem
> From:    nishthajaiswal at cdacnoida.in
> Date:    Mon, January 16, 2012 12:19 pm
> To:      srilm-user at speech.sri.com
> --------------------------------------------------------------------------------
>
> Hi
>
> Unable to install srilm on my fedora 8.
> Following error is coming when giving on giving make World
>
> mkdir include lib bin
> mkdir: cannot create directory `include': File exists
> mkdir: cannot create directory `lib': File exists
> mkdir: cannot create directory `bin': File exists
> make: [dirs] Error 1 (ignored)
> make init
> make[1]: Entering directory `/root/srilm'
> for subdir in misc dstruct lm flm lattice utils; do \
>                  (cd $subdir/src; make SRILM=/root/srilm MACHINE_TYPE=i686
> OPTION= MAKE_PIC= init) || exit 1; \
>          done
> make[2]: Entering directory `/root/srilm/misc/src'
> cd ..; /root/srilm/sbin/make-standard-directories
> /bin/sh: /root/srilm/sbin/make-standard-directories: /bin/csh: bad interpreter:
 From this last message it seems you don't have the C-shell installed.
I believe csh (or sometimes called tcsh) is optional in some Linux 
distributions.

Note the most recent beta version of SRILM no longer required csh, so 
another solution is to get that.

Andreas


From ryan at hlt.utdallas.edu  Mon Jan 16 11:59:31 2012
From: ryan at hlt.utdallas.edu (Ryan Zeigler)
Date: Mon, 16 Jan 2012 13:59:31 -0600
Subject: [SRILM User List] Difficulty Building SRILM with mingw-w64
Message-ID: <4F1481A3.7080205@hlt.utdallas.edu>

Hello SRILM mailing list,
I am having difficulty building SRILM 1.6 using the mingw32-w64 
toolchain on a Windows 7 x64 machine.
In order to try and do this, I modified the win32 machine type makefile 
to replace the bare g++/gcc invocations with the target prefixed names 
that are installed by mingw32 and removed the -mno-cygwin line. These 
being the relevant lines.

CC_FLAGS = -DNEED_RAND48 -Wall -Wno-unused-variable -Wno-uninitialized
CC = x86_64-w64-mingw32-gcc $(GCC_FLAGS) -Wimplicit-int
CXX = x86_64-w64-mingw32-g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES

When I subsequently attempt to compile, I receive the following errors 
from matherr.c

x86_64-w64-mingw32-gcc -DNEED_RAND48 -Wall -Wno-unused-variable 
-Wno-uninitialized -Wimplicit-int -I. -I../../include -c -g -O2 
-DUSE_SARRAY -DUSE_SARRAY_TRIE -DUSE_SARRAY_MAP2 -o 
../obj/x86_64-w64-mingw32_c/matherr.o matherr.c
matherr.c:19:16: warning: ?struct exception? declared inside parameter list
matherr.c:19:16: warning: its scope is only this definition or 
declaration, which is probably not what you want
matherr.c:19:1: error: conflicting types for ?_matherr?
/usr/x86_64-w64-mingw32/sys-root/mingw/include/math.h:179:23: note: 
previous declaration of ?_matherr? was here
matherr.c: In function ?_matherr?:
matherr.c:22:10: error: dereferencing pointer to incomplete type
matherr.c:22:36: error: dereferencing pointer to incomplete type
matherr.c:30:1: warning: control reaches end of non-void function
/cygdrive/c/srilm/common/Makefile.common.targets:85: recipe for target 
`../obj/x86_64-w64-mingw32_c/matherr.o' failed


For refrence, the definition of _matherr given is
_CRTIMP int __cdecl _matherr (struct _exception *);

I would appreciate any help in resolving this issue.
Regards,
Ryan Zeigler

From stolcke at icsi.berkeley.edu  Mon Jan 16 19:17:58 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 16 Jan 2012 19:17:58 -0800
Subject: [SRILM User List] Difficulty Building SRILM with mingw-w64
In-Reply-To: <4F1481A3.7080205@hlt.utdallas.edu>
References: <4F1481A3.7080205@hlt.utdallas.edu>
Message-ID: <4F14E866.5090304@icsi.berkeley.edu>

On 1/16/2012 11:59 AM, Ryan Zeigler wrote:
> Hello SRILM mailing list,
> I am having difficulty building SRILM 1.6 using the mingw32-w64 
> toolchain on a Windows 7 x64 machine.
> In order to try and do this, I modified the win32 machine type 
> makefile to replace the bare g++/gcc invocations with the target 
> prefixed names that are installed by mingw32 and removed the 
> -mno-cygwin line. These being the relevant lines.
>
> CC_FLAGS = -DNEED_RAND48 -Wall -Wno-unused-variable -Wno-uninitialized
> CC = x86_64-w64-mingw32-gcc $(GCC_FLAGS) -Wimplicit-int
> CXX = x86_64-w64-mingw32-g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES
>
> When I subsequently attempt to compile, I receive the following errors 
> from matherr.c

This particular error is due to a glitch in the ifdefs.  Replace

  #if defined(__MINGW32_VERSION) || defined(_MSC_VER)

with

  #if defined(WIN32) || defined(_MSC_VER)

I took at stab at a mingw32-w64 build recently, and there are link-time 
errors even after everything compiles fine.
Let me know how far you get with this!

Andreas


>
> x86_64-w64-mingw32-gcc -DNEED_RAND48 -Wall -Wno-unused-variable 
> -Wno-uninitialized -Wimplicit-int -I. -I../../include -c -g -O2 
> -DUSE_SARRAY -DUSE_SARRAY_TRIE -DUSE_SARRAY_MAP2 -o 
> ../obj/x86_64-w64-mingw32_c/matherr.o matherr.c
> matherr.c:19:16: warning: ?struct exception? declared inside parameter 
> list
> matherr.c:19:16: warning: its scope is only this definition or 
> declaration, which is probably not what you want
> matherr.c:19:1: error: conflicting types for ?_matherr?
> /usr/x86_64-w64-mingw32/sys-root/mingw/include/math.h:179:23: note: 
> previous declaration of ?_matherr? was here
> matherr.c: In function ?_matherr?:
> matherr.c:22:10: error: dereferencing pointer to incomplete type
> matherr.c:22:36: error: dereferencing pointer to incomplete type
> matherr.c:30:1: warning: control reaches end of non-void function
> /cygdrive/c/srilm/common/Makefile.common.targets:85: recipe for target 
> `../obj/x86_64-w64-mingw32_c/matherr.o' failed
>
>
> For refrence, the definition of _matherr given is
> _CRTIMP int __cdecl _matherr (struct _exception *);
>
> I would appreciate any help in resolving this issue.
> Regards,
> Ryan Zeigler
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>


From dmytro.prylipko at ovgu.de  Sun Jan 22 11:19:36 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Sun, 22 Jan 2012 20:19:36 +0100
Subject: [SRILM User List] Using hidden events
Message-ID: <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A@mail.gmail.com>

Hi,

I would like to use models with hidden vocabulary for filled pauses
but I am not sure what is the right way to train and test such models.
I have a train and test data containing filled pauses between words as
well as 'clean' datasets where FPs are removed.
The filled pauses are going to be modeled as '-observed -omit' or '-observed'.
The questions are:
  -  Should I train the model on the data containing the FPs or on the
clean data?
  - Which vocabulary to use during training and test: with FP or
without, since FP word is included into hidden vocabulary?

I am also trying to estimate local perplexity of the words following
filled pauses. I extracted these words together with the contexts into
separate sentences, e.g:
eine woche <FP> was
aus vom <FP> sonnabend

and applied trained LM on them. Total perplexity is calculated as 10^(
- totalLogProb / N ), where totalLogProb is the sum of log
probabilities of the words predicted after <FP>.

The same value is then calculated on these chunks where <FP> have been
removed from the context:
eine woche was
aus vom sonnabend.

Is this right?

Which setup should I use in order to calculate the local perplexity,
when I want to model FPs as hidden events with '-observed -omit'
options?

Thanks in advance.

Yours,
Dmytro.

From stolcke at icsi.berkeley.edu  Sun Jan 22 19:35:37 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 22 Jan 2012 19:35:37 -0800
Subject: [SRILM User List] Using hidden events
In-Reply-To: Your message of Sun, 22 Jan 2012 20:19:36 +0100.
	<CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A@mail.gmail.com>
Message-ID: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU>

In message <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A at mail.gmail.com>
you wrote:
> Hi,
> 
> I would like to use models with hidden vocabulary for filled pauses
> but I am not sure what is the right way to train and test such models.
> I have a train and test data containing filled pauses between words as
> well as 'clean' datasets where FPs are removed.
> The filled pauses are going to be modeled as '-observed -omit' or '-observed'

As stated in the ngram(1) man page, filled pauses should normally
modeled as -hidden-vocab tokens with -observed -omit.

> .
> The questions are:
>   -  Should I train the model on the data containing the FPs or on the
> clean data?

You need to have the FPs in the training data, since (1) they are observed
and (2) even hidden events need to be made "unhidden"  for training purposes.

There is no ready-made training procedure for hidden-event LMs.
You yourself have to extact the n-grams that correspond to the events
and histories implied by the LM.  For example, if "UH" is a filled pause and
the training data has 

	a b UH c d

and you want to train a 3gram LM, you need to generate ngrams

	UH	1
	b UH	1
	a b UH	1
	c	1
	b c	1
	a b c	1
	d	1
	c d	1
	b c d	1

and feed that to ngram-count -read plus any of the standard training 
options.


>   - Which vocabulary to use during training and test: with FP or
> without, since FP word is included into hidden vocabulary?

With FP in training (since there is no "hidden" vocabulary in training,
see above).

In testing it doesn't matter since all the tokens specified by -hidden-vocab 
are implicitly added to the overall LM vocabulary.

> 
> I am also trying to estimate local perplexity of the words following
> filled pauses. I extracted these words together with the contexts into
> separate sentences, e.g:
> eine woche <FP> was
> aus vom <FP> sonnabend

You want to use ngram -debug 2 -ppl 
and extract the probabilities from the output.

Andreas 

> 
> and applied trained LM on them. Total perplexity is calculated as 10^(
> - totalLogProb / N ), where totalLogProb is the sum of log
> probabilities of the words predicted after <FP>.
> 
> The same value is then calculated on these chunks where <FP> have been
> removed from the context:
> eine woche was
> aus vom sonnabend.
> 
> Is this right?
> 
> Which setup should I use in order to calculate the local perplexity,
> when I want to model FPs as hidden events with '-observed -omit'
> options?
> 
> Thanks in advance.
> 
> Yours,
> Dmytro.
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


--Andreas

From dmytro.prylipko at ovgu.de  Mon Jan 23 03:25:12 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Mon, 23 Jan 2012 12:25:12 +0100
Subject: [SRILM User List] Using hidden events
In-Reply-To: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU>
References: <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A@mail.gmail.com>
	<201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU>
Message-ID: <CANskbNP3mhAaoBOkpQvv3u1taG09kfQxBqr=oSrfMDQzBTewfw@mail.gmail.com>

On Mon, Jan 23, 2012 at 4:35 AM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> In message <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A at mail.gmail.com>
> you wrote:
>> Hi,
>>
>> I would like to use models with hidden vocabulary for filled pauses
>> but I am not sure what is the right way to train and test such models.
>> I have a train and test data containing filled pauses between words as
>> well as 'clean' datasets where FPs are removed.
>> The filled pauses are going to be modeled as '-observed -omit' or '-observed'
>
> As stated in the ngram(1) man page, filled pauses should normally
> modeled as -hidden-vocab tokens with -observed -omit.
>
>> .
>> The questions are:
>> ? - ?Should I train the model on the data containing the FPs or on the
>> clean data?
>
> You need to have the FPs in the training data, since (1) they are observed
> and (2) even hidden events need to be made "unhidden" ?for training purposes.
>
> There is no ready-made training procedure for hidden-event LMs.
> You yourself have to extact the n-grams that correspond to the events
> and histories implied by the LM. ?For example, if "UH" is a filled pause and
> the training data has
>
> ? ? ? ?a b UH c d
>
> and you want to train a 3gram LM, you need to generate ngrams
>
> ? ? ? ?UH ? ? ?1
> ? ? ? ?b UH ? ?1
> ? ? ? ?a b UH ?1
> ? ? ? ?c ? ? ? 1
> ? ? ? ?b c ? ? 1
> ? ? ? ?a b c ? 1
> ? ? ? ?d ? ? ? 1
> ? ? ? ?c d ? ? 1
> ? ? ? ?b c d ? 1
>
> and feed that to ngram-count -read plus any of the standard training
> options.

Wow, sounds tricky. I guess this procedure is required for those
disfluencies  which are omitted from the context, i.e. marked with the
-omit option in the hidden vocabulary, but need to be predicted
themselves. For other kinds, such as insertions, deletions and
repairs, LM can be trained just with ngram-count, right?

>
>
>> ? - Which vocabulary to use during training and test: with FP or
>> without, since FP word is included into hidden vocabulary?
>
> With FP in training (since there is no "hidden" vocabulary in training,
> see above).
>
> In testing it doesn't matter since all the tokens specified by -hidden-vocab
> are implicitly added to the overall LM vocabulary.
>
>>
>> I am also trying to estimate local perplexity of the words following
>> filled pauses. I extracted these words together with the contexts into
>> separate sentences, e.g:
>> eine woche <FP> was
>> aus vom <FP> sonnabend
>
> You want to use ngram -debug 2 -ppl
> and extract the probabilities from the output.
>
> Andreas
>
>>
>> and applied trained LM on them. Total perplexity is calculated as 10^(
>> - totalLogProb / N ), where totalLogProb is the sum of log
>> probabilities of the words predicted after <FP>.
>>
>> The same value is then calculated on these chunks where <FP> have been
>> removed from the context:
>> eine woche was
>> aus vom sonnabend.
>>
>> Is this right?
>>
>> Which setup should I use in order to calculate the local perplexity,
>> when I want to model FPs as hidden events with '-observed -omit'
>> options?
>>
>> Thanks in advance.
>>
>> Yours,
>> Dmytro.
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
> --Andreas


From dmytro.prylipko at ovgu.de  Mon Jan 23 06:23:28 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Mon, 23 Jan 2012 15:23:28 +0100
Subject: [SRILM User List] Using hidden events
In-Reply-To: <201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU>
References: <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A@mail.gmail.com>
	<201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU>
Message-ID: <CANskbNNbGEM++NOz+uFmj3XeMUfNRdnNpuPsnF5W4d752duJLA@mail.gmail.com>

Dear Andreas,

I am conducting experiments on filled pauses and some results are puzzle for me.

I estimated a perplexity of words following filled pauses in two ways:
(1) taking FPs into account (FP is modeled as a regular word, not
hidden event) and (2) after removal of them from both train and test
data.

I account just log probabilities of the words placed after FPs
(obtained with ngram -debug 2 -ppl), not for the FPs.

The first approach provides lower perplexity, which is expected.

But when using -hidden-vocab I have some strange results which are not
clear for me.

For example, I can assume that using language model (trained on
'clean' data, i.e. w/o FPs) together with 'FP -observed -omit' on the
test data containing pauses (i.e. 'not clean') should lead to the same
result as for word-only model (approach (2)), since we predict only
words and context is freed from disfluencies.

However, this assumption is not supported with experiments.
Using 'clean' model with hidden vocabulary on test data containing
pauses gives much higher perplexity (364 -> 400).
I found that word probability after FP in such a case is always
modeled with unigrams. I can conclude that FPs are not omitted from
the context despite of the hidden event instruction. This is supported
by the fact that the result is equal when I use either '-observed
-omit' or just '-observed' or just '-omit'.


Also I thought that using a model, which consider filled pauses as
regular words, incorporating hidden vocabulary with 'FP -observed'
should not change the result as well, since pauses are not omitted
from the context in this case.

This is not true as well, I get the value of 295 without hidden
vocabulary and 291 with it.

Also, I found that perplexity values do not change when I use 'FP
-observed' or just 'FP -omit' in the hidden vocabulary, which looks
very strange.

I would be greatly appreciated if you could clarify these questions.

Yours,
Dmytro.

From stolcke at icsi.berkeley.edu  Mon Jan 23 09:44:54 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 23 Jan 2012 09:44:54 -0800
Subject: [SRILM User List] Using hidden events
In-Reply-To: <CANskbNP3mhAaoBOkpQvv3u1taG09kfQxBqr=oSrfMDQzBTewfw@mail.gmail.com>
References: <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A@mail.gmail.com>
	<201201230335.q0N3ZbJs014374@fruitcake.ICSI.Berkeley.EDU>
	<CANskbNP3mhAaoBOkpQvv3u1taG09kfQxBqr=oSrfMDQzBTewfw@mail.gmail.com>
Message-ID: <4F1D9C96.6030408@icsi.berkeley.edu>

On 1/23/2012 3:25 AM, Dmytro Prylipko wrote:
>
>>> .
>>> The questions are:
>>>    -  Should I train the model on the data containing the FPs or on the
>>> clean data?
>> You need to have the FPs in the training data, since (1) they are observed
>> and (2) even hidden events need to be made "unhidden"  for training purposes.
>>
>> There is no ready-made training procedure for hidden-event LMs.
>> You yourself have to extact the n-grams that correspond to the events
>> and histories implied by the LM.  For example, if "UH" is a filled pause and
>> the training data has
>>
>>         a b UH c d
>>
>> and you want to train a 3gram LM, you need to generate ngrams
>>
>>         UH      1
>>         b UH    1
>>         a b UH  1
>>         c       1
>>         b c     1
>>         a b c   1
>>         d       1
>>         c d     1
>>         b c d   1
>>
>> and feed that to ngram-count -read plus any of the standard training
>> options.
> Wow, sounds tricky. I guess this procedure is required for those
> disfluencies  which are omitted from the context, i.e. marked with the
> -omit option in the hidden vocabulary, but need to be predicted
> themselves. For other kinds, such as insertions, deletions and
> repairs, LM can be trained just with ngram-count, right?
Well, you need to train a single model for all types of tokens.  So it 
is easiest to write a perl script (for example) that extract the counts 
for all ngrams.

Note that you can write the script so that it processes one sentence at 
a time, and output just a bunch of ngrams with count 1.
ngram-count -read will take care of merging and summing the counts.

Andreas


From dmytro.prylipko at ovgu.de  Sun Jan 29 07:45:54 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Sun, 29 Jan 2012 16:45:54 +0100
Subject: [SRILM User List] Does '-omit' work?
Message-ID: <CANskbNOJ_23rmqj_uUn59u-wBMuS9c5z72FBQJ5Yj0RnWgrCOw@mail.gmail.com>

Dear Andreas,

I found that using -omit and -observed options does not influence on
the calculation of perplexity.
I trained an skip-LM for filled pauses as you advised me (generated
n-grams, where FPs were skipped from context).
But when I apply it to the test data it does not matter which
combination of options do I use for the hidden-vocabulary:
<FP> -omit -observed
<FP> -omit
<FP> -observed
 or just
<FP>

For each case I have the same perplexity value. However, it differs
when the hidden vocabulary is empty or contains another token, so I
can conclude it works.
Could you tell me if I am doing everything right? Why the options do not work?

Sincerely yours,
Dmytro Prylipko.

From martin.ostrovsky at gmail.com  Mon Jan 30 13:26:15 2012
From: martin.ostrovsky at gmail.com (Martin Ostrovsky)
Date: Mon, 30 Jan 2012 16:26:15 -0500
Subject: [SRILM User List] Full list of spanish POS tags
Message-ID: <9D31A18D-4864-49A5-B6DA-B821958DE2C6@gmail.com>

Hello,

I've run the SVMTagger against some spanish text using spanish model provided on the SVMTool site and am looking for a canonical list of definitions for each POS tag.

Any suggestions?

From stolcke at icsi.berkeley.edu  Tue Jan 31 01:12:18 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 31 Jan 2012 01:12:18 -0800
Subject: [SRILM User List] Does '-omit' work?
In-Reply-To: <CANskbNOJ_23rmqj_uUn59u-wBMuS9c5z72FBQJ5Yj0RnWgrCOw@mail.gmail.com>
References: <CANskbNOJ_23rmqj_uUn59u-wBMuS9c5z72FBQJ5Yj0RnWgrCOw@mail.gmail.com>
Message-ID: <4F27B072.5030901@icsi.berkeley.edu>

On 1/29/2012 7:45 AM, Dmytro Prylipko wrote:
> Dear Andreas,
>
> I found that using -omit and -observed options does not influence on
> the calculation of perplexity.
> I trained an skip-LM for filled pauses as you advised me (generated
> n-grams, where FPs were skipped from context).
> But when I apply it to the test data it does not matter which
> combination of options do I use for the hidden-vocabulary:
> <FP>  -omit -observed
> <FP>  -omit
> <FP>  -observed
>   or just
> <FP>
This is a bug, more in the documentation than in the code.

The hidden event  "options"  (-omit, -observed, etc) are only processed 
when they appear in the -lm file following the ngram parameters.
When processing the -hidden-vocab file, on the other hand, only the 
names of the hidden events are recorded (like -vocab).

This should be fixed.  But for now, simply append your hidden-event file 
to the contents of the -lm file .
Sorry for the confusion in the man page.  It kind of says this but in a 
very confusing way, and I agree that the -hidden-vocab file should also 
interpret the full hidden event specifications.

Andreas


From shinichiro.hamada at gmail.com  Wed Feb  1 07:01:52 2012
From: shinichiro.hamada at gmail.com (shinichiro.hamada)
Date: Thu, 2 Feb 2012 00:01:52 +0900
Subject: [SRILM User List] LM whose counts are multiplied
Message-ID: <FA2C6AECCAD840D79161303DAF358379@f91>

Hello, all.

I want to make a language model with data which have fraction counts. But
not all smoothing method can handle them, so I'll try to multiply each count
by 10 and make it integer by rounding.

--
I did a preliminary experiment.

Files:
* count-file with integers : a.count
* the file whose counts are multiplied by 10 : b.count

Command:
ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1
-wbdiscount2 -wbdiscount3 -interpolate
ngram-count -read b.count -order 3 -lm b.lm -wbdiscount -wbdiscount1
-wbdiscount2 -wbdiscount3 -interpolate

I expected same language models are generated, but they differed. Why?
Followings are their heading parts.

------------------------
[a.lm]
\Data\
ngram 1=1055
ngram 2=2240
ngram 3=87

\1-grams: ..

------------------------
[b.lm]
\data\
ngram 1=1055
ngram 2=2240
ngram 3=2548

\1-grams: ..


From stolcke at icsi.berkeley.edu  Wed Feb  1 12:44:37 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 01 Feb 2012 12:44:37 -0800
Subject: [SRILM User List] Question about SRILM and sentence boundary
	detection
In-Reply-To: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
Message-ID: <4F29A435.5050905@icsi.berkeley.edu>

Georgi,

You can get the conditional probabilities for arbitrary sets of ngrams using

     ngram -counts FILE

Andreas


On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
> Dear Mr. Stolcke,
>
> I am trying to do sentence boundary segmentation. I have an n-gram 
> language model and for modelling it I use the SRILM toolkit. Thanks 
> for the nice tool!
>
> I have the following problem.
>
> I implement the forward-backward algortithm on my own. So I need to 
> combine the n-grams of your "hidden event model" with the prosodic model.
> Therefore, I need to get the probabilities of the individual n-grams 
> (in my case 3-grams).
>
> For example for the word sequence
> wordt_2 wordt_1 wordt wordt+1 wordt+2
>
> i need
> P( <s> , wordt | wordt_2 wordt_1)
> P(wordt | wordt_2 wordt_1)
> P(wordt+1 | wordt_1 wordt)
> ... and so on
> all possible combinations with and without <s> before each word.
>
>
> What I do to get one of these is to use the following SRILM command:
>
> # create text for case *wordt_2 wordt_1 <s> wordt*
> > echo "$wordt_2 $wordt_1
> > $wordt" > testtext2;
>
> > ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk 
> >/tmp/output;
> and then read the corresponding line from the output that I need (e.g. 
> line 3 )
>
>
> OUTPUT:
> wordt_2 wordt_1
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( <unk> | <unk> ...) = [2gram] 0.00343115 [ -2.46456 ]
> p( </s> | <unk> ...) = [2gram] 0.0937662 [ -1.02795 ]
> 1 sentences, 2 words, 0 OOVs
> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4
>
> wordt
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( </s> | <unk> ...) = [2gram] 0.10582 [ -0.975432 ]
> 1 sentences, 1 words, 0 OOVs
> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59
>
> file testtext2: 2 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21
> --------------------------------
>
>
>
> The problem is that for each trigram I call again ngram function and 
> it loads the LM ( > 1GB) and this makes it very slow.
> Is there a faster solution? I do not need perplexity as well.
>
> I know about the segmentation tool
> http://www.speech.sri.com/projects/srilm/manpages/segment.1.html
>   but it gives results for the whole sequence, which is not my goal.
>
>
>
>
> mit freundlichen Gr??en,
> Georgi Dzhambazov,
>
> Studentischer Mitarbeiter,
> NetMedia
> ________________________________________
> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu]
> Gesendet: Donnerstag, 13. Oktober 2011 05:50
> Bis: Dzhambazov, Georgi
> Cc: eee at speech.sri.com
> Betreff: Re: Question about sentence boundary detection paper
>
> Dzhambazov, Georgi wrote:
> > Dear A. Stolcke,
> > Dear E. Shriberg,
> >
> >
> > I am interested in your approach of sentence boundary detection.
> > I would be very happy if you find some time to clarify me some of the
> > steps of your approach.
> > I plan to implement them.
> >
> > Question 1)
> > In the paper (1) at paragraph 2.2.1 you say that states are "the end
> > of sentence status of each word plus any preceeding words.
> > So for example at position 4 of the example sentence, the state is (
> > <ns> + quick brown fox). At position 6 the state is (<s> + brown fox
> > flies ) .
> > This means a huge state space. Is this right?
> >
> > 1 2 3 4 5 6 7 8 9 10
> >
> > The quick brown fox flies <s> The rabbit is white.
> The state space is potentially huge, but just like in standard N-gram
> LMs you only consider the histories (= states) actually occurring in the
> training data, and handle any new histories through backoff.
> Furthermore, the state space is constrained to those that match the
> ngrams in the word sequence. So for every word position you have to
> consider only two states (<s> and no-<s>).
> >
> > Question 2)
> > Transitions probabilities are N-gram Probabilities. You give an
> > example with bigram probabilities in the next line.
> > However you say as well you are using a 4-gram LM. So the correct
> > example should be:
> > This means that a probability at position 6 is Pr(<s>|brown fox flies)
> > and at position 4 is Pr( <ns> | quick brown fox)
> > Is this right?
> correct.
> >
> > Question 3)
> > Then for recognition you say that the forward-backward algorithm is
> > used to determine the maximal P (T_i | W )
> > where T_i corresponds to <s> or <ns> at position i. However the
> > transition probabilities include information about states like ( <ns>
> > + quick brown fox).
> > How do you apply the transition probabilities in this model. Does it
> > relate to the formula of section 4 ot (2).
> > I think this formula can work for the forward-backward algorithm,
> > although it is stated in this section 4 that it is used for Viterbi.
> For finding the most probable T_i you use in fact the Viterbi algorithm.
>
> The formulas in section 4 just give one step in the forward computation
> that would be used in the Viterbi algorithm.
>
> Please note that this is all implemented in the "segment" tool that
> comes with SRILM.
> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and
> http://www.speech.sri.com/projects/srilm/ for more information on SRILM.
>
> Andreas
>
> >
> > References:
> >
> > 1) Shriberg et al. 2000 - Prosody based automatic segmentation of
> > Speech into sentences and topics
> > 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of
> > conversational speech
> >
> > Thank you!
> >
> > Kind Regards,
> > Georgi Dzhambazov,
> >
> > Studentischer Mitarbeiter,
> > NetMedia
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120201/65860552/attachment.html>

From stolcke at icsi.berkeley.edu  Wed Feb  1 12:51:40 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 01 Feb 2012 12:51:40 -0800
Subject: [SRILM User List] LM whose counts are multiplied
In-Reply-To: <FA2C6AECCAD840D79161303DAF358379@f91>
References: <FA2C6AECCAD840D79161303DAF358379@f91>
Message-ID: <4F29A5DC.8010002@icsi.berkeley.edu>

On 2/1/2012 7:01 AM, shinichiro.hamada wrote:
> Hello, all.
>
> I want to make a language model with data which have fraction counts. But
> not all smoothing method can handle them, so I'll try to multiply each count
> by 10 and make it integer by rounding.
>
> --
> I did a preliminary experiment.
>
> Files:
> * count-file with integers : a.count
> * the file whose counts are multiplied by 10 : b.count
>
> Command:
> ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1
> -wbdiscount2 -wbdiscount3 -interpolate
> ngram-count -read b.count -order 3 -lm b.lm -wbdiscount -wbdiscount1
> -wbdiscount2 -wbdiscount3 -interpolate
>
> I expected same language models are generated, but they differed. Why?
> Followings are their heading parts.

First off, the WB discounting method does support fractional counts, so 
you can just feed your counts to
ngram -float-counts ...
with no need to scale and truncate the counts to integers.

The reason you are seeing different LM outputs for different count 
multipliers is that smoothing is sensitive to the absolute occurrence 
counts of ngrams, not just their relative frequencies.  This has to be 
so, if you're trying to estimate the probabilities of unseen ngrams.  If 
you've seen only 10 cases "a b" and never saw "a b x" you should be less 
surprised to see your first "a b x",  than if you had seen 1000 
instances of "a b" (and still none of "a b x").

Andreas


From shinichiro.hamada at gmail.com  Thu Feb  2 06:10:02 2012
From: shinichiro.hamada at gmail.com (shinichiro.hamada)
Date: Thu, 2 Feb 2012 23:10:02 +0900
Subject: [SRILM User List] LM whose counts are multiplied
In-Reply-To: <4F29A5DC.8010002@icsi.berkeley.edu>
References: <FA2C6AECCAD840D79161303DAF358379@f91>
	<4F29A5DC.8010002@icsi.berkeley.edu>
Message-ID: <9FE92B5EFFF24E018A7A5E2D4F3E31AE@f91>

Dear Mr. Stolcke, 

Thank you for your clear explanation.
I understood it completely!!
I'll try to use WB discounting method with float-counts.

Shincihiro Hamada

> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] 
> Sent: Thursday, February 02, 2012 5:52 AM
> To: shinichiro.hamada
> Cc: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] LM whose counts are multiplied
> 
> On 2/1/2012 7:01 AM, shinichiro.hamada wrote:
> > Hello, all.
> >
> > I want to make a language model with data which have 
> fraction counts. 
> > But not all smoothing method can handle them, so I'll try 
> to multiply 
> > each count by 10 and make it integer by rounding.
> >
> > --
> > I did a preliminary experiment.
> >
> > Files:
> > * count-file with integers : a.count
> > * the file whose counts are multiplied by 10 : b.count
> >
> > Command:
> > ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1
> > -wbdiscount2 -wbdiscount3 -interpolate ngram-count -read b.count 
> > -order 3 -lm b.lm -wbdiscount -wbdiscount1
> > -wbdiscount2 -wbdiscount3 -interpolate
> >
> > I expected same language models are generated, but they 
> differed. Why?
> > Followings are their heading parts.
> 
> First off, the WB discounting method does support fractional 
> counts, so you can just feed your counts to ngram -float-counts ...
> with no need to scale and truncate the counts to integers.
> 
> The reason you are seeing different LM outputs for different 
> count multipliers is that smoothing is sensitive to the 
> absolute occurrence counts of ngrams, not just their relative 
> frequencies.  This has to be so, if you're trying to estimate 
> the probabilities of unseen ngrams.  If you've seen only 10 
> cases "a b" and never saw "a b x" you should be less 
> surprised to see your first "a b x",  than if you had seen 
> 1000 instances of "a b" (and still none of "a b x").
> 
> Andreas


From af4ex.radio at yahoo.com  Thu Feb  2 07:24:16 2012
From: af4ex.radio at yahoo.com (John Day)
Date: Thu, 2 Feb 2012 07:24:16 -0800 (PST)
Subject: [SRILM User List] Using srilm as Memory Jogger
Message-ID: <1328196256.12588.YahooMailNeo@web114411.mail.gq1.yahoo.com>

Hi Andreas,
? Can you (or the group) tell me if srilm could be used to query language models in such a way as to 'narrow down' the search for a "partially known word" where the context of its usage is known. By "partially known" I mean hints such as word prefixes or endings are known. The "context of usage" is equivalent, I think, to stating that the likelihood of the hidden word is increased if it is preceded or followed by a given set of words associated with some topic. 


So I would like to 'leverage' srilm and language model queries by using topic models to suggest some words associated with a certain topic.

For example, find the most likely words, that begin with "st", given a "context set" (suggested by some 'sociology' topic model) containing the words "neighborhood, behavior, customs, environment".

Does that make sense? Do you think srilm could be used to execute a query like that?

Thanks,
John Day
Palm Bay, Florida
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120202/4df876c1/attachment.html>

From amber.wilcox.ohearn at gmail.com  Thu Feb  2 08:29:07 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Thu, 2 Feb 2012 09:29:07 -0700
Subject: [SRILM User List] Question about SRILM and sentence boundary
	detection
In-Reply-To: <4F29A435.5050905@icsi.berkeley.edu>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
	<4F29A435.5050905@icsi.berkeley.edu>
Message-ID: <CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>

(Sorry Andreas, I meant to reply to the list):

Georgi,

I'm not sure if SRILM has something that does that -- i.e. holds the
whole LM in RAM and waits for queries.  You might need something like
that as opposed to using a whole file, if you want just the
probabilities of the last word with respect to the previous, and you
want to compare different last words depending on results of previous
calculations, for example.

I have a little C/Python tool I wrote for exactly this purpose.  It's
at https://github.com/lamber/BackOffTrigramModel

It's very specific to my work at the time.  So for example, it works
for only exactly trigrams, and it assumes you are using <unk>.  It
performs all the back-off calculations for unseen trigrams.  But it
looks like you have the same use case, so it might be useful for you.

It's not much documented, but the unit tests show how it works.

Amber
--
http://scholar.google.com/citations?user=15gGywMAAAAJ

On Wed, Feb 1, 2012 at 1:44 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> Georgi,
>
> You can get the conditional probabilities for arbitrary sets of ngrams using
>
> ??? ngram -counts FILE
>
> Andreas
>
>
> On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
>
> Dear Mr. Stolcke,
>
> I am trying to do sentence boundary segmentation. I have an n-gram language
> model and for modelling it I use the SRILM toolkit. Thanks for the nice
> tool!
>
> I have the following problem.
>
> I implement the forward-backward algortithm on my own. So I need to combine
> the n-grams of your "hidden event model" with the prosodic model.
> Therefore, I need to get the probabilities of the individual n-grams (in my
> case 3-grams).
>
> For example for the word sequence
> wordt_2 wordt_1 wordt wordt+1 wordt+2
>
> i need
> P( <s> , wordt | wordt_2 wordt_1)
> P(wordt | wordt_2 wordt_1)
> P(wordt+1 | wordt_1 wordt)
> ... and so on
> all possible combinations with and without <s> before each word.
>
>
> What I do to get one of these is to use the following SRILM command:
>
> # create text for case *wordt_2 wordt_1 <s> wordt*
>> echo "$wordt_2 $wordt_1
>> $wordt" > testtext2;
>
>> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk >/tmp/output;
> and then read the corresponding line from the output that I need (e.g. line
> 3 )
>
>
>
> OUTPUT:
> wordt_2 wordt_1
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( <unk> | <unk> ...) = [2gram] 0.00343115 [ -2.46456 ]
> p( </s> | <unk> ...) = [2gram] 0.0937662 [ -1.02795 ]
> 1 sentences, 2 words, 0 OOVs
> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4
>
> wordt
> p( <unk> | <s> ) = [2gram] 0.00235274 [ -2.62843 ]
> p( </s> | <unk> ...) = [2gram] 0.10582 [ -0.975432 ]
> 1 sentences, 1 words, 0 OOVs
> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59
>
> file testtext2: 2 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21
> --------------------------------
>
>
>
> The problem is that for each trigram I call again ngram function and it
> loads the LM ( > 1GB) and this makes it very slow.
> Is there a faster solution? I do not need perplexity as well.
>
> I know about the segmentation tool
> http://www.speech.sri.com/projects/srilm/manpages/segment.1.html
> ? but it gives results for the whole sequence, which is not my goal.
>
>
>
>
> mit freundlichen Gr??en,
> Georgi Dzhambazov,
>
> Studentischer Mitarbeiter,
> NetMedia
> ________________________________________
> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu]
> Gesendet: Donnerstag, 13. Oktober 2011 05:50
> Bis: Dzhambazov, Georgi
> Cc: eee at speech.sri.com
> Betreff: Re: Question about sentence boundary detection paper
>
> Dzhambazov, Georgi wrote:
>> Dear A. Stolcke,
>> Dear E. Shriberg,
>>
>>
>> I am interested in your approach of sentence boundary detection.
>> I would be very happy if you find some time to clarify me some of the
>> steps of your approach.
>> I plan to implement them.
>>
>> Question 1)
>> In the paper (1) at paragraph 2.2.1 you say that states are "the end
>> of sentence status of each word plus any preceeding words.
>> So for example at position 4 of the example sentence, the state is (
>> <ns> + quick brown fox). At position 6 the state is (<s> + brown fox
>> flies ) .
>> This means a huge state space. Is this right?
>>
>> 1 2 3 4 5 6 7 8 9 10
>>
>> The quick brown fox flies <s> The rabbit is white.
> The state space is potentially huge, but just like in standard N-gram
> LMs you only consider the histories (= states) actually occurring in the
> training data, and handle any new histories through backoff.
> Furthermore, the state space is constrained to those that match the
> ngrams in the word sequence. So for every word position you have to
> consider only two states (<s> and no-<s>).
>>
>> Question 2)
>> Transitions probabilities are N-gram Probabilities. You give an
>> example with bigram probabilities in the next line.
>> However you say as well you are using a 4-gram LM. So the correct
>> example should be:
>> This means that a probability at position 6 is Pr(<s>|brown fox flies)
>> and at position 4 is Pr( <ns> | quick brown fox)
>> Is this right?
> correct.
>>
>> Question 3)
>> Then for recognition you say that the forward-backward algorithm is
>> used to determine the maximal P (T_i | W )
>> where T_i corresponds to <s> or <ns> at position i. However the
>> transition probabilities include information about states like ( <ns>
>> + quick brown fox).
>> How do you apply the transition probabilities in this model. Does it
>> relate to the formula of section 4 ot (2).
>> I think this formula can work for the forward-backward algorithm,
>> although it is stated in this section 4 that it is used for Viterbi.
> For finding the most probable T_i you use in fact the Viterbi algorithm.
>
> The formulas in section 4 just give one step in the forward computation
> that would be used in the Viterbi algorithm.
>
> Please note that this is all implemented in the "segment" tool that
> comes with SRILM.
> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and
> http://www.speech.sri.com/projects/srilm/ for more information on SRILM.
>
> Andreas
>
>>
>> References:
>>
>> 1) Shriberg et al. 2000 - Prosody based automatic segmentation of
>> Speech into sentences and topics
>> 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of
>> conversational speech
>>
>> Thank you!
>>
>> Kind Regards,
>> Georgi Dzhambazov,
>>
>> Studentischer Mitarbeiter,
>> NetMedia
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ


From stolcke at icsi.berkeley.edu  Thu Feb  2 16:53:07 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Thu, 02 Feb 2012 16:53:07 -0800
Subject: [SRILM User List] Question about SRILM and sentence boundary
 detection
In-Reply-To: <CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
	<4F29A435.5050905@icsi.berkeley.edu>
	<CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>
Message-ID: <4F2B2FF3.2070602@icsi.berkeley.edu>

On 2/2/2012 8:29 AM, L. Amber Wilcox-O'Hearn wrote:
> (Sorry Andreas, I meant to reply to the list):
>
> Georgi,
>
> I'm not sure if SRILM has something that does that -- i.e. holds the
> whole LM in RAM and waits for queries.  You might need something like
> that as opposed to using a whole file, if you want just the
> probabilities of the last word with respect to the previous, and you
> want to compare different last words depending on results of previous
> calculations, for example.
Two SRILM solutions:

1- Start ngram -lm LM -escape "===" -counts - (read from stdin) and put 
an escape line (in this case, starting with "===") after every ngram in 
the input (make sure the ngram words are followed my a count "1").
This will cause ngram to dump out the conditional prob for the ngram 
right away (instead of waiting for end-of-file).

2. Directly access the network LM server protocol implemented by ngram 
-server-port.
Start the server with
         % ngram -lm LM -server-port 8888
then write ngrams to that TCP port and read back the log probs:

     % telnet localhost 8888
my first word << input
-4.6499 >> output

Of course you would do the equivalent of telnet in perl, python, C,  or 
some other language to make use of the probabilities.

Andreas


>
> I have a little C/Python tool I wrote for exactly this purpose.  It's
> at https://github.com/lamber/BackOffTrigramModel
>
> It's very specific to my work at the time.  So for example, it works
> for only exactly trigrams, and it assumes you are using<unk>.  It
> performs all the back-off calculations for unseen trigrams.  But it
> looks like you have the same use case, so it might be useful for you.
>
> It's not much documented, but the unit tests show how it works.
>
> Amber
> --
> http://scholar.google.com/citations?user=15gGywMAAAAJ
>
> On Wed, Feb 1, 2012 at 1:44 PM, Andreas Stolcke
> <stolcke at icsi.berkeley.edu>  wrote:
>> Georgi,
>>
>> You can get the conditional probabilities for arbitrary sets of ngrams using
>>
>>      ngram -counts FILE
>>
>> Andreas
>>
>>
>> On 2/1/2012 11:37 AM, Dzhambazov, Georgi wrote:
>>
>> Dear Mr. Stolcke,
>>
>> I am trying to do sentence boundary segmentation. I have an n-gram language
>> model and for modelling it I use the SRILM toolkit. Thanks for the nice
>> tool!
>>
>> I have the following problem.
>>
>> I implement the forward-backward algortithm on my own. So I need to combine
>> the n-grams of your "hidden event model" with the prosodic model.
>> Therefore, I need to get the probabilities of the individual n-grams (in my
>> case 3-grams).
>>
>> For example for the word sequence
>> wordt_2 wordt_1 wordt wordt+1 wordt+2
>>
>> i need
>> P(<s>  , wordt | wordt_2 wordt_1)
>> P(wordt | wordt_2 wordt_1)
>> P(wordt+1 | wordt_1 wordt)
>> ... and so on
>> all possible combinations with and without<s>  before each word.
>>
>>
>> What I do to get one of these is to use the following SRILM command:
>>
>> # create text for case *wordt_2 wordt_1<s>  wordt*
>>> echo "$wordt_2 $wordt_1
>>> $wordt">  testtext2;
>>> ngram -lm $LM_URI -order $order -ppl testtext2 -debug 2 -unk>/tmp/output;
>> and then read the corresponding line from the output that I need (e.g. line
>> 3 )
>>
>>
>>
>> OUTPUT:
>> wordt_2 wordt_1
>> p(<unk>  |<s>  ) = [2gram] 0.00235274 [ -2.62843 ]
>> p(<unk>  |<unk>  ...) = [2gram] 0.00343115 [ -2.46456 ]
>> p(</s>  |<unk>  ...) = [2gram] 0.0937662 [ -1.02795 ]
>> 1 sentences, 2 words, 0 OOVs
>> 0 zeroprobs, logprob= -6.12094 ppl= 109.727 ppl1= 1149.4
>>
>> wordt
>> p(<unk>  |<s>  ) = [2gram] 0.00235274 [ -2.62843 ]
>> p(</s>  |<unk>  ...) = [2gram] 0.10582 [ -0.975432 ]
>> 1 sentences, 1 words, 0 OOVs
>> 0 zeroprobs, logprob= -3.60386 ppl= 63.3766 ppl1= 4016.59
>>
>> file testtext2: 2 sentences, 3 words, 0 OOVs
>> 0 zeroprobs, logprob= -9.7248 ppl= 88.0967 ppl1= 1744.21
>> --------------------------------
>>
>>
>>
>> The problem is that for each trigram I call again ngram function and it
>> loads the LM (>  1GB) and this makes it very slow.
>> Is there a faster solution? I do not need perplexity as well.
>>
>> I know about the segmentation tool
>> http://www.speech.sri.com/projects/srilm/manpages/segment.1.html
>>    but it gives results for the whole sequence, which is not my goal.
>>
>>
>>
>>
>> mit freundlichen Gr??en,
>> Georgi Dzhambazov,
>>
>> Studentischer Mitarbeiter,
>> NetMedia
>> ________________________________________
>> Von: Andreas Stolcke [stolcke at icsi.berkeley.edu]
>> Gesendet: Donnerstag, 13. Oktober 2011 05:50
>> Bis: Dzhambazov, Georgi
>> Cc: eee at speech.sri.com
>> Betreff: Re: Question about sentence boundary detection paper
>>
>> Dzhambazov, Georgi wrote:
>>> Dear A. Stolcke,
>>> Dear E. Shriberg,
>>>
>>>
>>> I am interested in your approach of sentence boundary detection.
>>> I would be very happy if you find some time to clarify me some of the
>>> steps of your approach.
>>> I plan to implement them.
>>>
>>> Question 1)
>>> In the paper (1) at paragraph 2.2.1 you say that states are "the end
>>> of sentence status of each word plus any preceeding words.
>>> So for example at position 4 of the example sentence, the state is (
>>> <ns>  + quick brown fox). At position 6 the state is (<s>  + brown fox
>>> flies ) .
>>> This means a huge state space. Is this right?
>>>
>>> 1 2 3 4 5 6 7 8 9 10
>>>
>>> The quick brown fox flies<s>  The rabbit is white.
>> The state space is potentially huge, but just like in standard N-gram
>> LMs you only consider the histories (= states) actually occurring in the
>> training data, and handle any new histories through backoff.
>> Furthermore, the state space is constrained to those that match the
>> ngrams in the word sequence. So for every word position you have to
>> consider only two states (<s>  and no-<s>).
>>> Question 2)
>>> Transitions probabilities are N-gram Probabilities. You give an
>>> example with bigram probabilities in the next line.
>>> However you say as well you are using a 4-gram LM. So the correct
>>> example should be:
>>> This means that a probability at position 6 is Pr(<s>|brown fox flies)
>>> and at position 4 is Pr(<ns>  | quick brown fox)
>>> Is this right?
>> correct.
>>> Question 3)
>>> Then for recognition you say that the forward-backward algorithm is
>>> used to determine the maximal P (T_i | W )
>>> where T_i corresponds to<s>  or<ns>  at position i. However the
>>> transition probabilities include information about states like (<ns>
>>> + quick brown fox).
>>> How do you apply the transition probabilities in this model. Does it
>>> relate to the formula of section 4 ot (2).
>>> I think this formula can work for the forward-backward algorithm,
>>> although it is stated in this section 4 that it is used for Viterbi.
>> For finding the most probable T_i you use in fact the Viterbi algorithm.
>>
>> The formulas in section 4 just give one step in the forward computation
>> that would be used in the Viterbi algorithm.
>>
>> Please note that this is all implemented in the "segment" tool that
>> comes with SRILM.
>> See http://www.speech.sri.com/projects/srilm/manpages/segment.1.html and
>> http://www.speech.sri.com/projects/srilm/ for more information on SRILM.
>>
>> Andreas
>>
>>> References:
>>>
>>> 1) Shriberg et al. 2000 - Prosody based automatic segmentation of
>>> Speech into sentences and topics
>>> 2) Stolcke and Shriberg - 1996 - Automatic linguistic segmentation of
>>> conversational speech
>>>
>>> Thank you!
>>>
>>> Kind Regards,
>>> Georgi Dzhambazov,
>>>
>>> Studentischer Mitarbeiter,
>>> NetMedia
>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>


From stolcke at icsi.berkeley.edu  Fri Feb  3 11:03:45 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 03 Feb 2012 11:03:45 -0800
Subject: [SRILM User List] Using srilm as Memory Jogger
In-Reply-To: <1328196256.12588.YahooMailNeo@web114411.mail.gq1.yahoo.com>
References: <1328196256.12588.YahooMailNeo@web114411.mail.gq1.yahoo.com>
Message-ID: <4F2C2F91.1020409@icsi.berkeley.edu>

Sorry, such functionality is not built into SRILM, and would have to be 
built on top of it by querying various probability models that 
incorporate the co-occurance of words.   Personally I don't experience 
with this type of application but someone else on the list might.

Andreas


On 2/2/2012 7:24 AM, John Day wrote:
> Hi Andreas,
>   Can you (or the group) tell me if srilm could be used to query 
> language models in such a way as to 'narrow down' the search for a 
> "partially known word" where the context of its usage is known. By 
> "partially known" I mean hints such as word prefixes or endings are 
> known. The "context of usage" is equivalent, I think, to stating that 
> the likelihood of the hidden word is increased if it is preceded or 
> followed by a given set of words associated with some topic.
>
> So I would like to 'leverage' srilm and language model queries by 
> using topic models to suggest some words associated with a certain topic.
>
> For example, find the most likely words, that begin with "st", given a 
> "context set" (suggested by some 'sociology' topic model) containing 
> the words "neighborhood, behavior, customs, environment".
>
> Does that make sense? Do you think srilm could be used to execute a 
> query like that?
>
> Thanks,
> John Day
> Palm Bay, Florida
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120203/a41cd3a9/attachment.html>

From zeinab.vakil at gmail.com  Sun Feb  5 20:14:01 2012
From: zeinab.vakil at gmail.com (zeinab vakil)
Date: Mon, 6 Feb 2012 07:44:01 +0330
Subject: [SRILM User List] Predicting specified words
Message-ID: <CAMMrkxZ6qeeS6=PkReVj25hmNcESd=mn0oRFDfksnHWpdtaXQQ@mail.gmail.com>

Dear All
hi,
I newly getting to know SRILM and have a question.
Is it possible to use SRILM to predict a word that starts with certain
character. for example sentence is "i go to h...", and we want what word
has highest probability P("i go to"|w) or even P("to"|w) and starts by
'h'.
please guide me.
Best Regards,
zeinab vakil.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120206/76e5d448/attachment.html>

From stolcke at icsi.berkeley.edu  Wed Feb  8 21:44:29 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 08 Feb 2012 21:44:29 -0800
Subject: [SRILM User List] Predicting specified words
In-Reply-To: <CAMMrkxZ6qeeS6=PkReVj25hmNcESd=mn0oRFDfksnHWpdtaXQQ@mail.gmail.com>
References: <CAMMrkxZ6qeeS6=PkReVj25hmNcESd=mn0oRFDfksnHWpdtaXQQ@mail.gmail.com>
Message-ID: <4F335D3D.4070702@icsi.berkeley.edu>

On 2/5/2012 8:14 PM, zeinab vakil wrote:
> Dear All
> hi,
> I newly getting to know SRILM and have a question.
> Is it possible to use SRILM to predict a word that starts with certain 
> character. for example sentence is "i go to h...", and we want what 
> word has highest probability P("i go to"|w) or even P("to"|w) and 
> starts by 'h'.
> please guide me.
> Best Regards,
> zeinab vakil.
Boy, there seems to be a lot of interest lately in this sort of 
prediction problem (see previous posts on this list).

No, there is no ready-made solution for this in SRILM. I would probably 
try to build a mixed word/letter ngram LM, estimating probabilities
p(next-letter | word-2, word-3, letter-1, letter-2, letter-3, ...) .

Andreas


From tonyr at cantabresearch.com  Thu Feb  9 01:09:59 2012
From: tonyr at cantabresearch.com (Tony Robinson)
Date: Thu, 09 Feb 2012 09:09:59 +0000
Subject: [SRILM User List] Predicting specified words
In-Reply-To: <4F335D3D.4070702@icsi.berkeley.edu>
References: <CAMMrkxZ6qeeS6=PkReVj25hmNcESd=mn0oRFDfksnHWpdtaXQQ@mail.gmail.com>
	<4F335D3D.4070702@icsi.berkeley.edu>
Message-ID: <4F338D67.3090402@cantabResearch.com>

On 02/09/2012 05:44 AM, Andreas Stolcke wrote:
> On 2/5/2012 8:14 PM, zeinab vakil wrote:
>> Dear All
>> hi,
>> I newly getting to know SRILM and have a question.
>> Is it possible to use SRILM to predict a word that starts with 
>> certain character. for example sentence is "i go to h...", and we 
>> want what word has highest probability P("i go to"|w) or even 
>> P("to"|w) and starts by 'h'.
>> please guide me.
>> Best Regards,
>> zeinab vakil.
> Boy, there seems to be a lot of interest lately in this sort of 
> prediction problem (see previous posts on this list).

If you haven't already seen Dasher then you might like to look it it up 
at http://www.inference.phy.cam.ac.uk/dasher/Publications.html.


Tony

-- 
Dr A J Robinson, Founder and Director of Cantab Research Limited
St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK
Company reg no 05697423 (England and Wales), VAT reg no 925606030


From kutlak.roman at gmail.com  Thu Feb  9 01:34:08 2012
From: kutlak.roman at gmail.com (Kutlak Roman)
Date: Thu, 9 Feb 2012 09:34:08 +0000
Subject: [SRILM User List] Predicting specified words
In-Reply-To: <4F335D3D.4070702@icsi.berkeley.edu>
References: <CAMMrkxZ6qeeS6=PkReVj25hmNcESd=mn0oRFDfksnHWpdtaXQQ@mail.gmail.com>
	<4F335D3D.4070702@icsi.berkeley.edu>
Message-ID: <F7C8D526-921A-4115-8202-8B7BB02CAF4D@gmail.com>

Hi guys, I am not an expert on language modelling but here is a thought:

The library contains classes LM and Vocab where Vocab is the vocabulary used with the current language model. Maybe you could iterate through the words in the vocabulary, pick the ones that start with the letter you have and ask the LM to tell you which word gives you the highest probability given the context. 

Roman

On 9 Feb 2012, at 05:44, Andreas Stolcke wrote:

> On 2/5/2012 8:14 PM, zeinab vakil wrote:
>> Dear All
>> hi,
>> I newly getting to know SRILM and have a question.
>> Is it possible to use SRILM to predict a word that starts with certain character. for example sentence is "i go to h...", and we want what word has highest probability P("i go to"|w) or even P("to"|w) and starts by 'h'.
>> please guide me.
>> Best Regards,
>> zeinab vakil.
> Boy, there seems to be a lot of interest lately in this sort of prediction problem (see previous posts on this list).
> 
> No, there is no ready-made solution for this in SRILM. I would probably try to build a mixed word/letter ngram LM, estimating probabilities
> p(next-letter | word-2, word-3, letter-1, letter-2, letter-3, ...) .
> 
> Andreas
> 
> 
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From aktumuluru at cse.ust.hk  Sat Feb 11 02:31:37 2012
From: aktumuluru at cse.ust.hk (Anand Karthik)
Date: Sat, 11 Feb 2012 18:31:37 +0800
Subject: [SRILM User List] Please help : Problem with installation of
 SRILM1.4.6 on ubuntu 10.04 amd-64 bit machine
In-Reply-To: <CAMC4xT3EgtyXvkKj9MEwLBucF2W8ZQTDf0OpXomrnxgYpHdQHQ@mail.gmail.com>
References: <mailman.145.1328715952.29453.srilm-user@speech.sri.com>
	<CAMC4xT3EgtyXvkKj9MEwLBucF2W8ZQTDf0OpXomrnxgYpHdQHQ@mail.gmail.com>
Message-ID: <CAMC4xT2AJ8qMBW6_NqFe_sViv7UYCfRwtn-Zq70Ab3yj=QpETg@mail.gmail.com>

Hello,
I'm trying to install srilm 1.4.6 on ubuntu 10.04 64-bit and amd-64
bit machine. I have turned TCL off.
I have read the user archive and couldn't find a solution to the
problem. Please help me with the same.

Im using the following command :
make MACHINE_TYPE=i686-m64 SRILM=$PWD CC=/usr/bin/gcc CXX=/usr/bin/g++
 NO_TCL=X TCL_INCLUDE= TCL_LIBRARY= 2>&1 > make.log.txt

uname -a
Linux ubuntu 2.6.32-38-generic #83-Ubuntu SMP Wed Jan 4 11:12:07 UTC
2012 x86_64 GNU/Linux

gcc  version :

Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.4.3-4ubuntu5'
--with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --enable-multiarch --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4
--program-suffix=-4.4 --enable-nls --enable-clocale=gnu
--enable-libstdcxx-debug --enable-plugin --enable-objc-gc
--disable-werror --with-arch-32=i486 --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)


and I get the following error on the console:

I have attached the makefile log. The ngram, ngram-count,ngram-merge,
ngram-calss, disambig, anti-ngram, nbest-lattice, nbest-mix,
nbest-optimize,nnbest-pron-score,segment,segment-nbest,hidden-ngram,multi-ngram,fngram-count,fngram-count,fngram
and lattice-tool etc. binaries are not being created and they have a
problem like this
g++ command :

******************************************************************************************************************

/usr/bin/g++    -I. -I/home/ak/Downloads/srilm/include   -u matherr
-L/home/ak/Downloads/srilm/lib/i686-m64  -g -O3  -o
../bin/i686-m64/ngram ../obj/i686-m64/ngram.o
../obj/i686-m64/liboolm.a -lm -ldl
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a
/home/ak/Downloads/srilm/lib/i686-m64/libdstruct.a
/home/ak/Downloads/srilm/lib/i686-m64/libmisc.a  -lm 2>&1 | c++filt
../obj/i686-m64/liboolm.a(SimpleClassNgram.o): In function `global
constructors keyed to ctsBuffer':
/home/ak/Downloads/srilm/include/Debug.h:54: multiple definition of
`ctsBuffer'
../obj/i686-m64/liboolm.a(ClassNgram.o):/home/ak/Downloads/srilm/include/Debug.h:54:
first defined here
../obj/i686-m64/liboolm.a(Vocab.o): In function `LHash<unsigned int,
unsigned int>::remove(unsigned int, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int, unsigned int>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int, unsigned int>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to
`LHash<unsigned int, unsigned int>::removedData'
../obj/i686-m64/liboolm.a(Vocab.o): In function `LHash<char const*,
unsigned int>::remove(char const*, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<char const*, unsigned int>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<char const*, unsigned int>::removedData'
../obj/i686-m64/liboolm.a(Vocab.o): In function `Map_noKey<char>':
/usr/include/bits/string3.h:52: undefined reference to `LHash<char
const*, unsigned int>::removedData'
../obj/i686-m64/liboolm.a(Vocab.o): In function `LHash<char const*,
unsigned int>::remove(char const*, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to
`LHash<char const*, unsigned int>::removedData'
../obj/i686-m64/liboolm.a(SubVocab.o): In function
`SubVocab::addWord(unsigned int)':
/home/ak/Downloads/srilm/lm/src/SubVocab.cc:80: undefined reference to
`LHash<char const*, unsigned int>::getInternalKey(char const*, bool&)
const'
../obj/i686-m64/liboolm.a(MultiwordVocab.o): In function
`LHash<unsigned int, unsigned int*>::remove(unsigned int, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int, unsigned int*>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int, unsigned int*>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to
`LHash<unsigned int, unsigned int*>::removedData'
../obj/i686-m64/liboolm.a(LM.o): In function `~VocabIter':
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
../obj/i686-m64/liboolm.a(LM.o): In function `LM::pplCountsFile(File&,
unsigned int, TextStats&, char const*, bool)':
/home/ak/Downloads/srilm/lm/src/LM.cc:569: undefined reference to
`NgramCounts<unsigned int>::parseNgram(char*, char const**, unsigned
int, unsigned int&)'
../obj/i686-m64/liboolm.a(LM.o): In function `NgramStats':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:150: undefined reference
to `NgramCounts<unsigned int>::NgramCounts(Vocab&, unsigned int)'
../obj/i686-m64/liboolm.a(LM.o): In function `NgramCountsIter':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(LM.o): In function `NgramCounts<unsigned
int>::write(File&)':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:70: undefined reference
to `NgramCounts<unsigned int>::write(File&, unsigned int, bool)'
../obj/i686-m64/liboolm.a(LM.o): In function `NgramCounts<unsigned
int>::read(File&)':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:67: undefined reference
to `NgramCounts<unsigned int>::read(File&, unsigned int)'
../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV10NgramStats[vtable for
NgramStats]+0x68): undefined reference to `NgramCounts<unsigned
int>::memStats(MemStats&)'
../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV10NgramStats[vtable for
NgramStats]+0x70): undefined reference to `NgramCounts<unsigned
int>::countSentence(char const* const*, unsigned int)'
../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV10NgramStats[vtable for
NgramStats]+0x78): undefined reference to `NgramCounts<unsigned
int>::countSentence(unsigned int const*, unsigned int)'
../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV11NgramCountsIjE[vtable
for NgramCounts<unsigned int>]+0x68): undefined reference to
`NgramCounts<unsigned int>::memStats(MemStats&)'
../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV11NgramCountsIjE[vtable
for NgramCounts<unsigned int>]+0x70): undefined reference to
`NgramCounts<unsigned int>::countSentence(char const* const*, unsigned
int)'
../obj/i686-m64/liboolm.a(LM.o):(.rodata._ZTV11NgramCountsIjE[vtable
for NgramCounts<unsigned int>]+0x78): undefined reference to
`NgramCounts<unsigned int>::countSentence(unsigned int const*,
unsigned int)'
../obj/i686-m64/liboolm.a(NgramLM.o): In function `LHash<unsigned int,
float>::remove(unsigned int, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int, float>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int, float>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to
`LHash<unsigned int, float>::removedData'
../obj/i686-m64/liboolm.a(NgramLM.o): In function `LHash<unsigned int,
Trie<unsigned int, BOnode> >::remove(unsigned int, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int, Trie<unsigned int, BOnode> >::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int, Trie<unsigned int, BOnode> >::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to
`LHash<unsigned int, Trie<unsigned int, BOnode> >::removedData'
../obj/i686-m64/liboolm.a(Discount.o): In function `~VocabIter':
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
../obj/i686-m64/liboolm.a(Discount.o): In function `NgramCountsIter':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(Discount.o): In function
`Vocab::isMetaTag(unsigned int)':
/home/ak/Downloads/srilm/lm/src/Vocab.h:177: undefined reference to
`LHash<unsigned int, unsigned int>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(Discount.o): In function
`Vocab::typeOfMetaTag(unsigned int)':
/home/ak/Downloads/srilm/lm/src/Vocab.h:179: undefined reference to
`LHash<unsigned int, unsigned int>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(Discount.o): In function `NgramsIter':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(Discount.o): In function
`Vocab::isMetaTag(unsigned int)':
/home/ak/Downloads/srilm/lm/src/Vocab.h:177: undefined reference to
`LHash<unsigned int, unsigned int>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(Discount.o): In function
`Vocab::typeOfMetaTag(unsigned int)':
/home/ak/Downloads/srilm/lm/src/Vocab.h:179: undefined reference to
`LHash<unsigned int, unsigned int>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(Discount.o): In function `NgramCountsIter':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
/home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(Discount.o): In function `Trie<unsigned int,
unsigned int>::find(unsigned int const*, bool&) const':
/home/ak/Downloads/srilm/include/Trie.h:124: undefined reference to
`Trie<unsigned int, unsigned int>::findTrie(unsigned int const*,
bool&) const'
../obj/i686-m64/liboolm.a(Discount.o): In function `NgramsIter':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(Discount.o): In function
`Vocab::isMetaTag(unsigned int)':
/home/ak/Downloads/srilm/lm/src/Vocab.h:177: undefined reference to
`LHash<unsigned int, unsigned int>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(Discount.o): In function
`Vocab::typeOfMetaTag(unsigned int)':
/home/ak/Downloads/srilm/lm/src/Vocab.h:179: undefined reference to
`LHash<unsigned int, unsigned int>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(ClassNgram.o): In function `Map2<unsigned
int, unsigned int const*, double>::clear()':
ClassNgram.cc:(.text._ZN4Map2IjPKjdE5clearEv[Map2<unsigned int,
unsigned int const*, double>::clear()]+0xbc): undefined reference to
`LHash<unsigned int, LHash<unsigned int const*, double>
>::removedData'
ClassNgram.cc:(.text._ZN4Map2IjPKjdE5clearEv[Map2<unsigned int,
unsigned int const*, double>::clear()]+0xda): undefined reference to
`LHash<unsigned int, LHash<unsigned int const*, double>
>::removedData'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`SkipNgram::estimateMstep(NgramStats&, NgramCounts<double>&,
LHash<unsigned int, double>&, Discount**)':
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:344: undefined reference
to `LHashIter<unsigned int, double>::LHashIter(LHash<unsigned int,
double> const&, int (*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`NgramCounts<unsigned int>::findCount(unsigned int const*, unsigned
int)':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:47: undefined reference
to `Trie<unsigned int, unsigned int>::findTrie(unsigned int const*,
bool&) const'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `Trie<unsigned
int, unsigned int>::findTrie(unsigned int, bool&) const':
/home/ak/Downloads/srilm/include/Trie.h:145: undefined reference to
`LHash<unsigned int, Trie<unsigned int, unsigned int> >::find(unsigned
int, bool&) const'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`SkipNgram::wordProb(unsigned int, unsigned int const*)':
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:68: undefined reference
to `LHash<unsigned int, double>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`SkipNgram::write(File&)':
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:141: undefined reference
to `LHashIter<unsigned int, double>::LHashIter(LHash<unsigned int,
double> const&, int (*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `SkipNgram':
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:26: undefined reference
to `LHash<unsigned int, double>::LHash(unsigned int)'
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:26: undefined reference
to `LHash<unsigned int, double>::LHash(unsigned int)'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`SkipNgram::estimateEstepNgram(unsigned int*, unsigned int,
NgramStats&, NgramCounts<double>&, LHash<unsigned int, double>&)':
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:221: undefined reference
to `LHash<unsigned int, double>::find(unsigned int, bool&) const'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`NgramCounts<unsigned int>::findCount(unsigned int const*, unsigned
int)':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:47: undefined reference
to `Trie<unsigned int, unsigned int>::findTrie(unsigned int const*,
bool&) const'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `Trie<unsigned
int, unsigned int>::findTrie(unsigned int, bool&) const':
/home/ak/Downloads/srilm/include/Trie.h:145: undefined reference to
`LHash<unsigned int, Trie<unsigned int, unsigned int> >::find(unsigned
int, bool&) const'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `NgramCountsIter':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:115: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
/home/ak/Downloads/srilm/lm/src/NgramStats.h:122: undefined reference
to `TrieIter2<unsigned int, unsigned int>::TrieIter2(Trie<unsigned
int, unsigned int> const&, unsigned int*, unsigned int, int
(*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~NgramCounts':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference
to `Trie<unsigned int, double>::~Trie()'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`SkipNgram::estimate(NgramStats&, Discount**)':
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:178: undefined reference
to `NgramCounts<double>::NgramCounts(Vocab&, unsigned int)'
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:179: undefined reference
to `LHash<unsigned int, double>::LHash(unsigned int)'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~NgramCounts':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference
to `Trie<unsigned int, double>::~Trie()'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~VocabIter':
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
/home/ak/Downloads/srilm/lm/src/Vocab.h:258: undefined reference to
`LHashIter<char const*, unsigned int>::~LHashIter()'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`SkipNgram::memStats(MemStats&)':
/home/ak/Downloads/srilm/lm/src/SkipNgram.cc:34: undefined reference
to `LHash<unsigned int, double>::memStats(MemStats&) const'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function `~NgramCounts':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference
to `Trie<unsigned int, double>::~Trie()'
/home/ak/Downloads/srilm/lm/src/NgramStats.h:37: undefined reference
to `Trie<unsigned int, double>::~Trie()'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`NgramCounts<double>::write(File&)':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:70: undefined reference
to `NgramCounts<double>::write(File&, unsigned int, bool)'
../obj/i686-m64/liboolm.a(SkipNgram.o): In function
`NgramCounts<double>::read(File&)':
/home/ak/Downloads/srilm/lm/src/NgramStats.h:67: undefined reference
to `NgramCounts<double>::read(File&, unsigned int)'
../obj/i686-m64/liboolm.a(SkipNgram.o):(.rodata._ZTV11NgramCountsIdE[vtable
for NgramCounts<double>]+0x68): undefined reference to
`NgramCounts<double>::memStats(MemStats&)'
../obj/i686-m64/liboolm.a(SkipNgram.o):(.rodata._ZTV11NgramCountsIdE[vtable
for NgramCounts<double>]+0x70): undefined reference to
`NgramCounts<double>::countSentence(char const* const*, double)'
../obj/i686-m64/liboolm.a(SkipNgram.o):(.rodata._ZTV11NgramCountsIdE[vtable
for NgramCounts<double>]+0x78): undefined reference to
`NgramCounts<double>::countSentence(unsigned int const*, double)'
../obj/i686-m64/liboolm.a(TaggedNgram.o): In function `NgramBOsIter':
/home/ak/Downloads/srilm/lm/src/Ngram.h:139: undefined reference to
`TrieIter2<unsigned int, BOnode>::TrieIter2(Trie<unsigned int, BOnode>
const&, unsigned int*, unsigned int, int (*)(unsigned int, unsigned
int))'
../obj/i686-m64/liboolm.a(TaggedNgram.o): In function `NgramProbsIter':
/home/ak/Downloads/srilm/lm/src/Ngram.h:157: undefined reference to
`LHashIter<unsigned int, float>::LHashIter(LHash<unsigned int, float>
const&, int (*)(unsigned int, unsigned int))'
../obj/i686-m64/liboolm.a(TaggedNgram.o): In function `~NgramProbsIter':
/home/ak/Downloads/srilm/lm/src/Ngram.h:153: undefined reference to
`LHashIter<unsigned int, float>::~LHashIter()'
/home/ak/Downloads/srilm/lm/src/Ngram.h:153: undefined reference to
`LHashIter<unsigned int, float>::~LHashIter()'
../obj/i686-m64/liboolm.a(WordMesh.o): In function `LHash<unsigned
int, double>::remove(unsigned int, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int, double>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int, double>::removedData'
../obj/i686-m64/liboolm.a(VocabMultiMap.o): In function
`LHash<unsigned int const*, double>::remove(unsigned int const*,
bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int const*, double>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int const*, double>::removedData'
../obj/i686-m64/liboolm.a(VocabMultiMap.o): In function
`Map_noKey<VocabIndex>':
/usr/include/bits/string3.h:52: undefined reference to `LHash<unsigned
int const*, double>::removedData'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In
function `ProductNgram::read(File&, bool)':
/home/ak/Downloads/srilm/flm/src/ProductNgram.cc:54: undefined
reference to `FNgramSpecs<unsigned int>::FNgramSpecs(File&,
FactoredVocab&, unsigned int)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In
function `FNgramStats':
/home/ak/Downloads/srilm/flm/src/FNgramStats.h:148: undefined
reference to `FNgramCounts<unsigned int>::FNgramCounts(FactoredVocab&,
FNgramSpecs<unsigned int>&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In
function `ProductNgram::read(File&, bool)':
/home/ak/Downloads/srilm/flm/src/ProductNgram.cc:69: undefined
reference to `FNgramCounts<unsigned int>::read()'
/home/ak/Downloads/srilm/flm/src/ProductNgram.cc:74: undefined
reference to `FNgramCounts<unsigned int>::estimateDiscounts()'
/home/ak/Downloads/srilm/flm/src/ProductNgram.cc:75: undefined
reference to `FNgramCounts<unsigned
int>::computeCardinalityFunctions()'
/home/ak/Downloads/srilm/flm/src/ProductNgram.cc:76: undefined
reference to `FNgramCounts<unsigned int>::sumCounts()'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In
function `FNgramCounts<unsigned int>::read(File&)':
/home/ak/Downloads/srilm/flm/src/FNgramStats.h:83: undefined reference
to `FNgramCounts<unsigned int>::read()'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o): In
function `FNgramCounts<unsigned int>::write(File&)':
/home/ak/Downloads/srilm/flm/src/FNgramStats.h:99: undefined reference
to `FNgramCounts<unsigned int>::write(bool)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable
for FNgramStats]+0x50): undefined reference to `FNgramCounts<unsigned
int>::countFile(File&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable
for FNgramStats]+0x68): undefined reference to `FNgramCounts<unsigned
int>::memStats(MemStats&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable
for FNgramStats]+0x78): undefined reference to `FNgramCounts<unsigned
int>::countSentence(unsigned int, unsigned int, WidMatrix&, unsigned
int)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV11FNgramStats[vtable
for FNgramStats]+0x80): undefined reference to `FNgramCounts<unsigned
int>::countSentence(char const* const*, unsigned int)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable
for FNgramCounts<unsigned int>]+0x50): undefined reference to
`FNgramCounts<unsigned int>::countFile(File&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable
for FNgramCounts<unsigned int>]+0x68): undefined reference to
`FNgramCounts<unsigned int>::memStats(MemStats&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable
for FNgramCounts<unsigned int>]+0x78): undefined reference to
`FNgramCounts<unsigned int>::countSentence(unsigned int, unsigned int,
WidMatrix&, unsigned int)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(ProductNgram.o):(.rodata._ZTV12FNgramCountsIjE[vtable
for FNgramCounts<unsigned int>]+0x80): undefined reference to
`FNgramCounts<unsigned int>::countSentence(char const* const*,
unsigned int)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FactoredVocab.o): In
function `FactoredVocab::getIndex(char const*, unsigned int)':
/home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:279: undefined
reference to `FNgramSpecs<unsigned int>::getTag(char const*)'
/home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:282: undefined
reference to `FNgramSpecs<unsigned int>::wordTag()'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FactoredVocab.o): In
function `FactoredVocab::addWord(char const*)':
/home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:193: undefined
reference to `FNgramSpecs<unsigned int>::getTag(char const*)'
/home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:196: undefined
reference to `FNgramSpecs<unsigned int>::wordTag()'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FactoredVocab.o): In
function `FactoredVocab::addWord2(char const*, bool&)':
/home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:228: undefined
reference to `FNgramSpecs<unsigned int>::getTag(char const*)'
/home/ak/Downloads/srilm/flm/src/FactoredVocab.cc:231: undefined
reference to `FNgramSpecs<unsigned int>::wordTag()'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::recomputeBOWs()':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2162: undefined reference
to `FNgramSpecs<unsigned int>::FNgramSpec::LevelIter::next(unsigned
int&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::bgChildProbBO(unsigned int, unsigned int const*,
unsigned int, unsigned int, unsigned int)':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:685: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:686: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:706: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:707: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:726: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:727: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:744: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:746: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::boNode(unsigned int, unsigned int const*, unsigned
int, unsigned int, unsigned int)':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:544: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:554: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:567: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int,
unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:554: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:567: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int,
unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:601: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:608: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:617: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int,
unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:610: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:617: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int,
unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:610: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:627: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGGrandChildIter::BGGrandChildIter(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:636: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int,
unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:629: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGGrandChildIter::next(unsigned int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:636: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::ParentSubset::backoffValueRSubCtxW(unsigned int,
unsigned int const*, unsigned int, BackoffNodeStrategy, FNgram&,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:629: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGGrandChildIter::next(unsigned int&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `LHash<unsigned int, Trie<unsigned int, FNgram::BOnode>
>::remove(unsigned int, bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int, Trie<unsigned int, FNgram::BOnode>
>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int, Trie<unsigned int, FNgram::BOnode>
>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to
`LHash<unsigned int, Trie<unsigned int, FNgram::BOnode>
>::removedData'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `LHash<unsigned int, FNgram::ProbEntry>::remove(unsigned int,
bool&)':
/home/ak/Downloads/srilm/include/LHash.cc:416: undefined reference to
`LHash<unsigned int, FNgram::ProbEntry>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:417: undefined reference to
`LHash<unsigned int, FNgram::ProbEntry>::removedData'
/home/ak/Downloads/srilm/include/LHash.cc:473: undefined reference to
`LHash<unsigned int, FNgram::ProbEntry>::removedData'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::wordProbSum()':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2930: undefined reference
to `FNgramSpecs<unsigned int>::FNgramSpec::LevelIter::next(unsigned
int&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::rescoreFile(File&, double, double, LM&, double,
double, char const*)':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2805: undefined reference
to `FNgramSpecs<unsigned int>::loadWordFactors(char const* const*,
WordMatrix&, unsigned int)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::pplFile(File&, TextStats&, char const*)':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2684: undefined reference
to `FNgramSpecs<unsigned int>::loadWordFactors(char const* const*,
WordMatrix&, unsigned int)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::computeBOWs(unsigned int, unsigned int)':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2028: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::BGChildIterCnstr(unsigned int,
unsigned int, unsigned int)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:2030: undefined reference
to `FNgramSpecs<unsigned
int>::FNgramSpec::BGChildIterCnstr::next(unsigned int&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::write(unsigned int, File&)':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:1256: undefined reference
to `FNgramSpecs<unsigned int>::FNgramSpec::LevelIter::next(unsigned
int&)'
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:1265: undefined reference
to `FNgramSpecs<unsigned int>::FNgramSpec::LevelIter::next(unsigned
int&)'
/home/ak/Downloads/srilm/lib/i686-m64/libflm.a(FNgramLM.o): In
function `FNgram::estimate(unsigned int)':
/home/ak/Downloads/srilm/flm/src/FNgramLM.cc:1433: undefined reference
to `FNgramSpecs<unsigned int>::FNgramSpec::LevelIter::next(unsigned
int&)'
collect2: ld returned 1 exit status
/home/ak/Downloads/srilm/sbin/decipher-install 0555
../bin/i686-m64/ngram /home/ak/Downloads/srilm/bin/i686-m64
ERROR:  File to be installed (../bin/i686-m64/ngram) does not exist.
ERROR:  File to be installed (../bin/i686-m64/ngram) is not a plain file.
WARNING: creating directory /home/ak/Downloads/srilm/bin/i686-m64
Usage:  decipher-install <mode> <file1> ... <fileN> <directory>
       mode:                 file permission mode, in octal
       file1 ... fileN:      files to be installed
       directory:            where the files should be installed

files =  ../bin/i686-m64/ngram
directory =  /home/ak/Downloads/srilm/bin/i686-m64
mode =  0555

*****************************************************************************************************

Thanks a lot in advance.

Sincere Regards,
Anand Karthik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120211/cb581222/attachment.html>

From amber.wilcox.ohearn at gmail.com  Sat Feb 11 10:10:28 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Sat, 11 Feb 2012 11:10:28 -0700
Subject: [SRILM User List] Question about SRILM and sentence boundary
	detection
In-Reply-To: <4F2B2FF3.2070602@icsi.berkeley.edu>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
	<4F29A435.5050905@icsi.berkeley.edu>
	<CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>
	<4F2B2FF3.2070602@icsi.berkeley.edu>
Message-ID: <CAK+oVnVGTDW_EJORrP=KbzAC21AQPse8+-EX+mRBNYnXvJmASQ@mail.gmail.com>

On Thu, Feb 2, 2012 at 5:53 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> On 2/2/2012 8:29 AM, L. Amber Wilcox-O'Hearn wrote:

>>
>> I'm not sure if SRILM has something that does that -- i.e. holds the
>> whole LM in RAM and waits for queries. ?You might need something like
>> that as opposed to using a whole file, if you want just the
>> probabilities of the last word with respect to the previous, and you
>> want to compare different last words depending on results of previous
>> calculations, for example.
>
> Two SRILM solutions:
>
> 1- Start ngram -lm LM -escape "===" -counts - (read from stdin) and put an
> escape line (in this case, starting with "===") after every ngram in the
> input (make sure the ngram words are followed my a count "1").
> This will cause ngram to dump out the conditional prob for the ngram right
> away (instead of waiting for end-of-file).
>
> 2. Directly access the network LM server protocol implemented by ngram
> -server-port.
> Start the server with
> ? ? ? ?% ngram -lm LM -server-port 8888
> then write ngrams to that TCP port and read back the log probs:
>
> ? ?% telnet localhost 8888
> my first word << input
> -4.6499 >> output
>
> Of course you would do the equivalent of telnet in perl, python, C, ?or some
> other language to make use of the probabilities.

Thank you, Andreas.  I wasn't aware of these capabilities.

The server-port worked exactly as expected.  That is, if I give it w1
w2 w3, it returns p(w3|w1w2).  Combined with the caching, it looks
very promising for my applications.

The other solution using -counts (or actually -ppl for my case) also
worked, but of course if I give it w1 w2 w3, it returns the
probability of that whole string, i.e.  p(w1) * p(w2|w1) * p(w3|w1w2),
which would be redundant for my purposes.

I ran
> cat input_text | ngram -lm my_lm -escape "===" -ppl - -unk -no-sos -no-eos
where input_text looked like:
w1 w2 w3
===
w1 w2 w3'

Still, I'm glad it was brought up, because SRILM has so much
functionality, that I had overlooked something directly useful to me.

Amber
-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ


From alexx.tudor at gmail.com  Sat Feb 11 15:20:37 2012
From: alexx.tudor at gmail.com (alex tudor)
Date: Sun, 12 Feb 2012 01:20:37 +0200
Subject: [SRILM User List] SRILM install: LM.cc error
Message-ID: <CAM6aWRN8suqUW_RHjaidp6EMbJmQrSioEWB4Y37XnrXYyScmqA@mail.gmail.com>

Hello everyone,

I compiled SRILM with Cygwin under Windows XP. First I had:

*-bash: LANG=${locale -uU}: bad substitution
*
Afterwards all worked fine until I compiled *make World *and I had this
error:

*LM.cc: In member function 'virtual unsigned int LM::probServer(unsingned
int, unsigned int)':
LM.cc:893:38: error: call of overloaded 'waitpid(int, NULL, int)' is
ambiguous
/usr/include/sys/wait.h:38:7: note: candidates are: pid_t waitpid(pid_t,
int*, int)
/usr/include/sys/wait.h:84:14: note: pid_t waitpid(pid_t, wait*, int)
/cygdrive/c/srilm13/common/Makefile.common.targets:93: recipe for target
'../obj/cygwin/LM.o' failed*

What can I do ?

Thanks in advance !

Cheers,
Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120212/e3844efa/attachment.html>

From stolcke at icsi.berkeley.edu  Sat Feb 11 19:53:52 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sat, 11 Feb 2012 19:53:52 -0800
Subject: [SRILM User List] SRILM install: LM.cc error
In-Reply-To: <CAM6aWRN8suqUW_RHjaidp6EMbJmQrSioEWB4Y37XnrXYyScmqA@mail.gmail.com>
References: <CAM6aWRN8suqUW_RHjaidp6EMbJmQrSioEWB4Y37XnrXYyScmqA@mail.gmail.com>
Message-ID: <4F3737D0.2060200@icsi.berkeley.edu>

On 2/11/2012 3:20 PM, alex tudor wrote:
> Hello everyone,
>
> I compiled SRILM with Cygwin under Windows XP. First I had:
>
> /-bash: LANG=${locale -uU}: bad substitution
> /
> Afterwards all worked fine until I compiled /make World /and I had 
> this error:
>
> /LM.cc: In member function 'virtual unsigned int 
> LM::probServer(unsingned int, unsigned int)':
> LM.cc:893:38: error: call of overloaded 'waitpid(int, NULL, int)' is 
> ambiguous
> /usr/include/sys/wait.h:38:7: note: candidates are: pid_t 
> waitpid(pid_t, int*, int)
> /usr/include/sys/wait.h:84:14: note: pid_t waitpid(pid_t, wait*, int)
> /cygdrive/c/srilm13/common/Makefile.common.targets:93: recipe for 
> target '../obj/cygwin/LM.o' failed/
>
> What can I do ?
Try replacing the line

             while (waitpid(-1, NULL, WNOHANG) > 0) {

with
             while (waitpid(-1, (int *)NULL, WNOHANG) > 0) {

Let me know if that works.

Andreas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120211/63f149a0/attachment.html>

From stolcke at icsi.berkeley.edu  Sat Feb 11 20:03:36 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sat, 11 Feb 2012 20:03:36 -0800
Subject: [SRILM User List] Please help : Problem with installation of
 SRILM1.4.6 on ubuntu 10.04 amd-64 bit machine
In-Reply-To: <CAMC4xT2AJ8qMBW6_NqFe_sViv7UYCfRwtn-Zq70Ab3yj=QpETg@mail.gmail.com>
References: <mailman.145.1328715952.29453.srilm-user@speech.sri.com>
	<CAMC4xT3EgtyXvkKj9MEwLBucF2W8ZQTDf0OpXomrnxgYpHdQHQ@mail.gmail.com>
	<CAMC4xT2AJ8qMBW6_NqFe_sViv7UYCfRwtn-Zq70Ab3yj=QpETg@mail.gmail.com>
Message-ID: <4F373A18.9020405@icsi.berkeley.edu>

On 2/11/2012 2:31 AM, Anand Karthik wrote:
> Hello,
> I'm trying to install srilm 1.4.6 on ubuntu 10.04 64-bit and amd-64
> bit machine. I have turned TCL off.
> I have read the user archive and couldn't find a solution to the
> problem. Please help me with the same.
>
> Im using the following command :
> make MACHINE_TYPE=i686-m64 SRILM=$PWD CC=/usr/bin/gcc CXX=/usr/bin/g++
>  NO_TCL=X TCL_INCLUDE= TCL_LIBRARY= 2>&1 > make.log.txt
>
> uname -a
> Linux ubuntu 2.6.32-38-generic #83-Ubuntu SMP Wed Jan 4 11:12:07 UTC
> 2012 x86_64 GNU/Linux
>
> gcc  version :
>
> Target: x86_64-linux-gnu
> Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> 4.4.3-4ubuntu5'

I cannot reproduce this error, even with the same gcc version on Ubuntu.

First thing to try with problems is always to get the latest version of 
SRILM.
The current release is 1.6.0.   You are using a version that is quite old.

Andreas


From prochva1 at fel.cvut.cz  Sun Feb 12 01:37:10 2012
From: prochva1 at fel.cvut.cz (prochva1 at fel.cvut.cz)
Date: Sun, 12 Feb 2012 10:37:10 +0100
Subject: [SRILM User List] SRILM install: LM.cc error
In-Reply-To: <4F3737D0.2060200@icsi.berkeley.edu>
References: <CAM6aWRN8suqUW_RHjaidp6EMbJmQrSioEWB4Y37XnrXYyScmqA@mail.gmail.com>
	<4F3737D0.2060200@icsi.berkeley.edu>
Message-ID: <20120212103710.Horde.9IvYUuIAEqdPN4hG8jCRf1A@wimap.feld.cvut.cz>


Cituji Andreas Stolcke <stolcke at icsi.berkeley.edu>:

> On 2/11/2012 3:20 PM, alex tudor wrote:
>> Hello everyone,
>>
>> I compiled SRILM with Cygwin under Windows XP. First I had:
>>
>> /-bash: LANG=${locale -uU}: bad substitution
>> /
>> Afterwards all worked fine until I compiled /make World /and I had  
>> this error:
>>
>> /LM.cc: In member function 'virtual unsigned int  
>> LM::probServer(unsingned int, unsigned int)':
>> LM.cc:893:38: error: call of overloaded 'waitpid(int, NULL, int)'  
>> is ambiguous
>> /usr/include/sys/wait.h:38:7: note: candidates are: pid_t  
>> waitpid(pid_t, int*, int)
>> /usr/include/sys/wait.h:84:14: note: pid_t waitpid(pid_t, wait*, int)
>> /cygdrive/c/srilm13/common/Makefile.common.targets:93: recipe for  
>> target '../obj/cygwin/LM.o' failed/
>>
>> What can I do ?
> Try replacing the line
>
>             while (waitpid(-1, NULL, WNOHANG) > 0) {
>
> with
>             while (waitpid(-1, (int *)NULL, WNOHANG) > 0) {
>
> Let me know if that works.
>
> Andreas

Hello,

AFAICS both are base-files package/cygwin core problems (regressions  
from previous versions), both are already reported, the second one  
seems to be also fixed in cygwin CVS/snapshots (  
http://cygwin.com/snapshots/ ).

>> /-bash: LANG=${locale -uU}: bad substitution

http://cygwin.com/ml/cygwin/2012-02/msg00335.html

waitpid overload problem

http://cygwin.com/ml/cygwin/2012-02/msg00184.html
http://cygwin.com/ml/cygwin-patches/2012-q1/msg00016.html

Vaclav


From zeinab.vakil at gmail.com  Sun Feb 12 04:35:15 2012
From: zeinab.vakil at gmail.com (zeinab vakil)
Date: Sun, 12 Feb 2012 16:05:15 +0330
Subject: [SRILM User List] Predicting specified words
In-Reply-To: <4F335D3D.4070702@icsi.berkeley.edu>
References: <CAMMrkxZ6qeeS6=PkReVj25hmNcESd=mn0oRFDfksnHWpdtaXQQ@mail.gmail.com>
	<4F335D3D.4070702@icsi.berkeley.edu>
Message-ID: <CAMMrkxZoxfZeK11ecmzk_9he0E1+PWuz=iVmAObErgeUstOQ-g@mail.gmail.com>

On 2/9/12, Andreas Stolcke <stolcke at icsi.berkeley.edu> wrote:
> On 2/5/2012 8:14 PM, zeinab vakil wrote:
>> Dear All
>> hi,
>> I newly getting to know SRILM and have a question.
>> Is it possible to use SRILM to predict a word that starts with certain
>> character. for example sentence is "i go to h...", and we want what
>> word has highest probability P("i go to"|w) or even P("to"|w) and
>> starts by 'h'.
>> please guide me.
>> Best Regards,
>> zeinab vakil.
> Boy, there seems to be a lot of interest lately in this sort of
> prediction problem (see previous posts on this list).
>
> No, there is no ready-made solution for this in SRILM. I would probably
> try to build a mixed word/letter ngram LM, estimating probabilities
> p(next-letter | word-2, word-3, letter-1, letter-2, letter-3, ...) .
>
> Andreas
>

Thanks for all guidances,
how can i give query to srilm to obtain probability P(word-1|word-2)?
I want to use SRILM as server and send my query to it and receive
probability of requested bi-gram or n-gram. does it possible?
please guide me.
best regards,
zeinab

From alexx.tudor at gmail.com  Sun Feb 12 05:27:33 2012
From: alexx.tudor at gmail.com (alex tudor)
Date: Sun, 12 Feb 2012 15:27:33 +0200
Subject: [SRILM User List] Fwd:  SRILM install: LM.cc error
In-Reply-To: <CAM6aWRNX6X6kC0MmvSfEHUGL9iqm3tE1NZ1Ade3qjbWKh4QQeg@mail.gmail.com>
References: <CAM6aWRN8suqUW_RHjaidp6EMbJmQrSioEWB4Y37XnrXYyScmqA@mail.gmail.com>
	<4F3737D0.2060200@icsi.berkeley.edu>
	<20120212103710.Horde.9IvYUuIAEqdPN4hG8jCRf1A@wimap.feld.cvut.cz>
	<CAM6aWRNX6X6kC0MmvSfEHUGL9iqm3tE1NZ1Ade3qjbWKh4QQeg@mail.gmail.com>
Message-ID: <CAM6aWRMVURiQ1o4qAx2n8uSekiqUxnR5STrF1iaJo18AhYgKow@mail.gmail.com>

---------- Forwarded message ----------
From: alex tudor <alexx.tudor at gmail.com>
Date: Sun, Feb 12, 2012 at 3:23 PM
Subject: Re: [SRILM User List] SRILM install: LM.cc error
To: prochva1 at fel.cvut.cz


Andreas, it works ! Thank you !
Vaclav, I read it but that package fixed aren't in the cygwin install yet.
I'll try to download it separately.
Now I have another problem:

make[2]: Entering directory `/cygdrive/c/srilm13/dstruct/src'
gcc -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int    -I.
-I../../include   -c -g -O2 -o ../obj/cygwin/maxalloc.o maxalloc.c
g++ -Wall -Wno-unused-variable -Wno-uninitialized
-DINSTANTIATE_TEMPLATES    -I. -I../../include   -L../../lib/cygwin  -g -O2
-o ../bin/cygwin/maxalloc.exe ../obj/cygwin/maxalloc.o
../obj/cygwin/libdstruct.a  -lm  ../../lib/cygwin/libmisc.a  -ltcl84  -lm
/usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: cannot
find -ltcl84
collect2: ld returned 1 exit status
/cygdrive/c/srilm13/common/Makefile.common.targets:108: recipe for target
`../bin/cygwin/maxalloc.exe' failed

I suppose I need tcl-tk 8.4, but in cygwin is only 8.5.11. Any ideas ?

Alex


On Sun, Feb 12, 2012 at 5:53 AM, Andreas Stolcke
<stolcke at icsi.berkeley.edu>wrote:

>
>  Try replacing the line
>
>             while (waitpid(-1, NULL, WNOHANG) > 0) {
>
> with
>             while (waitpid(-1, (int *)NULL, WNOHANG) > 0) {
>
> Let me know if that works.
>
> Andreas
>
>
>

On Sun, Feb 12, 2012 at 11:37 AM, <prochva1 at fel.cvut.cz> wrote:

>
> Hello,
>
> AFAICS both are base-files package/cygwin core problems (regressions from
> previous versions), both are already reported, the second one seems to be
> also fixed in cygwin CVS/snapshots ( http://cygwin.com/snapshots/ ).
>
>  /-bash: LANG=${locale -uU}: bad substitution
>>>
>>
> http://cygwin.com/ml/cygwin/**2012-02/msg00335.html<http://cygwin.com/ml/cygwin/2012-02/msg00335.html>
>
> waitpid overload problem
>
> http://cygwin.com/ml/cygwin/**2012-02/msg00184.html<http://cygwin.com/ml/cygwin/2012-02/msg00184.html>
> http://cygwin.com/ml/cygwin-**patches/2012-q1/msg00016.html<http://cygwin.com/ml/cygwin-patches/2012-q1/msg00016.html>
>
> Vaclav
>
>
>
> ______________________________**_________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/**mailman/listinfo/srilm-user<http://www.speech.sri.com/mailman/listinfo/srilm-user>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120212/4cc5fae6/attachment.html>

From stolcke at icsi.berkeley.edu  Sun Feb 12 17:24:43 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 12 Feb 2012 17:24:43 -0800
Subject: [SRILM User List] Fwd:  SRILM install: LM.cc error
In-Reply-To: <CAM6aWRMVURiQ1o4qAx2n8uSekiqUxnR5STrF1iaJo18AhYgKow@mail.gmail.com>
References: <CAM6aWRN8suqUW_RHjaidp6EMbJmQrSioEWB4Y37XnrXYyScmqA@mail.gmail.com>
	<4F3737D0.2060200@icsi.berkeley.edu>
	<20120212103710.Horde.9IvYUuIAEqdPN4hG8jCRf1A@wimap.feld.cvut.cz>
	<CAM6aWRNX6X6kC0MmvSfEHUGL9iqm3tE1NZ1Ade3qjbWKh4QQeg@mail.gmail.com>
	<CAM6aWRMVURiQ1o4qAx2n8uSekiqUxnR5STrF1iaJo18AhYgKow@mail.gmail.com>
Message-ID: <4F38665B.9070808@icsi.berkeley.edu>

On 2/12/2012 5:27 AM, alex tudor wrote:
>
>
> ---------- Forwarded message ----------
> From: *alex tudor* <alexx.tudor at gmail.com <mailto:alexx.tudor at gmail.com>>
> Date: Sun, Feb 12, 2012 at 3:23 PM
> Subject: Re: [SRILM User List] SRILM install: LM.cc error
> To: prochva1 at fel.cvut.cz <mailto:prochva1 at fel.cvut.cz>
>
>
> Andreas, it works ! Thank you !
> Vaclav, I read it but that package fixed aren't in the cygwin install 
> yet. I'll try to download it separately.
> Now I have another problem:
>
> make[2]: Entering directory `/cygdrive/c/srilm13/dstruct/src'
> gcc -Wall -Wno-unused-variable -Wno-uninitialized -Wimplicit-int    
> -I. -I../../include   -c -g -O2 -o ../obj/cygwin/maxalloc.o maxalloc.c
> g++ -Wall -Wno-unused-variable -Wno-uninitialized 
> -DINSTANTIATE_TEMPLATES    -I. -I../../include   -L../../lib/cygwin  
> -g -O2 -o ../bin/cygwin/maxalloc.exe ../obj/cygwin/maxalloc.o 
> ../obj/cygwin/libdstruct.a  -lm  ../../lib/cygwin/libmisc.a  -ltcl84  -lm
> /usr/lib/gcc/i686-pc-cygwin/4.5.3/../../../../i686-pc-cygwin/bin/ld: 
> cannot find -ltcl84
> collect2: ld returned 1 exit status
> /cygdrive/c/srilm13/common/Makefile.common.targets:108: recipe for 
> target `../bin/cygwin/maxalloc.exe' failed
>
> I suppose I need tcl-tk 8.4, but in cygwin is only 8.5.11. Any ideas ?

You should be able to build with any recent Tcl version, possibly 
adjusting the name of the library. In the worst case just disable Tcl 
support as described in the FAQ.

Andreas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120212/3bdc373b/attachment.html>

From stolcke at icsi.berkeley.edu  Sun Feb 12 17:37:52 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 12 Feb 2012 17:37:52 -0800
Subject: [SRILM User List] Question about SRILM and sentence boundary
 detection
In-Reply-To: <CALFkT+=rPHjZj6K4aRdLgbsTEhrOX3ZOw+TLdXNXjc+-U4vqAA@mail.gmail.com>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
	<4F29A435.5050905@icsi.berkeley.edu>
	<CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>
	<4F2B2FF3.2070602@icsi.berkeley.edu>
	<CAK+oVnVGTDW_EJORrP=KbzAC21AQPse8+-EX+mRBNYnXvJmASQ@mail.gmail.com>
	<CALFkT+=rPHjZj6K4aRdLgbsTEhrOX3ZOw+TLdXNXjc+-U4vqAA@mail.gmail.com>
Message-ID: <4F386970.1040200@icsi.berkeley.edu>

From: *L. Amber Wilcox-O'Hearn* <amber.wilcox.ohearn at gmail.com 
<mailto:amber.wilcox.ohearn at gmail.com>>
>
> Thank you, Andreas.  I wasn't aware of these capabilities.
>
> The server-port worked exactly as expected.  That is, if I give it w1
> w2 w3, it returns p(w3|w1w2).  Combined with the caching, it looks
> very promising for my applications.
>
> The other solution using -counts (or actually -ppl for my case) also
> worked, but of course if I give it w1 w2 w3, it returns the
> probability of that whole string, i.e.  p(w1) * p(w2|w1) * p(w3|w1w2),
> which would be redundant for my purposes.
That's not correct.    ngram -counts will output CONDITIONAL ngram 
probabilities.

*-counts*/countsfile/
    Perform a computation similar to *-ppl*, but based only on the
    N-gram counts found in /countsfile/. Probabilities are computed for
    the last word of each N-gram, using the other words as contexts, and
    scaling by the associated N-gram count. Summary statistics are
    output at the end, as well as before each escaped input line. 

So it should do exactly what you need.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120212/1d87121b/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Feb 13 07:38:26 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 13 Feb 2012 07:38:26 -0800
Subject: [SRILM User List] Predicting specified words
In-Reply-To: <CAMMrkxZoxfZeK11ecmzk_9he0E1+PWuz=iVmAObErgeUstOQ-g@mail.gmail.com>
References: <CAMMrkxZ6qeeS6=PkReVj25hmNcESd=mn0oRFDfksnHWpdtaXQQ@mail.gmail.com>
	<4F335D3D.4070702@icsi.berkeley.edu>
	<CAMMrkxZoxfZeK11ecmzk_9he0E1+PWuz=iVmAObErgeUstOQ-g@mail.gmail.com>
Message-ID: <4F392E72.7030703@icsi.berkeley.edu>

On 2/12/2012 4:35 AM, zeinab vakil wrote:
>
> Thanks for all guidances,
> how can i give query to srilm to obtain probability P(word-1|word-2)?
> I want to use SRILM as server and send my query to it and receive
> probability of requested bi-gram or n-gram. does it possible?
> please guide me.
> best regards,
> zeinab

If you want to invoke SRILM via the C++ API, use the wordProb() function.
The other options are writing/reading to/from ngram via a pipe, or using 
the ngram client/server protocol.
See this recent thread for details:  
http://www.speech.sri.com/pipermail/srilm-user/2012q1/001148.html .

Andreas

From amber.wilcox.ohearn at gmail.com  Tue Feb 14 04:54:31 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Tue, 14 Feb 2012 05:54:31 -0700
Subject: [SRILM User List] Question about SRILM and sentence boundary
	detection
In-Reply-To: <4F386970.1040200@icsi.berkeley.edu>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
	<4F29A435.5050905@icsi.berkeley.edu>
	<CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>
	<4F2B2FF3.2070602@icsi.berkeley.edu>
	<CAK+oVnVGTDW_EJORrP=KbzAC21AQPse8+-EX+mRBNYnXvJmASQ@mail.gmail.com>
	<CALFkT+=rPHjZj6K4aRdLgbsTEhrOX3ZOw+TLdXNXjc+-U4vqAA@mail.gmail.com>
	<4F386970.1040200@icsi.berkeley.edu>
Message-ID: <CAK+oVnXwpAZ7QvAkUNaWvCpHhFs3Qtyaj85eddKk2QeGAxh=XA@mail.gmail.com>

On Sun, Feb 12, 2012 at 6:37 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> From: L. Amber Wilcox-O'Hearn <amber.wilcox.ohearn at gmail.com>
>
>
> Thank you, Andreas. ?I wasn't aware of these capabilities.
>
> The server-port worked exactly as expected. ?That is, if I give it w1
> w2 w3, it returns p(w3|w1w2). ?Combined with the caching, it looks
> very promising for my applications.
>
> The other solution using -counts (or actually -ppl for my case) also
> worked, but of course if I give it w1 w2 w3, it returns the
> probability of that whole string, i.e. ?p(w1) * p(w2|w1) * p(w3|w1w2),
> which would be redundant for my purposes.
>
> That's not correct.??? ngram -counts will output CONDITIONAL ngram
> probabilities.
> -counts countsfile Perform a computation similar to -ppl, but based only on
> the N-gram counts found in countsfile. Probabilities are computed for the
> last word of each N-gram, using the other words as contexts, and scaling by
> the associated N-gram count. Summary statistics are output at the end, as
> well as before each escaped input line. So it should do exactly what you
> need.

I see.   I misunderstood the difference between -ppl and -counts.

I did try this and the summary statistics at the end gave the correct
sum, but there weren't any statistics output before the escaped lines:
> cat testcounts | ngram -lm LM -escape "===" -counts - -unk
===
===
===
file -: 0 sentences, 4 words, 0 OOVs
0 zeroprobs, logprob= -9.87606 ppl= 294.452 ppl1= 294.452

Did I miss something?


Amber
-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ


From stolcke at icsi.berkeley.edu  Tue Feb 14 08:41:01 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 14 Feb 2012 08:41:01 -0800
Subject: [SRILM User List] Question about SRILM and sentence boundary
 detection
In-Reply-To: <CAK+oVnXwpAZ7QvAkUNaWvCpHhFs3Qtyaj85eddKk2QeGAxh=XA@mail.gmail.com>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
	<4F29A435.5050905@icsi.berkeley.edu>
	<CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>
	<4F2B2FF3.2070602@icsi.berkeley.edu>
	<CAK+oVnVGTDW_EJORrP=KbzAC21AQPse8+-EX+mRBNYnXvJmASQ@mail.gmail.com>
	<CALFkT+=rPHjZj6K4aRdLgbsTEhrOX3ZOw+TLdXNXjc+-U4vqAA@mail.gmail.com>
	<4F386970.1040200@icsi.berkeley.edu>
	<CAK+oVnXwpAZ7QvAkUNaWvCpHhFs3Qtyaj85eddKk2QeGAxh=XA@mail.gmail.com>
Message-ID: <4F3A8E9D.4080509@icsi.berkeley.edu>

On 2/14/2012 4:54 AM, L. Amber Wilcox-O'Hearn wrote:
> I see.   I misunderstood the difference between -ppl and -counts.
>
> I did try this and the summary statistics at the end gave the correct
> sum, but there weren't any statistics output before the escaped lines:
>> cat testcounts | ngram -lm LM -escape "===" -counts - -unk
> ===
> ===
> ===
> file -: 0 sentences, 4 words, 0 OOVs
> 0 zeroprobs, logprob= -9.87606 ppl= 294.452 ppl1= 294.452
>
> Did I miss something?
This is poorly documented.   The escape lines trigger output of 
"sentence level"  statistics.  At the end, you get the "file level" 
statistics.
However, to be compatible with -ppl, sentence level stats are only 
output with -debug 1 or higher.  So your example will work as long as 
you also add -debug 1.

Andreas


From amber.wilcox.ohearn at gmail.com  Tue Feb 14 11:20:13 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Tue, 14 Feb 2012 12:20:13 -0700
Subject: [SRILM User List] Question about SRILM and sentence boundary
	detection
In-Reply-To: <4F3A8E9D.4080509@icsi.berkeley.edu>
References: <03BD4FDE7D43CB45BAB27FF8B224A043141FFE5F@exchange-ms2.iais.fraunhofer.de>
	<4F29A435.5050905@icsi.berkeley.edu>
	<CAK+oVnVQPnnyP29me9yLNjoyEaOeRH6MncnYgyYjoee9rRPfjg@mail.gmail.com>
	<4F2B2FF3.2070602@icsi.berkeley.edu>
	<CAK+oVnVGTDW_EJORrP=KbzAC21AQPse8+-EX+mRBNYnXvJmASQ@mail.gmail.com>
	<CALFkT+=rPHjZj6K4aRdLgbsTEhrOX3ZOw+TLdXNXjc+-U4vqAA@mail.gmail.com>
	<4F386970.1040200@icsi.berkeley.edu>
	<CAK+oVnXwpAZ7QvAkUNaWvCpHhFs3Qtyaj85eddKk2QeGAxh=XA@mail.gmail.com>
	<4F3A8E9D.4080509@icsi.berkeley.edu>
Message-ID: <CAK+oVnVN3aisHRytmfE994p_BTy81gJ-3DZjoX=cK1weQ-5Y=w@mail.gmail.com>

On Tue, Feb 14, 2012 at 9:41 AM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> On 2/14/2012 4:54 AM, L. Amber Wilcox-O'Hearn wrote:
>>
>> I see. ? I misunderstood the difference between -ppl and -counts.
>>
>> I did try this and the summary statistics at the end gave the correct
>> sum, but there weren't any statistics output before the escaped lines:
>>>
>>> cat testcounts | ngram -lm LM -escape "===" -counts - -unk
>>
>> ===
>> ===
>> ===
>> file -: 0 sentences, 4 words, 0 OOVs
>> 0 zeroprobs, logprob= -9.87606 ppl= 294.452 ppl1= 294.452
>>
>> Did I miss something?
>
> This is poorly documented. ? The escape lines trigger output of "sentence
> level" ?statistics. ?At the end, you get the "file level" statistics.
> However, to be compatible with -ppl, sentence level stats are only output
> with -debug 1 or higher. ?So your example will work as long as you also add
> -debug 1.

Ah, perfect.  Thank you very much!

-Amber


From j.kirby at ed.ac.uk  Wed Feb 15 04:00:33 2012
From: j.kirby at ed.ac.uk (James Kirby)
Date: Wed, 15 Feb 2012 12:00:33 +0000
Subject: [SRILM User List] C(<s>) is always zero?
Message-ID: <CAOu498-P5KQqZBW0GjvU4o2GuEPw-_fbqkcbtxk2=h85A=35HA@mail.gmail.com>

Hello,

is there a reason why the unigram count of the auto-prepended sentence
start tag <s> is always zero? As can be seen from the output below, the log
probabilities are calculated counting the sentence send tags </s> but not
the start tags. Or have I just missed something horribly obvious?

Thanks,
James

----

[jkirby at Markov]$ more sentence.txt
Sentence number 1.
Sentence number 2.
Sentence number 3.

[jkirby at Markov]$ ngram-count -order 1 -text sentence.txt -tolower -lm
sentence.lm
warning: count of count 2 is zero -- lowering maxcount
GT discounting disabled

[jkirby at Markov]$ more sentence.lm

\data\
ngram 1=7

\1-grams:
-1.079181       1.
-1.079181       2.
-1.079181       3.
-0.60206        </s>
-99     <s>
-0.60206        number
-0.60206        sentence

\end\
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120215/a23f138d/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120215/a23f138d/attachment.ksh>

From stolcke at icsi.berkeley.edu  Wed Feb 15 07:31:44 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 15 Feb 2012 07:31:44 -0800
Subject: [SRILM User List] C(<s>) is always zero?
In-Reply-To: <CAOu498-P5KQqZBW0GjvU4o2GuEPw-_fbqkcbtxk2=h85A=35HA@mail.gmail.com>
References: <CAOu498-P5KQqZBW0GjvU4o2GuEPw-_fbqkcbtxk2=h85A=35HA@mail.gmail.com>
Message-ID: <4F3BCFE0.60202@icsi.berkeley.edu>

On 2/15/2012 4:00 AM, James Kirby wrote:
> Hello,
>
> is there a reason why the unigram count of the auto-prepended sentence 
> start tag <s> is always zero? As can be seen from the output below, 
> the log probabilities are calculated counting the sentence send tags 
> </s> but not the start tags. Or have I just missed something horribly 
> obvious?

You are confusing a token's frequency in the text with the probability 
in the model.
Because <s> only occurs as part of an ngram's history, but never as the 
token being predicted, its probability is 0.  If P(<s>) were > 0, then 
(via backoff) you would also have P(<s> | ...) > 0 and the sum of 
probabilities over all allowed words would be < 1.

If you want the unigram probability of a sentence boundary, use the </s> 
tag.

Andreas


From nobyte at sina.com  Wed Feb 22 00:21:46 2012
From: nobyte at sina.com (huajian xue)
Date: Wed, 22 Feb 2012 16:21:46 +0800
Subject: [SRILM User List] (no subject)
Message-ID: <78cc40$1ic2as5@irxd5-187.sinamail.sina.com.cn>

Hello, 

 
Can the current released srilm toolkit be utilized to build discriminative
language model? 

 
Thanks,

Xue

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120222/4e681f27/attachment.html>

From amber.wilcox.ohearn at gmail.com  Fri Feb 24 12:43:52 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Fri, 24 Feb 2012 13:43:52 -0700
Subject: [SRILM User List] Limiting the vocabulary size of an n-gram model
Message-ID: <CAK+oVnVHhSoD=zJsv4PmG=JpY5qgS=t7FBn5omjLGL4HwQCDUg@mail.gmail.com>

Greetings.

I am constructing a large trigram model using a pre-specified
vocabulary size.  What I have done in the past is to first get the
unigram counts, and then sort the top N most frequent words into my
vocabulary file, which I then pass to ngram for computing the trigram
counts, which I then pass again to ngram to construct the LM.

However, I seem to remember having read that the count of counts
estimates will be better if I compute the trigram counts first, and
only limit the vocabulary on the final step.  Is that correct?  Are
there any other shortcuts for this?

Thank you,
Amber

-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ

From stolcke at icsi.berkeley.edu  Fri Feb 24 13:16:55 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 24 Feb 2012 13:16:55 -0800
Subject: [SRILM User List] Limiting the vocabulary size of an n-gram
	model
In-Reply-To: <CAK+oVnVHhSoD=zJsv4PmG=JpY5qgS=t7FBn5omjLGL4HwQCDUg@mail.gmail.com>
References: <CAK+oVnVHhSoD=zJsv4PmG=JpY5qgS=t7FBn5omjLGL4HwQCDUg@mail.gmail.com>
Message-ID: <4F47FE47.6080807@icsi.berkeley.edu>

On 2/24/2012 12:43 PM, L. Amber Wilcox-O'Hearn wrote:
> Greetings.
>
> I am constructing a large trigram model using a pre-specified
> vocabulary size.  What I have done in the past is to first get the
> unigram counts, and then sort the top N most frequent words into my
> vocabulary file, which I then pass to ngram for computing the trigram
> counts, which I then pass again to ngram to construct the LM.
>
> However, I seem to remember having read that the count of counts
> estimates will be better if I compute the trigram counts first, and
> only limit the vocabulary on the final step.  Is that correct?  Are
> there any other shortcuts for this?

This is correct.  The make-big-lm script (a wrapper around ngram-count) 
will extract the discounting statistics from the full vocabulary and 
them apply them to the LM estimation with a limited vocabulary.  Check 
the training-scripts(1) man page.

Andreas


From ariya at jhu.edu  Fri Feb 24 14:35:10 2012
From: ariya at jhu.edu (Ariya Rastrow)
Date: Fri, 24 Feb 2012 17:35:10 -0500
Subject: [SRILM User List] NgramCountLM Bug?
Message-ID: <CAKcu0W=OMw_fQvdw7S-n_iNRkPN=kNEzRphF_Oz7jqi93M-g=g@mail.gmail.com>

Hi,
  I had a question about NgramCountLM (Jelinek-Mercer interpolation
method). It seems to me there is a bug with the way the \lambda parameters
are being estimated in the code. The problem is that the expectations for
\lambda's (using EM) are being collected by iterating through N-grams of
the held-out text. However, the count of the N-gram is not being taken into
account for each N-gram (even though for calculating the log-probability of
the held-out the wordProb is being multiplied by the count of the N-gram)
during the call to LM::countsProb(...) by NgramCountLM::estimate(). In
other words, the statistics for \lambda's are being collected as if each
event is a singleton in the held-out data. The fix to this would be to pass
*count from LM::countsProb(...) to NgramCountLM::wordProbTrain(...) such
that the posteriors of \lambda get multiplied by that count.

Thanks,

Ariya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120224/2d393aa3/attachment.html>

From stolcke at icsi.berkeley.edu  Fri Feb 24 19:42:40 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 24 Feb 2012 19:42:40 -0800
Subject: [SRILM User List] NgramCountLM Bug?
In-Reply-To: <CAKcu0W=OMw_fQvdw7S-n_iNRkPN=kNEzRphF_Oz7jqi93M-g=g@mail.gmail.com>
References: <CAKcu0W=OMw_fQvdw7S-n_iNRkPN=kNEzRphF_Oz7jqi93M-g=g@mail.gmail.com>
Message-ID: <4F4858B0.1030207@icsi.berkeley.edu>

On 2/24/2012 2:35 PM, Ariya Rastrow wrote:
>
> Hi,
>   I had a question about NgramCountLM (Jelinek-Mercer interpolation 
> method). It seems to me there is a bug with the way the \lambda 
> parameters are being estimated in the code. The problem is that the 
> expectations for \lambda's (using EM) are being collected by iterating 
> through N-grams of the held-out text. However, the count of the N-gram 
> is not being taken into account for each N-gram (even though for 
> calculating the log-probability of the held-out the wordProb is being 
> multiplied by the count of the N-gram) during the call 
> to LM::countsProb(...) by NgramCountLM::estimate(). In other words, 
> the statistics for \lambda's are being collected as if each event is a 
> singleton in the held-out data. The fix to this would be to pass 
> *count from LM::countsProb(...) to NgramCountLM::wordProbTrain(...) 
> such that the posteriors of \lambda get multiplied by that count.
>
Good catch!   That is indeed a bug.  Attached is s patch that should do 
the right thing.

Andreas


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ngramcountlm.patch
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120224/37f59863/attachment.ksh>

From amber.wilcox.ohearn at gmail.com  Sat Feb 25 08:57:35 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Sat, 25 Feb 2012 09:57:35 -0700
Subject: [SRILM User List] make-big-lm merge-batch-counts mv error?
Message-ID: <CAK+oVnVgRq0B40EuGrxdYSzpb=5TUe5WCxAJC8rvNeAMQSE_sA@mail.gmail.com>

Just a quick follow-up: I'm now trying to put this all together, but
I'm getting the following error:

[amber]$ make-big-lm -debug 1 -kndiscount3 -unk -name test_lm -read
counts_3/merge-iter7-1.ngrams.gz -vocab test.vocab
+ make-kn-counts no_max_order=1 max_per_file=10000000 order=3
kndiscount1=0 kndiscount2=0 kndiscount3=1 kndiscount4=0 kndiscount5=0
kndiscount6=0 kndiscount7=0 kndiscount8=0 kndiscount9=0
output=test_lm.kndir/kncounts
+ merge-batch-counts test_lm.kndir
final counts in
mv: missing destination file operand after `test_lm.kncounts.gz'
Try `mv --help' for more information.

Any ideas about what I'm missing?

Thanks again,
Amber

From amber.wilcox.ohearn at gmail.com  Sun Feb 26 16:24:12 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Sun, 26 Feb 2012 17:24:12 -0700
Subject: [SRILM User List] make-big-lm merge-batch-counts mv error?
In-Reply-To: <CAK+oVnVgRq0B40EuGrxdYSzpb=5TUe5WCxAJC8rvNeAMQSE_sA@mail.gmail.com>
References: <CAK+oVnVgRq0B40EuGrxdYSzpb=5TUe5WCxAJC8rvNeAMQSE_sA@mail.gmail.com>
Message-ID: <CAK+oVnWRRB+FL7aq0BBzkaddV3o8Vc_jL218PHuLuJCmSPHKVA@mail.gmail.com>

I finally figured out my error here.  I had passed make-big-lm an
order 3 counts file, not an order *upto and including* 3.  In
response, make-kn-counts silently generated no output, and then there
was no file to mv.

Amber

From stolcke at icsi.berkeley.edu  Sun Feb 26 16:38:44 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Sun, 26 Feb 2012 16:38:44 -0800
Subject: [SRILM User List] make-big-lm merge-batch-counts mv error?
In-Reply-To: <CAK+oVnWRRB+FL7aq0BBzkaddV3o8Vc_jL218PHuLuJCmSPHKVA@mail.gmail.com>
References: <CAK+oVnVgRq0B40EuGrxdYSzpb=5TUe5WCxAJC8rvNeAMQSE_sA@mail.gmail.com>
	<CAK+oVnWRRB+FL7aq0BBzkaddV3o8Vc_jL218PHuLuJCmSPHKVA@mail.gmail.com>
Message-ID: <4F4AD094.2050808@icsi.berkeley.edu>

On 2/26/2012 4:24 PM, L. Amber Wilcox-O'Hearn wrote:
> I finally figured out my error here.  I had passed make-big-lm an
> order 3 counts file, not an order *upto and including* 3.  In
> response, make-kn-counts silently generated no output, and then there
> was no file to mv.
Good.  Also, you want to use -kndiscount,  not -kndiscount3.   With the 
latter, you would only apply KN discounting to trigrams, but that 
doesn't really make sense since KN discounting relies on modifying the 
lower-order ngram distributions.

Andreas


From dmytro.prylipko at ovgu.de  Mon Feb 27 05:45:12 2012
From: dmytro.prylipko at ovgu.de (Dmytro Prylipko)
Date: Mon, 27 Feb 2012 14:45:12 +0100
Subject: [SRILM User List] Observed omit event
Message-ID: <CANskbNP0pJmx1UjaUG=NcXq+QDBEkySJ57tYq7LDd9sCZf3+Pg@mail.gmail.com>

Hi,

I would like to clarify how to evaluate properly a language model with an
observed hidden event (<A>), omitted from context.

I have manually created the counts file, where this event had been skipped
from context, and have built a LM from that.
Also, I have added this line to the end of the LM file:
<A> -observed -omit

My question is whether it is necessary to specify a hidden vocabulary with
-hidden-vocab option.
Which command line is correct:

ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist -hidden-vocab
df.defs

or just

ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist

Thanks.

Yours,
Dmytro Prylipko.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120227/3738f90b/attachment.html>

From amber.wilcox.ohearn at gmail.com  Mon Feb 27 17:10:42 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Mon, 27 Feb 2012 18:10:42 -0700
Subject: [SRILM User List] make-big-lm merge-batch-counts mv error?
In-Reply-To: <4F4AD094.2050808@icsi.berkeley.edu>
References: <CAK+oVnVgRq0B40EuGrxdYSzpb=5TUe5WCxAJC8rvNeAMQSE_sA@mail.gmail.com>
	<CAK+oVnWRRB+FL7aq0BBzkaddV3o8Vc_jL218PHuLuJCmSPHKVA@mail.gmail.com>
	<4F4AD094.2050808@icsi.berkeley.edu>
Message-ID: <CAK+oVnXdqeBwmGL0=hWCCkaYy_ydXfh8QPPbkMyuLVN-C+DF2g@mail.gmail.com>

On Sun, Feb 26, 2012 at 5:38 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> Good. ?Also, you want to use -kndiscount, ?not -kndiscount3. ? With the
> latter, you would only apply KN discounting to trigrams, but that doesn't
> really make sense since KN discounting relies on modifying the lower-order
> ngram distributions.

Oh, wow.  Thanks for pointing that out!

Amber


From chenmengdx at gmail.com  Sun Mar  4 16:56:01 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Mon, 5 Mar 2012 08:56:01 +0800
Subject: [SRILM User List] How to process the disfluency words when building
	LM
Message-ID: <CA+bc0mqyDCf-__ALg0GR2t5Em+cvHgeu84yGiqO9RrDGg9xj=w@mail.gmail.com>

Hello, I tried to make the language model from some
non-native spontaneous speech transcription. However, there are lots of
"strange words" in the corpus because the transcriber tried to transcribe
as close as the real pronunciation.

For example, some transcriptions are as follows:

<s> she taught english there and she gave english lesson to a secondary
school students in *boli bolivi  bolivia*</s>
<s> *er* what's wrong *er *he asked she asked </s>
<s> her her mother would *em er* her she took her mother in her own house
and the baby *em* *moven bester*</s>

So I want to ask how should I process these "strange words" that don't
exist such as boli, bolivi, er, em, moven, bester etc.
If I replace them with the correct words, the language model will be
unsuitable for the non-native spontaneous speech task.
If I keep them, their counts and probability are too small. And the
dictionary is also hard to generate.

Are there any suggestions?

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120305/68d66e90/attachment.html>

From reham.imamu at gmail.com  Mon Mar  5 07:17:15 2012
From: reham.imamu at gmail.com (Reham Al-Majed)
Date: Mon, 5 Mar 2012 18:17:15 +0300
Subject: [SRILM User List] disambig with Class-based n gram
In-Reply-To: <CAAbad9oYOrYOjCMGjGhMRBznMES25T7+giQczc7H1s6Q6ksadQ@mail.gmail.com>
References: <CAAbad9oYOrYOjCMGjGhMRBznMES25T7+giQczc7H1s6Q6ksadQ@mail.gmail.com>
Message-ID: <CAAbad9q_k9a_Uumy8hcs_OXnFNHtj2Huxuf9_KEe0_por-Ooaw@mail.gmail.com>

Hello ,,
>
> I've built class based n-gram by :
>
> 1- define my classes
> 2- use replace-words-with-classes
> 3- use ngram-count to estimate the LM
>
> I want to use this class based n gram model with disambig tool ,, The
> options (-factored and -count-lm) interpret the LMs as factored and count
> based LMs ... What about class-based ?  How to tell disambig to interpret
> the LM as a class-based ?
>
> I'm trying to use my class-based as an original n-gram model, however the
> output for sample test seems strange ... words in the test sample are
> always disambiguated using the last word in the mapping file !
>
>  Actually I want the words be disambiguated using the LM probabilities
> only without considering the probabilities in the mapping file.. I use the
> options -lmw 1 and -mapw 0 but the output still the same ...
>
>
> In short my questions are :
>
> 1- Is it possible to use class-based n gram with disabmig tool ? Or should
> I build my own disambiguator  using  the output of ngram tool ?
>
> 2- How to make disambig tool use the probabilities of LM ONLY ?
>
> Your help is really greatly appreciated ...
>
> Thanks in Advance ,
> Reham
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120305/dc10141c/attachment.html>

From stolcke at icsi.berkeley.edu  Mon Mar  5 10:09:32 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 05 Mar 2012 10:09:32 -0800
Subject: [SRILM User List] disambig with Class-based n gram
In-Reply-To: <CAAbad9q_k9a_Uumy8hcs_OXnFNHtj2Huxuf9_KEe0_por-Ooaw@mail.gmail.com>
References: <CAAbad9oYOrYOjCMGjGhMRBznMES25T7+giQczc7H1s6Q6ksadQ@mail.gmail.com>
	<CAAbad9q_k9a_Uumy8hcs_OXnFNHtj2Huxuf9_KEe0_por-Ooaw@mail.gmail.com>
Message-ID: <4F55015C.60301@icsi.berkeley.edu>

On 3/5/2012 7:17 AM, Reham Al-Majed wrote:
>
>
>     Hello ,,
>
>     I've built class based n-gram by :
>
>     1- define my classes
>     2- use replace-words-with-classes
>     3- use ngram-count to estimate the LM
>
>     I want to use this class based n gram model with disambig tool ,,
>     The options (-factored and -count-lm) interpret the LMs as
>     factored and count based LMs ... What about class-based ?  How to
>     tell disambig to interpret the LM as a class-based ?
>
>     I'm trying to use my class-based as an original n-gram model,
>     however the output for sample test seems strange ... words in the
>     test sample are always disambiguated using the last word in the
>     mapping file !
>
>      Actually I want the words be disambiguated using the LM
>     probabilities only without considering the probabilities in the
>     mapping file.. I use the options -lmw 1 and -mapw 0 but the output
>     still the same ...
>
>
>     In short my questions are :
>
>     1- Is it possible to use class-based n gram with disabmig tool ?
>     Or should I build my own disambiguator  using  the output of ngram
>     tool ?
>

Unfortunately disambig currently does not support the use of class-based 
ngram LMs (what is implemented by ngram -classes).
Two workarounds are
1) if feasible, expand the class-ngram LM into a word-ngram LM (using 
ngram -expand-classes).
2) rewrite the class-ngram as a factored LM. This will require some 
investment into understanding the much more general FLM mechanism.


>
>     2- How to make disambig tool use the probabilities of LM ONLY ?
>

disambig -mapw 0 will do that.

Andreas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120305/b8c2e437/attachment.html>

From stolcke at icsi.berkeley.edu  Tue Mar  6 12:34:20 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 06 Mar 2012 12:34:20 -0800
Subject: [SRILM User List] Observed omit event
In-Reply-To: <4F4BB9E4.7070403@icsi.berkeley.edu>
References: <CANskbNP0pJmx1UjaUG=NcXq+QDBEkySJ57tYq7LDd9sCZf3+Pg@mail.gmail.com>
	<4F4BB9E4.7070403@icsi.berkeley.edu>
Message-ID: <4F5674CC.1060300@icsi.berkeley.edu>

The attached source patch will fix the behavior of ngram -hidden-vocab 
so that the vocab file can contain event property specifications as 
described in the man page.   Previously only the names of the hidden 
event words were read from that file, but all treated as default hidden 
events.

The patch also fixes a couple of unrelated bugs in HiddenNgram.cc .

Andreas


On 2/27/2012 9:14 AM, Andreas Stolcke wrote:
> On 2/27/2012 5:45 AM, Dmytro Prylipko wrote:
>> Hi,
>>
>> I would like to clarify how to evaluate properly a language model 
>> with an observed hidden event (<A>), omitted from context.
>>
>> I have manually created the counts file, where this event had been 
>> skipped from context, and have built a LM from that.
>> Also, I have added this line to the end of the LM file:
>> <A> -observed -omit
>>
>> My question is whether it is necessary to specify a hidden vocabulary 
>> with -hidden-vocab option.
>> Which command line is correct:
>>
>> ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist 
>> -hidden-vocab df.defs
>>
>> or just
>>
>> ngram -lm 3-gram.omit.lm -ppl test.txt -order 3 -vocab wlist
>
> If you append the hidden vocab definitions to the LM file you only 
> need to tell ngram that it IS a hidden even LM that you're reading.
> You can achieve that by adding -hidden-vocab /dev/null .
>
> Andreas
>
>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: hidden-ngram.patch
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120306/b86b9713/attachment.ksh>

From reham.imamu at gmail.com  Tue Mar  6 13:11:06 2012
From: reham.imamu at gmail.com (Reham Al-Majed)
Date: Wed, 7 Mar 2012 00:11:06 +0300
Subject: [SRILM User List] disambig with FLM
Message-ID: <CAAbad9reKPdc+eawkqom-t9OU9qp3+FUVdxQJW+5L8nzxAdFRQ@mail.gmail.com>

Thanks a lot for your reply ,,


I'm trying to build FLM with the following FLM specifications file:

## normal trigram LM
1
W : 2 W(-1) W(-2) FLMCount.count  FLMLM.lm  3
W1,W2 W2 wbdiscount  interpolate
W1 W1 wbdiscount  interpolate
0 0 wbdiscount


I generate my FLM model using the following command :

fngram-count -factor-file FLMDes -debug 2 -text TrainFLM  -lm FLMLM .lm
-write-counts FLMcount.count -no-virtual-begin-sentence -nonull

It runs without errors .. I then measure the ppl of  the generated FLM with
the following command:

fngram -factor-file FLMDes -debug 2 -ppl FLMTest -nonull


Unfortunately, when I tried to test the main step I got an error :(  ... I
search the mailing list archive but I didn't  find similar problem

The command I used to test disambig with my FLM model was :

 disambig -text FLMTest -map 3.map -factored -lm FLMLM.lm

The output of this command was:

No known factors found in Aa
No known factors found in AA
No known factors found in aa
No known factors found in Bb
No known factors found in bb
No known factors found in BB
No known factors found in CC
No known factors found in cc
No known factors found in Cc
FLMLM.lm: line 2: Error: couldn't form int for number of factored LMs in
when reading FLM spec file


I don't know what dose it mean by "No known factors found in ......"

And I wonder about the error message "couldn't form int for number of
factored LMs in when reading FLM spec file"  .... As you can see above in
my FLM specifications file, I determined the number of FLM specifications !


Some notes may help you  to solve my problem :

-- I've built my model  to  test disambig with FLM  before using it in my
project so it was build with training data of  only 28 sentences, 138 words

-- The mapping file (named 3.map) used to test disambig was :
W-aa Aa 0.5  AA 0.4 aa 0.1
W-bb Bb 0.6 bb 0.1 BB 0.3
W-cc CC 0.7 cc 0.1 Cc 0.2

-- The FLMTest contains only one sentence:
<s>  W-aa W-bb W-cc </s>


Am I doing something wrong ?


Your help and support is really greatly appreciated .. I've a graduation
project  that needs a disambiguator for highly inflected language I'm
worried that I could not use your disambig program with FLM model :(


Best Regards,,
Reham


On 5 March 2012 21:09, Andreas Stolcke <stolcke at icsi.berkeley.edu> wrote:

>  On 3/5/2012 7:17 AM, Reham Al-Majed wrote:
>
>
>
>  Hello ,,
>>
>> I've built class based n-gram by :
>>
>> 1- define my classes
>> 2- use replace-words-with-classes
>> 3- use ngram-count to estimate the LM
>>
>> I want to use this class based n gram model with disambig tool ,, The
>> options (-factored and -count-lm) interpret the LMs as factored and count
>> based LMs ... What about class-based ?  How to tell disambig to interpret
>> the LM as a class-based ?
>>
>> I'm trying to use my class-based as an original n-gram model, however the
>> output for sample test seems strange ... words in the test sample are
>> always disambiguated using the last word in the mapping file !
>>
>>  Actually I want the words be disambiguated using the LM probabilities
>> only without considering the probabilities in the mapping file.. I use the
>> options -lmw 1 and -mapw 0 but the output still the same ...
>>
>>
>> In short my questions are :
>>
>> 1- Is it possible to use class-based n gram with disabmig tool ? Or
>> should I build my own disambiguator  using  the output of ngram tool ?
>>
>
> Unfortunately disambig currently does not support the use of class-based
> ngram LMs (what is implemented by ngram -classes).
> Two workarounds are
> 1) if feasible, expand the class-ngram LM into a word-ngram LM (using
> ngram -expand-classes).
> 2) rewrite the class-ngram as a factored LM. This will require some
> investment into understanding the much more general FLM mechanism.
>
>
>
>
>> 2- How to make disambig tool use the probabilities of LM ONLY ?
>>
>
> disambig -mapw 0 will do that.
>
> Andreas
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120307/cae0b76e/attachment.html>

From vinay.amnsit at gmail.com  Thu Mar  8 22:39:22 2012
From: vinay.amnsit at gmail.com (Vinay Shashidhar)
Date: Fri, 9 Mar 2012 12:09:22 +0530
Subject: [SRILM User List] Posterior Probability : HTK
Message-ID: <CANKzyxryfRApz2tPMWt9yaLMfJew0R-eXP2fv3BmddmE_9L26g@mail.gmail.com>

Hi Guys,

I have a read a lot of papers regarding posterior probability being a
more robust and speaker independent features, but how does one
calculate it?

I am using HTK and am doing forced alignment. All i get is the
likelihood scores.

Thanks. Looking forward for your help.!

regards
Vinay

From stolcke at icsi.berkeley.edu  Fri Mar  9 16:02:07 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 09 Mar 2012 16:02:07 -0800
Subject: [SRILM User List] Posterior Probability : HTK
In-Reply-To: <CANKzyxryfRApz2tPMWt9yaLMfJew0R-eXP2fv3BmddmE_9L26g@mail.gmail.com>
References: <CANKzyxryfRApz2tPMWt9yaLMfJew0R-eXP2fv3BmddmE_9L26g@mail.gmail.com>
Message-ID: <4F5A99FF.4040105@icsi.berkeley.edu>


The first step is to compute posterior probabilities for arcs and nodes 
in your lattice, using the forward-backward algorithm.
The posterior probability is the sum of the scores of all paths going 
through an arc/node, normalized by the sum over all paths through the 
lattice.
This is implemented by the lattice-tool -write-posteriors option (the 
output format is different from HTK format though).
It is important to scale the combined acoustic/language model scores, 
check the -posterior-scale option.

Often one wants posterior probabilities at the word level, and combine 
all word hypotheses that occur at the same "position" in the lattice.
For this you can build a word confusion network or "word mesh".  This is 
done by the lattice-tool -write-mesh option.

For an introduction to these concepts you might want to check the 
article 
http://www.speech.sri.com/cgi-bin/run-distill?ftp:papers/CSL2000-consensus.ps.gz, 
but note that the confusion network algorithm in SRILM is not the same 
as described in there.

Andreas

On 3/8/2012 10:39 PM, Vinay Shashidhar wrote:
> Hi Guys,
>
> I have a read a lot of papers regarding posterior probability being a
> more robust and speaker independent features, but how does one
> calculate it?
>
> I am using HTK and am doing forced alignment. All i get is the
> likelihood scores.
>
> Thanks. Looking forward for your help.!
>
> regards
> Vinay
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From stolcke at icsi.berkeley.edu  Fri Mar  9 16:28:17 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Fri, 09 Mar 2012 16:28:17 -0800
Subject: [SRILM User List] How to process the disfluency words when
 building LM
In-Reply-To: <CA+bc0mqyDCf-__ALg0GR2t5Em+cvHgeu84yGiqO9RrDGg9xj=w@mail.gmail.com>
References: <CA+bc0mqyDCf-__ALg0GR2t5Em+cvHgeu84yGiqO9RrDGg9xj=w@mail.gmail.com>
Message-ID: <4F5AA021.4080408@icsi.berkeley.edu>

On 3/4/2012 4:56 PM, Meng Chen wrote:
> Hello, I tried to make the language model from some 
> non-native spontaneous speech transcription. However, there are lots 
> of "strange words" in the corpus because the transcriber tried to 
> transcribe as close as the real pronunciation.
>
> For example, some transcriptions are as follows:
>
> <s> she taught english there and she gave english lesson to a 
> secondary school students in *boli bolivi  bolivia*</s>
> <s> *er* what's wrong *er *he asked she asked </s>
> <s> her her mother would *em er* her she took her mother in her own 
> house and the baby *em* *moven bester*</s>

First, such words are not strange at all, and occur even for native 
speakers when speaking spontaneously.
"er" and "em" are called "filled pauses", and "boli" etc. "word 
fragments".   Both are associated with a more general class of  
spontaneous speech phenomena called "disfluencies".   For an overview 
see 
http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp96-dfs-swb.ps.gz .
>
> So I want to ask how should I process these "strange words" that don't 
> exist such as boli, bolivi, er, em, moven, bester etc.
> If I replace them with the correct words, the language model will be 
> unsuitable for the non-native spontaneous speech task.
> If I keep them, their counts and probability are too small. And the 
> dictionary is also hard to generate.
>
> Are there any suggestions?
Filled pauses are usually modeled as any other words, though you might 
normalize their spellings.  There are usually just two forms, with and 
without nasal (usually spelled "um" and "uh" respectively). You should 
normalize alternative spellings like "ah", "eh",  "er", etc. and map 
them to the standard form to avoid fragmenting your data.   Often people 
use a dedicated vowel phone for pronunciations of these words because 
they are more variable in quality and duration than the standard schwa 
phone.

Fragments, especially short ones, are hard to recognize because they are 
very confusable.   First, you should use a spelling convention that 
distinguishes them from full words, usually with a final hyphen, e.g., 
"boli-".
For LM training purposes you might want to delete them entirely, and 
represent them with a garbage model in acoustic training to avoid 
contaminating the models for regular words.
At SRI we tried modeling the most frequent word fragments in AM and LM, 
but even those (especially because they tend to have just one or two 
phones) are not recognized well, and removing them from the LM was best 
for overall word recognition accuracy.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120309/a27e4f09/attachment.html>

From rico.sennrich at gmx.ch  Mon Mar 12 06:10:00 2012
From: rico.sennrich at gmx.ch (Rico Sennrich)
Date: Mon, 12 Mar 2012 14:10:00 +0100
Subject: [SRILM User List] nan in language model
Message-ID: <1331557800.12711.25.camel@rico-work>

Hi list,

Occasionally, I get 'nan' as probability or backoff weight in LMs
trained with SRILM. This is not expected in an ARPA file and eventually
leads to crashes / undefined behaviour in other programs that use the
model.

Here's some statistics:

\data\
ngram 1=2054819
ngram 2=40441708
ngram 3=187680929
ngram 4=382878635
ngram 5=519867931

probability nan:
1 0
2 0
3 0
4 0
5 1233183

backoff nan:
1 0
2 0
3 0
4 415865
5 0


Here's the training parameters:

make-batch-counts file-list.txt 10 cat /wrk/smt/tmp -order 5

make-big-lm -kndiscount -interpolate -order 5 -read \
tmp/file-list.txt-1.ngrams.gz -unk -lm hugelm.gz

This happened with SRILM 1.5.9 and 1.6.0-beta, and stderr didn't show
any errors/warnings.

best wishes,
Rico


From stolcke at icsi.berkeley.edu  Mon Mar 12 09:33:15 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Mon, 12 Mar 2012 09:33:15 -0700
Subject: [SRILM User List] nan in language model
In-Reply-To: <1331557800.12711.25.camel@rico-work>
References: <1331557800.12711.25.camel@rico-work>
Message-ID: <4F5E254B.9040103@icsi.berkeley.edu>

On 3/12/2012 6:10 AM, Rico Sennrich wrote:
> Hi list,
>
> Occasionally, I get 'nan' as probability or backoff weight in LMs
> trained with SRILM. This is not expected in an ARPA file and eventually
> leads to crashes / undefined behaviour in other programs that use the
> model.
It's certainly not supposed to happen.
In your case it looks like 5-grams end up with nan probabilities, which 
would then lead to BOWs also being computed as NaNs.

I have never seens this, actually.  It would help to try a few things:

- see if it only happens with -kndiscount.
- try to elicit the problem with a smaller amount of input data (e.g., 
including only the ngrams that have the NaN's in the probabilities)
- see if those ngram counts have any special properties.

Andreas


>
> Here's some statistics:
>
> \data\
> ngram 1=2054819
> ngram 2=40441708
> ngram 3=187680929
> ngram 4=382878635
> ngram 5=519867931
>
> probability nan:
> 1 0
> 2 0
> 3 0
> 4 0
> 5 1233183
>
> backoff nan:
> 1 0
> 2 0
> 3 0
> 4 415865
> 5 0
>
>
> Here's the training parameters:
>
> make-batch-counts file-list.txt 10 cat /wrk/smt/tmp -order 5
>
> make-big-lm -kndiscount -interpolate -order 5 -read \
> tmp/file-list.txt-1.ngrams.gz -unk -lm hugelm.gz
>
> This happened with SRILM 1.5.9 and 1.6.0-beta, and stderr didn't show
> any errors/warnings.
>
> best wishes,
> Rico
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user


From john at dowding.net  Tue Mar 13 19:34:29 2012
From: john at dowding.net (John Dowding)
Date: Tue, 13 Mar 2012 19:34:29 -0700
Subject: [SRILM User List] distance between two language models
Message-ID: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net>

 
Hi,

 
I have an application where I need to create LMs for a large number of
categories of text (thousands).

I'ld like to be able to combine the LMs in cases where two (or more)
categories are sufficiently similar.  

Does SRILM provide a way to compute the distance between two LMs?   Is there
another approach I should consider?

 
Thanks

John

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120313/0328db05/attachment.html>

From amber.wilcox.ohearn at gmail.com  Wed Mar 14 06:45:17 2012
From: amber.wilcox.ohearn at gmail.com (L. Amber Wilcox-O'Hearn)
Date: Wed, 14 Mar 2012 07:45:17 -0600
Subject: [SRILM User List] distance between two language models
In-Reply-To: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net>
References: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net>
Message-ID: <CAK+oVnURO3uB1_LqnW_58SXXOW_1Kda+z-1M5Cn-rNKgheUwPA@mail.gmail.com>

On Tue, Mar 13, 2012 at 8:34 PM, John Dowding <john at dowding.net> wrote:
>
> I have an application where I need to create LMs for a large number of
> categories of text (thousands).
>
> I?ld like to be able to combine the LMs in cases where two (or more)
> categories are sufficiently similar.
>
> Does SRILM provide a way to compute the distance between two LMs??? Is there
> another approach I should consider?

I would use KL divergence, or a related measure.

-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ


From stolcke at icsi.berkeley.edu  Wed Mar 14 09:34:23 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Wed, 14 Mar 2012 09:34:23 -0700
Subject: [SRILM User List] distance between two language models
In-Reply-To: <CAK+oVnURO3uB1_LqnW_58SXXOW_1Kda+z-1M5Cn-rNKgheUwPA@mail.gmail.com>
References: <01b701cd018a$fca63cc0$f5f2b640$@dowding.net>
	<CAK+oVnURO3uB1_LqnW_58SXXOW_1Kda+z-1M5Cn-rNKgheUwPA@mail.gmail.com>
Message-ID: <4F60C88F.6020003@icsi.berkeley.edu>

On 3/14/2012 6:45 AM, L. Amber Wilcox-O'Hearn wrote:
> On Tue, Mar 13, 2012 at 8:34 PM, John Dowding<john at dowding.net>  wrote:
>> I have an application where I need to create LMs for a large number of
>> categories of text (thousands).
>>
>> I?ld like to be able to combine the LMs in cases where two (or more)
>> categories are sufficiently similar.
>>
>> Does SRILM provide a way to compute the distance between two LMs?   Is there
>> another approach I should consider?
> I would use KL divergence, or a related measure.
Exactly, but computing the KL divergence between two ngram models 
exactly would require some work.
You'd have to iterate over all ngrams occurring in either model 
(including the those handled by backoff) and sum up p1(w,h) log p2(w|h).

Of course an empirical estimate of KL divergence is easy:  to estimate 
cross-entropy you just run ngram -ppl on a sample of the source for 
model 2, computing probabilities using model 1.

Andreas
>


From songbaoqiang at gmail.com  Tue Mar 27 03:10:56 2012
From: songbaoqiang at gmail.com (vincent sung)
Date: Tue, 27 Mar 2012 18:10:56 +0800
Subject: [SRILM User List] From China
Message-ID: <CAKMvr6q+eo6GzLW4pb6i7SQ_zaYmtPuOd=b0Ana3fqDv=6Hk-Q@mail.gmail.com>

I need help from SRI International. I want to use SRI technoladge to
build a project for chinese to lear english. Does anyone can tell me
who will contact with? Thanks your patients. I'm at beijing

I have a Business Plan and  if anyone has interesting. I will share my
BP with you.

My Email:songbaoqiang at gmail.com  Skype:songbaoqiang

From stolcke at icsi.berkeley.edu  Tue Mar 27 07:29:19 2012
From: stolcke at icsi.berkeley.edu (Andreas Stolcke)
Date: Tue, 27 Mar 2012 07:29:19 -0700
Subject: [SRILM User List] From China
In-Reply-To: <CAKMvr6q+eo6GzLW4pb6i7SQ_zaYmtPuOd=b0Ana3fqDv=6Hk-Q@mail.gmail.com>
References: <CAKMvr6q+eo6GzLW4pb6i7SQ_zaYmtPuOd=b0Ana3fqDv=6Hk-Q@mail.gmail.com>
Message-ID: <4F71CEBF.9060308@icsi.berkeley.edu>

On 3/27/2012 3:10 AM, vincent sung wrote:
> I need help from SRI International. I want to use SRI technoladge to
> build a project for chinese to lear english. Does anyone can tell me
> who will contact with? Thanks your patients. I'm at beijing
>
> I have a Business Plan and  if anyone has interesting. I will share my
> BP with you.
>
> My Email:songbaoqiang at gmail.com  Skype:songbaoqiang
>
This list is not appropriate for this type of inquiry.
You probably want to try http://www.eduspeak.com/utils/contact.php .

Andreas


From chenmengdx at gmail.com  Sat Mar 31 20:00:35 2012
From: chenmengdx at gmail.com (Meng Chen)
Date: Sun, 1 Apr 2012 11:00:35 +0800
Subject: [SRILM User List] Question of replace-words-with-classes
Message-ID: <CA+bc0mppaw_F+gyCSkK2TW_W6y+2gTR=_YTtPNV6wRZyS++r1w@mail.gmail.com>

Hi, I met a question when training class-based language model by
replace-words-with-classes command. My commands are as follows:


   - ngram-class -vocab wlist -text training_set -numclasses 200
   -incremental -classes output.classes
   - replace-words-with-classes classes=output.classes training_set >
   training_set_classes

After these two steps, I found that there are both words and classes in
training_set_classes. These words are OOVs in wlist, however, I don't need
them at all. Shouldn't these words belong to <unk> in CLASS-00001? So I
wonder to know how to process this situation? Does SRILM support some
scripts to map these OOVs to CLASS-00001? Or Do I need to write a script by
myself?

Thanks!

Meng Chen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120401/57771067/attachment.html>