Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  SRILM
  Seminars
  Technologies for License
  In the News
  Career Opportunities
  Contact Us
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Speaker Recognition and TalkPrinting

Project Summary

Standard approaches to automatic speaker recognition use spectrum-related features based on very short time slices of speech. Models based on such information suffer from a lack of robustness to channel mismatches, and fail to capture longer-range characteristics of how a person talks, including the speaker's word patterns, and patterns in speech prosody (the timing, pausing, and intonation of speech). The goal of our project is to discover "TalkPrint" features -- features that capture these habitual variations in speaking style, and to model them in conjunction with standard features to improve automatic speaker recognition.

One core technical challenge in this work is to design long-range features (which by definition occur less frequently than very short-range features) that provide robust additional information even for short (e.g., 30 seconds) training and test spurts of speech. A second crucial challenge area is to develop methods for feature selection and model combination at the feature level, that can cope with large numbers of interrelated features, odd feature space distributions, inherent missing features (such as pitch when a person is not voicing), and heterogeneous feature types. A third issue is how to employ TalkPrint features successfully for a new language and across languages, since traditional speech recognition and derived features are inherently language-dependent. We are also investigating how to make TalkPrinting systems more robust to channel variability.

Recently we developed a series of novel techniques for speaker modeling, both in the stylistic and the acoustic realm, as well as a new method for model combination. Many of these techniques leverage a tight integration with our prosody modeling and large-vocabulary speech recognition efforts. We have evaluated our techniques against state-of-the-art speaker recognition systems in the annual NIST Speaker Recognition Evaluations, with excellent results. Here is a web-based overview of SRI's unique speaker verification process modeled after the NIST task.

Investigators

Murat Akbacak
Harry Bratt
Lukas Burget
Luciana Ferrer
Yun Lei
Martin Graciarena
Nicolas Scheffer
Elizabeth Shriberg
Andreas Stolcke

Publications

A. Stolcke, M. Graciarena, & L. Ferrer (2012), Effects of Audio and ASR Quality on Cepstral and High-level Speaker Verification Systems, Proc. Odyssey Speaker and Language Recognition Workshop, pp. 298-303, Singapore.

A. Stolcke, A. Mandal, & E. Shriberg (2012), Speaker Recognition With Region-Constrained MLLR Transforms, Proc. IEEE ICASSP, pp. 4397-440, Kyoto.

N. Scheffer, Y. Lei, & L. Ferrer (2011), Factor analysis back ends for MLLR transforms in speaker recognition, Proc. Interspeech, pp. 257-260, Florence.

M. Kockmann, L. Ferrer, L. Burget, & J. H. Cernockı, iVector fusion of prosodic and cepstral features for speaker verification, Proc. Interspeech, pp. 265-268, Florence.

M. H. Sanchez, L. Ferrer, E. Shriberg, & A. Stolcke (2011), Constrained Cepstral Speaker Recognition Using Matched UBM and JFA Training, Proc. Interspeech, pp. 737-740, Florence.

M. Akbacak, D. Vergyri, A. Stolcke, N. Scheffer, & A. Mandal (2011), Effective Arabic Dialect Classification Using Diverse Phonotactic Models, Proc. Interspeech, pp. 737-740, Florence. (PDF)

N. Scheffer, L. Ferrer, M. Graciarena, S. Kajarekar, E. Shriberg & A. Stolcke (2011), The SRI NIST 2010 Speaker Recognition Evaluation System, Proc. IEEE ICASSP, pp. 5292-5295, Prague.

E. Shriberg & A. Stolcke (2011), Language-independent constrained cepstral features for speaker recognition, Proc. IEEE ICASSP, pp. 5296-5299, Prague.

M. Kockmann, L. Ferrer, L. Burget, E. Shriberg, & J. Cernocky (2011), Recent Progress in Prosodic Speaker Verification, Proc. IEEE ICASSP, pp. 4556-4559, Prague.

M. Graciarena, M. Delplanche,E. Shriberg & A. Stolcke (2011), Bird Species Recognition Combining Acoustic and Sequence Modeling, Proc. IEEE ICASSP, pp. 341-344, Prague.

A. Stolcke, M. Akbacak, L. Ferrer, S. Kajarekar, C. Richey, N. Scheffer, & E. Shriberg (2010), Improving Language Recognition with Multilingual Phone Recognition and Speaker Adaptation Transforms, Proc. Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, pp. 256-262. (PDF)

L. Ferrer, N. Scheffer, & E. Shriberg (2010), A Comparison of Approaches for Modeling Prosodic Features in Speaker Recognition, Proc. IEEE ICASSP, Dallas, Texas, pp. 4414-4417. (PDF)

S. S. Kajarekar (2010), Across-phone Variability and Diagonal Term in Joint Factor Analysis Proc. IEEE ICASSP, Dallas, Texas, pp. 4406-4409.

N. Scheffer & R. Vogt (2010), On the Use of Speaker Superfactors for Speaker Recognition, Proc. IEEE ICASSP, Dallas, Texas, pp. 4410-4413.

L. Ferrer, K. Sonmez, & E. Shriberg (2009). An anticorrelation kernel for subsystem training in multiple classifier systems, Journal of Machine Learning Research, Vol. 10, pp. 2079-2114.

E. Shriberg, S. Kajarekar, N. Scheffer (2009). Does Session Variability Compensation in Speaker Recognition Model Intrinsic Variation Under Mismatched Conditions?, Proc. Interspeech, Brighton, UK, pp. 1551-1554.

M. Graciarena, T. Bocklet, E. Shriberg, A. Stolcke, S. Kajarekar (2009). Feature-Based and Channel-Based Analyses of Intrinsic Variability in Speaker Verification, Proc. Interspeech, Brighton, UK, pp. 2015-2018.

S. S. Kajarekar, N. Scheffer, M. Graciarena, E. Shriberg, A. Stolcke, L. Ferrer, & T. Bocklet (2009), The SRI NIST 2008 Speaker Recognition Evaluation System, Proc. IEEE ICASSP, Taipei, pp. 4205-4208. (PDF)

T. Bocklet and E. Shriberg (2009), Speaker Recognition Using Syllable-Based Constraints for Cepstral Frame Selection , Proc. ICASSP, Taipei, Taiwan, pp. 4525-4528.

S. S. Kajarekar, L. Ferrer, A. Stolcke, & E. Shriberg (2008), Voice-Based Speaker Recognition Combining Acoustic and Stylistic Features, in N. K. Ratha & V. Govindaraju (eds.), Advances in Biometrics: Sensors, Algorithms and Systems, pp. 183-201, Springer, London. (Abstract, PDF)

E. Shriberg, M. Graciarena, H. Bratt, A. Kathol, S. Kajarekar, H. Jameel, C. Richey, & F. Goodman (2008), Effects of Vocal Effort and Speaking Style on Text-Independent Speaker Verification, Proc. Interspeech, pp. 609-612, Brisbane, Australia.

L. Ferrer (2008), Modeling Prior Belief for Speaker Verification SVM Systems, Proc. Interspeech, pp. 1385-1388, Brisbane, Australia. (PDF)

E. Shriberg & A. Stolcke (2008), The Case for Automatic Higher-Level Features in Forensic Speaker Recognition, Proc. Interspeech, pp. 1509-1512, Brisbane, Australia. (PDF)

S. S. Kajarekar (2008), Phone-based Cepstral Polynomial SVM System for Speaker Recognition, Proc. Interspeech, pp. 845-848, Brisbane, Australia. (PDF)

L. Ferrer, M. Graciarena, A. Zymnis, & E. Shriberg (2008), System Combination Using Auxiliary Information for Speaker Verification, Proc. IEEE ICASSP, pp. 4853-4857, Las Vegas. (PDF)

A. Stolcke, S. Kajarekar, & L. Ferrer (2008), Nonparametric Feature Normalization for SVM-based Speaker Verification, Proc. IEEE ICASSP, pp. 1577-1580, Las Vegas. (PDF)

E. Shriberg, L. Ferrer, S. Kajarekar, N. Scheffer, A. Stolcke, & M. Akbacak (2008), Detecting Nonnative Speech Using Speaker Recognition Approaches. Proc. Odyssey Speaker and Language Recognition Workshop, Stellenbosch, South Africa.

L. Ferrer, K. Sonmez, & E. Shriberg (2008), An Anticorrelation Kernel for Improved System Combination in Speaker Verification. Proc. Odyssey Speaker and Language Recognition Workshop, Stellenbosch, South Africa. (PDF)

A. Stolcke & S. Kajarekar (2008), Recognizing Arabic Speakers with English Phones. Proc. Odyssey Speaker and Language Recognition Workshop, Stellenbosch, South Africa. (PDF)

A. Stolcke, S. Kajarekar, L. Ferrer, & E. Shriberg (2007), Speaker Recognition with Session Variability Normalization Based on MLLR Adaptation Transforms, IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 1987-1998. Special issue on speaker and language recognition. (PDF, abstract)

E. Shriberg & L. Ferrer (2007), A Text-Constrained Prosodic System for Speaker Verification, Proc. Interspeech/Eurospeech, pp. 1226-1229, Antwerp. (PDF)

L. Ferrer, K. Sonmez, and E. Shriberg (2007), A Smoothing Kernel for Spatially Related Features and Its Application to Speaker Verification, Proc. Interspeech/Eurospeech, pp. 738-741, Antwerp. (PDF)

G. Tur, E. Shriberg, A. Stolcke, & S. Kajarekar (2007), Duration and Pronunciation Conditioned Lexical Modeling for Speaker Verification Proc. Interspeech/Eurospeech, pp. 2049-2052, Antwerp. (PDF)

S. Kajarekar & A. Stolcke (2007), NAP and WCCN: Comparison of Approaches Using MLLR-SVM Speaker Verification System, Proc. IEEE ICASSP, vol. 4, pp. 249-252, Honolulu, Hawaii. (PDF)

L. Ferrer, E. Shriberg, S. Kajarekar, & K. Sonmez (2007), Parameterization of Prosodic Feature Distributions for SVM Modeling in Speaker Recognition, Proc. IEEE ICASSP, vol. 4, pp. 233-236, Honolulu, Hawaii. (PDF)

M. Graciarena, S. Kajarekar, A. Stolcke, E. Shriberg (2007), Noise Robust Speaker Identification for Spontaneous Arabic Speech, Proc. IEEE ICASSP, vol. 4, pp. 245-248, Honolulu, Hawaii. (PDF)

A. Stolcke, E. Shriberg, L. Ferrer, S. Kajarekar, K. Sonmez, & G. Tur (2007), Speech Recognition as Feature Extraction for Speaker Recognition, Proc. SAFE 2007: Workshop on Signal Processing Applications for Public Security and Forensics, pp. 39-43, Washington, D.C. (PDF)

A. O. Hatch, S. Kajarekar, & A. Stolcke (2006), Within-Class Covariance Normalization for SVM-based Speaker Recognition. Proc. ICSLP, pp. 1471-1474, Pittsburgh. (PDF)

S. S. Kajarekar, H. Bratt, E. Shriberg, & R. de Leon (2006), A Study of Intentional Voice Modifications for Evading Automatic Speaker Recognition. Proc. IEEE Odyssey 2006 Speaker and Language Recognition Workshop, San Juan, Puerto Rico. (PDF)

A. Stolcke, L. Ferrer, & S. Kajarekar (2006), Improvements in MLLR-Transform-based Speaker Recognition. Proc. IEEE Odyssey 2006 Speaker and Language Recognition Workshop, pp. 1-6, San Juan, Puerto Rico. (PDF)

L. Ferrer, E. Shriberg, S. S. Kajarekar, A. Stolcke, K. Sonmez, A. Venkataraman, & H. Bratt (2006), The Contribution of Cepstral and Stylistic Features to SRI's 2005 NIST Speaker Recognition Evaluation System. Proc. IEEE ICASSP, vol. 1, pp. 101-104, Toulouse. (PDF)

A. O. Hatch and A. Stolcke (2006), Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition. Proc. IEEE ICASSP, vol. 5, pp. 585-588, Toulouse. (PDF)

S. S. Kajarekar (2005), Four Weightings and a Fusion: A Cepstral-SVM System for Speaker Recognition. Proc. IEEE Speech Recognition and Understanding Workshop, pp. 17-22, San Juan, Puerto Rico. (PDF)

A. O. Hatch, A. Stolcke, & B. Peskin (2005), Combining Feature Sets with Support Vector Machines: Application to Speaker Recognition. Proc. IEEE Speech Recognition and Understanding Workshop, pp. 75-79, San Juan, Puerto Rico. (PDF)

E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, & A. Stolcke (2005), Modeling Prosodic Feature Sequences for Speaker Recognition. Speech Communication 46(3-4), 455-472. Special Issues on Quantitative Prosody Modelling for Natural Speech Description and Generation.

A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, & A. Venkataraman (2005), MLLR Transforms as Features in Speaker Recognition. Proc. Eurospeech, Lisbon, pp. 2425-2428. (PDF)

L. Ferrer, K. Sonmez, & S. Kajarekar (2005), Class-dependent Score Combination for Speaker Recognition. Proc. Eurospeech, Lisbon, pp. 2173-2176. (PDF)

S. S. Kajarekar, L. Ferrer, E. Shriberg, K. Sonmez, A. Stolcke, A. Venkataraman, & J. Zheng (2005), SRI's 2004 NIST Speaker Recognition Evaluation System, Proc. IEEE ICASSP, Philadelphia, vol. 1, pp. 173-176. (PDF)

A. O. Hatch, B. Peskin, & A. Stolcke (2005), Improved Phonetic Speaker Recognition Using Lattice Decoding, Proc. IEEE ICASSP, Philadelphia, vol. 1, pp. 169-172. (PDF)

E. Shriberg, L. Ferrer, A. Venkataraman, & S. Kajarekar (2004), SVM Modeling of ``SNERF-Grams'' for Speaker Recognition. Proc. Intl. Conf. on Spoken Language Processing, pp. 1409-1412, Jeju, Korea. (PDF)

S. Kajarekar, L. Ferrer, K. Sonmez, J. Zheng, E. Shriberg, & A. Stolcke (2004), Modeling NERFs for Speaker Recognition. Proc. Odyssey 04 Speaker and Language Recognition Workshop, pp. 51-56, Toledo, Spain. (PDF)

S. Kajarekar, L. Ferrer, A. Venkataraman, K. Sonmez, E. Shriberg, A. Stolcke, & R. R. Gadde (2003), Speaker Recognition using Prosodic and Lexical Features. Proc. IEEE Speech Recognition and Understanding Workshop, pp. 19-24, St. Thomas, U.S. Virgin Islands. (PDF)

L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. Sonmez, A. Stolcke, & A. Venkataraman (2003), Modeling Duration Patterns for Speaker Recognition. Proc. Eurospeech, pp. 2017-2020, Geneva. (PDF)

S. Kajarekar, K. Sonmez, L. Ferrer, V. Gadde, A. Venkataraman, E. Shriberg, A. Stolcke, & H. Bratt (2003), "TalkPrinting": Improving Speaker Recognition by Modeling Stylistic Features (PDF). In Intelligence and Security Informatics. First NSF/NIJ Symposium, ISI 2003, Springer Lecture Notes in Computer Science Series, Volume 2665, H. Chen, R. Miranda, D.D. Zeng, C. Demchak, J. Schroeder, T. Madhusudan, editors, pp. 350-354. © 2003 Springer-Verlag.

K. Sonmez, L. Heck, & M. Weintraub (2000), Multiple Speaker Tracking and Detection: Handset Normalization and Duration Scoring, Digital Signal Processing, 10(1/2/3), 133-143.

L. P. Heck, Y. Konig, M. K. Sonmez, & M. Weintraub (2000), Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design, Speech Communication, 31(2-3), 181-192.

K. Sonmez, L. Heck & M. Weintraub (1999), Speaker Tracking and Detection with Multiple Speakers, Proc. Eurospeech, vol. 5, pp. 2219-2222, Budapest. (PDF)

H. Murthy, F. Beaufays, L. P. Heck, & M. Weintraub (1999), Robust Text-Independent Speaker Identification over Telephone Channels, IEEE Trans. on Speech and Audio Processing 7(5), 554-568. (PDF)

K. Sonmez, E. Shriberg, L. Heck & M. Weintraub (1998), Modeling Dynamic Prosodic Variation for Speaker Verification, Proc. Intl. Conf. on Spoken Language Processing, vol. 7, pp. 3189-3192, Sydney. (PDF)

Y. Konig, L. Heck, M. Weintraub, & K. Sonmez, (1998), Nonlinear Discriminant Feature Extraction for Robust Text-Independent Speaker Recognition, Proc. RLA2C-ESCA Speaker Recognition and its Commercial and Forensic Applications, pp. 72-75, Avignon, France. (PDF)

L. Heck & Y. Konig (1998), Discriminative Training of Minimum Cost Speaker Verification Systems, Proc. RLA2C-ESCA Speaker Recognition and its Commercial and Forensic Applications, pp. 93-96, Avignon, France. (PDF)

K. Sonmez, L. Heck, M. Weintraub & E. Shriberg (1997), A lognormal tied mixture model of pitch for prosody-based speaker recognition, Proc. Eurospeech, vol. 3, pp. 1391-1394, Rhodes, Greece. (PDF)

L. Julia, L. P. Heck, & A. Cheyer (1997), A Speaker Identification Agent, Proc. AVBPA'97, Crans Montana, Switzerland. (PDF)

L. P. Heck & M. Weintraub (1997), Handset-Dependent Background Models for Robust Text-Independent Speaker Recognition, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, vol. 2, pp. 1071-1074, Munich. (PDF)

F. Beaufays & M. Weintraub (1997), Model Transformation for Robust Speaker Recognition from Telephone Data, Proc. IEEE Intl. Conf. on Acoustic, Speech, and Signal Processing, vol. 2, pp. 1063-1066, Munich. (PDF)

L. P. Heck & J. H. McClellan (1993), Subspace Techniques for Large-Scale Feature Selection, Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, volume 4, pp. 17-20, Minneapolis.

D. A. Reynolds & L. P. Heck (1991), Integration of Speaker and Speech Recognition Systems, Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, pp. 869-872, Toronto.

Presentations

L. Ferrer, M. Graciarena, S. Kajarekar, N. Scheffer, E. Shriberg, & A. Stolcke, The SRI NIST SRE10 Speaker Verification System, NIST Speaker Recognition Evaluation Workshop, June 24, 2010, Brno, Czech Republic.

A. Stolcke, lectures given at the Winter School in Speech and Audio Processing (WiSSAP'09), IIT Kanpur, India, January 9-12, 2009:

  1. Higher-Level Features for Speaker Recognition
  2. Phonetic Speaker Recognition
  3. MLLR Transform and Constrained Cepstral Modeling

A. Stolcke, Machine Learning for Speaker Recognition, NIPS Workshop on Speech and Language: Learning-based Methods and Systems, Dec. 12, 2008, Whistler, B.C.

M. Graciarena, S. Kajarekar, N. Scheffer, E. Shriberg, A. Stolcke, L. Ferrer, & T. Bocklet, The SRI NIST SRE08 Speaker Verification System, NIST Speaker Recognition Evaluation Workshop, June 16, 2008, Montreal.

S. Kajarekar, L. Ferrer, M. Graciarena, E. Shriberg, K. Sönmez, A. Stolcke, G. Tur, & A. Venkataraman, SRI’s NIST 2006 Speaker Recognition Evaluation System, NIST Speaker Recognition Evaluation Workshop, June 2006, San Juan, PR.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2011 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Jun 24, 2012