Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014)

Authors

  • Joaquin Gonzalez-Rodriguez ATVS-Biometric Recognition Group, Universidad Autónoma de Madrid

DOI:

https://doi.org/10.3989/loquens.2014.007

Keywords:

automatic speaker recognition, discrimination and calibration, assessment, benchmark

Abstract


Automatic Speaker Recognition systems show interesting properties, such as speed of processing or repeatability of results, in contrast to speaker recognition by humans. But they will be usable just if they are reliable. Testability, or the ability to extensively evaluate the goodness of the speaker detector decisions, becomes then critical. In the last 20 years, the US National Institute of Standards and Technology (NIST) has organized, providing the proper speech data and evaluation protocols, a series of text-independent Speaker Recognition Evaluations (SRE). Those evaluations have become not just a periodical benchmark test, but also a meeting point of a collaborative community of scientists that have been deeply involved in the cycle of evaluations, allowing tremendous progress in a specially complex task where the speaker information is spread across different information levels (acoustic, prosodic, linguistic…) and is strongly affected by speaker intrinsic and extrinsic variability factors. In this paper, we outline how the evaluations progressively challenged the technology including new speaking conditions and sources of variability, and how the scientific community gave answers to those demands. Finally, NIST SREs will be shown to be not free of inconveniences, and future challenges to speaker recognition assessment will also be discussed.

Downloads

Download data is not yet available.

References

Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), 4, 788–91. http://dx.doi.org/10.1109/ICASSP.2003.1202761

Auckenthaler, R., Carey, M., & Lloyd-Thomas, H. (2000). Score normalization for text-independent speaker verification systems. Digital Signal Processing, 10(1-3), 42–54. http://dx.doi.org/10.1006/dspr.1999.0360

Brümmer, N., Burget, L., Černocký, J., Glembek, O., Grézl, F., Karafiát, M., ... Strasheim, A. (2007). Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2072–2084. http://dx.doi.org/10.1109/TASL.2007.902870

Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Torres-Carrasquillo, P. A. (2006a). Support vector machines for speaker and language recognition. Computer Speech & Language, 20(2–3), 210–229. http://dx.doi.org/10.1016/j.csl.2005.06.003

Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006b). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311. http://dx.doi.org/10.1109/LSP.2006.870086

Cieri, C., Miller, D., & Walker, K. (2003). From switchboard to fisher: Telephone collection protocols, their uses and yields. Proceedings of the 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 – INTERSPEECH 2003, 1597–1600.

Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366. http://dx.doi.org/10.1109/TASSP.1980.1163420

Degottex, G., Kane, J., Drugman, T., Raitio, T., & Scherer, S. (2014, May). COVAREP - A collaborative voice analysis repository for speech technologies. To be presented at the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '14), Florence, Italy.

Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. http://dx.doi.org/10.1109/TASL.2010.2064307

Doddington, G. R. (2001). Speaker recognition based on idiolectal differences between speakers. Proceedings of the 7th European Conference on Speech Communication and Technology, EUROSPEECH 2001 – INTERSPEECH 2001, 2521–2524.

Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proceedings of the International Conference on Spoken Language, 1–5.

Doddington, G. R., Przybocki, M. A., Martin, A. F., & Reynolds, D. A. (2000). The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective. Speech Communication, 31(2–3), 225–254. http://dx.doi.org/10.1016/S0167-6393(99)00080-1

Ferrer, L., McLaren, M., Scheffer, N., Lei, Y., Graciarena, M., & Mitra, V. (2013). A noise-robust system for NIST 2012 speaker recognition evaluation. Paper presented at the 14th INTERSPEECH Conference 2013, Lyon, France.

Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2), 254–272. http://dx.doi.org/10.1109/TASSP.1981.1163530

Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. Proceedings of the 12th INTERSPEECH Conference 2011, 249–252.

Gonzalez-Rodriguez, J. (2011). Speaker recognition using temporal contours in linguistic units: The case of formant and formant-bandwidth trajectories. Proceedings of the 12th INTERSPEECH Conference 2011, 133–136.

Gonzalez-Rodriguez, J., Rose, P., Ramos, D., Toledano, D. T., & Ortega-Garcia, J. (2007). Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2104–2115. http://dx.doi.org/10.1109/TASL.2007.902747

Greenberg, C., Martin, A., Brandschain, L., Campbell, J., Cieri, C., Doddington, G., & Godfrey, J. (2011). Human assisted speaker recognition in NIST SRE10. Paper presented at the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '11), Prague, Czech Republic.

Hébert, M. (2008). Text-dependent speaker recognition. In J. Benesty, M. Sondhi, & Y. Huang (Eds.), Springer handbook of speech processing (pp. 743–762). Berlin–Heidelberg, Germany: Springer. http://dx.doi.org/10.1007/978-3-540-49127-9_37 PMCid:PMC2553867

Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589. http://dx.doi.org/10.1109/89.326616

Hernando, J., & Nadeu, C. (1998). Speaker verification on the polycost database using frequency filtered spectral energies. Proceedings of the 5th International Conference on Spoken Language, 98, 129–132.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6), 82–97. http://dx.doi.org/10.1109/MSP.2012.2205597

Kajarekar, S. S., Ferrer, L., Shriberg, E., Sonmez, K., Stolcke, A., Venkataraman, A., & Zheng, J. (2005). SRI's 2004 NIST speaker recognition evaluation system. Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), 1, 173–176. http://dx.doi.org/10.1109/ICASSP.2005.1415078

Kajarekar, S. S., Scheffer, N., Graciarena, M., Shriberg, E., Stolcke, A., Ferrer, L., & Bocklet, T. (2009). THE SRI NIST 2008 speaker recognition evaluation system. Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing, 4205–4208. http://dx.doi.org/10.1109/ICASSP.2009.4960556

Kenny, P. (2005). Joint factor analysis of speaker and sesión variability: Theory and algorithms (Technical Report No. CRIM-06/08-13). Montreal, Canada: CRIM.

Khoury, E., Vesnicer, B., Franco-Pedroso, J., Violato, R., Boulkcnafet, Z., Mazaira Fernandez, L. M., ... Marcel, S. (2013, June). The 2013 speaker recognition evaluation in mobile environment. 2013 International Conference on Biometrics (ICB), 1–8. http://dx.doi.org/10.1109/ICB.2013.6613025 PMCid:PMC3741813

Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52(1), 12–40. http://dx.doi.org/10.1016/j.specom.2009.08.009

Kockmann, M., Ferrer, L., Burget, L., Shriberg, E., & Černocký, J. (2011). Recent progress in prosodic speaker verification. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '11), 4556–4559. http://dx.doi.org/10.1109/ICASSP.2011.5947368

Larcher, A., Aronowitz, H., Lee, K. A., & Kenny, P. (Organizers) (2014, September). Text-dependent speaker verification with short utterances. Special session to be conducted at the 15th INTERSPEECH Conference 2014, INTERSPEECH 2014 (Singapore). Retrieved from http://www.interspeech2014.org/public.php?page=special_sessions.html

Larcher, A., Lee, K. A., Ma, B., & Li, H. (2014). Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Communication, 60, 56–77. http://dx.doi.org/10.1016/j.specom.2014.03.001

Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martínez-González, D., Gonzalez-Rodriguez, J., Moreno, P.J. (2014, May). Automatic language identification using deep neural networks. To be presented at the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'14), Florence, Italy.

Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. Proceedings of the 5th European Conference on Speech Communication and Technology, EUROSPEECH 1997, 1895–1898.

National Institute of Standards and Technology (NIST) (2012). The NIST year 2012 speaker recognition evaluation plan, 1–7. Retrieved from http://www.nist.gov/itl/iad/mig/upload/NIST_SRE_evalplan-v11-r0.pdf

Ortega-Garcia, J., Gonzalez-Rodriguez, J., & Marrero-Aguiar, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31(2), 255–264. http://dx.doi.org/10.1016/S0167-6393(99)00081-3

Pelecanos, J., & Sridharan, S. (2001). Feature warping for robust speaker verification. Proceedings of 2001: A speaker odyssey: The speaker recognition workshop, 213–218.

Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586. http://dx.doi.org/10.1109/89.784109

Prince, S. J. D, & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. IEEE 11th International Conference on Computer Vision, 1–8.

Ramos, D., Gonzalez-Rodriguez, J., Zadora, G., & Aitken, C. (2013). Information-theoretical assessment of the performance of likelihood ratio computation methods. Journal of Forensic Sciences, 58(6), 1503–1518. http://dx.doi.org/10.1111/1556-4029.12233 PMid:23879526

Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., ... Xiang, B. (2003). The SuperSID project: Exploiting high-level information for high-accuracy speaker recognition. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), 4, 784–787. http://dx.doi.org/10.1109/ICASSP.2003.1202760

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41. http://dx.doi.org/10.1006/dspr.1999.0361

Rose, P. (2002). Forensic speaker identification. CRC Press. http://dx.doi.org/10.1201/9780203166369

Saeidi, R., Lee, K. A., Kinnunen, T., Hasan, T., Fauve, B., Bousquet, P. M., ... Ambikairajah, E. (2013, August). I4U submission to NIST SRE 2012: A large-scale collaborative effort for noise-robust speaker verification. Paper presented at the 14th INTERSPEECH Conference 2013, Lyon, France.

Scheffer, N., Ferrer, L., Graciarena, M., Kajarekar, S., Shriberg, E., & Stolcke, A. (2011). The SRI NIST 2010 speaker recognition evaluation system. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'11), 5292–5295. http://dx.doi.org/10.1109/ICASSP.2011.5947552

Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge MA: MIT Press.

Solomonoff, A., Campbell, W. M., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), 1, 629–632. http://dx.doi.org/10.1109/ICASSP.2005.1415192

Stolcke, A., Kajarekar, S. S., Ferrer, L., & Shrinberg, E. (2007). Speaker recognition with session variability normalization based on MLLR adaptation transforms. IEEE Transactions on Acoustics, Speech and Signal Processing, 15(7), 1987–1998.

Thiruvaran, T., Ambikairajah, E., & Epps, J. (2008). FM features for automatic forensic speaker recognition. Proceedings of the 9th INTERSPEECH Conference 2008, 1497–1500.

Vapnik, V. N. (1995). The nature of statistical learning theory. New York NY: Springer. http://dx.doi.org/10.1007/978-1-4757-2440-0

Variani, E., Lei, X., McDermott, E., Lopez-Moreno, I., Gonzalez-Dominguez, J. (2014, May). Deep neural networks for small footprint text-dependent speaker verification. To be presented at the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '14), Florence, Italy.

Vasilakakis, V., Cumani, S., & Laface, P. (2013, October). Speaker recognition by means of Deep Belief Networks. Technologies in Forensic Science, Nijmegen, The Netherlands.

Published

2014-06-30

How to Cite

Gonzalez-Rodriguez, J. (2014). Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014). Loquens, 1(1), e007. https://doi.org/10.3989/loquens.2014.007

Issue

Section

Articles