Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014)
DOI:
https://doi.org/10.3989/loquens.2014.007Keywords:
automatic speaker recognition, discrimination and calibration, assessment, benchmarkAbstract
Automatic Speaker Recognition systems show interesting properties, such as speed of processing or repeatability of results, in contrast to speaker recognition by humans. But they will be usable just if they are reliable. Testability, or the ability to extensively evaluate the goodness of the speaker detector decisions, becomes then critical. In the last 20 years, the US National Institute of Standards and Technology (NIST) has organized, providing the proper speech data and evaluation protocols, a series of text-independent Speaker Recognition Evaluations (SRE). Those evaluations have become not just a periodical benchmark test, but also a meeting point of a collaborative community of scientists that have been deeply involved in the cycle of evaluations, allowing tremendous progress in a specially complex task where the speaker information is spread across different information levels (acoustic, prosodic, linguistic…) and is strongly affected by speaker intrinsic and extrinsic variability factors. In this paper, we outline how the evaluations progressively challenged the technology including new speaking conditions and sources of variability, and how the scientific community gave answers to those demands. Finally, NIST SREs will be shown to be not free of inconveniences, and future challenges to speaker recognition assessment will also be discussed.
Downloads
References
Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), 4, 788–91. http://dx.doi.org/10.1109/ICASSP.2003.1202761
Auckenthaler, R., Carey, M., & Lloyd-Thomas, H. (2000). Score normalization for text-independent speaker verification systems. Digital Signal Processing, 10(1-3), 42–54. http://dx.doi.org/10.1006/dspr.1999.0360
Brümmer, N., Burget, L., Černocký, J., Glembek, O., Grézl, F., Karafiát, M., ... Strasheim, A. (2007). Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2072–2084. http://dx.doi.org/10.1109/TASL.2007.902870
Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Torres-Carrasquillo, P. A. (2006a). Support vector machines for speaker and language recognition. Computer Speech & Language, 20(2–3), 210–229. http://dx.doi.org/10.1016/j.csl.2005.06.003
Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006b). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311. http://dx.doi.org/10.1109/LSP.2006.870086
Cieri, C., Miller, D., & Walker, K. (2003). From switchboard to fisher: Telephone collection protocols, their uses and yields. Proceedings of the 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 – INTERSPEECH 2003, 1597–1600.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366. http://dx.doi.org/10.1109/TASSP.1980.1163420
Degottex, G., Kane, J., Drugman, T., Raitio, T., & Scherer, S. (2014, May). COVAREP - A collaborative voice analysis repository for speech technologies. To be presented at the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '14), Florence, Italy.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. http://dx.doi.org/10.1109/TASL.2010.2064307
Doddington, G. R. (2001). Speaker recognition based on idiolectal differences between speakers. Proceedings of the 7th European Conference on Speech Communication and Technology, EUROSPEECH 2001 – INTERSPEECH 2001, 2521–2524.
Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proceedings of the International Conference on Spoken Language, 1–5.
Doddington, G. R., Przybocki, M. A., Martin, A. F., & Reynolds, D. A. (2000). The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective. Speech Communication, 31(2–3), 225–254. http://dx.doi.org/10.1016/S0167-6393(99)00080-1
Ferrer, L., McLaren, M., Scheffer, N., Lei, Y., Graciarena, M., & Mitra, V. (2013). A noise-robust system for NIST 2012 speaker recognition evaluation. Paper presented at the 14th INTERSPEECH Conference 2013, Lyon, France.
Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2), 254–272. http://dx.doi.org/10.1109/TASSP.1981.1163530
Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. Proceedings of the 12th INTERSPEECH Conference 2011, 249–252.
Gonzalez-Rodriguez, J. (2011). Speaker recognition using temporal contours in linguistic units: The case of formant and formant-bandwidth trajectories. Proceedings of the 12th INTERSPEECH Conference 2011, 133–136.
Gonzalez-Rodriguez, J., Rose, P., Ramos, D., Toledano, D. T., & Ortega-Garcia, J. (2007). Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2104–2115. http://dx.doi.org/10.1109/TASL.2007.902747
Greenberg, C., Martin, A., Brandschain, L., Campbell, J., Cieri, C., Doddington, G., & Godfrey, J. (2011). Human assisted speaker recognition in NIST SRE10. Paper presented at the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '11), Prague, Czech Republic.
Hébert, M. (2008). Text-dependent speaker recognition. In J. Benesty, M. Sondhi, & Y. Huang (Eds.), Springer handbook of speech processing (pp. 743–762). Berlin–Heidelberg, Germany: Springer. http://dx.doi.org/10.1007/978-3-540-49127-9_37 PMCid:PMC2553867
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589. http://dx.doi.org/10.1109/89.326616
Hernando, J., & Nadeu, C. (1998). Speaker verification on the polycost database using frequency filtered spectral energies. Proceedings of the 5th International Conference on Spoken Language, 98, 129–132.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6), 82–97. http://dx.doi.org/10.1109/MSP.2012.2205597
Kajarekar, S. S., Ferrer, L., Shriberg, E., Sonmez, K., Stolcke, A., Venkataraman, A., & Zheng, J. (2005). SRI's 2004 NIST speaker recognition evaluation system. Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), 1, 173–176. http://dx.doi.org/10.1109/ICASSP.2005.1415078
Kajarekar, S. S., Scheffer, N., Graciarena, M., Shriberg, E., Stolcke, A., Ferrer, L., & Bocklet, T. (2009). THE SRI NIST 2008 speaker recognition evaluation system. Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing, 4205–4208. http://dx.doi.org/10.1109/ICASSP.2009.4960556
Kenny, P. (2005). Joint factor analysis of speaker and sesión variability: Theory and algorithms (Technical Report No. CRIM-06/08-13). Montreal, Canada: CRIM.
Khoury, E., Vesnicer, B., Franco-Pedroso, J., Violato, R., Boulkcnafet, Z., Mazaira Fernandez, L. M., ... Marcel, S. (2013, June). The 2013 speaker recognition evaluation in mobile environment. 2013 International Conference on Biometrics (ICB), 1–8. http://dx.doi.org/10.1109/ICB.2013.6613025 PMCid:PMC3741813
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52(1), 12–40. http://dx.doi.org/10.1016/j.specom.2009.08.009
Kockmann, M., Ferrer, L., Burget, L., Shriberg, E., & Černocký, J. (2011). Recent progress in prosodic speaker verification. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '11), 4556–4559. http://dx.doi.org/10.1109/ICASSP.2011.5947368
Larcher, A., Aronowitz, H., Lee, K. A., & Kenny, P. (Organizers) (2014, September). Text-dependent speaker verification with short utterances. Special session to be conducted at the 15th INTERSPEECH Conference 2014, INTERSPEECH 2014 (Singapore). Retrieved from http://www.interspeech2014.org/public.php?page=special_sessions.html
Larcher, A., Lee, K. A., Ma, B., & Li, H. (2014). Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Communication, 60, 56–77. http://dx.doi.org/10.1016/j.specom.2014.03.001
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martínez-González, D., Gonzalez-Rodriguez, J., Moreno, P.J. (2014, May). Automatic language identification using deep neural networks. To be presented at the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'14), Florence, Italy.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. Proceedings of the 5th European Conference on Speech Communication and Technology, EUROSPEECH 1997, 1895–1898.
National Institute of Standards and Technology (NIST) (2012). The NIST year 2012 speaker recognition evaluation plan, 1–7. Retrieved from http://www.nist.gov/itl/iad/mig/upload/NIST_SRE_evalplan-v11-r0.pdf
Ortega-Garcia, J., Gonzalez-Rodriguez, J., & Marrero-Aguiar, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31(2), 255–264. http://dx.doi.org/10.1016/S0167-6393(99)00081-3
Pelecanos, J., & Sridharan, S. (2001). Feature warping for robust speaker verification. Proceedings of 2001: A speaker odyssey: The speaker recognition workshop, 213–218.
Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586. http://dx.doi.org/10.1109/89.784109
Prince, S. J. D, & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. IEEE 11th International Conference on Computer Vision, 1–8.
Ramos, D., Gonzalez-Rodriguez, J., Zadora, G., & Aitken, C. (2013). Information-theoretical assessment of the performance of likelihood ratio computation methods. Journal of Forensic Sciences, 58(6), 1503–1518. http://dx.doi.org/10.1111/1556-4029.12233 PMid:23879526
Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., ... Xiang, B. (2003). The SuperSID project: Exploiting high-level information for high-accuracy speaker recognition. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), 4, 784–787. http://dx.doi.org/10.1109/ICASSP.2003.1202760
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41. http://dx.doi.org/10.1006/dspr.1999.0361
Rose, P. (2002). Forensic speaker identification. CRC Press. http://dx.doi.org/10.1201/9780203166369
Saeidi, R., Lee, K. A., Kinnunen, T., Hasan, T., Fauve, B., Bousquet, P. M., ... Ambikairajah, E. (2013, August). I4U submission to NIST SRE 2012: A large-scale collaborative effort for noise-robust speaker verification. Paper presented at the 14th INTERSPEECH Conference 2013, Lyon, France.
Scheffer, N., Ferrer, L., Graciarena, M., Kajarekar, S., Shriberg, E., & Stolcke, A. (2011). The SRI NIST 2010 speaker recognition evaluation system. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'11), 5292–5295. http://dx.doi.org/10.1109/ICASSP.2011.5947552
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge MA: MIT Press.
Solomonoff, A., Campbell, W. M., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), 1, 629–632. http://dx.doi.org/10.1109/ICASSP.2005.1415192
Stolcke, A., Kajarekar, S. S., Ferrer, L., & Shrinberg, E. (2007). Speaker recognition with session variability normalization based on MLLR adaptation transforms. IEEE Transactions on Acoustics, Speech and Signal Processing, 15(7), 1987–1998.
Thiruvaran, T., Ambikairajah, E., & Epps, J. (2008). FM features for automatic forensic speaker recognition. Proceedings of the 9th INTERSPEECH Conference 2008, 1497–1500.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York NY: Springer. http://dx.doi.org/10.1007/978-1-4757-2440-0
Variani, E., Lei, X., McDermott, E., Lopez-Moreno, I., Gonzalez-Dominguez, J. (2014, May). Deep neural networks for small footprint text-dependent speaker verification. To be presented at the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '14), Florence, Italy.
Vasilakakis, V., Cumani, S., & Laface, P. (2013, October). Speaker recognition by means of Deep Belief Networks. Technologies in Forensic Science, Nijmegen, The Netherlands.
Published
How to Cite
Issue
Section
License
Copyright (c) 2014 Consejo Superior de Investigaciones Científicas (CSIC)
This work is licensed under a Creative Commons Attribution 4.0 International License.
© CSIC. Manuscripts published in both the printed and online versions of this Journal are the property of Consejo Superior de Investigaciones Científicas, and quoting this source is a requirement for any partial or full reproduction.All contents of this electronic edition, except where otherwise noted, are distributed under a “Creative Commons Attribution 4.0 International” (CC BY 4.0) License. You may read here the basic information and the legal text of the license. The indication of the CC BY 4.0 License must be expressly stated in this way when necessary.
Self-archiving in repositories, personal webpages or similar, of any version other than the published by the Editor, is not allowed.