Comparison of intensity-based methods for automatic speech rate computation
Keywords:Prosody, speech rate, syllable count, automatic assessment
Automatic computation of speech rate is a necessary task in a wide range of applications that require this prosodic feature, in which a manual transcription and time alignments are not available. Several tools have been developed to this end, but not enough research has been conducted yet to see to what extent they are scalable to other languages.
In the present work, we take two off-the- shelf tools designed for automatic speech rate computation and already tested for Dutch and English (v1, which relies on intensity peaks preceded by an intensity dip to find syllable nuclei and v3, which relies on intensity peaks surrounded by dips) and we apply them to read and spontaneous Spanish speech. Then, we test which of them offers the best performance. The results obtained with precision and normalized mean squared error metrics showed that v3 performs better than v1. However, recall measurement shows a better performance of v1, which suggests that a more fine-grained analysis on sensitivity and specificity is needed to select the best option depending on the application we are dealing with.
Albalá, M. J., Battaner, E., Carranza, M., Mota Gorriz, C. d. l., Gil, J., Llisterri, J., ... others (2008). VILE: Análisis estadístico de los parámetros relacionados con el grupo de entonación. Language Design: Journal of Theoretical and Experimental Linguistics (Special Issue), 15-21.
Battaner Moro, E., Gil Fernández, J., Marrero Aguiar, V., Carbo Marro, C., Llisterri Boix, J., Machuca Ayuso, M. J., ... Ríos Mestre, A. (2005). VILE: estudio acústico de la variación inter- e intralocutor en español. Procesamiento del Lenguaje Natural, 35, pp. 435-436.
Cucchiarini, C., Strik, H., & Boves, L. (1998). Quantitative assessment of second language learners’ fluency: An automatic approach. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP’98), pp. 2619-2622. Sydney, Australia.
Cucchiarini, C., Strik, H., & Boves, L. (2000a). Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms. Speech Communication, 30(2-3), 109-119.
Cucchiarini, C., Strik, H., & Boves, L. (2000b). Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. Journal of the Acoustical Society of America, 107(2), 989-999.
Cucchiarini, C., Strik, H., & Boves, L. (2002). Quantitative assessment of second language learners’ fluency: comparisons between read and spontaneous speech. Journal of the Acoustical Society of America, 111(6), 2862-2873.
de Jong, N. H., Pacilly, J., & Wempe, T. (2021). Praat scripts to measure speed fluency and breakdown fluency in speech automatically. Assessment in Education: Principles, Policy and Practice, 28(4), 456-476.
de Jong, N. H., & Wempe, T. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods, 41(2), 385-390.
de Jong, N. H., Wempe, T., et al. (2007). Automatic measurement of speech rate in spoken Dutch. ACLC Working Papers, 2, 51-60.
Dekens, T., Demol, M., Verhelst, W., & Verhoeve, P. (2007). A comparative study of speech rate estimation techniques. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association (INTERSPEECH 2007), 510-513.
Farrús, M., Elvira-García, W., & Garrido- Almiñana, J. M. (2021). On the need of standard assessment metrics for automatic speech rate computation tools. In 4th Phonetics and Phonology in Europe 2021 Conference (PAPE 2021).
Garofolo, J.-S., et al. (1993) TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium.
Godfrey, J.-J., Holliman, E. (1993). Switchboard-1 Release 2 LDC97S62. Web Download. Philadelphia: Linguistic Data Consortium.
Goldman, J.-P. (2011). Easyalign: an automatic phonetic alignment tool under Praat. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH 2011). Florence, Italy. 28-21 August, 2011.
Honig, F., Batliner, A., Weilhammer, K., & Nöth, E. (2010). Automatic assessment of non-native prosody for English as L2. In Speech Prosody 2010, Chicago, IL, USA.
Llisterri, J., Machuca, M., & Ríos, A. (2017). VILE-P: un corpus para el estudio prosodico de la variación inter e intralocutor. Comunicación presentada en SUBSIDIA: Herramientas y recursos para las ciencias del habla, Málaga, Spain. June, 2017.
Mortaz, E. (2020). Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowledge-Based Systems, 210, 106490.
Narayanan, S., & Wang, D. (2005). Speech rate estimation via temporal correlation and selected sub-band correlation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) Vol. 1, pp. 1-413.
Neumeyer, L., Franco, H., Digalakis, V., & Weintraub, M. (2000). Automatic scoring of pronunciation quality. Speech Communication, 30(2-3), 88-93.
Ortega-García, J., González-Rodríguez, J., & Marrero-Aguiar, V. (2000). Ahumada: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31(2-3), 255-264.
Pellegrino, F., Farinas, J., & Rouas, J.-L. (2004). Automatic estimation of speaking rate in multilingual spontaneous speech. In Speech Prosody 2004 (pp. 517-520).
Pfau, T., & Ruske, G. (1998). Estimating the speaking rate by vowel detection. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ‘98). Vol. 2, pp. 945-948.
Pfitzinger, H. R. (1996). Two approaches to speech rate estimation. In Proceedings of the 6th Australian International Conference on Speech Science and Technology (SST, 96).Vol. 96, pp. 421-426
Sabu, K., Chaudhuri, S., Rao, P., & Patil, M. (2021). An optimized signal-processing pipeline for syllable detection and speech rate estimation. In National Conference on Communications (NCC, 2020).
Verhasselt, J. P., & Martens, J.-P. (1996). A fast and reliable rate of speech detector. In Proceedings of Fourth International Conference on Spoken Language Processing (ICSLP’96). Vol. 4, pp. 2258-2261.
Wang, D., & Narayanan, S. S. (2007). Robust speech rate estimation for spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2190- 2201.
Zechner, K., Higgins, D., Xia, X., & Williamson, D. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883-895.
How to Cite
Copyright (c) 2023 Consejo Superior de Investigaciones Científicas (CSIC)
This work is licensed under a Creative Commons Attribution 4.0 International License.© CSIC. Manuscripts published in both the printed and online versions of this Journal are the property of Consejo Superior de Investigaciones Científicas, and quoting this source is a requirement for any partial or full reproduction.
All contents of this electronic edition, except where otherwise noted, are distributed under a “Creative Commons Attribution 4.0 International” (CC BY 4.0) License. You may read here the basic information and the legal text of the license. The indication of the CC BY 4.0 License must be expressly stated in this way when necessary.
Self-archiving in repositories, personal webpages or similar, of any version other than the published by the Editor, is not allowed.
Ministerio de Ciencia, Innovación y Universidades
Grant numbers PGC2018-094233-B-C21