Comparison of intensity-based methods for automatic speech rate computation

Wendy Elvira-García; Mireia Farrús; Juan María Garrido Almiñana

doi:10.3989/loquens.2022.e090

Authors

Wendy Elvira-García Universitat de Barcelona https://orcid.org/0000-0001-7002-9851
Mireia Farrús Universitat de Barcelona https://orcid.org/0000-0002-7160-9513
Juan María Garrido Almiñana Universidad Nacional de Educación a Distancia (UNED) https://orcid.org/0000-0002-3310-8582

DOI:

https://doi.org/10.3989/loquens.2022.e090

Keywords:

Prosody, speech rate, syllable count, automatic assessment

Abstract

Automatic computation of speech rate is a necessary task in a wide range of applications that require this prosodic feature, in which a manual transcription and time alignments are not available. Several tools have been developed to this end, but not enough research has been conducted yet to see to what extent they are scalable to other languages.

In the present work, we take two off-the- shelf tools designed for automatic speech rate computation and already tested for Dutch and English (v1, which relies on intensity peaks preceded by an intensity dip to find syllable nuclei and v3, which relies on intensity peaks surrounded by dips) and we apply them to read and spontaneous Spanish speech. Then, we test which of them offers the best performance. The results obtained with precision and normalized mean squared error metrics showed that v3 performs better than v1. However, recall measurement shows a better performance of v1, which suggests that a more fine-grained analysis on sensitivity and specificity is needed to select the best option depending on the application we are dealing with.

Downloads

Download data is not yet available.

References

Albalá, M. J., Battaner, E., Carranza, M., Mota Gorriz, C. d. l., Gil, J., Llisterri, J., ... others (2008). VILE: Análisis estadístico de los parámetros relacionados con el grupo de entonación. Language Design: Journal of Theoretical and Experimental Linguistics (Special Issue), 15-21.

Battaner Moro, E., Gil Fernández, J., Marrero Aguiar, V., Carbo Marro, C., Llisterri Boix, J., Machuca Ayuso, M. J., ... Ríos Mestre, A. (2005). VILE: estudio acústico de la variación inter- e intralocutor en español. Procesamiento del Lenguaje Natural, 35, pp. 435-436.

Cucchiarini, C., Strik, H., & Boves, L. (1998). Quantitative assessment of second language learners' fluency: An automatic approach. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP'98), pp. 2619-2622. Sydney, Australia. https://doi.org/10.21437/ICSLP.1998-754

Cucchiarini, C., Strik, H., & Boves, L. (2000a). Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms. Speech Communication, 30(2-3), 109-119. https://doi.org/10.1016/S0167-6393(99)00040-0

Cucchiarini, C., Strik, H., & Boves, L. (2000b). Quantitative assessment of second language learners' fluency by means of automatic speech recognition technology. Journal of the Acoustical Society of America, 107(2), 989-999. https://doi.org/10.1121/1.428279 PMid:10687708

Cucchiarini, C., Strik, H., & Boves, L. (2002). Quantitative assessment of second language learners' fluency: comparisons between read and spontaneous speech. Journal of the Acoustical Society of America, 111(6), 2862-2873. https://doi.org/10.1121/1.1471894 PMid:12083220

de Jong, N. H., Pacilly, J., & Wempe, T. (2021). Praat scripts to measure speed fluency and breakdown fluency in speech automatically. Assessment in Education: Principles, Policy and Practice, 28(4), 456-476. https://doi.org/10.1080/0969594X.2021.1951162

de Jong, N. H., & Wempe, T. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods, 41(2), 385-390. https://doi.org/10.3758/BRM.41.2.385 PMid:19363178

de Jong, N. H., Wempe, T., et al. (2007). Automatic measurement of speech rate in spoken Dutch. ACLC Working Papers, 2, 51-60.

Dekens, T., Demol, M., Verhelst, W., & Verhoeve, P. (2007). A comparative study of speech rate estimation techniques. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association (INTERSPEECH 2007), 510-513. https://doi.org/10.21437/Interspeech.2007-237

Farrús, M., Elvira-García, W., & Garrido- Almiñana, J. M. (2021). On the need of standard assessment metrics for automatic speech rate computation tools. In 4th Phonetics and Phonology in Europe 2021 Conference (PAPE 2021).

Garofolo, J.-S., et al. (1993) TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium.

Godfrey, J.-J., Holliman, E. (1993). Switchboard-1 Release 2 LDC97S62. Web Download. Philadelphia: Linguistic Data Consortium.

Goldman, J.-P. (2011). Easyalign: an automatic phonetic alignment tool under Praat. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH 2011). Florence, Italy. 28-21 August, 2011. https://doi.org/10.21437/Interspeech.2011-815

Honig, F., Batliner, A., Weilhammer, K., & Nöth, E. (2010). Automatic assessment of non-native prosody for English as L2. In Speech Prosody 2010, Chicago, IL, USA.

Llisterri, J., Machuca, M., & Ríos, A. (2017). VILE-P: un corpus para el estudio prosodico de la variación inter e intralocutor. Comunicación presentada en SUBSIDIA: Herramientas y recursos para las ciencias del habla, Málaga, Spain. June, 2017.

Mortaz, E. (2020). Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowledge-Based Systems, 210, 106490. https://doi.org/10.1016/j.knosys.2020.106490

Narayanan, S., & Wang, D. (2005). Speech rate estimation via temporal correlation and selected sub-band correlation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) Vol. 1, pp. 1-413. https://doi.org/10.1109/ICASSP.2005.1415138

Neumeyer, L., Franco, H., Digalakis, V., & Weintraub, M. (2000). Automatic scoring of pronunciation quality. Speech Communication, 30(2-3), 88-93. https://doi.org/10.1016/S0167-6393(99)00046-1

Ortega-García, J., González-Rodríguez, J., & Marrero-Aguiar, V. (2000). Ahumada: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31(2-3), 255-264. https://doi.org/10.1016/S0167-6393(99)00081-3

Pellegrino, F., Farinas, J., & Rouas, J.-L. (2004). Automatic estimation of speaking rate in multilingual spontaneous speech. In Speech Prosody 2004 (pp. 517-520).

Pfau, T., & Ruske, G. (1998). Estimating the speaking rate by vowel detection. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98). Vol. 2, pp. 945-948. https://doi.org/10.1109/ICASSP.1998.675422

Pfitzinger, H. R. (1996). Two approaches to speech rate estimation. In Proceedings of the 6th Australian International Conference on Speech Science and Technology (SST, 96).Vol. 96, pp. 421-426

Sabu, K., Chaudhuri, S., Rao, P., & Patil, M. (2021). An optimized signal-processing pipeline for syllable detection and speech rate estimation. In National Conference on Communications (NCC, 2020).

Verhasselt, J. P., & Martens, J.-P. (1996). A fast and reliable rate of speech detector. In Proceedings of Fourth International Conference on Spoken Language Processing (ICSLP'96). Vol. 4, pp. 2258-2261. https://doi.org/10.1109/ICSLP.1996.607256

Wang, D., & Narayanan, S. S. (2007). Robust speech rate estimation for spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2190- 2201. https://doi.org/10.1109/TASL.2007.905178 PMid:20428476 PMCid:PMC2860302

Zechner, K., Higgins, D., Xia, X., & Williamson, D. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883-895. https://doi.org/10.1016/j.specom.2009.04.009