1. INTRODUCTION
⌅ Automatic computation of speech rate has several applications in speech
technologies, such as automatic evaluation of prosody. Several studies
have explored, for example, its use for automatic evaluation of speech
fluency (Cucchiarini, Strik, & Boves, 1998Cucchiarini,
C., Strik, H., & Boves, L. (1998). Quantitative assessment of
second language learners’ fluency: An automatic approach. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP’98), pp. 2619-2622. Sydney, Australia. http://dx.doi.org/10.21437/ICSLP.1998-754
, 2000aCucchiarini,
C., Strik, H., & Boves, L. (2000a). Different aspects of expert
pronunciation quality ratings and their relation to scores produced by
speech recognition algorithms. Speech Communication, 30(2-3), 109-119. http://dx.doi.org/10.1016/S0167-6393(99)00040-0
, 2000bCucchiarini,
C., Strik, H., & Boves, L. (2000b). Quantitative assessment of
second language learners’ fluency by means of automatic speech
recognition technology. Journal of the Acoustical Society of America, 107(2), 989-999. http://dx.doi.org/10.1121/1.428279
, 2002Cucchiarini,
C., Strik, H., & Boves, L. (2002). Quantitative assessment of
second language learners’ fluency: comparisons between read and
spontaneous speech. Journal of the Acoustical Society of America, 111(6), 2862-2873. http://dx.doi.org/10.1121/1.1471894
; Neumeyer, Franco, Digalakis, & Weintraub, 2000Neumeyer, L., Franco, H., Digalakis, V., & Weintraub, M. (2000). Automatic scoring of pronunciation quality. Speech Communication, 30(2-3), 88-93. http://dx.doi.org/10.1016/S0167-6393(99)00046-1
; Zechner, Higgins, Xia, & Williamson, 2009Zechner,
K., Higgins, D., Xia, X., & Williamson, D. (2009). Automatic
scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883-895. http://dx.doi.org/10.1016/j.specom.2009.04.009
; Honig, Batliner, Weilhammer, & Nöth, 2010Honig,
F., Batliner, A., Weilhammer, K., & Nöth, E. (2010). Automatic
assessment of non-native prosody for English as L2. In Speech Prosody 2010, Chicago, IL, USA.
,
among others). Usual approaches to speech rate computation use a
phonetic aligner to obtain the necessary phonetic segmentation. This is
so because the performance of speech recognition systems, if available,
is not good enough to guarantee that the obtained phonetic segmentation
is reliable.
Phonetic aligners appear then as an alternative to
obtain a more accurate segmentation of the speech chain, but they need
the orthographic transcription of the input discourse to be known. If
the computation of the speech rate of unrestricted text -not previously
known by the system- is attempted, there are some alternatives that do
not require a full phonetic segmentation of the input speech to be
available, such as the automatic detection of syllabic nuclei. With the
aim of exploring this alternative, the current paper compares the
performance of two different methods for the automatic computation of
the number of syllabic nuclei using a similar technique based on
intensity peak detection. The first one is a Praat script described in de Jong and Wempe (2009)de Jong, N. H., & Wempe, T. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods, 41(2), 385-390. http://dx.doi.org/10.3758/BRM.41.2.385
and the second one is another Praat script developed by the same authors and other collaborators (de Jong, Pacilly, & Wempe, 2021de
Jong, N. H., Pacilly, J., & Wempe, T. (2021). Praat scripts to
measure speed fluency and breakdown fluency in speech automatically. Assessment in Education: Principles, Policy and Practice, 28(4), 456-476. http://dx.doi.org/10.1080/0969594X.2021.1951162
), in which a different approach to detect
intensity peaks is applied. The final goal is to determine which of them
would perform better in a task of syllable detection oriented to speech
rate calculation and to establish if any of these two methods is
adequate to be used in an automatic prosody evaluation system for
Spanish.
This paper is structured as follows: Section 2 briefly overviews the related work on this topic, Section 3 describes the experimental setup, Section 4 presents the assessment results, and finally, Sections 5 and 6 sketch the discussion and conclusions, respectively.
2. RELATED WORK
⌅ Most of the studies that deal with automatic computation of speech rate
are based on the transcriptions obtained -either manually or
automatically- from speech material. They mainly differ in the units
used to compute speech rate. Most of them are based on counting the
number of syllables within a specific segment of speech, providing the
speech rate computation as the number of syllables per second, while
some other works also provide other measures. Verhasselt and Martens (1996)Verhasselt, J. P., & Martens, J.-P. (1996). A fast and reliable rate of speech detector. In Proceedings of Fourth International Conference on Spoken Language Processing (ICSLP’96). Vol. 4, pp. 2258-2261. http://dx.doi.org/10.21437/ICSLP.1996-577
, for instance, defines speech rate as the number
of phones per second and computes them over the sentences of the TIMIT
corpus. Pfitzinger (1996)Pfitzinger, H. R. (1996). Two approaches to speech rate estimation. In Proceedings of the 6th Australian International Conference on Speech Science and Technology (SST, 96).Vol. 96, pp. 421-426
also used the number of phones per second as speech rate measure over a
total of 240 sentences spoken by eight different speakers.
The
literature on automatic speech rate computation tools without
transcriptions, which is the goal of the current paper, is scarce. One
of the most relevant works in this respect is Pfau and Ruske (1998)Pfau, T., & Ruske, G. (1998). Estimating the speaking rate by vowel detection. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ‘98). Vol. 2, pp. 945-948. http://dx.doi.org/10.1109/ICASSP.1998.675422
, in which speech rate is computed by means of
vowel detection, based on loudness in vowel regions, which tends to be
higher than in consonant regions. Similarly, the method of Pellegrino, Farinas and Rouas (2004)Pellegrino,
F., Farinas, J., & Rouas, J.-L. (2004). Automatic estimation of
speaking rate in multilingual spontaneous speech. In Speech Prosody 2004 (pp. 517-520).
is based on an unsupervised vowel detection algorithm scalable to any
language. Validation was assessed on a spontaneous speech subset of the
OGI Multilingual Telephone Speech Corpus. In Narayanan and Wang (2005)Narayanan, S., & Wang, D. (2005). Speech rate estimation via temporal correlation and selected sub-band correlation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) Vol. 1, pp. 1-413. http://dx.doi.org/10.1109/ICASSP.2005.1415138
and Wang and Narayanan (2007)Wang, D., & Narayanan, S. S. (2007). Robust speech rate estimation for spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2190- 2201. http://dx.doi.org/10.1109/TASL.2007.905178
, the authors present novel methods for speech
rate estimation, measured as the number of syllables per second,
analyzing the segments contained between pauses in the Switchboard
database (Godfrey & Holliman, 1993Godfrey, J.-J., Holliman, E. (1993). Switchboard-1 Release 2 LDC97S62. Web Download. Philadelphia: Linguistic Data Consortium.
).
Both methods are based on an extension of signal correlation -essential
for syllable detection- by including temporal correlation and prominent
spectral sub-bands.
The work described in Dekens, Demol, Verhelst and Verhoeve (2007)Dekens, T., Demol, M., Verhelst, W., & Verhoeve, P. (2007). A comparative study of speech rate estimation techniques. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association (INTERSPEECH 2007), 510-513. http://dx.doi.org/10.21437/Interspeech.2007-237
is also based on the number of syllables per
second, and the authors evaluate the performance of several speech
estimators on a multilingual database covering Dutch, English, French,
Romanian and Spanish, by using sub-band and time correlation to detect
the number of vowels and diphthongs.
However, giving that speech
rate can be computed using syllables or phones and total time of speech,
any tool that identifies either syllable boundaries or vowels can be
used for this task, for example, tools that syllabify conversational
speech (Landsiedel et al., 2011; Mary et al., 2018) or tools that locate
syllable nuclei (Sabu, Chaudhuri, Rao, & Patil, 2021Sabu,
K., Chaudhuri, S., Rao, P., & Patil, M. (2021). An optimized
signal-processing pipeline for syllable detection and speech rate
estimation. In National Conference on Communications (NCC, 2020).https://doi.org/10.48550/arXiv.2103.04346
). Using this last method, de Jong et al. (2007)de Jong, N. H., Wempe, T., et al. (2007). Automatic measurement of speech rate in spoken Dutch. ACLC Working Papers, 2, 51-60.
and de Jong & Wempe (2009)de Jong, N. H., & Wempe, T. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods, 41(2), 385-390. http://dx.doi.org/10.3758/BRM.41.2.385
compute speech rate over two corpora of spoken
Dutch, by identifying peaks in intensity that are preceded by dips,
which is then considered as a syllable nucleus. In Sabu et al. (2021)Sabu,
K., Chaudhuri, S., Rao, P., & Patil, M. (2021). An optimized
signal-processing pipeline for syllable detection and speech rate
estimation. In National Conference on Communications (NCC, 2020).https://doi.org/10.48550/arXiv.2103.04346
, the authors use the TIMIT dataset (Garofolo et al., 1993Garofolo, J.-S., et al. (1993) TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium.
)
and a children’s oral reading corpus created ad hoc, for which they
identify vowel sonority by means of local peak picking on a
frequency-weighted energy contour.
3. EXPERIMENTAL SETUP
⌅3.1. Evaluated tools for speech rate computation
⌅ The present paper analyses the performance of two tools distributed under a GNU General Public License (de Jong & Wempe, 2009de Jong, N. H., & Wempe, T. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods, 41(2), 385-390. http://dx.doi.org/10.3758/BRM.41.2.385
; de Jong et al., 2021de
Jong, N. H., Pacilly, J., & Wempe, T. (2021). Praat scripts to
measure speed fluency and breakdown fluency in speech automatically. Assessment in Education: Principles, Policy and Practice, 28(4), 456-476. http://dx.doi.org/10.1080/0969594X.2021.1951162
). Both of them are Praat-based scripts that use
intensity in order to find syllable nuclei. More specifically, they
extract an intensity object using the following parameters: ‘minimum
Pitch’ set to 50Hz and the autocorrelation method. After this point,
their behavior differs.
The first tool (v1), described in de Jong and Wempe (2009)de Jong, N. H., & Wempe, T. (2009). Praat script to detect syllable nuclei and measure speech rate automatically. Behavior Research Methods, 41(2), 385-390. http://dx.doi.org/10.3758/BRM.41.2.385
, applies a predefined threshold (2dB above the
median intensity of the total sound file) to find peaks preceded by a
dip in intensity (see Figure 1). Then, out of those peaks, it discards those that are unvoiced.
The second tool (v3) relies on a different method (de Jong et al., 2021de
Jong, N. H., Pacilly, J., & Wempe, T. (2021). Praat scripts to
measure speed fluency and breakdown fluency in speech automatically. Assessment in Education: Principles, Policy and Practice, 28(4), 456-476. http://dx.doi.org/10.1080/0969594X.2021.1951162
). It detects every intensity peak above 25 dB and
below 95 % of the highest peak (in order to disregard loud bursts in
the signal). Then, it measures the intensity surrounding the peak and if
it is a dip of at least 2dB at both sides the peak is labelled as
syllable nucleus (de Jong et al., 2021de
Jong, N. H., Pacilly, J., & Wempe, T. (2021). Praat scripts to
measure speed fluency and breakdown fluency in speech automatically. Assessment in Education: Principles, Policy and Practice, 28(4), 456-476. http://dx.doi.org/10.1080/0969594X.2021.1951162
).
Therefore, the main difference between the tools is that one (v1) considers as syllable nuclei those intensity peaks preceded by a dip, and the other (v3) considers as syllable nuclei those intensity peaks that are surrounded by intensity dips. This difference results in the same judgments most of the time, however in some cases it does not. Discrepancies between v1 and v3 are usually related to approximants, whose dip is short enough to be considered a whole with the next one, and laterals (and nasals to a lesser degree) in coda position (Figure 2).
3.2. Materials
⌅ In order to test which method (preceding peak or surrounding peak)
offers the best performance in Spanish, we used a subcorpus from the
AHUMADA corpus (Ortega-Garcia, Gonzalez-Rodriguez, & Marrero-Aguiar, 2000Ortega-García,
J., González-Rodríguez, J., & Marrero-Aguiar, V. (2000). Ahumada: A
large speech corpus in Spanish for speaker characterization and
identification. Speech Communication, 31(2-3), 255-264. http://dx.doi.org/10.1016/S0167-6393(99)00081-3
) selected for the VILE project (Albalá et al., 2008Albalá,
M. J., Battaner, E., Carranza, M., Mota Gorriz, C. d. l., Gil, J.,
Llisterri, J., ... others (2008). VILE: Análisis estadístico de los
parámetros relacionados con el grupo de entonación. Language Design: Journal of Theoretical and Experimental Linguistics (Special Issue), 15-21.
; Battaner Moro et al., 2005Battaner
Moro, E., Gil Fernández, J., Marrero Aguiar, V., Carbo Marro, C.,
Llisterri Boix, J., Machuca Ayuso, M. J., ... Ríos Mestre, A. (2005).
VILE: estudio acústico de la variación inter- e intralocutor en español. Procesamiento del Lenguaje Natural, 35, pp. 435-436.
),
consisting of recordings of 30 male speakers, with a total of 3.5 hours
of speech, recorded in three different sessions in different days (M1-
M2-M3), and two different conditions: read speech (26984 vowels) and
spontaneous speech (35366 vowels).
The read subcorpus consists of the reading of a phonologically and syllabically balanced text of approximately one minute read at a normal speech rate. All speakers read the same text in the three sessions.
The spontaneous subcorpus consists of at least one minute of speech describing a picture, explaining speakers’ last holidays, a well- known board game or simply something familiar to them.
This material was manually annotated at the phoneme, syllable and word levels for the VILE project (Albalá et al., 2008Albalá,
M. J., Battaner, E., Carranza, M., Mota Gorriz, C. d. l., Gil, J.,
Llisterri, J., ... others (2008). VILE: Análisis estadístico de los
parámetros relacionados con el grupo de entonación. Language Design: Journal of Theoretical and Experimental Linguistics (Special Issue), 15-21.
; Battaner Moro et al., 2005Battaner
Moro, E., Gil Fernández, J., Marrero Aguiar, V., Carbo Marro, C.,
Llisterri Boix, J., Machuca Ayuso, M. J., ... Ríos Mestre, A. (2005).
VILE: estudio acústico de la variación inter- e intralocutor en español. Procesamiento del Lenguaje Natural, 35, pp. 435-436.
).
The annotation procedure involved three steps: in the first one, a team
of phoneticians orthographically transcribed intonational groups
following the guidelines described in Llisterri, Machuca and Ríos, (2017)Llisterri,
J., Machuca, M., & Ríos, A. (2017). VILE-P: un corpus para el
estudio prosodico de la variación inter e intralocutor. Comunicación
presentada en SUBSIDIA: Herramientas y recursos para las ciencias del habla, Málaga, Spain. June, 2017.
; in the second one, EasyAlign (Goldman, 2011Goldman, J.-P. (2011). Easyalign: an automatic phonetic alignment tool under Praat. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH 2011). Florence, Italy. 28-21 August, 2011.
) was used to automatically align the annotation; finally, a human annotator revised the automatic segmentation.
3.3. Evaluation metrics
⌅One of the main challenges when assessing systems dealing with the automatic computation of speech rate is the diversity and sparseness of evaluation metrics. The metrics used in the literature to evaluate the speech rate estimators vary among the different works and include a wide range of metrics such as the relative prediction error, the correlation coefficient between the estimated and actual syllables, the syllable error rate, the vowel error rate, the linear regression coefficient, the mean error, the standard deviation error, and F-score, among others. Moreover, these metrics are computed either over the number of syllables (or phones) as units of measurement, or directly over the speech rate measurement.
In the current paper, we present two different evaluations to compare the two tools addressed. Firstly, we show a performance analysis based on common metrics used for classification problems: accuracy, precision, recall, and F-score. For this assessment, we have considered the tier where vowel (syllable nuclei) and consonant intervals (non syllable nuclei) are labelled.
Additionally, we provide the root mean square error (RMSE) and normalized root mean square error (NRMSE) for the assessment analysis, based on the syllable annotation tier and, more specifically, the number of syllables of each file in the VILE corpus.
3.3.1. Performance metrics
⌅For the first evaluation, we compare both tools using the standard performance metrics in classification problems:
where
TP = True Positives (detected syllable nuclei)
TN = True Negatives
FP = False Positives
FN = False Negatives
-
Precision: defined as the number of correctly detected syllable nuclei over the actual cases, that is:
-
Recall: defined as the number of correctly detected syllable nuclei over the estimated cases, that is:
3.3.2. Assessment metrics
⌅ Farrús et al. (2021)Farrús,
M., Elvira-García, W., & Garrido- Almiñana, J. M. (2021). On the
need of standard assessment metrics for automatic speech rate
computation tools. In 4th Phonetics and Phonology in Europe 2021 Conference (PAPE 2021).
explored the adequacy of several metrics commonly used in the
literature, such as: (a) correlation coefficient between actual number
of syllables -or speech rate- and estimated number of syllables -or
speech rate measurement-, (b) mean error defined as the mean of the
error in absolute values, (c) standard deviation error defined as the
standard deviation of the previous mean, (d) coefficient of variation
defined as (standard deviation error)/ (mean error), (e) mean square
error (MSE), (f) root mean square error (RMSE), and (g) normalized root
mean square error (NRMSE), by mean defined as:
where
N is the number of observations
yi is the ith reference (actual) value
ŷi is its corresponding estimated value
ў is the mean of the measured data
Farrús et al. (2021)Farrús,
M., Elvira-García, W., & Garrido- Almiñana, J. M. (2021). On the
need of standard assessment metrics for automatic speech rate
computation tools. In 4th Phonetics and Phonology in Europe 2021 Conference (PAPE 2021).
concluded that correlation coefficients were not adequate for this kind
of assessment, and that, instead, the use of the relative error as a
unit for the different metrics should be encouraged, since it
homogenizes the assessment based on the number of syllables and speech
rate, apart from exhibiting consistent and coherent results. In the
current paper, we evaluate the performance of both tools by computing
the number of syllables, the speech rate, and the relative error.
Moreover, as suggested in our previous study, we compare both tools by
means of RMSE as an assessment metric, together with its normalized
value (NRMSE) for a better comparison between models computed over
different scales.
4. SYSTEM COMPARISON
⌅4.1. Performance analysis
⌅The two tools analyzed provide the number of syllables detected via a TextGrid with a point tier (in which the syllable nuclei are indicated as points in time). The Spanish databases are labeled sound-by-sound using interval tiers. In order to make the results comparable, we have combined the automatic point tier with the manual interval tier. We considered that the system succeeded when either there is a point in the time range of a manual interval labelled as vowel (true positive, TP) or there is no point within the time range of a manual interval labelled as consonant (true negative, TN). The system fails when we have a point within a consonant time range (false positive, FP), we have more than a point within a vowel time range (as many false positives as surplus points) or there is no point within a vowel time range (false negative, FN).
This comparison method is accurate for our purpose (computing the number of syllables detected by the script). However, it would not be accurate for tasks where the interest was the actual center of syllable nuclei, since the method counts as correct any point that falls within the vowel range without taking into account whether the script has placed the point in the vowel mid-point.
Table 1 shows the comparison of the results obtained by both tools in terms of performance analysis. It details the number of syllable nuclei (vowels) correctly detected (True Positives), wrongly detected (False Positives), missed (False Negatives) and correctly dismissed (True Negatives) by the two tools (v1 and v3) in the two analyzed conditions (read data and spontaneous data). The same results expressed in percentage are illustrated in Figure 3.
read | spontaneous | |||
---|---|---|---|---|
V1 | V3 | V1 | V3 | |
TP | 18137 | 16989 | 23208 | 21671 |
FP | 2034 | 1147 | 3655 | 1851 |
FN | 8847 | 9995 | 12158 | 13695 |
TN | 29740 | 30601 | 37990 | 39049 |
Figure 3 shows that v1 results are better for detected syllables (higher value) and missed syllables (lower value), whereas v3 performs better for true negatives (a higher value) and false positives (a lower value), taking as a reference the number of manually annotated vowels (nucleus) and consonants (non-nucleus) in both subcorpora (26984 nuclei for read, 35366 for spontaneous speech).
Table 2 shows the main performance metrics for tool v1 and v3 for read and
spontaneous speech with the best result highlighted in bold. Results
show that, in general, v3 is the best tool when we consider precision
and v1 shows a better performance in recall and F-score. Accuracy
reveals contradictory results, having best results for tool v1 in read
speech and the best result for tool v3 in spontaneous speech. However,
accuracy is a discouraged metric in cases of heavily imbalanced cases
(big difference between the number of false positives and false
negatives or missed cases) (e.g. Mortaz, 2020Mortaz, E. (2020). Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowledge-Based Systems, 210, 106490. http://dx.doi.org/10.1016/j.knosys.2020.106490
) and in those cases performance analysis should rely on F-score metrics.
read | spontaneous | |||
---|---|---|---|---|
v1 | v3 | v1 | v3 | |
accuracy | 0.815 | 0.810 | 0.795 | 0.796 |
precision | 0.899 | 0.994 | 0.864 | 0.921 |
recall | 0.672 | 0.629 | 0.656 | 0.613 |
F-score | 0.769 | 0.753 | 0.746 | 0.733 |
4.2. Assessment metrics
⌅In this section, we present the assessment metrics obtained for the following units of analysis: number of syllables, speech rate, and relative error. Speech rate is defined as number of syllables per second, and the relative error is defined as:
where #sylla is the estimated (automatic) count of syllables, and #syllm in the actual (manual) count. Since speech rate is obtained using the number of syllables along the entire speech duration, and the length of the spurt analyzed is the same in both evaluations (automatic and manual), the relative error applied to the number of syllables and to speech rate coincides, making it a homogenized measurement.
In Table 3, we show the total number of syllables obtained with the manual transcriptions in the entire corpus, as well as the number of syllables obtained in both automatic tools (v1 and v3) for read and spontaneous modalities. The results clearly show that v3 fails more than v1 when detecting syllable nuclei, although both tools underestimate the actual number of syllables.
read | spontaneous | ||||
---|---|---|---|---|---|
manual | v1 | v3 | manual | v1 | v3 |
27005 | 20165 | 18155 | 35408 | 27491 | 24003 |
Tables 4 and 5 show the root mean square error (RMSE) and normalized root mean square error (NRMSE) respectively, obtained for both tools, the different units of analysis (number of syllables, speech rate, and error rate), and both read and spontaneous modalities.
read | spontaneous | |||
---|---|---|---|---|
v1 | v3 | v1 | v3 | |
#syllables | 78.3 | 99.8 | 99.2 | 135.3 |
speech rate | 1.420 | 1.789 | 3.604 | 4.032 |
error rate | 0.261 | 0.333 | 0.233 | 0.326 |
read | spontaneous | |||
---|---|---|---|---|
v1 | v3 | v1 | v3 | |
#syllables | 1.030 | 1.015 | 1.128 | 1.068 |
speech rate | 1.050 | 1.033 | 1.136 | 1.105 |
error rate | 1.031 | 1.015 | 1.063 | 1.030 |
The best result within both tools, with each assessment metric and for both read and spontaneous speech is highlighted in bold. The results mainly show that, while tool v1 performs better when it is evaluated by means of RMSE, tool v3 performs better if we consider NRMSE.
5. DISCUSSION
⌅The performance analysis (see 4.1) shows that both tools are reliable finding syllable nuclei (precision> 0.8 and recall > 0.5, in all cases). Also, both tools perform better with read speech than with spontaneous speech. However, they share a common problem in classification tasks: an imbalanced classification, with more false negatives than false positives, which complicates the assessment. For our data, this is a foreseeable result given that finding a syllable nucleus that is not preceded or followed by an intensity dip is more usual in speech than it is for a voiced intensity peak to be a consonant. This means that, if we want the tool to correctly disregard peaks that are not vowels, we need to use a more restrictive system -which v3 does by requiring a preceding and following intensity dip in order to consider an interval as a syllable nucleus- and that will give us a better result in accuracy and precision, which is exactly what is shown in Table 2, given that precision is as a measure of quality, meaning that the vowels that are marked as vowels with v3 are more likely to be real vowels. However, if we consider the global result of correctly identified and disregarded syllable nuclei (the quantity) a less restrictive rule (i.e., v1, which only considers previous intensity dips) has a better performance as illustrated by Table 2 recall and F-score.
For the aim of this paper, which is the automatic computation of speech rate, quantity measures can prove more relevant than quality measures given that, when computing speech rate, we are not interested in knowing whether the segment is a syllable nucleus but rather in getting a number of syllable nuclei as close as possible to the actual one. That is, if a false positive is later compensated by a missed nucleus the system is still accurate. This is the exact scenario when in a real syllable the automatic tool places the syllable nucleus within the onset instead of in the actual nucleus, but then does not label the vowel as nucleus.
In Table 3, we can clearly see that the number of syllables counted by v1 is closer to the actual number of syllables counted by v3. In other words, v3 is missing a larger number of syllables, which results in larger values of RMSE for v3, both in the read and the spontaneous modalities (Table 4). These results are consistent with those shown in Table 1, also illustrated in Figure 2: the number of detected (true positives) and false positive syllables is greater in v1. The number of missed syllables (false negatives) also contributes to enlarge the underestimation in the syllable counting.
However, the NRMSE metric (Table 5) shows otherwise: the RMSE normalized values by the mean of the measured data in v3 outperform those obtained in v1. The fact is that, although v3 fails more in detecting syllables than v1, such failure is more stable. This is strengthened by the measurement of other metrics such as the standard error (standard deviation of the mean error) and the coefficient of variation -or relative standard deviation- defined as (standard deviation)/mean). For both measurements, v3 shows a better performance than v1 for both read and spontaneous modalities.
On the one hand, this shows that, although v3 fails largely in missing syllables, such failure could be better compensated by a correction factor. On the other hand, and since v3 appears to be more restrictive in the detection conditions of syllables - we need an intensity dip in both side of the vowel and not only one as in v1-, but we can also ensure that the detected syllables come more often from actual syllable nuclei in v3 than in v1, in which the detected syllable could come more often from false nuclei. This is also strengthened by the larger number of true negatives in v3 encountered in Table 1 for both modalities.
6. CONCLUSIONS
⌅The results presented and discussed in the previous sections indicate, on the one hand, that both methods of syllable detection are not fully reliable yet to face a speech rate analysis task: both detect a number of syllables which is remarkably lower than the number of syllables obtained from a manual annotation. However, v1 seems to offer a better performance for this task than v3, as the number of detected syllables is closer to the manually obtained value, which compensates the fact that it is less precise in the detection of actual syllables, a fact that is secondary in a speech rate calculation task if the number of detected syllables is close enough to the number of manually annotated ones.
On the other hand, the results also show that, although v3 detects in general less true syllables than v1, it seems more adequate for tasks in which it is important that detected syllables correspond to actual syllables, such as automatic acoustic measurements of corpora involving the detection of syllabic nuclei.