Comparison of intensity-based methods for automatic speech rate computation

: Automatic computation of speech rate is a necessary task in a wide range of applications that require this prosodic feature, in which a manual transcription and time alignments are not available. Several tools have been developed to this end, but not enough research has been conducted yet to see to what extent they are scalable to other languages. In the present work


INTRODUCTION
Automatic computation of speech rate has several applications in speech technologies, such as automatic evaluation of prosody. Several studies have explored, for example, its use for automatic evaluation of speech fluency (Cucchiarini, Strik, & Boves, 1998, 2000a, 2000b, 2002Neumeyer, Franco, Digalakis, & Weintraub, 2000;Zechner, Higgins, Xia, & Williamson, 2009;Honig, Batliner, Weilhammer, & Nöth, 2010, among others). Usual approaches to speech rate computation use a phonetic aligner to obtain the necessary phonetic segmentation. This is so because the performance of speech recognition systems, if available, is not good enough to guarantee that the obtained phonetic segmentation is reliable.
Phonetic aligners appear then as an alternative to obtain a more accurate segmentation of the speech chain, but they need the orthographic transcription of the input discourse to be known. If the computation of the speech rate of unrestricted text -not previously known by the system-is attempted, there are some alternatives that do not require a full phonetic segmentation of the input speech to be available, such as the automatic detection of syllabic nuclei. With the aim of exploring this alternative, the current paper compares the performance of two different methods for the automatic computation of the number of syllabic nuclei using a similar technique based on intensity peak detection. The first one is a Praat script described in de Jong and Wempe (2009) and the second one is another Praat script developed by the same authors and other collaborators (de Jong, Pacilly, & Wempe, 2021), in which a different approach to detect intensity peaks is applied. The final goal is to determine which of them would perform better in a task of syllable detection oriented to speech rate calculation and to establish if any of these two methods is adequate to be used in an automatic prosody evaluation system for Spanish. This paper is structured as follows: Section 2 briefly overviews the related work on this topic, Section 3 describes the experimental setup, Section 4 presents the assessment results, and finally, Sections 5 and 6 sketch the discussion and conclusions, respectively.

RELATED WORK
Most of the studies that deal with automatic computation of speech rate are based on the transcriptions obtained -either manually or automatically-from speech material. They mainly differ in the units used to compute speech rate. Most of them are based on counting the number of syllables within a specific segment of speech, providing the speech rate computation as the number of syllables per second, while some other works also provide other measures. Verhasselt and Martens (1996), for instance, defines speech rate as the number of phones per second and computes them over the sentences of the TI-MIT corpus. Pfitzinger (1996) also used the number of phones per second as speech rate measure over a total of 240 sentences spoken by eight different speakers.
The literature on automatic speech rate computation tools without transcriptions, which is the goal of the current paper, is scarce. One of the most relevant works in this respect is Pfau and Ruske (1998), in which speech rate is computed by means of vowel detection, based on loudness in vowel regions, which tends to be higher than in consonant regions. Similarly, the method of Pellegrino, Farinas and Rouas (2004) is based on an unsupervised vowel detection algorithm scalable to any language. Validation was assessed on a spontaneous speech subset of the OGI Multilingual Telephone Speech Corpus. In Narayanan and Wang (2005) and Wang and Narayanan (2007), the authors present novel methods for speech rate estimation, measured as the number of syllables per second, analyzing the segments contained between pauses in the Switchboard database (Godfrey & Holliman, 1993). Both methods are based on an extension of signal correlation -essential for syllable detection-by including temporal correlation and prominent spectral sub-bands.
The work described in Dekens, Demol, Verhelst and Verhoeve (2007) is also based on the number of syllables per second, and the authors evaluate the performance of several speech estimators on a multilingual database covering Dutch, English, French, Romanian and Spanish, by using sub-band and time correlation to detect the number of vowels and diphthongs.
However, giving that speech rate can be computed using syllables or phones and total time of speech, any tool that identifies either syllable boundaries or vowels can be used for this task, for example, tools that syllabify conversational speech (Landsiedel et al., 2011;Mary et al., 2018) or tools that locate syllable nuclei (Sabu, Chaudhuri, Rao, & Patil, 2021). Using this last method, de Jong et al. (2007) and de Jong & Wempe (2009) compute speech rate over two corpora of spoken Dutch, by identifying peaks in intensity that are preceded by dips, which is then considered as a syllable nucleus. In Sabu et al. (2021), the authors use the TIMIT dataset (Garofolo et al., 1993) and a children's oral reading corpus created ad hoc, for which they identify vowel sonority by means of local peak picking on a frequency-weighted energy contour.

Evaluated tools for speech rate computation
The present paper analyses the performance of two tools distributed under a GNU General Public License (de Jong & Wempe, 2009;de Jong et al., 2021). Both of them are Praat-based scripts that use intensity in order to find syllable nuclei. More specifically, they extract an intensity object using the following parameters: 'minimum Pitch' set to 50Hz and the autocorrelation method. After this point, their behavior differs.
The first tool (v1), described in de Jong and Wempe (2009), applies a predefined threshold (2dB above the median intensity of the total sound file) to find peaks preceded by a dip in intensity (see Figure 1). Then, out of those peaks, it discards those that are unvoiced.
The second tool (v3) relies on a different method (de Jong et al., 2021). It detects every intensity peak above 25 dB and below 95 % of the highest peak (in order to disregard loud bursts in the signal). Then, it measures the intensity surrounding the peak and if it is a dip of at least 2dB at both sides the peak is labelled as syllable nucleus (de Jong et al., 2021).
Battaner Moro et al., 2005), consisting of recordings of 30 male speakers, with a total of 3.5 hours of speech, recorded in three different sessions in different days (M1-M2-M3), and two different conditions: read speech (26984 vowels) and spontaneous speech (35366 vowels).
The read subcorpus consists of the reading of a phonologically and syllabically balanced text of approximately one minute read at a normal speech rate. All speakers read the same text in the three sessions.
The spontaneous subcorpus consists of at least one minute of speech describing a picture, explaining speakers' last holidays, a well-known board game or simply something familiar to them.
This material was manually annotated at the phoneme, syllable and word levels for the VILE project (Albalá et al., 2008;Battaner Moro et al., 2005). The annotation procedure involved three steps: in the first one, a team of phoneticians orthographically transcribed intonational groups following the guidelines described in Llisterri, Machuca and Ríos, (2017); in the second one, EasyAlign (Goldman, 2011) was used to automatically align the annotation; finally, a human annotator revised the automatic segmentation.

Evaluation metrics
One of the main challenges when assessing systems dealing with the automatic computation of speech rate is the diversity and sparseness of evaluation metrics. The metrics used in the literature to evaluate the speech rate estimators vary among the different works and include a wide range of metrics such as the relative prediction error, the correlation coefficient between the estimated and actual syllables, the syllable error rate, the vowel error rate, the linear regression coefficient, the mean error, the standard deviation error, and F-score, among others. Moreover, these metrics are computed either over the number of syllables (or phones) as units of measurement, or directly over the speech rate measurement.
In the current paper, we present two different evaluations to compare the two tools addressed. Firstly, we show a performance analysis based on common metrics used for classification problems: accuracy, precision, recall, and F-score. For this assessment, we have considered the tier where vowel (syllable nuclei) and consonant intervals (non syllable nuclei) are labelled.
Additionally, we provide the root mean square error (RMSE) and normalized root mean square error (NRM-SE) for the assessment analysis, based on the syllable annotation tier and, more specifically, the number of syllables of each file in the VILE corpus.

Performance metrics
For the first evaluation, we compare both tools using the standard performance metrics in classification problems: • Accuracy: defined as the number of cases of the correctly predicted class, that is: Figure 1: Intensity curve of the Spanish phrase "La baba" (the slime) with a syllable nucleus and its preceding and following dips highlighted. Therefore, the main difference between the tools is that one (v1) considers as syllable nuclei those intensity peaks preceded by a dip, and the other (v3) considers as syllable nuclei those intensity peaks that are surrounded by intensity dips. This difference results in the same judgments most of the time, however in some cases it does not. Discrepancies between v1 and v3 are usually related to approximants, whose dip is short enough to be considered a whole with the next one, and laterals (and nasals to a lesser degree) in coda position (Figure 2).

Materials
In order to test which method (preceding peak or surrounding peak) offers the best performance in Spanish, we used a subcorpus from the AHUMADA corpus (Ortega-Garcia, Gonzalez-Rodriguez, & Marrero-Aguiar, 2000) selected for the VILE project (Albalá et al., 2008; (1) accuracy = TP + TN TP + TN + FP +FN where TP = True Positives (detected syllable nuclei), TN = True Negatives, FP = False Positives, and FN = False Negatives.
• Precision: defined as the number of correctly detected syllable nuclei over the actual cases, that is: (2) precision = TP TP + FP • Recall: defined as the number of correctly detected syllable nuclei over the estimated cases, that is: (3) recall = TP TP + FN • F-score: defined as the combination of both precision and recall in the following form: (4) F -score = 2 precision · recall precision + recall Farrús et al. (2021) explored the adequacy of several metrics commonly used in the literature, such as: (a) correlation coefficient between actual number of syllables -or speech rate-and estimated number of syllables -or speech rate measurement-, (b) mean error defined as the mean of the error in absolute values, (c) standard deviation error defined as the standard deviation of the previous mean, (d) coefficient of variation defined as (standard deviation error)/ (mean error), (e) mean square error (MSE), (f) root mean square error (RMSE), and (g) normalized root mean square error (NRMSE), by mean defined as:

Assessment metrics
where N is the number of observations, y i is the ith reference (actual) value, ŷ i is its corresponding estimated value, and ў is the mean of the measured data. Farrús et al. (2021) concluded that correlation coefficients were not adequate for this kind of assessment, and that, instead, the use of the relative error as a unit for the different metrics should be encouraged, since it homogenizes the assessment based on the number of syllables and speech rate, apart from exhibiting consistent and coherent results. In the current paper, we evaluate the performance of both tools by computing the number of syllables, the speech rate, and the relative error. Moreover, as suggested in our previous study, we compare both tools by means of RMSE as an assessment metric, together with its normalized value (NRMSE) for a better comparison between models computed over different scales.

Performance analysis
The two tools analyzed provide the number of syllables detected via a TextGrid with a point tier (in which the syllable nuclei are indicated as points in time). The Spanish databases are labeled sound-by-sound using interval tiers. In order to make the results comparable, we have combined the automatic point tier with the manual interval tier. We considered that the system succeeded when either there is a point in the time range of a manual interval labelled as vowel (true positive, TP) or there is no point within the time range of a manual interval labelled as consonant (true negative, TN). The system fails when we have a point within a consonant time range (false positive, FP), we have more than a point within a vowel time range (as many false positives as surplus points) or there is no point within a vowel time range (false negative, FN).
This comparison method is accurate for our purpose (computing the number of syllables detected by the script). However, it would not be accurate for tasks where the interest was the actual center of syllable nuclei, since the method counts as correct any point that falls within the vowel range without taking into account whether the script has placed the point in the vowel mid-point. Table 1 shows the comparison of the results obtained by both tools in terms of performance analysis. It details the number of syllable nuclei (vowels) correctly detected (True Positives), wrongly detected (False Positives), missed (False Negatives) and correctly dismissed (True Negatives) by the two tools (v1 and v3) in the two analyzed conditions (read data and spontaneous data). The same results expressed in percentage are illustrated in Figure 3. Figure 3 shows that v1 results are better for detected syllables (higher value) and missed syllables (lower value), whereas v3 performs better for true negatives (a higher value) and false positives (a lower value), taking as a reference the number of manually annotated vowels (nucleus) and consonants (non-nucleus) in both subcorpora (26984 nuclei for read, 35366 for spontaneous speech).  Table 2 shows the main performance metrics for tool v1 and v3 for read and spontaneous speech with the best result highlighted in bold. Results show that, in general, v3 is the best tool when we consider precision and v1 shows a better performance in recall and F-score. Accuracy reveals contradictory results, having best results for tool v1 in read speech and the best result for tool v3 in spontaneous speech. However, accuracy is a discouraged metric in cases of heavily imbalanced cases (big difference between the number of false positives and false negatives or missed cases) (e.g. Mortaz, 2020) and in those cases performance analysis should rely on F-score metrics.
as well as the number of syllables obtained in both automatic tools (v1 and v3) for read and spontaneous modalities. The results clearly show that v3 fails more than v1 when detecting syllable nuclei, although both tools underestimate the actual number of syllables.

Assessment metrics
In this section, we present the assessment metrics obtained for the following units of analysis: number of syllables, speech rate, and relative error. Speech rate is defined as number of syllables per second, and the relative error is defined as: where #sylla is the estimated (automatic) count of syllables, and #syllm in the actual (manual) count. Since speech rate is obtained using the number of syllables along the entire speech duration, and the length of the spurt analyzed is the same in both evaluations (automatic and manual), the relative error applied to the number of syllables and to speech rate coincides, making it a homogenized measurement.
In Table 3, we show the total number of syllables obtained with the manual transcriptions in the entire corpus, Tables 4 and 5 show the root mean square error (RMSE) and normalized root mean square error (NRM-SE) respectively, obtained for both tools, the different units of analysis (number of syllables, speech rate, and error rate), and both read and spontaneous modalities.
The best result within both tools, with each assessment metric and for both read and spontaneous speech is highlighted in bold. The results mainly show that, while tool v1 performs better when it is evaluated by means of RMSE, tool v3 performs better if we consider NRMSE.

DISCUSSION
The performance analysis (see 4.1) shows that both tools are reliable finding syllable nuclei (precision> 0.8 and recall > 0.5, in all cases). Also, both tools perform better with read speech than with spontaneous speech. However, they share a common problem in classification tasks: an imbalanced classification, with more false negatives than false positives, which complicates the assessment. For our data, this is a foreseeable result given that finding a syllable nucleus that is not preceded or followed by an intensity dip is more usual in speech than it is for a voiced intensity peak to be a consonant. This means that, if we want the tool to correctly disregard peaks that are not vowels, we need to use a more restrictive system -which v3 does by requiring a preceding and following intensity dip in order to consider an interval as a syllable nucleus-and that will give us a better result in accuracy and precision, which is exactly what is shown in Table  2, given that precision is as a measure of quality, meaning that the vowels that are marked as vowels with v3 are more likely to be real vowels. However, if we consider the global result of correctly identified and disregarded syllable nuclei (the quantity) a less restrictive rule (i.e., v1, which only considers previous intensity dips) has a better performance as illustrated by Table 2 recall and F-score.
For the aim of this paper, which is the automatic computation of speech rate, quantity measures can prove more relevant than quality measures given that, when computing speech rate, we are not interested in knowing whether the segment is a syllable nucleus but rather in getting a number of syllable nuclei as close as possible to the actual one. That is, if a false positive is later compensated by a missed nucleus the system is still accurate. This is the exact scenario when in a real syllable the automatic tool places the syllable nucleus within the onset instead of in the actual nucleus, but then does not label the vowel as nucleus.
In Table 3, we can clearly see that the number of syllables counted by v1 is closer to the actual number of syllables counted by v3. In other words, v3 is missing a larger number of syllables, which results in larger values of RMSE for v3, both in the read and the spontaneous modalities (Table 4). These results are consistent with those shown in Table 1, also illustrated in Figure 2: the number of detected (true positives) and false positive syllables is greater in v1. The number of missed syllables (false negatives) also contributes to enlarge the underestimation in the syllable counting.
However, the NRMSE metric (Table 5) shows otherwise: the RMSE normalized values by the mean of the measured data in v3 outperform those obtained in v1. The fact is that, although v3 fails more in detecting syllables than v1, such failure is more stable. This is strengthened by the measurement of other metrics such as the standard error (standard deviation of the mean error) and the coefficient of variation -or relative standard deviationdefined as (standard deviation)/mean). For both measurements, v3 shows a better performance than v1 for both read and spontaneous modalities.
On the one hand, this shows that, although v3 fails largely in missing syllables, such failure could be better compensated by a correction factor. On the other hand, and since v3 appears to be more restrictive in the detection conditions of syllables -we need an intensity dip in both side of the vowel and not only one as in v1-, but we can also ensure that the detected syllables come more often from actual syllable nuclei in v3 than in v1, in which the detected syllable could come more often from false nuclei. This is also strengthened by the larger number of true negatives in v3 encountered in Table 1 for both modalities.

CONCLUSIONS
The results presented and discussed in the previous sections indicate, on the one hand, that both methods of syllable detection are not fully reliable yet to face a speech rate analysis task: both detect a number of syllables which is remarkably lower than the number of syllables obtained from a manual annotation. However, v1 seems to offer a better performance for this task than v3, as the number of detected syllables is closer to the manually obtained value, which compensates the fact that it is less precise in the detection of actual syllables, a fact that is secondary in a speech rate calculation task if the number of detected syllables is close enough to the number of manually annotated ones.
On the other hand, the results also show that, although v3 detects in general less true syllables than v1, it seems more adequate for tasks in which it is important that detected syllables correspond to actual syllables, such as automatic acoustic measurements of corpora involving the detection of syllabic nuclei.