Characterizing speech rhythm using spectral coherence between jaw displacement and speech temporal envelope

: Lower modulation rates in the temporal envelope (ENV) of the acoustic signal are believed to be the rhythmic backbone in speech, facilitating speech comprehension in terms of neuronal entrainments at δ- and θ-rates (these rates are comparable to the foot- and syllable-rates phonetically). The jaw plays the role of a carrier articulator regulating mouth opening in a quasi-cyclical way, which correspond to the low-frequency modulations as a physical consequence. This paper describes a method to examine the joint roles of jaw oscillation and ENV in realizing speech rhythm using spectral coherence. Relative powers in the frequency bands corresponding to the δ-and θ-oscillations in the coherence (respectively notated as %δ and %θ) were quantified as one possible way of revealing the amount of concomitant foot- and syllable-level rhythmici ties carried by both acoustic and articulatory domains. Two English corpora (mngu0 and MOCHA-TIMIT) were used for the proof of concept. %δ and %θ were regressed on utterance duration for an initial analysis. Results showed that the degrees of foot- and syllable-sized rhythmicities are different and are contingent upon the utterance length. y %θ en función de la duración del enunciado. Los resultados mostraron que los grados de ritmicidad del pie y de la sílaba son diferentes y dependen de la longitud del enunciado. Palabras clave: ritmo del habla, coherencia espectral, envolvente temporal, desplazamiento de la mandíbula.


INTRODUCTION
This paper characterizes speech rhythm in terms of the spectral coherence between jaw oscillations and speech temporal envelopes (ENV, henceforth). Two frequency bands in the coherence spectrum covering the neuronal δ-and θ-rates were particularly analyzed in terms of their relative contributions to the entire coherence power. These bands have been claimed to correspond to the foot-and syllable-timescales in speech and have been demonstrated to play a crucial role in neurological speech processing via brainwave-to-ENV entrainment (e.g. Doelling, Arnal, Ghitza, & Poeppel, 2014;Ghitza, 2017;Poeppel & Assaneo, 2020). This paper reports an initial analysis on the relationships between relative powers of the δ-and θ-bands in their coherence and utterance length using two English corpora: mngu0 (Richmond, Hoole, & King, 2011) and MOCHA-TIMIT (Wrench, 1999).
Speech rhythm is not evolutionarily redundant; it is functional in the neurological processing of the speech signal. The recurring oscillations in the ENV -which supposedly reflect the rhythmic frames -facilitate the brain to parse the incoming speech signal for comprehension. It has been demonstrated that the δ-oscillation (.5-3 Hz, corresponding to foot/stress rates) and θ-oscillation (3-9 Hz, corresponding to syllable rates) in the auditory cortex entrain to the speech ENV at these modulation rates (Doelling et al., 2014;Ghitza, 2017;Giraud & Poeppel, 2012;Strauß & Schwartz, 2017). These slow neuronal oscillations formulated a temporal window structure whereby the auditory cortex tracks the speech signal at the foot and syllable rates. Within such longer temporal windows, information encoded in finer timescales (e.g. phonemes up to ~40 Hz, corresponding to the γ-oscillation) can then be processed to achieve comprehension (Doelling et al., 2014;Giraud & Poeppel, 2012).
The motor knowledge of speech production is arguably indispensable in the neurological processing of speech signals (Strauß & Schwartz, 2017). The jaw performs the role of a carrier articulator responsible for lower modulation frequencies that may correspond to the rhythmic frames (Strauß & Schwartz, 2017) to which the slower neuronal oscillations can be phase-locked, not only in the auditory cortex, but also in the visual cortex (Park, Kayser, Thut, & Gross, 2016). Seeing the speaker's mouth movements facilitates the listener to understand speech, particularly in adverse conditions with excessive noise (Park et al., 2016 4 ). The mouth movements help the listener visually access the rhythmic structure, like visual scaffolding. Therefore, the jaw as a carrier articulator plays an important role in both production and perception of speech rhythm; the temporal windows facilitating the neuronal entrainment to the speech ENV must be discoverable in the jaw oscillation as well. However, the roles of the jaw and ENV have been disjointedly studied: The jaw displacement has been shown to well explain the metrical structure of the utterance (Erickson, Suemitsu, Shibuya, & Tiede, 2012;Erickson & Kawahara, 2016;Huang & Erickson, 2019). The ENV has been extensively investigated in terms of its recurring patterns (He, 2018;Tilsen & Arvaniti, 2013;Tilsen & Johnson, 2008) and synchronizations between different modulation rates (Cummins & Port, 1998;Lancia et al., 2019;Leong et al., 2014).
We thus propose to characterize speech rhythm using the spectral coherence between the jaw oscillation and speech ENV (hereinafter, jaw-env coherence). A spectral coherence is a Fourier transform-based representation that quantifies common periodicities in two signals. It evaluates the correlation between these two signals in the frequency domain, hence its advantage over assessing simple correlations in the time domain. 5 A similar approach has been attempted, though, by calculating the coherence between the ENV and mouth opening size in terms of the number of pixels shrouded by the lip contour or the inter-lip distance (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009); however, the roles of the jaw elevation and depression and peripheral lip gestures could not be disentangled thereof (also see footnote 4). This study, instead, examined the sole role of the jaw movement.
Since the lower frequency components pertaining to the δ-and θ-oscillations are crucial for the neurological speech processing in the auditory, visual and motor cortices, the jaw oscillation and the ENV should be coherent in these frequency ranges. The degree of such coherence is measurable in terms of the percentage of the spectral integral bounded by the δ-or θ-band cutoffs out of the entire spectral integral of jaw-env coherence (notated as %δ and %θ, see Eq. (1) in §2.3). These two measures capture the relative amount of power shared by the jaw oscillation and ENV in terms of regularities at the frequency bands corresponding to the neuronal δ-and θ-samplings. Moreover, %δ and %θ are analyzed as a function of the utterance length ( §3), because the rhythmic structure is more likely to evolve into a more complex pattern over time (a 5-sec utterance would intuitively have a more complex rhythmic structure than a 1-sec utterance "Hello!" which contains a single iamb). It is expected that higher %δ is associated with longer utterances, because more sizeable prosodic boundaries (including foot-sized timescales) may be included; for an utterance with higher %δ, a smaller %θ is expected because the total power of jaw-env coherence is fixed, and determined by the joint temporal amplitudes of both 5 In fact, simple correlations for time-series data are problematic in general with spuriously high correlation coefficients. jaw oscillation and ENV (in reference to Parseval's theorem of energy conservation).

The corpora
The mngu0 (Richmond et al., 2011) contains one male English speaker producing over 1,000 utterances, amongst which 594 in the duration range of [2, 8] sec were chosen for the present study. The 2-sec cutoff allowed at least one cycle of the lowest δ frequency (.5 Hz) to be included; the 8-sec cutoff excluded sentences with medial pauses. The MOCHA-TIMIT (Wrench, 1999) contains three English speakers (1f, coded as "fsew0"; 2m, coded as "maps0" and "msak0") producing the same set of 460 sentences. Altogether 5 sentences shorter than 2 sec were excluded. All utterances were shorter than 6 sec. For both corpora, the electromagnetic articulograph (Carstens AG500 for mngu0 and AG100 for MOCHA-TIMIT) was used to record the kinematic trajectories of various articulators (with 200 Hz or 500 Hz temporal resolutions) together with the audio speech signal (16-bit @ 16 kHz). All kinematic data were head-corrected and translated to a new Cartesian coordinate system in the midsagittal plane. Sensor histories data from the lower incisor were used for the jaw movements for the study.

Calculating JAW-ENV coherence
Jaw-env coherences were calculated following three steps using Matlab Ⓡ R2018b: (i) Obtaining the spectra of the jaw oscillation functions (the matrix FFT JAW ) for each utterance. First, the jaw oscillation time series were estimated as the Euclidean distances of the lower incisor coordinates to zero (the vector d JAW ). To obtain FFT JAW , a 512-point fast Fourier transform was applied to d JAW which had been offset-removed, down-sampled to 80 Hz, cosine-tapered (α = .1), and zero-padded. The magnitude of FFT JAW was then linearly normalized in 1 arbitrary unit (arb'U, henceforth).
(ii) Obtaining the spectrum of the speech ENV (the matrix FFT ENV ). First, a "beat" detection filter (Cummins & Port, 1998;Tilsen & Johnson, 2008) was applied to the speech signal (first-order Butterworth, center frequency = 1,000 Hz, bandwidth = 300 Hz) to keep the vocalic energy while removing the glottal energy and obstruent noise. Then, the filtered signal was full-wave rectified and further bandpass filtered (fourth-order Butterworth, center frequency = 5 Hz, bandwidth = 10 Hz) to obtain the ENV. To obtain FFT ENV , the ENV was offset-removed, down-sampled to 80 Hz, cosine-tapered (α = .1), zero-padded, and supplied to a 512-point fast Fourier transform. The magnitude of FFT ENV was then linearly normalized in 1 arb'U. The object obtained this way is called the beat histogram in music information retrieval (Lykartsis & Lerch, 2015).
(iii) The jaw-env coherence (the matrix COH JAW-ENV ) was calculated as the Hermitian inner product 6 of the Fourier coefficients in FFT JAW and FFT ENV normalized to the individual power of FFT JAW and FFT ENV (a code snippet in Cohen, 2017 was applied); negative frequencies were neglected. Figure 1 shows an example of calculating the jaw-env coherence from the spectra of jaw oscillation and the speech ENV. This process computes the common periodicities in two signals by evaluating the correlation between these two signals in the frequency domain.

Calculating %δ and %θ in jaw-env coherence
Eq. (1) illustrates the conceptual calculations of %δ and %θ -the percentage of the spectral integral bounded by the δ-band cutoffs (f 1 = .5 Hz, f 2 = 3 Hz) or θ-band cutoffs (f 1 = 3 Hz, f 2 = 9 Hz) over the entire spectral integral of the coherence function C(f) (f Nyq = 40 Hz). The Nyquist frequency (f Nyq ) of 40 Hz was arbitrarily chosen at the upper γ-band boundary responsible for processing phonemes and smaller features. Empirically, the frequency granularity (df) is equal to 2 × f Nyq (40 Hz) ÷ FFT points (512) = .16 Hz. Because of the frequency discretization, the coherence function C(f) is effectively the matrix COH JAW-ENV . The integrals (approximated using Riemann sums) can be calculated through iterations at the step size of df in COH JAW-ENV. The Hermitian inner product of two signals (in this case, the jaw oscillation and speech ENV) is simply the multiplications of the Fourier coefficients of the first signal and the complex conjugates (sign change of the imaginary part) of the Fourier coefficients of the second one. It reveals the covariance between the two signals in the frequency domain.

DATA ANALYSES AND RESULTS 7
For the mngu0 data, simple linear regressions between utterance length and %δ and %θ were performed using R. The utterance duration was right skewed, hence was natural log transformed. Table 1 and Figure 2 illustrate the results: %δ increased as utterance duration increased, whereas %θ decreased as utterance length increased, conforming to the expectation.
The MOCHA-TIMIT data were subsequently analyzed to examine whether consistent results would be obtained. Random-slope models were fitted by maximum likelihood (response variables: %δ and %θ; random effects: speaker and utterance; fixed effect: utterance length) using R{lme4, v1.1-21} (Bates, Mächler, Bolker, & Walker, 2015). The significance of the slope estimate and between-speaker variability were tested in particular (see Table 2 and Figure 3): in general, a positive slope estimate was found significant between %δ and utterance length, and a negative slope estimate was found significant between %θ and utterance length. Moreover, individual differences were significant at the same time.

DISCUSSION
This paper introduced a method to characterize speech rhythm using spectral coherence between jaw oscillation and the speech ENV, i.e. the jaw-env coherence. It provides a spectro-temporal representation of the common periodicities in both signals. Two frequency bands corresponding to the brain δ-and θ-oscillations were analyzed in terms of the percentage of power accounted for by these two bands in jaw-env coherence, i.e. %δ and %θ. In general, utterance length was found to be a significant 7 A more stringent α-level (= .01) was chosen in statistical analyses to reduce the chance of false positive findings.

Figura 1:
The spectra of the jaw oscillation and the speech ENV (a); the JAW-ENV coherence calculated from the spectra of the jaw oscillation and the speech ENV (b).

a b
Characterizing speech rhythm using spectral coherence between jaw displacement and speech temporal envelope • 5 predictor of %δ and %θ, yet individual differences must not be neglected. The findings have several implications: (i) The jaw oscillation and speech ENV possess strong spectral coherence in the low frequency bands of .5 -3 Hz and 3 -9 Hz. This upholds the role of jaw movement and speech ENV in speech rhythmicity. The semi-cyclical jaw movements constantly change the amount of radiated energy corresponding to the lower modulation frequencies in the speech signal, to which the auditory cortex of the listener entrains at the δ-and θ-rates. (Doelling et al., 2014;Ghitza, 2017;Giraud & Poeppel, 2012;Strauß & Schwartz, 2017). The jaw movements also invite neuronal entrainment in the listener's visual cortex (Park et al., 2016). These entrainments play a useful role in speech processing and comprehension.
(ii) Both .5 -3 Hz and 3 -9 Hz bands (pertaining to the δ-and θ-rates) are represented in jaw-env coherence, but in different degrees as measured by %δ and %θ. This  ª Likelihood ratio test was used to test between-speaker variability between the full model and the speaker-reduced model. The AICs of the full models were smaller than those of the reduced models, suggesting that the full models had better fits. The χ2 values were calculated as the differences between twice the -LogLik of the full and reduced models (the differences of the deviances). suggests that different levels of rhythmicities (including foot-sized and syllable-sized) are present simultaneously but differ in degrees. The amount of regularities at a larger timescale increases as the utterance length increases for all speakers from the two corpora (Figures 2 and 3). It is possible that longer utterances are more likely to contain larger prosodic boundaries or more extreme intonational accents, which would increase the power pertaining to the δ-band. This may have a functional advantage: higher δ-rate regularities would facilitate sensory chunking of a longer utterance under the neuronal δ-sampling (an example of sensory chunking is using temporal groupings when memorizing a series of digits or syllables) (Boucher, Gilbert, & Jemel, 2019). Smaller units pertaining to faster rates (e.g., syllables, phonemes or even phonological features) could be processed within each δ-window.
(iii) Individual differences are conspicuous in %δ and %θ as a function of utterance duration. The amounts of regularities in δ-and θ-bands are inversely proportional for the mngu0 speaker as well as speaker "fsew" in MOCHA-TIMIT (Figures 2 and 3), possibly because more δ power has already taken up the majority of power in jaw-env coherence in longer utterances, leaving little power for the θ-band regularity. For speaker "msak" in MOCHA-TIMIT, δ-band regularities were already prominent even in short sentences (high intercept of "msak" in Figure 3a), leaving little power for syllable-sized frequencies regardless of the utterance length (low intercept and flat slope of "msak" in Figure 3b). Nevertheless, to investigate individual differences fully, it is mandatory to increase the sample size significantly.
(iv) The results may also explain why early phoneticians (e.g. Abercrombie, 1967;Jones, 1922;Lloyd James, 1940;Pike, 1945), despite having undergone rigorous ear training, would still inaccurately describe languages such as English as possessing isochronous feet. Higher %δ may be a strong cue to foot-sized regularity in both jaw oscillation and speech temporal modulation. For all speakers analyzed in the study, a large amount of footsized regularity has been found. It is likely that early phoneticians have discerned such foot-sized regularity in English, yet unfortunately described it in absolute terms as "stress-timed." This study has limitations too: (i) In terms of data variance, all speakers in the MOCHA-TIMIT corpus showed bigger variances than the mngu0 speaker (cf. Figures 2 and 3). This may be due to the data inconsistency issue of the MOCHA-TIMIT corpus. It has been demonstrated that even for a relatively stationary sensor at the velum, a tremendous amount of data inconsistency existed (Richmond, 2009;Richmond et al., 2011). Technical issues with respect to the early generation of the electromagnetic articulograph may be the culprit (Richmond, 2009).
(ii) The two frequency bands analyzed in this study were informed by the low-neuronal oscillations that have been shown to play a key role in the rhythmic parsing in speech processing. Apart from considering these two bands as pertaining to the stress-rate or syllable-rate, further research still needs to be done to assess whether these frequency cutoffs are justifiable in linguistic/phonetic terms.
(iii) The corpora adopted in this study were small in terms of the number of speakers, and only English was analyzed. This reduced the generalizability of this study.
For future research, it is imperative to test the method using more speakers from different languages, including those traditionally labeled as "syllable-timed." That they have been described as "syllable-timed" may be due to a high degree of syllable-sized cyclicity in jaw oscillations Figura 3: Regression lines and the 99% confidence intervals (shaded areas) superimposed over the scatterplots showing the relationships between %δ and utterance duration (in sec) (a), and %θ and utterance duration (b) for each of the three speakers in the MOCHA-TIMIT corpus.

a b
and speech temporal modulations (measurable as high %θ in jaw-env coherence) even in longer sentences. So far, the coherence of the jaw oscillation and ENV has been investigated based on the power spectra. It will also be interesting to explore the coherence based on the phase spectra from multi-domain signals, including acoustic, articulatory and neurological, to further explore their temporal relationships in constituting speech rhythmicity both at the production and perception levels.