Lower modulation rates in the temporal envelope (ENV) of the acoustic signal are believed to be the rhythmic backbone in speech, facilitating speech comprehension in terms of neuronal entrainments at δ- and θ-rates (these rates are comparable to the foot- and syllable-rates phonetically). The jaw plays the role of a carrier articulator regulating mouth opening in a quasi-cyclical way, which correspond to the low-frequency modulations as a physical consequence. This paper describes a method to examine the joint roles of jaw oscillation and ENV in realizing speech rhythm using spectral coherence. Relative powers in the frequency bands corresponding to the δ-and θ-oscillations in the coherence (respectively notated as %δ and %θ) were quantified as one possible way of revealing the amount of concomitant foot- and syllable-level rhythmicities carried by both acoustic and articulatory domains. Two English corpora (mngu0 and MOCHA-TIMIT) were used for the proof of concept. %δ and %θ were regressed on utterance duration for an initial analysis. Results showed that the degrees of foot- and syllable-sized rhythmicities are different and are contingent upon the utterance length.

Se piensa que las frecuencias de modulación más bajas en la envolvente temporal (ENV) de la señal acústica constituyen la columna vertebral rítmica del habla, facilitando su comprensión a nivel de enlaces neuronales en términos de los rangos δ y θ (estos rangos son comparables fonéticamente a los rangos de pie métrico y silábicos). La mandíbula funciona como un articulador que regula la abertura de la boca de una manera cuasi cíclica, lo que se corresponde, como una consecuencia física, con las modulaciones de baja frecuencia. Este artículo describe un método para examinar el papel conjunto de la oscilación de la mandíbula y de la envolvente ENV en la producción del ritmo del habla utilizando la coherencia espectral. Las potencias relativas en las bandas de frecuencia correspondientes a las oscilaciones δ y θ en la coherencia (indicadas respectivamente como %δ y %θ) se cuantificaron como un posible modo de revelar la cantidad de ritmicidad concomitante a nivel de pie métrico y de sílaba que los dominios acústicos y articulatorios comportan. Para someter a prueba esta idea, en este estudio se analizaron dos corpus en inglés (mngu0 y MOCHA-TIMIT). Para un primer análisis, se realizó una regresión de %δ y %θ en función de la duración del enunciado. Los resultados mostraron que los grados de ritmicidad del pie y de la sílaba son diferentes y dependen de la longitud del enunciado.

This paper characterizes speech rhythm in terms of the spectral coherence between jaw oscillations and speech temporal envelopes (ENV, henceforth). Two frequency bands in the coherence spectrum covering the neuronal δ- and θ-rates were particularly analyzed in terms of their relative contributions to the entire coherence power. These bands have been claimed to correspond to the foot- and syllable-timescales in speech and have been demonstrated to play a crucial role in neurological speech processing via brainwave-to-ENV entrainment (e.g. Doelling, Arnal, Ghitza, & Poeppel,

Historically, phoneticians described the rhythm of world languages in terms of intuitive isochronous units: stress-timed vs. syllable-timed rhythm^{1}
(or metaphorically, Morse code vs. machine gun rhythm^{2}) (e.g. Abercrombie, ^{3} In terms of phonological theorization, the metrical grid can be constructed based on intuitive assessment of prominent values, exhibiting the rhythmic skeleton of an utterance (e.g. Liberman & Prince,

How did rhythm evolve in speech? From a Darwinian perspective, MacNeilage (

Speech rhythm is not evolutionarily redundant; it is functional in the neurological processing of the speech signal. The recurring oscillations in the ENV – which supposedly reflect the rhythmic frames – facilitate the brain to parse the incoming speech signal for comprehension. It has been demonstrated that the δ-oscillation (.5–3 Hz, corresponding to foot/stress rates) and θ-oscillation (3–9 Hz, corresponding to syllable rates) in the auditory cortex entrain to the speech ENV at these modulation rates (Doelling et al.,

The motor knowledge of speech production is arguably indispensable in the neurological processing of speech signals (Strauß & Schwartz, ^{4}). The mouth movements help the listener visually access the rhythmic structure, like visual scaffolding. Therefore, the jaw as a carrier articulator plays an important role in both production and perception of speech rhythm; the temporal windows facilitating the neuronal entrainment to the speech ENV must be discoverable in the jaw oscillation as well. However, the roles of the jaw and ENV have been disjointedly studied: The jaw displacement has been shown to well explain the metrical structure of the utterance (Erickson, Suemitsu, Shibuya, & Tiede,

We thus propose to characterize speech rhythm using the spectral coherence between the jaw oscillation and speech ENV (hereinafter, ^{5} A similar approach has been attempted, though, by calculating the coherence between the ENV and mouth opening size in terms of the number of pixels shrouded by the lip contour or the inter-lip distance (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar,

Since the lower frequency components pertaining to the δ- and θ-oscillations are crucial for the neurological speech processing in the auditory, visual and motor cortices, the jaw oscillation and the ENV should be coherent in these frequency ranges. The degree of such coherence is measurable in terms of the percentage of the spectral integral bounded by the δ- or θ-band cutoffs out of the entire spectral integral of jaw-env coherence (notated as %δ and %θ, see Eq. (1) in §2.3). These two measures capture the relative amount of power shared by the jaw oscillation and ENV in terms of regularities at the frequency bands corresponding to the neuronal δ- and θ-samplings. Moreover, %δ and %θ are analyzed as a function of the utterance length (§3), because the rhythmic structure is more likely to evolve into a more complex pattern over time (a 5-sec utterance would intuitively have a more complex rhythmic structure than a 1-sec utterance “Hello!” which contains a single iamb). It is expected that higher %δ is associated with longer utterances, because more sizeable prosodic boundaries (including foot-sized timescales) may be included; for an utterance with higher %δ, a smaller %θ is expected because the total power of jaw-env coherence is fixed, and determined by the joint temporal amplitudes of both jaw oscillation and ENV (in reference to Parseval’s theorem of energy conservation).

The mngu0 (Richmond et al.,

Jaw-env coherences were calculated following three steps using Matlab^{Ⓡ} R2018b:

Obtaining the spectra of the jaw oscillation functions (the matrix _{JAW}) for each utterance. First, the jaw oscillation time series were estimated as the Euclidean distances of the lower incisor coordinates to zero (the vector _{JAW}). To obtain _{JAW}, a 512-point fast Fourier transform was applied to _{JAW} which had been offset-removed, down-sampled to 80 Hz, cosine-tapered (α = .1), and zero-padded. The magnitude of _{JAW} was then linearly normalized in 1 arbitrary unit (arb’U, henceforth).

Obtaining the spectrum of the speech ENV (the matrix _{ENV}). First, a “beat” detection filter (Cummins & Port, _{ENV}, the ENV was offset-removed, down-sampled to 80 Hz, cosine-tapered (α = .1), zero-padded, and supplied to a 512-point fast Fourier transform. The magnitude of _{ENV} was then linearly normalized in 1 arb’U. The object obtained this way is called the beat histogram in music information retrieval (Lykartsis & Lerch,

The jaw-env coherence (the matrix _{JAW-ENV}) was calculated as the Hermitian inner product^{6} of the Fourier coefficients in _{JAW} and _{ENV} normalized to the individual power of _{JAW} and _{ENV} (a code snippet in Cohen,

Eq. (1) illustrates the conceptual calculations of %δ and %θ — the percentage of the spectral integral bounded by the δ-band cutoffs (f_{1} = .5 Hz, f_{2} = 3 Hz) or θ-band cutoffs (f_{1} = 3 Hz, f_{2} = 9 Hz) over the entire spectral integral of the coherence function _{Nyq} = 40 Hz). The Nyquist frequency (f_{Nyq}) of 40 Hz was arbitrarily chosen at the upper γ-band boundary responsible for processing phonemes and smaller features. Empirically, the frequency granularity (_{Nyq} (40 Hz) ÷ FFT points (512) = .16 Hz. Because of the frequency discretization, the coherence function _{JAW-ENV}. The integrals (approximated using Riemann sums) can be calculated through iterations at the step size of _{JAW-ENV.}

The spectra of the jaw oscillation and the speech ENV (a); the JAW-ENV coherence calculated from the spectra of the jaw oscillation and the speech ENV (b).

For the mngu0 data, simple linear regressions between utterance length and %δ and %θ were performed using R. The utterance duration was right skewed, hence was natural log transformed.

Results of linear regression analyses for the mngu0 data.

Model (Y~X) | F-test of overall significance |
t-test of estimated slope |
|||
---|---|---|---|---|---|

F(DoFs) | p | β | 99% CI | |t| | |

%δ ~ ln(utterance duration) | 881.5(1,592) | ≪ .01 | 30.66 | 28.00, 33.32 | > 2.576 |

%θ ~ ln(utterance duration) | 545.5(1,592) | ≪ .01 | –24.44 | –27.14, –21.75 | > 2.576 |

Regression lines and the 99% confidence intervals (shaded areas) superimposed over the scatterplots showing the relationships between %δ and log utterance duration (a), and %θ and log utterance duration (b) in the mngu0 corpus. Log durationvis-à-vis linear duration at abscissa tick marks in both subplots: .75 ln(sec) ⇌ 2.12 sec, 1.0 ln(sec) ⇌ 2.72 sec, 1.25 ln(sec) ⇌ 3.49 sec, 1.5 ln(sec) ⇌ 4.48 sec, 1.75 ln(sec) ⇌ 5.75 sec, and 2.0 ln(sec) ⇌ 7.39 sec.

The MOCHA-TIMIT data were subsequently analyzed to examine whether consistent results would be obtained. Random-slope models were fitted by maximum likelihood (response variables: %δ and %θ; random effects: speaker and utterance; fixed effect: utterance length) using R{

Results of random-slope models for the MOCHA-TIMIT data.

Response variable | Fixed effect: utterance length |
Random effect: speaker ^{a} |
p | ||||
---|---|---|---|---|---|---|---|

β | 99% CI | |t| | AIC (full; reduced) | –LogLik (full; reduced) | χ2(DoF) | ||

%δ | 8.38 | 5.49, 11.27 | > 2.576 | 10121; 10510 | 5248.8; 5051.5 | 394.47(3) | ≪ .01 |

%θ | –2.89 | –5.16, –.62 | > 2.576 | 10038; 10363 | 5009.7; 5175.7 | 331.92(3) | ≪ .01 |

Likelihood ratio test was used to test between-speaker variability between the full model and the speaker-reduced model.

The AICs of the full models were smaller than those of the reduced models, suggesting that the full models had better fits. The χ2 values were calculated as the differences between twice the –LogLik of the full and reduced models (the differences of the deviances).

Regression lines and the 99% confidence intervals (shaded areas) superimposed over the scatterplots showing the relationships between %δ and utterance duration (in sec) (a), and %θ and utterance duration (b) for each of the three speakers in the MOCHA-TIMIT corpus.

This paper introduced a method to characterize speech rhythm using spectral coherence between jaw oscillation and the speech ENV, i.e. the jaw-env coherence. It provides a spectro-temporal representation of the common periodicities in both signals. Two frequency bands corresponding to the brain δ- and θ-oscillations were analyzed in terms of the percentage of power accounted for by these two bands in jaw-env coherence, i.e. %δ and %θ. In general, utterance length was found to be a significant predictor of %δ and %θ, yet individual differences must not be neglected. The findings have several implications:

The jaw oscillation and speech ENV possess strong spectral coherence in the low frequency bands of .5 – 3 Hz and 3 – 9 Hz. This upholds the role of jaw movement and speech ENV in speech rhythmicity. The semi-cyclical jaw movements constantly change the amount of radiated energy corresponding to the lower modulation frequencies in the speech signal, to which the auditory cortex of the listener entrains at the δ- and θ-rates. (Doelling et al.,

Both .5 – 3 Hz and 3 – 9 Hz bands (pertaining to the δ- and θ-rates) are represented in jaw-env coherence, but in different degrees as measured by %δ and %θ. This suggests that different levels of rhythmicities (including foot-sized and syllable-sized) are present simultaneously but differ in degrees. The amount of regularities at a larger timescale increases as the utterance length increases for all speakers from the two corpora (

Individual differences are conspicuous in %δ and %θ as a function of utterance duration. The amounts of regularities in δ- and θ-bands are inversely proportional for the mngu0 speaker as well as speaker “fsew” in MOCHA-TIMIT (

The results may also explain why early phoneticians (e.g. Abercrombie,

This study has limitations too:

In terms of data variance, all speakers in the MOCHA-TIMIT corpus showed bigger variances than the mngu0 speaker (cf.

The two frequency bands analyzed in this study were informed by the low-neuronal oscillations that have been shown to play a key role in the rhythmic parsing in speech processing. Apart from considering these two bands as pertaining to the stress-rate or syllable-rate, further research still needs to be done to assess whether these frequency cutoffs are justifiable in linguistic/phonetic terms.

The corpora adopted in this study were small in terms of the number of speakers, and only English was analyzed. This reduced the generalizability of this study.

For future research, it is imperative to test the method using more speakers from different languages, including those traditionally labeled as “syllable-timed.” That they have been described as “syllable-timed” may be due to a high degree of syllable-sized cyclicity in jaw oscillations and speech temporal modulations (measurable as high %θ in jaw-env coherence) even in longer sentences. So far, the coherence of the jaw oscillation and ENV has been investigated based on the power spectra. It will also be interesting to explore the coherence based on the phase spectra from multi-domain signals, including acoustic, articulatory and neurological, to further explore their temporal relationships in constituting speech rhythmicity both at the production and perception levels.

This study was supported by the Forschungskredit of the University of Zurich (Grant FK-19-069 to YZ and Grant FK-20-078 to LH). It was also benefited from a completed project from the Swiss National Science Foundation (Grant P2ZHP1_178109 to LH). We thank Alejandra Pesantez for her great help in the Spanish abstract.

Quintessential “stress-timed” languages include the Germanic languages, and “syllable-timed” languages, the Romance languages.

Arthur Lloyd James illustrated the “Morse code” rhythm of English to a foreign student whose native language was Sinhalese in a historical film

The crux of all these approaches is the consensus of revealing different rhythmicities through different forms of variability in the speech signal. Variability can either be quantified via different physical quantities, such as duration (e.g. Dellwo,

Park et al. (

In fact, simple correlations for time-series data are problematic in general with spuriously high correlation coefficients.

The Hermitian inner product of two signals (in this case, the jaw oscillation and speech ENV) is simply the multiplications of the Fourier coefficients of the first signal and the complex conjugates (sign change of the imaginary part) of the Fourier coefficients of the second one. It reveals the covariance between the two signals in the frequency domain.

A more stringent α-level (= .01) was chosen in statistical analyses to reduce the chance of false positive findings.