Do speakers converge rhythmically? A study on segmental timing properties of Grison and Zurich German before and after dialogical interactions

A study ABSTRACT: This paper reports on the results of a research investigating whether rhythmic features, in terms of segmental timing properties, are object of speaker’s adjustments after the exposure to a conversational partner. In the context of dialects in contact, this is crucial to understand whether rhythmic attributes may bring about language variation and change. In the context of human-machine interactions, this can benefit the design of spoken dialogues systems to achieve human-likeness. To study rhythmic accommodation, we selected a corpus of pre- and post-dialogue recordings, performed by 18 speakers of Grison and Zurich German (henceforth GRG and ZHG), two Swiss German dialects characterised by noticeable segmental and supra-segmental differences. To quantify rhythmic convergence, we designed three measures based on the segmental timing differences between the two dialects. We compared the Euclidean distances in the three measures between GRG and ZHG speakers in a pair before and after two interactions. Results reveal that dyads members do not significantly shift the production of segmental tim ing features after the dialogues. Neither linguistic nor social factors can account for the observed accommodation pattern. Cross-dialectal segmental timing differences, captured by the three ratio measures, may be either robust against the influence of interlocutors’ acoustic behaviour or too subtle to be perceived or retained after interactions.


INTRODUCTION
The way an individual speaks is highly idiosyncratic as it is largely determined by his/her anatomy, sex, age, language background, social status and health conditions (Dellwo et al., 2007).
During social interactions, however, the way individuals sound like they do is also influenced by the characteristics of the interlocutor (i.e., age, dialect, social status), the formality of the communicative setting (i.e., formal vs informal) and the quality of background conditions (i.e., noisy vs quiet) (Giles & Ogay, 2007). When we address to infants, relative to adults, for example, we typically speak slower, use longer pauses, exaggerate pitch variations and hyper-articulate vowels (see a.o., Fernald et al., 1989;Soderstrom, 2007). Most of these acoustic characteristics that are used to gain an infant's attention and to facilitate language acquisition, are also present when talking to elderly people (Kemper, 1994), and to some extent to second language speakers (Ferguson, 1975), or when the interaction takes place in a noisy environment (Hazan & Baker, 2011) to foster comprehension.
For the domain of human-human communication, two major theoretical models have been proposed to ac-count for interspeaker' adjustments: the social approach of the Communication Accommodation Theory (CAT) (e.g., Giles et al. 1991;Shepard et al., 2001) and the automatic account of the Interactive Alignment Model (IAM) proposed by Pickering & Garrod (2004). The former postulates that speakers express social closeness to or distance from their interlocutors, by respectively becoming acoustically more similar (convergence) or dissimilar (divergence) (Soliz & Giles, 2016). The latter, instead, assumes that convergence in conversation is regulated by a priming mechanism based on the automatic link between perception and production. Evidence in support of CAT can be found in studies showing that social factors, among which speakers' perceived friendliness, dominance, attractiveness, attitude or stereotypes towards a specific language variety (e.g., Babel et al., 2013;2014, Schweitzer & Lewandowski, 2014, Michalsky & Schoormann, 2017Gregory & Webster, 1996) affect the amount and direction of accommodation. IAM, instead, is supported by the line-up of studies documenting convergence in non-interactive settings (e.g., shadowing task) in which participants are not instructed to imitate the model talker or explicitly requested to avoid imitation (e.g., Goldinger, 1998;Shockley et al., 2004;Walker & Campbell-Kibler, 2015;cf. Dufour & Nguyen, 2013, for a comparison between imitation and shadowing tasks).
Studies on phonetic convergence, however, have pointed out the influence of factors other than social on speakers' accommodation behaviour. It has been, indeed, observed that individuals greatly vary in the amount and direction of convergence depending on the frequency characteristics of the lexical items (Goldinger, 1998;Goldinger & Azuma, 2004;Nielsen, 2011), previous exposure to lexical items (Goldinger, 1998;Goldinger & Azuma, 2004), cognitive load involved in a task (Abel & Babel, 2017), and phonetic distance between interlocutors' language repertoires (Babel, 2012;Walker & Campbell-Kibler, 2015;Walters et al., 2013). The effect of linguistic and phonetic factors was not accounted for by either IAM or CAT. A more dominant view that reconciles the social and the automatic perspectives and integrates the effect of linguistic-phonetic factors on accommodation is the so-called hybrid approach (Babel, 2012;Pardo, 2012;Pardo et al., 2017). In this view, social, linguistic and phonetic factors are seen as catalysts or inhibitors of convergence in that they can boost or diminish the strength of the link between perception and production.
The aim of the present paper is to contribute to advancing the understanding of forms and factors evoking convergence, shifting the attention from the typical acoustic correlates of phonetic convergence (i.e., vowel quality, rate, pitch, intensity, voice onset time) to speech rhythm, conceptualized here as the variability of segmental durational characteristics. Rhythmic convergence is studied using a pre-existent dataset designed to study cross-dialectal vowel convergence (cf. 2.1.). This will ultimately permit to compare the accommodation behavior of the same speakers across different measures, and test which type of factors between linguistic (cross-dialectal phonetic distance) and social (dialect markedness) will be driving convergence or divergence.

Rhythmic Accommodation
Three basic questions may arise when studying rhythmic accommodation: (a) Can speech rhythm in terms of segmental timing properties be object of adjustments between speakers? (b) In which communicative contexts is it possible to study rhythmic accommodation? and (c) Why is the research on accommodation in segmental timing a worthwhile pursuit?
With respect to (a) speech rhythm research has provided evidence that the durational characteristics of consonantal and vocalic intervals, as well as amplitude envelope characteristics, vary in response to the interlocutor's age and cognitive development. For example, studies on the rhythmic characteristics of infantcompared to adult-directed speech have shown that: a) English, Catalan and Spanish mothers present less durational variability of consonantal and vocalic intervals as well as longer vowel duration when speaking to their children compared to addressing adults (Payne et al., 2009); b) in Australian English delta modulations corresponding to the prosodic stress is greater in infant-than in adult-directed speech, while theta modulations, tracking syllable patterns, dominated the adult-directed speech modulation spectrum (Leong et al., 2017). Not only do speech rhythm vary depending on the interlocutors' characteristics, but the presence itself of an interlocutor (i.e., reading partner) has been shown to influence the degree of rhythm entrainment in synchronous reading tasks (Cerda-Oñate et al., 2021). In light of these findings, it seems plausible to assume that speakers can also mutually adapt the production of segmental timing features after exposure to a dialogue partner. On the other hand, in view of evidence showing that the timing properties of different speech intervals (e.g. consonants, vowels, voicing) are resistant to different sources of within speaker variability (speaking style, prosodic and linguistic factors) (Dellwo et al., 2015;Leeman et al. 2014), we cannot exclude that the speakers may maintain their segmental durational characteristics in post-dialogue productions. We will test precisely these two competing hypotheses in the present study.
With respect to (b), one of the contexts in which the study of rhythmic accommodation is possible is that of dialects in contact. In this setting, one might examine whether speakers of dialects that are mutually intelligible but present distinct rhythmic features converge rhythmically after being exposed to each other's dialect. In this respect, the linguistic situation of German-speaking Switzerland is an excellent testing ground for studying cross-dialectal rhythmic accommodation. Swiss German dialects, indeed, do not only differ for segmental features, speech rate and intonation contours (see Leeman, 2012 for a review), but also for their rhythmic properties. It has been documented that Midland vs Alpine dialects as well as Eastern vs Western dialects can be grouped according to their rhythmic characteristics, measured acoustically in terms of the timing variability of consonantal and vocalic intervals (Leeman et al., 2012).
With respect to (c), it has been argued that assessments of phonetic convergence based on a single (supra)segmental feature hardly capture the complexity of the phenomenon (Pardo et al., 2017). Nevertheless, choosing one acoustic attribute over another is still a valid approach when the comprehension of dynamics of sound variation and change is at stake (Pardo et al., 2017), or when decisions must be taken about which aspects of human-human interaction have to be modelled in speech interactive systems to achieve human-likeness (Beňuš, 2014). Understanding whether rhythmic properties in terms of segmental durational characteristics are object of mutual adaptations can be also crucial for the interpretation of evidence in forensic phonetic speaker comparisons. Any acoustic adjustments between interlocutors might lead to mistake within-for between-speaker variability and produce higher error in recognition rate.

Material
To study rhythmic accommodation in a dialect contact situation, we used a corpus of speech material in Zurich and Grison German (henceforth ZHG and GRG), two Swiss German dialects exhibiting crucial segmental and suprasegmental differences (cf. 2.2.) that legitimate the assumption of interspeaker adjustments after exposure to the interlocutor's dialect.
The corpus was designed, collected and annotated by Hanna Ruch to study vowel accommodation between GRG and ZHG (Ruch, 2015). It included speech samples of: • 2 audio-recorded diapix tasks (i.e., speakers comparing pictures that contain a certain number of differences, cf. Van Engen et al., 2010) performed by 18 pairs of previously unacquainted GRG and ZHG female speakers. • 18 pre-and 18 post-dialogue recordings (picture naming task and retelling a story based on a comic), performed individually by GRG and ZHG participants.
The diapix tasks were designed to elicit the target words present in picture naming task and story retelling. All tasks were carried out in one single recording session.

Cross-dialectal phonetic differences
Grison and Zurich German present noticeable differences at several linguistic levels (Eckhardt, 1991;Fleischer & Smith, 2006;Christen et al., 2010;Leeman, 2012). Phonetically, these have to do with the quality of front vowels, realization of word-initial and post-vocalic k, speech rate and intonation contours. It is of interest -for the purpose of this study -that GRG and ZHG also exhibit segmental durational differences that lead to a distinct rhythmic organisation of the two dialects. As reported in the literature on acoustic differences between GRG and ZHG (see a.o. Ruch, 2018), these differences concern: a) intervocalic sonorants gemination (henceforth ISG) in words ending in -e; b) open syllable lengthening (henceforth OSL); c) vowel reduction in word final position (henceforth RedVow).
Given that segmental timing properties are among the acoustic correlates of speech rhythm, in this paper we will refer to the three cross-dialectal differences in ISG, OSL and RedVow as rhythmic differences. Regarding ISG, GRG intervocalic sonorants can be realized either as geminates or as single consonants, while ZHG allows only the singleton realisation. As for OSL, in GRG open syllables can be either lengthened or not, while in ZHG the lengthening tendency has not been documented. With respect to RedVow, in GRG vowels in word final position are not reduced in quality, and presumably either in duration, while in ZHG word final vowels are always reduced. (Cf. Table  1 for examples of cross-dialectal realizations of ISG, OSL and GR). Evidence in support that the differences in the quality of final vowels come also with distinct timing patterns has been provided in Leeman et al. (2012). Here it was shown that the durational variability of vocalic intervals was higher in Midland dialects (to which ZHG belongs to) than in the group of Alpine dialects (to which GRG belongs to), and this was interpreted in view of the tendency of Alpine dialects to retain full vowels in unstressed position.

Method
To understand whether pairs of GRG and ZHG speakers produce the rhythmic features more similarly after participating in the diapix tasks, the following steps were taken: • From the pre-and post-dialogue recordings of individual speakers, we extracted the lexical items instantiating the three target rhythmic features (ISG, OSL and RedVow) 1 . • For every item, we measured the duration of individual segments. The raw measures of segment duration served as a basis for the calculation of three ratio measures designed ad hoc to capture inter-dialectal differences in ISG, OSL and RedVow.
-For ISG, we calculated the ratio between the duration of intervocalic sonorants (l, n) in -CCe words (e.g., Sonne, Welle) and that of the corresponding sonorant in -Ce words (l or n from the item Melone). -For OSL, we calculated the ratio between the duration of stressed vowels in open syllables and that of unstressed vowels within the same item. -For RedVow, we calculated the ratio between the duration of stressed vowels in open and closed syllables and that of unstressed vowel within the same item.
To determine whether pairs GRG and ZHG speakers converge, diverge or maintain their rhythmic behaviour after the interaction, we calculated: • the Euclidean distance within individual pairs in the three ratio measures in pre-and post-dialogue recordings (i.e., dist 1 = GRG pre -ZHG pre; dist 2 = GRG post -ZHG post); • the difference in distance between the two speakers' production of a word before the dialogues (i.e., dist 1 = GRG pre -ZHG pre) and after the dialogues (i.e., dist 2 = GRG post -ZHG post). Accommodation within a pair (DDpair) was calculated as follow: DDpair = dist 2 -dist 1. A negative difference in distance is evidence of convergence. A positive value indicates divergence. A value 0 demonstrates maintenance.

Data analysis and statistics
The present study reports on the data extracted from the picture naming task. In view of evidence showing the influence of linguistic factors on accommodation (Goldinger, 1998;Goldinger & Azuma, 2004;Niels-en, 2011), analysing the data from picture naming tasks (henceforth PNT) has given the main advantage of controlling for the effect of the item variability in the assessment of: a. cross-dialectal differences before the interactions; b. differences in distance between ZHG and GRG speakers in a pair before and after the interaction.
The lexical items used in this study and the dialectal features they instantiate are listed in Table 2 in the Standard German spelling. To test (a), i.e., whether pairs of GRG and ZHG speakers realised the three durational contrasts differently before the interaction, and thus to make sure that there was room for rhythmic accommodation, we ran three separate Linear Mixed Effects Models with the ratio measures (ISG, OSL and RedVow) as dependent variables, dialect (ZHG and GRG) as fixed factor, and speaker and lexical item as random effect (i.e. random intercepts).
In the light of segmental durational differences between GRG and ZHG mentioned above, we make the following hypotheses regarding the rhythmic behaviour of ZHG and GRG speakers before the interaction: • ISG contrast is higher in GRG than in ZHG, given that in GRG intervocalic sonorants can be pronounced also as geminates, while in ZHG only as a single consonant. • OSL contrast is higher in GRG than in ZHG, given that in GRG open syllables can be lengthened, while in ZHG typically are not. • RedVow is higher in ZHG than in GRG given that in ZHG word final vowels are reduced, while in GRG they are pronounced as full vowels.
To test (b), i.e., whether pairs of GRG and ZHG speakers produce the three rhythmic features more similarly after the diapix tasks, we compared the Euclidean distances within pair in ISG, OSL and RedVow before and after the interactions. We ran three separate Linear Mixed Effect Models, with Euclidean distance in ISG, OSL and RedVow as dependent variables and Session (1 = before interaction; 2 = after the interaction) as fixed factor. Given that Euclidean distance between pairs may vary before and after the interaction, in the structure of the random effect we first included the random slope of Pairs by Session. However, this model was too complex to be supported by the data. For this reason, we simplified the random effects by including the intercept for the interaction between Session and Pair, instead of the random slope. The random part of the model comprised also the intercept for Item.
We hypothesise that if rhythmic features are object of accommodation, dyads members adjust their rhythmic behaviour such that the Euclidean distance in ISG, OSL and RedVow will be lower after than before the interaction. In view of findings showing the effect that speakers converge more for features that differ mostly between dialects (MacLeod, 2012;Ruch, 2015;Walker & Campbell-Kibler, 2015;Clopper & Dossey, 2020) and between the speakers and the model talkers (Babel, 2012), we hypothesise that more accommodation is evoked by RedVow than ISG and OSL. RedVow, indeed, is one of the features that best distinguishes the two dialects. ZHG indeed exhibits open syllable lengthening -though in articulatory contexts other than GRG -and presents longer nasal duration in -CCer words. However, given that the realisation of reduced vowels is also a strong dialect marker for ZHG (Ruch, 2018), in view of evidence about little convergence for features that are dialect markers (Babel, 2010), we cannot exclude that the speakers may diverge or maintain their original behaviour for RedVow.
To test these hypotheses, we ran one Linear Mixed Effects Model with DDpair as dependent variable and Ratio Type (ISG, OSL and RedVow) as fixed factor. The random part of the model comprised the intercept for the interaction between Pair and Ratio, as well as the intercept for Item. Statistical analyses were performed with RStudio (2009-2019) Version 1.2.1335.

Results and Discussion
Regarding (a), i.e., cross-dialectal differences in ISG, OSL and RedVow before the interaction, the results from pre-dialogue recordings show a significant main effect of Dialect for the three measures (Table 3). As shown in Fig. 1, the scores obtained by GRG speakers in the three ratio measures are higher than ZHG speakers. If the results for ISG and OSL are in line with predictions, what is more surprising is that RedVow is lower in ZHG than in GRG. One plausible explanation for this finding might be that in picture naming task, for which speakers were asked to pronounce words in isolation, ZHG speakers do not drastically reduce the duration of unstressed vowels in word final position, as these vowels are subjected to pre-pausal lengthening. In other words, in ZHG the durational difference between stressed and unstressed vowels in final word position is not that big as one might expect.
With respect to (b), i.e., the accommodation behaviour in ISG, OSL and RedVow, the results of statistical analysis reveal no significant main effect of Session (preand post-dialogue recordings) in the Euclidean distances (Table 4). In other words, the Euclidean distance between dyads members did not change significantly before and after the interactions (Fig. 2).
With respect to the hypothesis that RedVow is more prone to convergence compared to OSL and ISG, against the predictions, no significant differences in degree and direction of accommodation (DDpair) were found between the three ratio measures (Table 5). Unlike findings on vowel accommodation between GRG and ZHG or between other dialects, showing more convergence for phonetically more distant features (Ruch, 2015;MacLeod, 2012;Walker & Campbell-Kibler, 2015;Clopper & Dossey, 2020), and more divergence for acoustic attributes perceived as strong dialect markers (Babel, 2010;Clopper & Dossey, 2020), in the case of ISG, OSL and RedVow, interpretations of accommodation based on phonetic distance or degree of dialect markedness do not seem tenable (Fig 3).  As shown in Fig. 3, RedVow, indeed, was neither more nor less prone to accommodation than OSL and ISG. Conversely, the values of the three measures circle around zero pointing in favour of rhythmic maintenance.
There could be at least two possible explanations for this result: (1) likewise the rhythmic metrics analysed in previous research (e.g., Leeman et al., 2014;Dellwo et al., 2015), the three timing measures examined here may be robust against source of within-speaker variability. The exposure to the distinct rhythmic behaviour of the dialogue partner might have not altered the post-dialogue realization of ISG, OSL and RedVow, as instead was observed for vowel formants. We cannot exclude, however, that accommodation in segmental timing properties has happened in the more spontaneous tasks of the corpus which has not been object of the present investigation. For future research, it will be interesting to examine whether the same pattern would replicate when rhythm is examined at the utterance level, using the metrics which have been typically employed in speech rhythm research (see a.o. Ramus, Nespor and Mehler, 1999;Grabe & Low, 2002;Dellwo, 2006;White and Mattys, 2007;He & Dellwo, 2016). (2) Another possible explanation may have to do, instead, with the perceptual salience of the cross-dialectal features, captured by the three rhythmic measures. Given that differences must be perceptible in order to be imitated (Mitterer & Müssler, 2013), the interspeakers' differences in ISG, OSL and RedVow may probably be too subtle to be perceived or retained after the interaction. This would be also in line with findings from Swiss German dialects recognition research that shows that listeners pay attention to segmental features to a higher degree than rhythmic and prosodic features when recognizing the dialectal origin of the speakers (see Leemann, et al., 2018; for varieties of English see a.o., Fuchs, 2015).
The differences in accommodation behaviour of the same ZHG and GRG speakers across segmental and rhythmic measures confirm the complexity and multi-facetedness of vocal accommodation. As pointed out by Sanker (2015) and Cohen Priva and Sanker (2018), patterns of convergence in one measure within a pair or within a speaker cannot be taken to be representative of pairs and speakers' overall convergence patterns in other measures.

CONCLUSIONS
Based on a corpus of pre-and post-dialogue picture naming task performed by 18 speakers of GRG and ZHG, results reveal that members of pairs, who show significant durational differences before the interaction, do not shift noticeably the production of ISG, OSL and RedVow after being exposed to the interlocutors' dialect. Although the evidence from rhythmic variability in child-and adult-directed speech, as well from synchronous reading, supports the view that rhythmic features can be object of interspeaker variations, these adjustments can be unidirectional and irrespective of the rhythmic behaviour of the dialogue partners