Individual variability in cue weighting for first-language vowels

This study investigates the use of different cues in discrimination of Azerbaijani /œ/ and /ɯ/ vowels. Regarding the large overlap in f1–f2 vowel space in the production of these vowels, this study researched for other possible cues in their categorization. Twenty native Azerbaijani listeners were tested through a perceptual identification test. Since f2 was weighted more consistently through the experiment, it is suggested that f2 is the primary cue in discriminating of this vowel pair. We observed individual differences in the perceptual weighting of f2 and f3 among the listeners. Although most of the participants gave more weight to f2, some others weighted f3 heavier than f2 or gave weight to both cues equally. These findings expand the knowledge on perceptual cue weighting and point the importance of examining cue weighting at the individual level.


INTRODUCTION
There are multiple acoustic dimensions that define speech categories.During spoken language comprehension, listeners categorize speech sounds based on these continuous acoustic cues.Listeners need to determine which cues are relevant to pay attention to, and what relative importance each cue has in order to assign more weight to that cue.Several studies have attempted to find the acoustic dimensions that are important in the discrimination of different speech sounds.Morrison (2013) provides a review of theories related to dynamic aspects of vowel perception.Strange and Jenkins (2013), in their Dynamic Specification model, mention that the most im-portant cues to vowel identity are in the spectro-temporal patterns of consonant-vowel and vowel-consonant formant transitions.
Among the early studies on the role of different cues in the perception of vowels is the study by Bennett (1968), which investigated the relative importance of the spectral and temporal cues in the discrimination of pairs of English and German vowels.He suggested that the importance of the temporal cue is inversely proportional to the distance between the qualities of a given pair of vowels.His results showed that spectral form is, in general, more important than duration in vowel recognition in both English and German, and it is only when two vowels are very close in quality that the duration cue is more important for their discrim-ination.Ainsworth (1972) used sets of synthetic vowel sounds that differed in first-formant frequency, second-formant frequency, and duration, and investigated the effect of these cues on the identification of vowels.He found that listeners' judgments depended on all of these factors; however, duration was a relatively more important cue for vowels located in the centre of the f 1 -f 2 space where a vowel might more readily be confused with one of its neighbours.
According to Idemaru et al. (2012), "whereas any of the acoustic dimensions may play a role in phonetic categorization, they are not necessarily perceptually equivalent".Giving greater perceptual weight to some of the acoustic dimensions is referred to as cue weighting (Holt & Lotto, 2006;Francis, Kaganovich, & Driscoll-Huber, 2008;Idemaru et al., 2012).Hillenbrand, Clark, and Houde (2000) found that English listeners give more weight to the spectral than the temporal dimension in categorizing English [i] and [ɪ] vowels.It has also been found that in discrimination of voiced and voiceless bilabial stop consonants at the syllable initial position, voice onset time (VOT) is more strongly weighted by English listeners and fundamental frequency (f 0 ) of the following vowel is used as a secondary cue in the discrimination of this pair (Abramson & Lisker, 1985;Francis et al., 2008).Holt and Lotto (2006) suggest that dimensions that are highly related to category identity need to be more strongly perceptually weighted than those less predictive of category identity.These acoustic dimensions are sometimes weighted differently among listeners.

Vowel perception
Through the history of speech research, formants have played an important role in the studies of vowel perception and acoustic descriptions of vowels.Fant (1960) indicated the importance of formant frequencies as the prime determinants of the spectral envelope of oral vowels, suggesting that the complex spectra of vowel-like sounds could be uniquely indexed with relatively few parameters.Since formant amplitude appeared to be redundant with formant frequency (Fant, 1956;Stevens, 1998) and because formant bandwidth appeared to have little influence on perception (Klatt, 1982), the focus in speech perception studies was placed on formant frequencies as correlates of perceptual vowel identification.
Some other studies have shown that general spectral shape is a good correlate to the measures of psychoacoustic distance between vowel-like stimuli (Bladon & Lindblom, 1981;Pols, van der Kamp, & Plomp, 1969).In this approach, it is suggested that listeners compare vowel spectra to find the closest match to an internal representation of the corresponding vowel categories.Therefore, formant peaks are not treated differently from other spectral properties and all spectral components are given weight.However, Kiefte and Kluender (2008) proposed that listeners ignore spectral shape properties in the identification of synthetic monophthongs when the target stimuli were embedded in a sentence.
In addition to formant-based and spectral shape approaches another approach in vowel perception research is the concept of spectral features based on an intermediate representation.Some studies mention that auditory f 2 and f 3 are perceptually interrelated.Delattre, Liberman, Cooper, and Gerstman (1952) noted that it was possible to produce acceptable versions of French vowels with the Pattern Playback, with two energy bands close to measured f 1 and f 2 of naturally produced vowels.This higher f 2 has been called the effective f 2 or f 2 prime (Fant, 1973;Fant & Risberg, 1963).Chistovich and Lublinskaya (1979) proposed that formant peaks closer than 3.0-3.5Bark are merged into a single perceived spectral prominence.Fujimura (1967) studied the perception of high vowels in Swedish to investigate the theories of formant integration into f 2 prime.The notion of wideband integration of f 2 and f 3 was criticized based on his results and he instead proposed that both f 2 and f 3 make independent contributions even when they are separated by less than 3.0 Bark.Rosner and Pickering (1994, pp. 151-152) give experimental evidence indicating that it is not likely that higher formants merge auditorily into a single effective perceptual feature.Neary and Kiefte (2003) used a neural network to model spectral integration similar to that proposed by f 2 prime models in order to reduce a large threedimensional vowel continuum to two effective formants or parameters; however, this attempt was not successful.A three-dimensional formant-based representation performed substantially better in predicting listeners' vowel judgments than any two-dimensional representation that could be discovered with the neural network.This again supports Fujimura's (1967) hypothesis that vowel perception cannot be explained with two parameters alone.

Azerbaijani vowels
The Azerbaijani belongs to the western group of the southwestern, or Oghuz, branch of the Turkic language family and is mainly spoken in Azerbaijan and Iran.Among nonPersian languages in Iran, Azerbaijani, with approximately 15-20 million native speakers, has the largest number of speakers (Crystal, 2010).Azerbaijani has nine vowels, /ae ɑ o e oe ɯ u i y/, with no length distinction (Figure 1).

Present study
Most of the vowels are acoustically differentiated in terms of their first and second formant (f 1 and f 2 ) values in different languages.In a recent research, Ghaffarvand Mokari and Werner (2016) found a large overlap in f 1 -f 2 space between Azerbaijani /ɯ/ and /oe/ vowels (Figures 2  and 3).The /ɯ/ and /oe/ vowel are contrastive phonemes in different contexts in Azerbaijani (e.g., /sɯz/ 'groan' versus /soez/ 'word').
Linear discriminant analysis revealed that f 1 and f 2 as predictors fail to accurately classify these two vowels.Further inclusion of f 0 and duration as the predictors also did not improve the classification percentage.However, the inclusion of f 3 to the predictors improved the classifications pretty dramatically.It seems these two vowels are more distinct based on f 3 information (Figure 4).Holt and Lotto (2006) argue that "if there is not much overlap in a specific acoustic dimension, then that dimension would be very informative about category identity and it would be expected to receive more perceptual weight than the other acoustic dimensions".Based on the hypothesis of Fujimura (1967) and findings of Chistovich and Lublinskaya (1979), since the difference between f 2 and f 3 at least in Azerbaijani /ɯ/ vowel is more than 3.5 Bark (Ghaffarvand Mokari & Werner, 2016), in this study we assumed f 2 and f 3 as separate perceptual parameters.Studies on how listeners weight perceptual cues in categorization of L1 vowels are limited, especially when vowels are differentiated only by spectral features.We designed the present study in order to explore the perceptual categorization of the two Azerbaijani /ɯ/ and /oe/ vowels.We first aim to find out how listeners weight f 2 and f 3 in discriminating the /ɯ/-/oe/ pair.We aim to find whether f 3 is an important perceptual cue in the discrimination of this pair or not.We are specifically interested in observing how listeners weigh different acoustic dimensions of these vowels and if they use different cue-weighting strategies in their discrimination.

Participants
Participants were 10 male and 10 female native Azerbaijani speakers in Tabriz, north-west of Iran.They were born and grown in Tabriz, always used Azerbaijani as the communication language, and reported no history of hearing or other speech problems.They had a mean (SD) age of 30.4 (5.4) years.An informed consent form was obtained from all participants.

Stimuli
A 29-year-old male native speaker of Azerbaijani from Tabriz produced several examples of the word /boel/ 'divide' in isolation.All tokens were recorded in a soundtreated room using a ZOOM H6 recorder positioned at approximately 20 cm in front of the speaker.Recordings were at a sampling rate of 44.1 kHz with 16bit resolution.One natural production of the token /boel/ was selected.The selected token had no sudden changes in formants during the periodic portion of the signal, no changes in fundamental frequency, and no clicks.For the resynthesis of the tokens, a periodic portion of the vowel wave form was manually extracted from the /boel/ token, from the end of the /b/ burst to the last zero crossing of the vowel waveform before the silent gap.The first three formants (f 1 , f 2 , and f 3 ) were measured using the standard LPC analysis in Praat (version 6.0.21).The first formant was 475 Hz, the second formant was 1386 Hz, and the third formant was 2273 Hz.The average intensity was 70 dB.We synthesized this token in Praat and created 24 stimuli for the perception experiment.Three sets of tokens were made: (1) by only manipulating the f 3 , (2) by only manipulating the f 2 , (3) by manipulating f 2 and f 3 .For each set, eight spectral steps (equal along a bark scale [1 step = 0.22 Bark for f 2 and 1 step = 0.19 Bark] for f 3 ) were created (Figure 5).
We decided to use extreme spectral values within the one standard deviation of the mean for these vowels as absolute exemplars.The values of the two stimuli are based on mean values of the Azerbaijani vowels /oe/ and /ɯ/ in productions of male speakers reported by Ghaffarvand Mokari and Werner (2016).The absolute /oe/like instance was the token with the highest f 2 and the lowest f 3 (the upper-right corner in Figure 5) and the absolute /ɯ/ like instance was the token with the lowest f 2 and the highest f 3 in the continuum (the lower-left corner in Figure 5).Additionally, we asked four Azerbaijani native listeners to approve if synthetic tokens were the exemplars of the intended vowels to use in the experiment.
On each trial of the test in a XAB task, three vowel tokens were played and the listeners were asked to decide whether the first vowel sounded like the second (A) or third (B).The second and third tokens were the most /oe/ like and /ɯ/like stimulus, and the first token was one of the 24 stimuli.Participants had to classify each of the 24 stimuli as one of the two absolute exemplars of the vowels.Following Werker and Logan (1985) and Escudero, Benders, and Lipski (2009) the interval between the three tokens was set to 1.2 seconds in order to ensure language-specific phonological processing.The order of the presentation of the A and B stimuli was counterbalanced, leading to 48 different XAB trials, which were presented four times each.This way, we ended up with a total of 192 trials.

Procedure
The listeners were tested individually in a quiet room by native Azerbaijani speakers who gave all instructions in Azerbaijani.Prior to the experiment, the absolute exemplars of the vowels (the most /oe/like token and the most /ɯ/like token of the stimulus set) were played and participants were asked to pronounce each endpoint and mention three words that include that vowel.This was to ensure that the tokens were easily identifiable as the intended Azerbaijani vowels by native Azerbaijani listeners (Escudero et al., 2009).The listeners were asked to click on a computer screen displaying the numbers "1", "2", and "3".The number "1" was presented in grey color and was non-clickable.The test was carried out on a PC using Praat.The experiment lasted approximately 20 minutes for each participant.They had a 5minute break in the middle of the experiment.All participants' accuracy in discrimination of the absolute exemplars of either Azerbaijani /oe/ or /ɯ/ was more than 80%, so we did not exclude any of them.

Analysis
We performed logistic regression analysis to investigate the listeners' use of different spectral information.The equation ( 1) is the example equation used for a model including f 2 and f 3 .
(1) log(odds(oe)) = ln(p(oe)/p(ɯ) In this equation, α is the intercept of the regression model.The coefficients (β's) show how much a one-step difference in one of the predictors makes a change in the log odds of a participant's response.Hence, according to the suggestions by Morrison (2007Morrison ( , 2009)), β is regarded as participant's reliance on each of these cues.Following Escudero et al. (2009) we used equation (2) to compute the relative reliance of the participants on each cue.Values higher than 0.5 mean that f 2 is weighted heavier than f 3 and those below 0.5 show that f 3 is weighted heavier.
(2) cue weighting = β f2 / (β f2 + β f3 ) Also, as mentioned by Escudero et al. (2009), polarcoordinate magnitude can be calculated using the logistic regression coefficients, which indicate the boundary crispness.The larger polar-coordinate magnitude indicates a clearer boundary between the two categories (Morrison, 2007).The polar-coordinate magnitude for the model including only f 2 was measured as indicated in equation ( 3).
(3) polar-coordinate magnitude = β f2 2 Based on the results of a logistic regression analysis, it is also possible to find whether an individual cue significantly affects a listener's responses or not.To this end, we tested whether a logistic regression model that includes a cue as an independent variable predicts the responses significantly better than the null model.For instance, the effect of f 2 is evaluated through the comparison of the fit of a model with f 2 as a predictor to the fit of a model with only the intercept.The fit difference between the two models is the ΔG 2 , which is approximately χ 2 dis-tributed.The difference in degrees of freedom between the two models are the degrees of freedom of this ΔG 2 .These results are reported using an α level of 0.05 for each participant.

RESULTS
A series of G 2 comparisons as described in the method section were performed to examine which of the cues were used significantly for vowel categorization.Inclusion of f 2 significantly improved the fit of the model for 20 out of 20 participants (p < 0.05), compared to a model without any independent variable; when only f 3 was included, the fit of the model significantly improved for 9 participants out of 20 (p < 0.05), compared to a model without any factor.Figure 6 represents a scatterplot for the coefficients of the regression model with f 2 and f 3 for the 20 participants.Escudero et al. (2009) mention that "the coefficients of the logistic regression analysis show to what extent a one-step difference in one of the predictors causes a change in the log odds of a participant's response (p.457)".
As described in 2.4, a cue weighting of 0.5 indicates that the listener weights both cues equally, and a cue weighting higher than 0.5 indicates that the weighting of f 2 is heavier than that of f 3 ; if it is below 0.5, f 3 is weighted heavier.Figure 7 represents the mean cue weighting for each of the participants.The closer the coefficients of f 2 and f 3 (Figure 6) are to each other, the closer the relative cue weighting is to the center line (both cues weighted equally; Figure 7).
According to Figure 7, most of the listeners weighted f 2 heavier, and some listeners weighted f 3 heavier or both equally.However, there are differences on the amount of weighting of cues among the listeners.Overall, the reliance on f 2 was much stronger than on f 3 .Some partici- pants' relative cue-weighting scores were close to 1 (f 2 only); however, the three participants who gave more weight to f 3 did not weight it as strong as the average of participants who gave more weight to f 2 .
Finally, we computed the polar-coordinate magnitudes of the two models (only f 2 and f 2 + f 3 ) and compared the models' steepness in categorization boundaries.As mentioned by Morrison (2007), the contrast coefficient slope in the logistic space is related to the slope of the sigmoidal curve which represents the rate of change from one category to another in the probability space.The size of the contrast coefficient and the corresponding steepness of the steepest tangent to the sigmoidal curve in the probability space are indicators of the crispness of the boundary between the two categories" (p.229).
Figure 8 shows the probability of choosing the /oe/ vowel along the 8 steps.Compared to the model when only f 2 changes toward the /oe/ vowel, changes of both in f 2 and f 3 make the probability of /oe/ response to jump steeper toward 1.
There was a significance difference between the coefficients of these two models (t = -2.03,p = 0.05), and inclusion of f 3 made the curve steeper compared to model with only f 2 (Figure 8).The mean polar-magnitude values for models were f 2 + f 3 = 0.83 > f 2 = 0.74.

DISCUSSION
The current study examined perceptual weighting of different acoustic dimensions in perceptual discrimination of Azerbaijani /oe/ and /ɯ/ vowels.Regarding the large overlap in f 1 -f 2 vowel space in the production of these two vowels (Ghaffarvand Mokari & Werner, 2016), this study explored if other cues play a role in their discrimination.To the best of our knowledge, this is the first study of cue weight in the discrimination of native vowels based solely on spectral information.Our results revealed individual differences in the cue weighting for the perception of Azerbaijani /oe/ and /ɯ/ vowels.Although f 2 was a more important cue for the discrimination of this vowel pair, f 3 also played a role.Our results of the reliance on different cues revealed that a higher number of listeners mostly relied on f 2 and some of them relied on f 3 or on both.
Overall, one of the important findings in the present study is that f 2 is still the main cue in the distinction of these two vowels despite their large overlap regarding their f 2 values.One explanation for this issue would be that listeners use a perceptual vowel-intrinsic normalization process that does not need the information from other vowels.According to Adank et al. (2004) "vowel-intrinsic normalization models have been considered to be  more suitable as models for human vowel perception" because they "can normalize a single vowel from a speaker without information about other vowels from that speaker" (p.3105).These findings are in line with individual differences reported in previous studies.Regarding the discrimination of voiced and voiceless stops, Stevens and Klatt (1974) report that some listeners relied more on VOT than f 1 onset frequency, and Haggard et al. (1970) found that some listeners were more sensitive to f 0 than VOT in distinguishing voiced and voiceless stops.More recently, Idemaru et al. (2012) found considerable variability between Japanese listeners' perceptual weighting of absolute and relative durations in the discrimination of Japanese singleton and geminate stop categories.
In addition, our results revealed that the categorical boundary was steeper when both f 2 and f 3 were included in the model compared to the model including only f 2 .This was in line with the study by Hazan and Rosen (1991), who observed that listeners' identification functions were uniformly steep in the full-cue condition.
Some previous studies have indicated that f 2 and f 3 might be perceptually regarded as one percept (f 2 prime; Delattre et al., 1952;Fant & Risberg, 1963;Fox, Jacewicz, & Chang, 2011).Chistovich and Lublinskaya (1979) proposed that close formant peaks are merged into a single perceived spectral prominence.However, other studies suggest that a two-dimensional view on formants cannot explain vowel perception and at least three spectrally prominent regains (corresponding to f 1 , f 2 , and f 3 ) are necessary to explain vowel perception (Fujimura, 1967;Rosner & Pickering, 1994).There is a need for further studies to investigate whether f 2 and f 3 covary in perceptual discrimination of Azerbaijani /oe/ and /ɯ/ vowels and whether they can be merged into one dimension in perception or not.Francis et al. (2008) suggest that listeners normally rely on primary cues (e.g., on VOT in the discrimination of English stop voicing contrast) in ideal listening conditions.However, they adjust their cue weighting toward secondary cues under less-than-ideal conditions, for instance when listening to speech in noise or listening to multiple speakers.If f 2 is the primary cue in the discrimination of the Azerbaijani /oe/ and /ɯ/ vowels, it can be assumed that listeners will use f 3 cue in noisy and not in ideal conditions.
Our results also revealed individual differences in categorization gradiency in presence of different cues.In explanation of individual differences in speech perception, Kong and Edwards (2016) hypothesized that gradiency would be related to general cognitive control.They tested this hypothesis by correlating measures of gradiency with performance on measures of inhibition and task shifting and found little support for this claim.Kapnoula (2016) also did not find consistent relationships between gradiency and measures of executive function.Idemaru et al. (2012) speculate that their observed individual cueweighting pattern can be due to the similar informativeness of the acoustic dimensions and it allows listeners to freely use either source of information, perhaps varying in which information they use across time.
Future research may look into the relation between production and perception of reliance on spectral cues.One would assume that the individuals who give more weight to f 3 in discrimination of the Azerbaijani /oe/ and /ɯ/ vowels produce them also with heavier f 3 differences.In summary, we observed some individual differences in cue-weighting strategies among native listeners.Although there are a few studies on the individual differences in cue weighting, the source of these differences still remains to be discovered in future.

Figure 2 :
Figure 2: Distribution of the Azerbaijani vowels in f 1 × f 2 (Bark) based on production of the 23 female participants in the study by Ghaffarvand Mokari and Werner (2016).The ellipses representing two standard deviations from the mean.

Figure 3 :
Figure 3: Scatterplot of /ɯ/ and /oe/ vowels based on production of the 23 female participants in the study by Ghaffarvand Mokari and Werner (2016).Axes represent f 1 , f 2 values in Hz.

Figure 4 :
Figure 4: 3D scatterplot of /ɯ/ and /oe/ vowels based on the productions of 23 female participants in the study by Ghaffarvand Mokari and Werner (2016).Axes represent f 1 , f 2 , and f 3 values in Hz.

Figure 6 :
Figure 6: Scatterplot of coefficients from the logistic regression analysis that shows the reliance on f 2 and f 3 .

Figure 7 :
Figure 7: Relative cue weighting of f 2 and f 3 per participant.

Figure 8 :
Figure 8: Sigmoidal curves in the probability space for contrast coefficient values of model f 2 and f 2 + f 3 .