The role of the input frequency in L1 Spanish phonological acquisition. A corpus-based study

: This study presents the phonological system exhibited by children (n=59) aged 3;0 to 6;0 and focuses on the role of input frequency. Using a spontaneous child speech corpus of Spanish (CHIEDE) as a data source, as well as computational processing techniques —including an automatic phonological transcriber—, data relating to the phonological level was retrieved. This resulted in a phonological inventory of Spanish-speaking children, ordered by frequency of use, which may serve as a model for research on typical and atypical child language development. Additionally, a study was carried out on the stability of the participants’ phonological systems by calculating the variability that the different age groups displayed, and outcomes were compared with other similar corpora. Results obtained from the comparison of the phonological inventory of children and adults show that there is a relationship between frequency of use in adult speech and the order of acquisition of phonemes


INTRODUCTION
First language (L1 henceforth) acquisition and development have drawn the attention of researchers for centuries. However, new technology development from the last few decades has entailed a qualitative change in research (Dolgova & Tyler, 2019;Ellis, 2017;Kern et al., 2014;MacWhinney, 1996). The gradual introduction of new technological tools and the adoption of common methodologies and procedures made the design of the first corpora of child language possible. In those corpora, hundreds of speech recordings from different-aged children were transcribed, providing researchers with an invaluable database for the study of child language (Ellis, 2017). Currently, the international corpus of reference is CHIL-DES 1 (MacWhinney & Snow, 1985), a multilingual child language corpus, in which we can find samples of Spanish language, some of which were used to corroborate results from this study. And particularly regarding the phonological treatment of corpora, the development of the software PHON 2 (Hedlund & Rose, 2020) meant a landmark in the study of child language.
Within the field of L1 acquisition research, the description of the phonological development involves four basic concerns (Grunwell, 1981): the great variation from one individual to another; the extension and gradual regularisation of the child's pronunciation system, characterised by unsystematicity; the difficulty determining the starting point of the phonological development; and the need to consider both the input and output in the process of description. Grunwell (1981, p. 167) disapproved of the fact that "studies are to discover when children achieve the correct pronunciation of the sounds of their language". She considered that the question about when the sounds of speech are learnt was ill posed, due to factors such as the wide range of individual variation, or the fact that a child does not acquire each phoneme separately. Therefore, research on phonological acquisition must not focus so much on the precise moment at which a child acquires a certain phoneme, but on the search for patterns by describing large samples of speech language. "We need models of usage and its effects upon acquisition" (Ellis, 2017, p. 48).
This subject matter of phonological acquisition has been largely aimed at improving research on language disorders. From a detailed study of a child's normal linguistic development and the establishment of patterns in language behaviour it is possible to detect atypical phenomena in the development of an individual. According to Ingram (1976), the knowledge about patterns of typical language development gives us the clues for the treatment of pathologies. And corpus linguistics plays a pertinent role in this regard, since corpora are a huge source for the analysis of natural language in the elaboration of, for instance, what Acosta and Ramos (1998) demanded: a phonological inventory; or to study the role of input in 1 https://childes.talkbank.org/ 2 https://www.phon.ca/phon-manual/index.html the acquisition process examining child-directed speech (CDS) in natural contexts.
The present study is based on CHIEDE (Garrote, 2010), a cross-sectional corpus in which n=59 children aged 3;0-6;0 participated. The corpus was recorded, transcribed and, subsequently, tagged by means of automatic processing techniques (phonological and morphosyntactic tagging software), and then manually checked to correct possible tagging errors. This methodology facilitates the retrieval of linguistically annotated data (parts of speech, morphological, and phonological information) to quantify linguistic features. It is descriptive work, following an observational method based on performance, on external empirical data, and not on competence and experimentation.
This paper presents a phonological study of L1 Spanish children with the aim to show the phonological development displayed by the participants. Taking into account the participants' age, our purpose was not to establish the order of acquisition of phonemes, but to carry out a description of the typical phonological development of Spanish-speaking children from 3;0 to 6;0 years old, based on the frequency of occurrence of phonemes (providing a phonological inventory), and to highlight the role of the input frequency as a facilitator to acquire phonemes (even those traditionally considered more complex). Three questions are considered: (1) Is the phonological system completely acquired at 3;0? (2) Is 4;0 a turning point in the acquisition process as many linguistic studies claim (Bosch, 1983;Díez-Itza & Martínez López, 2004;Maratsos, 1974)? And finally, and most importantly, (3) To what extent is the input frequency relevant in this process? The goal is to clarify these questions through the revision of some of the most significant theories and research, and the analysis of data from different corpora.

PREVIOUS RESEARCH
Morphology and syntax are the linguistic levels which have been addressed to the most extent by research on L1 acquisition. Studies carried out on child language have mainly focused on the acquisition of the lexical and grammatical structure, to the detriment of phonology, semantics, or pragmatics. According to Vihman et al. (2009, p. 164), "The role of phonology in the development of linguistic knowledge is often given short shrift by researchers interested in word learning". Consequently, phonological studies on acquisition are less frequent (Polo, 2016). Moreover, a vast majority focus on the English language. Though research has been gradually carried out on other languages, it is "heavily biased toward Indo-European languages of Western Europe with the bulk of research still concentrated on English" (Stoll, 2009, p. 89).
One of the pioneering works on phonological development was Stampe's (1969), for whom the language acquisition process is based upon an innate mechanism children have in order to simplify adult words. By means of these mechanisms or processes -unstressed syllable deletion, clusters reduction, merging vowels into /a/-the The role of the input frequency in L1 Spanish phonological acquisition. A corpus-based study• 3 Loquens, 9(1-2), December 2022, e089, eISSN 2386-2637. https://doi.org/10.3989/loquens.2022 child goes from what Stampe called a "language-innocent state" to the adult production.
Later, Ingram (1976) adopted Stampe's theory for clinical phonology research. Following the piagetian stages (Piaget, 1926) of cognitive development and their corresponding linguistic periods, Ingram established a parallelism with the phonological level, thus locating the evolution of the different phonemes and phonological skills at distinct stages from the sensorimotor stage (0;0-1;6) to the formal operational stage (12;0-16;0).
However, crosslinguistic studies on acquisition beyond the early period (around one year of age) have proved that it is not possible to establish clear stages of development applicable to every language. For instance, Durgunoğlu and Öney (1999, p. 283) examined the "effects of language-specific influences on the development of phonological awareness" and explained how structural phonetic differences among languages mean differences in the child's development of phonology. In a similar line, Bleses, Basbøll, Lum and Vach (2010) set up a ranking of 7 languages based on the complexity of their phonetic systems (vowel/consonant ratio) and concluded that the most complex one was the Danish phonemic system, followed by the Swedish, the Dutch, the French, the English (American), the Galician and the Croatian. Bernhardt and Stemberger (2017), comparing typical development with protracted phonological development, showed that in four languages, Mandarin, Arabic, Slovene and European Spanish, the WWM 3 scores for 4-year-old children were 80-85% (85.4% for European Spanish).
Though differences across languages, McLeod and Crowe (2018), after reviewing 64 studies involving more than 26,000 children and 27 languages concluded that 93% of consonants were correctly produced by 5 years old. In the same line, Stoel-Gammon (2006, p. 646) stated that "By the age of 3 years, the level of intelligibility increases to 75%, and by age 4, it is 100%", meaning that, though not adult-like yet, the child phonological system is sufficiently developed to be intelligible.
In Spain, the theories set out first by Stampe and then by Ingram were later introduced by authors such as Bosch (1983) and Díez-Itza (1995). For both researchers, the phonological acquisition period is placed between approximately one and a half years old and six to seven years old, with an intermediate division around four years old (Bosch, 1983). This means that one cannot talk about a total control of the complete phonological system until the age of six or seven, when the child masters certain complicated phonemes and their combination in more complex syllables. In spite of that, as mentioned before (Bernhardt and Stemberger, 2017;Stoel-Gammon, 2006), by the age of 4 years intelligibility is complete.
Spanish studies have mostly focused on what Díez-Itza and Martínez López (2004) call periodo temprano 'early period', that is, until about three years old. These authors consider necessary to increase research on the periodo tardío 'late period', i.e., from three to six years old. They determined three stages in the phonological acquisition: expansión 'expansion', the stage until 3;0, characterised by a progressive diminution of phonological processes (such as unstressed syllable deletion, clusters reduction, etc.), after which there would be a standstill; estabilización 'stabilisation', from three to four years old and initially defined by a considerable decrease of processes, which increase again at around four years old (showing a U-shape developmental pattern); and resolución 'resolution' from the age of five years onwards, when phonological processes are residual. Díez-Itza and Martínez López's (2004) intention was to confirm if the age of four clearly becomes a universal milestone of transition towards subsequent periods, as it has been repeatedly assumed by descriptive studies. In fact, at the age of four years children's language is characterised, from the standpoint of phonology, by an increased speech rate, which means more coarticulation and the lengthening of utterances and conversational turns (Díez-Itza & Martínez López, 2004). Many scholars agree on a transition point at four years old (Bosch, 1983;Díez-Itza & Martínez López, 2004) regarding phonological acquisition, but also other linguistic levels. For example, Maratsos (1974), analysing the acquisition of the passive structure, concluded that children show a U-shape developmental pattern around four years old, as the rate of passive comprehension decreased in comparison to younger children. Also, Garrote (2010) found that it was around 4;0 that children produced more non-targeted speech as a consequence of rule overgeneralisation errors. Bosch (1983), based on studies by Serra (1983) and Melgar de González (1976), summarised the most problematic phonemes during the acquisition process of Spanish: the trill /r/, fricatives such as /s/, /θ/ and /x/, and the voiced plosive /d/. She concludes that the most difficult place of articulation is that located in the dento-alveolar area, where a great number of sounds are differentiated just by the manner of articulation (Bosch, 1983). López Valero et al. (1989) supported Bosch's findings concluding that the sounds belatedly acquired in Spanish are /x/, /f/, /r/ and / θ /.
Other authors such as Serra (1983) established the following order of acquisition: nasals, plosives, fricatives, and, finally, liquids and the alveolar trill.
It is noteworthy to mention here two studies related to the present one, due to the age range (3 to almost 6 years old) and the language (Spanish, though Mexican variety). First, Jiménez (1987) found out that, by age 5 years, the 120 children forming the sample showed production problems only with two consonants: /s/ and /r/. Second, Acevedo (1993, p. 11) also tested 120 Mexican children. Results proved that sound "mastery occurred by the 4;0-4;5 age group", remaining problematic the following consonants: /ɲ/, /g/, /f/, /s/, and /x/. Both studies were based on elicitation tasks, not on spontaneous speech.
Taking into account previous research and the above-mentioned claims (Acosta & Ramos, 1998;Grunwell, 1981;MacWhinney, 1996, among others), there is a need for a phonological frequency-based analysis of the linguistic performance of children aged 3;0 to 6;0 (late period), using a spontaneous speech corpus as a data source.

The role of input and frequency
Although input is considered by advocates of nativist theories of a Chomskyan nature as irrelevant, citing the Poverty of Stimulus Argument (Chomsky, 1980), later tendencies such as connectionist models (Menn & Stoel-Gammon, 1996) give the input a key role in the learning process, considering it the source of empirical knowledge from which children, through statistical processing, acquire language. Indeed, "a number of linguists have recently proposed statistical explanations for patterns of phonological productions" (Rose, 2009, p. 329).
In recent years, the cognitive-functional or usage-based model (Tomasello, 2003) has posed the emergence of language as a result of use, from which linguistic patterns arise, and then grammatical constructions are consolidated. From a usage-based approach to language acquisition, "children learn linguistic constructions from the conspiracy of experienced exemplars, with abstract syntactic constructions and their associated meanings emerging from the statistical distribution of form-function correspondences in usage" (Ellis, 2017, p. 46). Zamuner, Gerken andHammond (2004, p. 1406) based their research on the Specific Language Grammar Hypothesis (SLGH), which states that "language acquisition is best described with respect to the patterns in the input or ambient language". Thus, children will acquire first those phonemes which are more frequent in their language.
Studies based on frequency and likelihood of occurrence have shed some light on the process of language acquisition (Ellis, 2017;Polo, 2016;Rose, 2009;Zamuner et al., 2004). For example, Lleó (2003), in a crosslinguistic study of German and Spanish, found that coda consonants are acquired earlier in languages where codas and coda clusters are common. The same author concluded some years later that "We now know that babbling results from a combination of unmarked sounds and the most frequent sounds produced around the baby" (Lleó, 2012, p. 693). Also, Demuth (2009), after analysing the fact that /t/ (and not voiced /d/) is the first coda consonant acquired by English speaking children, determined that "although frequency and markedness typically pattern together, children may show a preference for frequency over markedness effects in their early productions" (De-muth, 2009, p. 189). Roark and Demuth (2000) carried out a corpus-based study on prosodic properties on language. Results proved that "young language learners are sensitive to statistical properties of the input, and this influences the course of language development." (Roark & Demuth, 2000, p. 599). For a more complete view of the role of input and frequency in child language acquisition, see Kern et al. (2014), who, in a special issue, crosslinguistically analyse the essential function of these two factors in the process of L1 acquisition, covering distinct linguistic levels.
The present research is framed within the usage-based phonology (Polo, 2016), and the SLHG (Zamuner et al., 2004), following Ellis's (2008, p. 95) statement: "language processing is intimately tuned to input frequency and probabilities of mappings at all levels of grain: phonology and phonotactics, reading, spelling, lexis, morphosyntax, formulaic language, language comprehension, grammaticality, sentence production, and syntax. It relies on this prior statistical knowledge".
Notwithstanding, following Rose (2009, p. 346), "while statistics of the input seem to play a central role in infant speech perception, such statistics appear to be only one of the many factors underlying patterns observed in speech production". Therefore, a single approach is not enough to account for language acquisition, but a contribution to the general research scenario.

Contribution of Corpus Linguistics
Investigation of language acquisition has traditionally been based on experiments or tests of a logopedic kind rather than on spontaneous speech (see Acevedo, 1993or Jiménez, 1987 as examples of research describing the phonological development of Mexican Spanish children ranging in age from 3 to more than 5 years). This may be due to the fact that, on the one hand, such studies tend to focus on speech and language disorders and, therefore, the samples in many cases belong to subjects who show atypical language development. These samples are collected in assessment situations where the context tends to be artificially created. On the other hand, another reason for using tests and not speech corpora in child language research is related to the difficulty of obtaining large samples of spontaneous speech, which poses a major disadvantage to any investigation: we have to find the occasion to make recordings, but also these must be later transcribed. This difficulty is compounded by the challenges of working with children, since it is not only necessary to count on the permission of parents or guardians, but also, we must be particularly respectful of their right to privacy. Ellis (2017) states that usage-based linguistics are supported by findings from Corpus Linguistics, Cognitive Linguistics, and Psycholinguistics. In the same line, Dolgova and Tyler (2019, p. 914) claim that Corpus Linguistics studies are an example of the different existing usage-based models, which "reveals frequency patterns and meanings in natural usage contexts". These authors The role of the input frequency in L1 Spanish phonological acquisition. A corpus-based study• 5 call for the need of using corpus linguistics in research from a usage-based perspective: "The usage-based research program necessitates extensive analysis both of the usage from which learners learn and of learner usage as it develops" (Ellis, 2017, p. 41), by means of corpora and computational techniques. Nonetheless, Ellis (2017, p. 46) warns about the need for complementary sources of information: "Learner language corpora show what learners say; they do not show what they know. Experimental techniques are needed to probe aspects of knowledge and understanding".
The use of corpora for assessing phonological development has been extensively promoted by researchers (Demuth, 2009;Dolgova and Tyler, 2019;Ellis, 2017;MacWhinney, 1996;Stoll, 2009, among others) as a complement to tests carried out in artificial contexts in order to observe the production of selected words. The acquisition of a sound is gradual, and its production is maintained for a certain period, fluctuating between the correct form and the non-targeted alternatives to its fossilisation. However, experimental tasks typically use isolated words as a model of production of a certain sound; during tests, which consist of the child repeating a word or group of words after the adult, immediate imitation can lead to a better pronunciation, which outside those contexts would not be that correct. Acosta and Ramos (1998) criticised the historically used assessment procedure that focused on isolated words as opposed to the analysis of spontaneous speech samples.
In addition, corpora can be easily managed to retrieve data using useful automatic or semi-automatic computational tools, which facilitate work and save time. Therefore, corpus linguistics can be either a method in itself or a complement to the traditional approach, especially describing the most unconscious and spontaneous facet of language.
The main contribution of naturalistic language corpora to the study of language acquisition is providing samples of authentic language in real context, an invalu-able source for the study of child language. Spontaneous language corpora are preferable to study the real use of language in children, on occasion combined with corpora made up of texts obtained by means of elicitation tasks or tests as a supplement to evoke those phenomena difficult to find in spontaneous speech, due to low frequency of occurrence, or even to avoidance strategies -words children systematically avoid due to pronunciation difficulties (Stoll, 2009).

METHODOLOGY
3.1. The CHIEDE corpus CHIEDE, a spontaneous child language corpus of Spanish, is made up of approximately 60,000 words. About a third of the corpus consists of child language and the remaining is CDS. The main feature of CHIEDE is the spontaneity of interactions. The corpus is made up of transcribed recordings of communicative situations in their natural context. The recordings were carried out in central Spain, where the linguistic variety is Peninsular Spanish, in a medium-sized town. The speakers are monolingual and belonging to middle socioeconomic status regarding their families' income and occupation.
The corpus presents two types of interactions: spontaneous collective interactions, recorded at a daily activity in the classroom where the whole group of children and the teacher informally chatted; and dialogues, in which an adult talks with a single child. Figure 1 shows the corpus design 4 . Children were grouped according to their year of birth.
CHIEDE contains 58,616 word tokens in 30 text files for a total of 7 hours and 53 minutes of recordings in 30 audio files from n=59 child participants. Table  1 presents figures regarding word tokens, number of utterances, word types and the token/type ratio by age group. The fact that the corpus was going to be published required being extremely respectful and compliant with the current legal framework. Consequently, before recording, parents, teachers and participants were properly apprised and asked to sign an informed consent agreeing to participate in the research. Regarding ethical concerns, all names were anonymised and, on occasion, parts of the recordings were cut and discarded due to sensitive information the children gave about their private lives.
The device used to record the corpus was a Sony DAT (Digital Audio Tape), which allows for a digital recording with professional quality, with a Sony Stereo microphone placed in the most adequate spot to capture the sound. Even so, when recording ambient sound, a certain level of background noise is inevitable; it is impossible to obtain studio sound quality. For this reason, a sound editing software (Wavelab, https://www.steinberg.net/es/wavelab/) was used to improve the quality of the recordings.
The topics of conversation were varied, but all of them related to the children's everyday lives: what they did yesterday or the previous weekend, describing their family, talking about their friends, their pets, or the things they like to do, etc.
Each recording is aligned with its corresponding orthographic transcription, including a header with metadata or sociolinguistic and contextual information. In addition to the audio and the text files, two other kind of files are included: those with the soundtext alignment by utterances and those in XML format with morphosyntactic annotation. The files are identified with a name where the age of the child participant is specified.

Procedure
This work was conducted from the perspectives of computational linguistics and corpus linguistics, to assist other disciplines such as phonology and psycholinguistics. The main advantage of working with corpora is to improve and facilitate the empirical work through computational tools that make tasks such as labelling, counting of items and calculation of frequencies faster and more reliable. Undoubtedly, the phonological transcription of a text is a task which needs the investment of many working hours. If the orthographic transliteration does consume most of the time devoted to the creation of a corpus, the phonological transcription would at least double that time. Nowadays, software such as PHON (Hedlund & Rose, 2020) facilitates this task. The present study, however, used the one developed the software by Moreno Sandoval et al. (2008), which, to simplify, transforms "the orthographical representation of a word to its phonemic transcription based on context-dependent rules" (Moreno Sandoval et al., 2008Sandoval et al., , p. 1098. The reliability of the automatic phonological transcription was high: 4% of the words transcribed automatically were found to have a transcription (either phonemic or syllabic) error. Therefore, it was necessary that a group of linguists carry out a second part of the task (peer review), listening to the audio files and manually correcting the mistakes, and completing those features and nuances absent in an orthographic representation. It must also be clarified that the phonological transcription was a broad one, not a narrow annotation, which would have considerably increased the work. As children were not too young regarding the language acquisition period, most of them exhibited an adult-like speech in phonological terms, and just three children from the 3;0 group had typical (not due to any pathology) pronunciation difficulties (files ADR3.wav, BRU3.wav, and NAT3.wav, and their corresponding ADR3.txt, BRU3.txt, and NAT3.txt files, which can be consulted in the website mentioned in Note 4), which were carefully annotated.
Finally, to be faithful to the children's production, the phonological transcription was carried out over the actual orthographic transcription, that is, a second orthographic line (introduced by %pho) in which the real production of the child (including errors) was represented, as shown in example (1).

RESULTS
Results 5 are presented in four separate sections. In the first one a frequency-based phonological inventory is provided to address research questions 1 and 3. The next two sections offer data regarding variability between the three age groups. Finally, data from CHIEDE are corroborated by comparing results with three corpora from the CHIL-DES database.

Data retrieved from the phonological transcription
According to the data collected, Table 2 presents the relative frequency of the total number of phoneme tokens in the three child groups that make up the corpus. 5 Statistical analysis was carried out using the software IBM SPSS Statistics.  The total number of phoneme tokens is 75,535, and the Phonological Mean Length of Utterance (PMLU) (Ingram, 2002) is 12.72 phonemes. In this case, the table does not present the order of acquisition of phonemes (already acquired due to the children's age), but their usage frequency, as data were not longitudinally collected. It can be observed how the phonemes that occupy the final rows in the table are more infrequent in Spanish and therefore their frequency decreases in relation to the most common ones; nevertheless, the figures increase as the children grow older. This shows that from three to five years old, the process of language acquisition is still ongoing and therefore studies on the acquisition of language must not stop at 36 months. However, according to these data, all children show a complete (intelligible) acquisition, even of those phonemes considered as acquired later.
In addition to the phonological data extracted from the children's speech, a fourth column that includes the frequencies of phonemes in the child-directed speech (adults') has been added. Although data are similar for both children and adults, greater similarity can be noticed, especially at the top of the table, between the oldest group (5;0-6;0) and the group of adults than between the 3;0-4;12-year-olds' and the adults' speech. 6 The phoneme /ʎ/ is the default output of the automatic phonological transcriber. However, it must be clarified that the language variety studied (central Spain) presents yeísmo. Thus, the actual phonetic representation of /ʎ/ is /ʝ/.
Analysing absolute frequency means for the three child groups, the asymptotic significance is p = 0.000 7 , which denotes noteworthy different distributions of the three groups. If we observe the sample in detail (Table 3), the results are as follows: By comparing the distribution of data from the three groups, we can observe a significant difference between the youngest (3;0-3;12) and the oldest (4;0-6;0) groups.
Another feature of the automatic phonological transcriber is the segmentation of words into syllables. In this way, it is possible to quickly and reliably know the total number of syllables that make up our corpus, and their frequency of use. The total number of syllable tokens is 35,086 and the Syllable Mean Length of Utterance (SMLU) is 5.91.
The top 25 more frequent syllables are made up of no more than two phonemes, and most of them follow the pattern CV, supporting previous research (Carreira, 1991;Goldstein & Cintrón, 2001;Kehoe & LLeó, 2003). Closed syllables (CVC) or consonant clusters like CCV involve a higher articulatory difficulty and therefore their frequency of use is lower compared to open syllables consisting of no more than two phonemes. The four groups coincide (80%): 20 out of the 25 most frequent syllables are the same for children as for adults. From these data, it is possible to easily and accurately calculate PMLU and SMLU for each age group. In Table 4 we observe how figures appreciably increase from three to six years old. Statistics show that there is a significant difference between groups' means, being p = 0.026 for PMLU and p = 0.025 for SMLU.
Findings (Table 2) prove a relationship between input frequency and order of acquisition that will be thoroughly analysed in the section devoted to the discussion, revisiting research question 3.

Standard deviation analysis
So far, all data presented belong to the whole corpus divided into age groups. However, to calculate the standard deviation a sub-corpus was extracted in order to get a balance between the participants. As seen in Figure 1, representing the corpus design, CHIEDE is divided into two sub-corpora: collective interactions and dialogues. In the former communicative setting, the number of subjects is about twenty children (see Figure 1 for exact numbers), and the participation of all of them is not equal. When extracting the phonemes inventory for each of the participants, it was observed that while for some of them the number of words was very high -and therefore they presented a high frequency of phonemes-for others figures were considerably lower due to their moderate participation. Hence, a decision was made to use just the dialogues sub-corpus for this task as only one child participates in each interaction, so the number of conversational turns increases and therefore his/her production in terms of number of words enlarges. In addition, it was found that the number of words uttered by the children was similar in each dialogue (Table 5). A balance needed to compare data from different subjects was thus obtained. Thus, the total sample consists of 24 children, equally divided into three age groups -3;0-3;12, 4;0-4;12 and 5;0-6;0 years old-each one made up of eight children, four boys and four girls (see Figure 1). The relative frequency was calculated from the automatic count of the absolute frequency of the twenty-three Spanish phonemes, and then, the standard deviation across all children in each age group was computed. Table 6 presents the values for each age group.
Noting the values, the deviation degree of each phoneme in relation to the mean is appreciable, especially for the figures corresponding to the 4;0-4;12 years old group (11 out of 23 phonemes), which show a higher fluctuation from the mean. On the contrary, the 5;0-6;0 years old group displays less variation, although it is notable salient in four cases: /f/, /g/, /p/ and /ɾ/. To appreciate the differences more clearly, these data have been transferred to boxplots (Figures 2, 3 and 4). For the last values, due to the low frequency of phonemes, differences are hardly substantial; but for higher values, the degree of variability is noticeable.   The role of the input frequency in L1 Spanish phonological acquisition. A corpus-based study• 9 Loquens, 9(1-2), December 2022, e089, eISSN 2386-2637. https://doi.org/10.3989/loquens.2022.089 In the boxplots, the form of the median line shows three distinct blocks: after the first six most frequent phonemes (/e/, /a/, /o/, /i/, /n/, /s/) there is a marked drop, after which the values are kept within a stable range until a second drop in the last and least frequent ones (from /x/ in the 3;0-3;12 and 4;0-4;12 years old groups, and /θ/ in the 5;0-6;0 years old group). The highest frequency rates are distributed among seven phonemes: vowels /a/, /e/, /i/, and /o/, the nasal /n/, and the fricative /s/ (mean above 7, Table 6). Within the second block, we find plosives, the vowel /u/, the liquids /l/ and / ɾ /, and the nasal /m/. Finally, the last block (mean below 1, Table 6), in which the frequency of sounds is moderate, includes the rest of the fricatives, the trill /r/, and the nasal /ɲ/; here the degree of variability decreases due to the low frequency of use.
Despite the fact that the median line pattern is similar for the three charts, in the first two age groups there are more striking irregularities, while the last age group's plot shows a softer median curve. In the latter case the degree of deviation is lower, showing more consistency. 8 As results belong to a sample, the symbol representing the mean is x and the symbol representing the standard deviation is s (instead of µ for mean and σ for standard deviation, symbols conventionally used to describe a population). Again, Friedman's Two-Way Analysis of Variance by Ranks presents an asymptotic significance of p = 0.018, detailed by age groups as follows: Table 7 shows significant differences between 5 and 4-year-olds and between 5 and 3-year-olds. However, between the 3 and the 4 years old groups there seems to be no significant difference, which means that in the oldest age group (5;0-6;0 years old) there is a stabilisation of the phonological system, since figures for standard deviation are lower (as can be seen in 8), given that fluctuation from the mean decreases. At ages 3;0-3;12 and 4;0-4;12 years the values present a higher variation, especially for the most frequent phonemes. However, from 5 years old these differences disappear and the figures are stabilised, decreasing the distance between the values and the mean, in contrast to the irregularities which the other two age groups show, especially the 4;0-4;12 years old group. Thus, the idea of a turning point at the age of four years in the process of phonological acquisition is reinforced: again, it seems that it is from that age when children's language begins to approach adult use.

U-shape development at four years old
Linked to the question about whether 4;0 is a turning point in the language acquisition process, and to the above data (standard deviation analysis), it is relevant to describe the finding of the greatest variability of 4-yearolds in the present study as a sign of a U-shaped (inverted in the chart) development pattern. Figure 5 shows how variability (based on standard deviation) is higher for 11 out of 23 phonemes (43.5%) in the 4;0 group: /e/, /a/, /o/, /s/, /i/, /l/, /p/, / θ /, /x/, /r/, and /f/. Therefore, it can be concluded that, at least in these 11 cases, a U-shape development pattern can be observed. This issue will be thoroughly discussed later.

Extrapolation of results
Phonological frequencies depend on the lexical use and on the lexical selection the child makes (statistical acquisition based on the lexicon, Polo, 2016). "Children who still have a small vocabulary may be very selective in their choice of words, that is, either actively avoid words which are difficult to pronounce or substitute con-sonants systematically" (Stoll, 2009, p. 94). Therefore, a study such as the one presented here is incomplete if lexical units are not taken into account. To accomplish this, the most frequent lexical units presented in were analysed. But in order to reinforce conclusions, we used not only CHIEDE, but three more corpora from the CHILDES database (MacWhinney and Snow 1985). In this way, it can be determined if the results presented here are contextual or, on the contrary, they are a general tendency. To carry out this test, the methodology was as follows: • Among the CHILDES corpora in Spanish language, three corpora which shared features with CHIEDE were selected, especially regarding age range. They were Spanish Díez-Itza Corpus (Díez-Itza, 1995), Spanish BecaCESNo Corpus (Benedet & Snow, 2004) and Spanish Marrero Corpus (Albalá & Marrero, 2004). • From two of them, BecaCESNo and Marrero, those files (transcriptions) in which the child was younger than 3;0 and older than 6;0 years old were discarded, as CHIEDE's participants are within that age range. • Once the corpora were selected, CLAN, a tool provided by the CHILDES Project (MacWhinney & Snow, 1985),was used to extract the list of different forms (types) and their frequency of use. • After cleaning up those lists (deleting Proper Names, as they are contextual, or correcting orthographic mistakes), they were compared and the most frequent lexical units or types common to the four corpora were extracted. • The 500 most frequent types were selected and the phonological transcriber was applied to them. The role of the input frequency in L1 Spanish phonological acquisition. A corpus-based study• 11 Loquens, 9(1-2), December 2022, e089, eISSN 2386-2637. https://doi.org/10.3989/loquens.2022.089  Table 8 show the results. The most relevant figures are those in the last column, in which the coefficient of variation shows the variability of the four samples in relation to the mean. The most homogeneous values belong to the phonemes /a/, /e/, /o/, /s/, /i/, /n/, /l/, /t/, /d/, /b/, and /θ/. On the other hand, /ʧ/, /f/, and /r/ show the most heterogeneous distribution. These phonemes are precisely the most infrequent ones not only in CHIEDE, but in the other three corpora too, as well as in the adults' speech, again reinforcing the assumption about an existing relationship of the input frequency with the order of acquisition of phonemes.
Broadly speaking, the differences among the four corpora are not meaningful, as frequency figures are almost equal, which means that the basic lexical units are not context dependent, but generalised, as well as the most frequent phonemes. Therefore, the results obtained after the phonological analysis carried out on CHIEDE can be extrapolated.

DISCUSSION
Revisiting research questions in light of the results, major findings are summarised here. Regarding the first research question posed in the present study, it can be concluded that, according to the sample, the phonological Spanish system is essentially acquired (in terms of intel-ligibility) at the age of three years (as shown in Table 2). Acquisition is here understood as development, that is, as a process where phonemes are already organised into patterns (what Velleman and Vihman (2002) call templates) typical of the final stages of development in children, showing that units are rooted. According to Velleman and Vihman (2002, p. 20), "templates serve as a stepping stone in the direction of the adult system, despite the decrease in accuracy that may temporarily result". Vihman (2018, p. 38) also states that "template formation is neither the outcome of a pre-existing principle nor an end in itself, but instead a dynamic (and momentary) child response, in the early stages of acquisition, to the phonological and lexical challenges of the language".
It is generally accepted in Spanish phonological acquisition research that the most problematic phonemes are liquid consonants, the fricatives /s/, /θ/ and /x/, the nasal /ɲ/, and the plosive /d/ (Acevedo, 1993;Bosch, 1983, Jiménez, 1978. However, after analysing these sounds in CHIEDE, it can be observed that both the fricative /s/ and the liquids /l/ and /ɾ/ are among the most frequent phonemes. CHIEDE's participants showed no added difficulty in their use, indicating that, although they may be problematic phonemes at the time of their acquisition, from three years old onwards these three sounds do not present any difficulty for children with typical development; in fact, they are widely used. Regarding the rest of the phonemes which are considered problematic, it can be concluded that they are characterised by a lower use. The higher frequency of certain phonemes over others is a lexical matter: "Thus, when we examine the lexicon (words) of a language, not all sounds have an equal opportunity to appear in all positions." (Bernstein-Ratner, 1994, p. 351). Certain phonemes, such as /r/ or /ɲ/, are less frequent in the Spanish lexicon, and thus their frequency of use is low (as seen in frequency lists, Tables 2 and 8).
Results from the present study shed light on the existence of a turning point at four years old in the process of L1 acquisition (research question 2). On the one hand, figures on PMLU and SMLU (Table 4) indicate that from four to five years of age there is a significant increase towards adult language. Furthermore, standard deviation (Table 6) shows how language becomes stabilised from five years old onwards. It can also be stated that the subjects from this study fit Díez-Itza and Martínez López's (2004) stages, as it seems that from 3;0 to 5;0 years old children are in a period of reorganisation of the phonological system, termed "stabilisation" by the authors; however, from 5;0 years old onwards children seem to achieve the "resolution" stage. Variability showed by the group of 4;0-4;12 leads to the conclusion that around four years old there is a landmark which is relevant not only for research on typical language development, but specially for research on speech and language disorders. This turning point is also supported by the U-shape development pattern evidenced from the analysis in Figure 5. Although the 3-year-olds group displayed a similar pattern, this was shown in those less frequent phonemes. However, 4-year-olds exhibited a higher variation and a U-shape pattern precisely for those phonemes which are acquired earlier and, therefore, should be stable at this age.
The overriding question guiding this research is to what extent the input frequency is relevant in the L1 acquisition process (research question 3). In disciplines such as Psycholinguistics, and more specifically in Speech and Language Therapy, it is quite accepted, that phonemes which usually pose a problem in the acquisition process, such as the Spanish trill, are characterised by a more difficult physiological articulation (Bosch, 1983;López Valero et al., 1989). However, this idea is conceived from the standpoint of adult speakers whose articulatory system is fossilised. The baby's physiology is ready to adapt to different circumstances and therefore we cannot claim whether it is difficult for a child to manage his/her articulators to pronounce a sound or if he/she simply lacks enough examples to learn it. According to Zamuner et al. (2004Zamuner et al. ( , p. 1420, "it appears that children are not limited by articulatory or perceptual constraints, but rather that children's errors are largely influenced by their ability to access stored representations.". For these reasons, and mainly based on the results obtained from CHIEDE, it is highlighted here the relevance of probability and frequency in studies on language ontogenesis, as frequency of use may be an essential indicator of typical development. It is also agreed that at the age of three all vowels are acquired, followed by nasals, approximants, and later plosives. However, at this age, the incomplete acquisition of liquids, fricatives and affricates prevails (LLeó, 2012). Interestingly, this order of acquisition coincides with the order of frequency of spontaneous adult speech phonemes in Spanish (Table 2).
Studies such as those by Demuth (2009), Ellis (2017), Kern et al. (2014) or Tomasello (2009), among several others, demonstrate the probabilistic relationship between input and language acquisition. The present study is another example of how the input frequency affects language development (in this particular case, phonological acquisition). "Ease of articulation seems to play only a partial role in determining the overall developmental route" (Pye, Ingram & List, 1987, p. 182).
Another factor influencing phonological learning is phonological neighbourhoods or phonologically similar words. Studies such as those by Zamuner (2009) showed that the words which are first acquired have denser neighbourhoods than those acquired later. Maekawa and Storkel (2006) also highlighted the importance of phonotactic probability and density neighbourhood. These authors concluded that "[...] phonotactic probability, density and frequency appeared to predict expressive vocabulary development but with individual variation across children" (Maekawa & Storkel, 2006, p. 457).
Likewise Pierrehumbert (2003) referred to various studies that have shown that children are sensitive to statistical patterns of sound. This stands in opposition to the idea of a universal inventory from which the individual selects the necessary elements to design his/her phonological system. The main counter-argument she stated is that this theory does not explain why children take so much time from when they acquire or distinguish an element as one of their own language until they master its production in an adult manner. Phonetic knowledge is gradually acquired and it is updated through experience. "Acquiring the phonetic encoding system of a language involves acquiring probability distributions over the phonetic space" 9 (Pierrehumbert, 2003, p. 184).
This last idea leads to consider how crucial the roles of probability and frequency of use are in the process of language acquisition. Bernstein Ratner (1994) suggested that those elements that children acquire earlier are the most frequent both in adult speech and in all languages throughout the world, while phonemes that present a higher learning difficulty are precisely those that are less represented.
In this study (Table 2), the frequency of use that the oldest age group shows is very similar to that shown by adults in spontaneous speech, whereas the differences between the other two groups of children and the adult one are larger. According to data from CHIEDE, the most common sounds in adult speech are precisely those which, based on previous research (Bosch, 1993;López Valero et al., 1989;Serra, 1983), are acquired earlier and more easily, i.e., vowels and nasals in the first place, followed by plosives and liquids. Lower positions on the frequency list are occupied by fricatives, which are precisely the last and most problematic in the acquisition process.
The same phenomenon occurs in other languages. For example, the English sounds identified as more complicated to learn (Grunwell, 1981) are those which have a lower frequency rate in adult language (Mines et al., 1978). Among these phonemes are some fricatives, such as the voiceless dental /θ/ and the voiceless and voiced postalveolar affricates /ʧ/ and /ʤ/. There is an undeniable relationship between less frequent phonemes in adult language and those which are more problematic in the acquisition process.
The evidence so far leads to emphasize the importance of the input frequency in the study of L1 acquisition and its relation to the most problematic phonemes. With this, the importance of the place and manner of articulation as the sole factor causing the delayed acquisition of certain phonemes should be played down (Rose, 2009). As stated by Menn and Stoel-Gammon (1996, p. 352), "A theory of child phonology cannot ignore word frequency although current adult phonological theory has no place for this notion".

Limitations
Despite the fact that the children in CHIEDE showed a complete (intelligible) acquisition of phonemes at 3 years old, this situation must be regarded with caution, since the participants represent only a part of the whole population of Spanish-speaking children. Giving priority to sub-corpora balance (between the three different age groups' language production) limited the number of participants per age group. Nevertheless, the comparison of CHIEDE's data to those from three different corpora supports, to some extent, the findings in the present study.
As Grunwell (1981) stated, language acquisition is characterised by great variation from an individual to another. However, data from CHIEDE may serve as a paradigmatic pattern of linguistic behaviour for research on child language.
Another potential limitation could be the grouping of participants. As 4 years old is hypothesised as a critical age, speakers could have been grouped by different age limits to analyse the range 3;5-4;5. However, a balanced distribution of children in three groups prevailed here. Otherwise, age ranges and number of participants per group would be unbalanced. In addition, it would also be relevant to consider the role of the gender factor for future research.
Concerning the characteristics of the transcription, further research is suggested regarding issues such as the distribution of phonemes and syllable structure, clusters or allophones description. This would involve a narrow transcription, which exceeds the scope of this research. Indeed, as mentioned, recording conditions were not ideal due to the ambient sound.
Finally, it would be interesting to extend this experiment to other languages, particularly to other Spanish dialects and varieties, and observe to what extent patterns coincide.

CONCLUSIONS
Research on language acquisition beyond English and crosslinguistically has thrived during the last decades, although many unsolved questions still remain. There is a need for large cross-sectional spontaneous speech corpora, sufficiently representative and linguistically annotated. Furthermore, standards must be established to facilitate analysis and comparison. As Stoll (2009, p. 91) complained, "the use of different data sets, different methods or different criteria for coding makes it difficult to compare across languages". Also, corpus-based analysis of the late acquisition period should be increased, that is, exceeding 36 months old, as most of the existing corpora do not include child participants exceeding that initial period of language development. The use of representative corpora and computational tools enriches research on language acquisition and is a reliable method for the study of frequency, which, as several investigations reveal, is a significant factor throughout the acquisition process.
The findings from the present study contribute to current research on Spanish-speaking children's phonological acquisition in three ways: • Providing a phonological inventory which may serve as a model for future research on typical and atypical child language development (from 3 years old onwards).
• Contributing to the assumption that 4 years old is a turning point in the process of language acquisition, as the variability analysis of the frequency of phonemes in CHIEDE shows. • Corroborating the importance of the role of input frequency as a factor to take into consideration when analysing child language.
From a methodological point of view, we encourage language acquisition research based on natural language corpora. Corpus Linguistics and Computational Linguistics are essential in language analysis, especially from a usage-based approach, as commented above and showed in this research. In addition, apart from the three contributions mentioned above, the findings of this research have practical implications for Clinical Linguistics and Speech and Language Therapy, as they can be used as a paradigm for the assessment of child language.