Measuring a decade of progress in Text-to-Speech

The Blizzard Challenge offers a unique insight into progress in text-to-speech synthesis over the last decade. By using a very large listening test to compare the performance of a wide range of systems that have been constructed using a common corpus of speech recordings, it is possible to make some direct comparisons between competing techniques. By reviewing over a hundred papers describing all entries to the Challenge since 2005, we can make a useful summary of the most successful techniques adopted by participating teams, as well as drawing some conclusions about where the Blizzard Challenge has succeeded, and where there are still open problems in cross-system comparisons of text-to-speech synthesisers.


INTRODUCTION
The last ten years have seen considerable improvements in the quality of speech generated by text-to-speech (TTS) systems, and we have evidence for this from the Blizzard Challenge 1 and the associated summary papers by the organisers (Black and Tokuda, 2005a;Bennett, 2005;Bennett and Black, 2006;Fraser and King, 2007;Karaiskos et al., 2008;King and Karaiskos, 2009, 2011, 2012Prahallad et al., 2013).

The Blizzard Challenge
Inspired by corresponding evaluation methods in automatic speech recognition (ASR), the Blizzard Challenge (or "Blizzard" in short) set out to provide direct comparisons between systems in a way that was not possible before. As 1 http://www.synsig.org/index. php/Blizzard_Challenge we will briefly describe in Section 1.2, TTS systems are generally rather complex and even messy (to the point of being impossible to optimise in any formal sense) because they rely on a large and disparate collection of linguistic resources and data in order to achieve the difficult transformation from written to spoken language. Blizzard performs cross-system comparisons, and tries to make them as meaningful as possible.
Blizzard is an annual event, started in 2005, in which typically 10 to 20 groups independently build synthetic voices from a common speech corpus and then submit synthetic speech samples to a common evaluation, which uses a large pool of listeners. We will summarise the methodology used by Blizzard in Section 2 and at the end of the paper in Section 4 we will provide a critique of this methodology's strengths and weaknesses. In between, the core of this paper in Section 3 lists the key findings from nearly a decade of Blizzard Challenges: this means identifying the techniques used by the most successful systems which went on to be widely adopted. This is certainly not a survey of the entire field of speech synthesis -for that you might turn to (Taylor, 2009) for a comprehensive textbook or to (Suendermann et al., 2006) for a discussion of open challenges. Rather, this is a view taken through the lens of the Blizzard Challenge, the only place where direct comparisons across a wide range of systems can be seen.

The typical architecture of a Text-to-Speech system
In order to understand the scope of the Blizzard Challenge, and in particular what it is able to evaluate and what it so far has not attempted to evaluate, we need to describe a typical TTS system architecture. Almost invariably, systems are divided into two components. The first is a linguistic processor, or "front end" which takes unnormalised text and produces from it a "linguistic specification". This will contain information such as a phonetic string, syllabification of that string, some representation of prosody (e.g., accents and boundaries), and so on. The second component is a waveform generator that takes as input this linguistic specification and creates a corresponding speech waveform.
The methods used within the front end are many and various, including both humancreated resources such as text normalisation rules or pronunciation dictionaries, and learnedfrom-data models such as those needed to predict the pronunciation of words not in the dictionary. There are really only two good methods available for the waveform generator: either fragments of recorded speech are selected from a database and concatenated -the unit selection approach -or a (statistical) model learned from that database is used to generate synthetic waveforms via a vocoder. The database is a critical component, and great care is usually taken both in selecting what should be contained in it, and then in recording a professional speaker under ideal studio conditions.

What this means for any attempt at cross-system evaluation
It is now clear that making comparisons between different TTS systems is not going to be easy, because their performance rests on so many sub-components, any of which could be responsible for differences in the generated speech. In particular, if two systems employ recordings of different speakers then all comparisons may be rendered meaningless because listeners may simply prefer one speaker over the other. It is this factor that Blizzard first set out to control, by using the same speaker in all systems to be compared. Blizzard also controls the database content, by distributing a single shared corpus of speech recordings, often provided from an established company or research group; for example, a corpus was released by ATR for the 2007 Challenge (Ni et al., 2007).

THE BLIZZARD CHALLENGE METHODOLOGY
Given the complicated nature of the front end, and the fact that the content of the linguistic specification varies from one system to another, it is hard to design an evaluation that targets the front end specifically. Likewise, since the waveform generation component may be carefully tuned to use one particular form of linguistic specification (particularly in the case of unit selection systems), it is hard to evaluate that in isolation too. So, the Blizzard Challenge is obliged to take a holistic approach and it generally evaluates entire end-to-end systems.

Common data
The methodology used in the Challenge is described by Black and Tokuda (2005b) and we will only summarise it briefly here. First, a language (or in some years multiple languages) are selected and common data sets are defined. The data minimally comprise recorded speech from a single speaker alongside text transcriptions. Optionally, alignments between the text and speech are provided, possibly including phonetic segmentation or other linguistic annotations on the text such as syllabification, up to and including a complete linguistic specification. Rules on the use of the data, and what additional resources may or may not be employed by participants are defined and refined each year.

Open participation
An open invitation for participation is sent to the speech synthesis research community, and teams register. During a defined time period, usually of a few months, each team builds their system using the common data. At the end of this period, a set of previously-unseen test material is circulated and teams return the corresponding synthetic speech from their systems.

Evaluation using a listening test
The organisers conduct a large scale listening test, typically with over 500 listeners, and provide the results to the teams. The Challenge concludes with a workshop and published pa-pers summarising these results.

Anonymity
In order to encourage industry participation, and system of anonymity is adopted so that, although the names of all participating teams are made public at the end of the Challenge, the results are presented without showing the correspondence between team names and results in any publication. Individual teams of course know which results are for their system; they may choose to reveal this in their own publications, but this is not required.

TECHNIQUES EMPLOYED BY PARTICIPATING SYSTEMS
We now proceed to the main point of this paper: a kind of 'executive summary' of the techniques used by participating teams that have proved most successful and have therefore been widely adopted. The Blizzard Challenge cannot claim to have caused the emergence of new techniques: its claim is more limited and concerns providing independent evidence about the relative merits of competing techniques. This evidence is sometimes more compelling than that found in individual papers because of the direct comparisons made between 'best in class' systems, and the comparisons with natural speech, rather than the usual comparisons made between a single proposed method and a baseline system which is usually also created by the same researchers. The best example of this kind of evidence is the landmark finding that a statistical parametric synthesiser was as intelligible as natural speech and more intelligible than all unit selection systems.
Whilst the first two techniques listed in the next part of the paper (Section 3.1) emerged well before the start of the Blizzard Challenge, they have continued to perform well and can each claim to be "better" than the other along some dimension of the evaluation. Indeed, another good example of the strong evidence that the Challenge provides concerns the relative naturalness and intelligibility of unit selection and statistical parametric approaches.
Applications of TTS Most TTS systems aim at some non-existent 'general purpose' application, but the Blizzard Challenge has also witnessed more targeted systems, such as personalised synthesis for clinical applications (Bunnell et al., 2005(Bunnell et al., , 2010) -something that Yamagishi's adaptive systems (Yamagishi et al., , 2008 are also being used for. Recent Challenges have used audiobooks as a source of transcribed speech recordings and part of the evaluation has involved synthesis of paragraphsized texts, roughly approximating a TTS audiobook application. The Challenge places no constraints on resources other than the few months allowed to build the system and the few days to synthesise the test material: most Blizzard entries are resource-hungry (in terms of memory and/or compute) server-based research systems. There have been only occasional entires that are small footprint / low compute such as that described by Baumgartner et al. (2012), which would be appropriate for embedded applications.

Waveform generation
3.1.1. Unit selection generates the most natural speech Consistently, in every challenge, the system that has been rated as the most natural by listeners has always generated the speech signal by concatenating recorded samples of speech. The size of these units has varied somewhat, as have the methods for selecting and concatenating them, but it is striking that listeners consistently rate recorded speech containing inevitable concatenation artefacts as sounding more natural than speech generated using a vocoder. Nevertheless, whilst listeners might say such speech is more natural, they generally find it harder to understand than speech for a vocoder driven by a statistical parametric model. The Challenge has seen many 'classical' unit selection systems that closely follow Hunt and Black (1996). The most prototypical of these is the Festival system, with its 'multisyn' unit selection engine (Clark et al., 2005(Clark et al., , 2006Richmond et al., 2007). This system was adopted as a benchmark in later challenges, allowing some limited comparisons across different years of the challenge to be made (e.g., was a system better or worse than Festival?). Another classical unit selection, which like Festival has its roots in earlier ATR systems, is Ximera .

The inevitable creep of statistical techniques
As soon as the good performance of HMMbased (Section 3.1.2) and later hybrid (Section 3.1.3) synthesisers was demonstrated, many unit selection systems entered into the Challenge started to adopt statistical methods. Jess was initially a classical unit selection system in its first appearances Carson-Berndsen, 2006, 2007) but later added an HMM-based prosody model (Cahill et al., 2011). OpenMary also evolved from unit selection (Schroeder et al., 2006;Schroeder and Hunecke, 2007) by adding a statistical join model (Schroeder et al., 2008) and continues to participate in the Challenge with both unit selection (Schröder et al., 2009;Charfuelan et al., 2013) and HMM-based (Section 3.1.2) systems. The I 2 R system likewise has evolved from classical unit selection (Dong et al., 2008(Dong et al., , 2009(Dong et al., , 2010 to a system employing HMM-guided unit selection (Dong et al., 2011;Lee et al., 2013). Predating these though, is the clunits system Black and Taylor (1997), first entered in 2008 -see Section 3.1.3.
Newcomers can build great unit selection systems too Many unit selection systems entered into the Challenge do not actually perform any better than Festival, so have to be seen mainly as as learning exercise for the participating teams and not a contribution to knowledge. However, the ability to build excellent unit selection systems can be developed independently, as demonstrated by a couple of 'newcomers' (from a speech synthesis community point of view). One notable entry into three of the Blizzard Challenges is the classical unit selection system IVONA (Osowski and Kaszczuk, 2006;Osowski, 2007;Kaszczuk and Osowski, 2009) from a previously little-known Polish company. This system achieved outstanding results; the company was subsequently acquired by Amazon. Another previously little-known company has also entered very respectable unit selection systems into the Challenge: Lessac's method uses a unit called the Lesseme (a kind of phonetic/prosodic-contextdependent unit) to very good effect (Nitisaroj et al., 2010(Nitisaroj et al., , 2011. The reason that the Lesseme works is probably that it hardcodes some of the key target cost features into the unit type, rather than being radically different from more common units like diphones. What do we learn from such systems? We see that unit selection continues to be the obvious choice if building a commercial product; that, with the right engineers, it delivers very high naturalness. The executive summary is pretty clear: if we don't care about controllability, expressivity, or having a library of many voices, and we have the time, money and the right people to do the engineering, then we should choose unit selection every time. Taking a little more risk The above systems were entered into the Challenge principally to benchmark them against other systems, although typically participants that take part more than once do generally report that the Challenge has helped them improve their systems. On the other hand, some participants in the Challenge use it as an opportunity to take a little more risk and try new ideas. Cerevoice ex-perimented with compressed waveforms (Aylett et al., 2007) in one Challenge, and a form of data cleaning based on genre pruning in another (Andersson et al., 2008). Some have even used Blizzard as a way to develop research methodology (Kominek et al., 2005).
Voice conversion Blizzard requires that the entered voice sounds close to the provided speaker, which usually means building a voice on that data. Only two unit selection entries have done differently, by starting from an existing voice. The IBM system of 2005 used speaker transformation (Hamza et al., 2005), and a system based on the Festival front end with the AhoTTS waveform concatenator, which first entered in 2008 (Sainz et al., 2008), also applied voice conversion in 2009 (Sainz et al., 2009).
Non-uniform units For engineering simplicity, most systems employ a single unit type such as the diphone or half-phone, but a few try to extend this to non-uniform units. Examples from the Blizzard Challenge include Ding and Alhonen (2007), Yang et al. (2006) which also employs an HMM-generated prosody target, the DSSP system (Latacz et al., 2008) which later added a statistical target cost (Latacz et al., 2009) and trainable context-dependent target cost weights (Latacz et al., 2010), and a system using syllable-sized units plus back-off (Raghavendra et al., 2008).
Learning the unit type and constructing synthetic units Almost all unit selection systems used expert-defined types (e.g., diphones) as the acoustic unit. Two exceptions to this are the IBM unit selection systems which use HMM state-sized units (a fraction of a phone) and employ HMM state clustering to identify classes of interchangeable units (Eide et al., 2006;Fernandez et al., 2008).
Another departure from the usual type of unit is Toshiba's 'plural unit selection and fusion' approach which constructs units by automatically merging together several recorded instances (Buchholz et al., 2007;Li et al., , 2009). Other systems also try to overcome the limitations of units available in the original recordings by constructing additional units either through concatenation (Aylett et al., 2006) or using HMMs (Aylett and Pidcock, 2009), in an offline procedure known as 'bulking'. It's worth re-iterating at this point that we are only concerned in this paper with systems entered into the Blizzard Challenge, and are not attempting to trace ideas back to their inventors.

Statistical parametric methods generate the most intelligible speech
In contrast to the unit selection approach, systems which employ statistical parametric models to drive a vocoder are generally rated as less natural-sounding by listeners. Nevertheless, the same listeners can transcribe this 'less natural' speech more accurately than unit selection output. The Blizzard Challenge has witnessed the most important period of progress for statistical parametric models. The first Challenge already saw the use of the high-quality vocoder that has become the most widely used (STRAIGHT) and explicit duration models (hidden semi-Markov models: HSMMs) (Zen and Toda, 2005) and subsequent years saw systems employing a vast array of enhancements such as MGC-LSP acoustic features which combine the benefits of cepstral and all-pole representations of the spectral envelope, and global variance (GV) (Zen et al., 2006), minimum generation error training (MGE) (Ling et al., 2006(Ling et al., , 2007, formant enhancement (Oura et al., 2009), trajectory training (Maia et al., 2009), the use of GV during training along with trainable mixed excitation (Shiga et al., 2010), minimum generation error linear regression (MGELR) model adaptation (Oura et al., 2010), adjustments to the perceptual scales used to represent acoustic features (Yamagishi and Watts, 2010), deterministic annealing expectation maximisation (Hashimoto et al., 2011) and 'chapter-adaptive training' to cope with changes in recording conditions within audiobook training data (Takaki et al., 2013).

XXX
Adaptive models High intelligibility might be a very attractive property, but was discovered in the course of evaluation and was not specifically designed or claimed as a feature of these systems. On the other hand, a 'killer feature' of the statistical parametric framework, that is designed right into the system and is one of the main claims of proponents of the statistical approach, is the ability to modify the underlying model parameters. This is most commonly achieved using adaptation techniques borrowed from ASR, then subsequently extended for TTS. Blizzard entries have used supervised speaker adaptation (Yamagishi et al., , 2008 as well as unsupervised adaptation (i.e., with word transcripts obtained using ASR) , as an effective way to leverage pre-existing recordings of other speakers when constructing a voice for that year's target speaker. As with unit selection, not all of these are better than the HTS benchmark (employed alongside the Festival benchmark, to give an addition point of calibration from year to year). So, whilst statistical parametric methods might rightly claim to be more 'automatic' than unit selection, nevertheless a high degree of expertise and engineering skill is still required to obtain good results.
Improvements to the vocoder through source modelling The hypothesis that the vocoder is the limiting factor in the naturalness of statistical parametric speech synthesis has led to various attempts to construct improved vocoders. Within the Blizzard Challenge, the most prominent strand of research in this area has focussed on improving the excitation source either by modelling residual signals (Maia et al., 2008(Maia et al., , 2009, with a parametric glottal waveform model (Andersson et al., 2009) or by using sampled glottal pulse waveforms as in the Glot-tHMM system (Suni et al., 2010(Suni et al., , 2011(Suni et al., , 2012.

Hybrid systems: unit selection guided by a statistical parametric model
In the first few years of the Challenge, it became clear that statistical parametric systems consistently had the better intelligibility, whereas unit selection systems consistently had better naturalness. Although never formally proven, it is widely thought that this better naturalness was a result of using recorded waveforms -in other words, it is a local property of the signal that is partly independent of concatenation artefacts. Conversely, it is widely thought that the intelligibility of statistical parametric systems is a result of their ability to more accurately generate context-dependent speech units (as opposed to the out-of-context units of unit selection). An obvious next step was then to retain unit selection as the method for waveform generationthus ensuring a natural-sounding signal -but to select the units using a statistical parametric model -thus taking advantage of its ability to predict the acoustic properties of unitsin-context that did not occur in the available recorded corpus but that were needed at synthesis time.
Probabilistic models for unit selection The hand-crafted nature of the join and target cost functions used in classical unit selection are often seen as unsatisfactory, since they must be tuned by ear and it is not possible to be sure that optimal values of the various parameters (e.g., weights on linguistic features) have been found.
Overcoming this limitation has been a long-standing goal in unit selection research. Within the Blizzard Challenge, we have observed a number of systems tackling this problem. Sakai and Shu (2005); Sakai (2006) describe a system evolved from MIT's Envoice in which probabilistic models replace almost all hand-tunable parameters. Likewise, the 'clunits' method, first entered to the Challenge in 2008 Oliveira et al., 2008) builds clustering trees which group together acoustically interchangeable units which share a subset of linguistic feature values. Other attempts at trainable unit selection include the two early entries from µXac (Rozak, , 2008 followed by the much improved system described in Rozak (2009). Lessac also entered systems in which an acoustic target, in this case from a Hierarchical Mixture of Experts, guided the selection of units . The weakness of most attempts to employ learned-from-data models in unit selection is perhaps that they pay attention only to acoustic similarity and do not involve human perceptual judgments. This is probably why a hand-tuned target cost is still better, if correctly constructed and tuned by an expert: it accounts for perceptual judgements.
Hybrid systems We define 'hybrid' systems as those which employ a statistical parametric model -which is in itself capable of generating speech in conjunction with a vocoder -to guide the selection of units from the database, which are subsequently concatenated. There is of course not a clear dividing line: for example, the unit selection system described by Wilhelms-Tricarico et al. (2012 uses a powerful statistical model to predict an acoustic target trajectory, but without any intention of generating speech from it. The first proposal of a hybrid system observed in the Blizzard Challenge was from Kominek and Black (2006), who mentioned both 'clunits' and HTS as candidates for the statistical parametric model, but actually used their own 'ClusterGen' method as the statistical parametric component; this is rather similar to decision tree-clustered HMM states, as used in HTS. The system was refined and entered again in 2007 (Black et al., 2007).
Subsequently, the 'hybridistion' of HMMbased synthesis with unit selection was developed and placed on a formal mathematical foundation in which the probabilistic nature of the HMMs was made use of. The sequence of highly-successful entries from USTC and their spinout iFlytek are strong evidence that this technique does indeed combine benefits of unit selection and statistical parametric models (Ling et al., 2007(Ling et al., , 2008Lu et al., 2009;Jiang et al., 2010). Subsequent systems of theirs experimented with Lessemes as the modelling unit (Chen et al., 2011), channeland expressiveness-related labels for audiobook data (Ling et al., 2012), automatic weight learning based on an objective quality model (Chen et al., 2013) and vocal tract resonance (VTR) trajectory-guided unit selection .
In parallel to the USTC/iFlytek system evolution, Microsoft Research Asia (MSRA) have entered similar systems. The rather elegant name of 'trajectory tiling' was coined by them and featured in their 2010 entry to the Challenge (Qian et al., 2010). It alludes to a method used in computer graphics in which a parametric model (e.g., a wireframe or skeleton) is given a 'skin' composed from sampled images. The skeleton is convenient for the artist to manipulate and is flexible enough to produce any desired pose, whilst the detailed skin convinces the viewer that the object is real and not computer-generated. In speech, the corresponding advantages are that the underlying statistical parametric model is able to generate any speech sound in any context (the 'trajectory'), whilst the overlayed samples ('tiles') provide the necessary details to convince the listener that the signal is natural speech.
In latter years, more groups have adopted various forms of the hybrid approach, including the NTNU (Meen and Svendsen, 2010), BUCEADOR , and SHRC-Ginkgo systems (Yu et al., 2013).

Linguistic features
It is impossible, for the reasons discussed in Section 1.2 to make many meaningful comparisons across the linguistic processors employed in the Blizzard Challenge. The differences are numerous and their effects on the speech output are impossible to quantify. This has not prevented us still drawing very concrete conclusions about waveform generation though, because we observe the same patterns in intelligibility and naturalness across multiple systems -employing different front ends -and across several years of the challenge.
All we can do with regard to the linguistic features predicted by each system from the text input is to highlight exceptional or unusual features employed by some systems.
Unsupervised features It should be clear that typical front ends are knowledge-rich and are both difficult and expensive to construct. To sidestep this, the system described by Watts et al. (2013) attempted to predict features from text without requiring any human expertise or pre-built resources such as pronunciation dictionaries. The method failed on English, but was reasonably successful on several more well-behaved languages.
Wider and deeper features With the introduction of audiobook data in the Challenge, the opportunity arose to use information beyond the current sentence, which has been tried in several ways including simply appending them as additional contextual features to HMMs (Takaki et al., 2012). Wider context may also be used to separate out disparate data, such as with the channel-and expressiveness-related labels of Ling et al. (2012), or the 'chapter-adaptive training' to cope with changes in recording conditions within audiobook training data used by Takaki et al. (2013).
Whilst many believe that a 'deeper' analysis of the text should yield useful features, it has proven very hard to obtain measurable improvement in the output speech. A possible exception to this is the excellent system described by Yu et al. (2013), which uses syntactic parser features for an audiobook synthesis task.

Positive contributions
In addition to the unquantifiable warm feeling of improved speech synthesis community cohesion and a spirit of sharing techniques and data, the Blizzard Challenge can claim a couple of concrete contributions in its own right.

Advances in objective measures
Although not directly used to rank the systems with the Challenge, objective measures of speech quality have made some progress over the last decade. Most notable is the work of Falk et al. (2008), Hinterleitner et al. (2010) and Norrenbrock et al. (2012) who have collectively pursued instrumental (that is, signal-based rather than listener-based) measures; these have begun to show useful results. These measures attempt to replicate the judgements that listeners would provide for a given set of speech signals. The Blizzard Challenge has been able to provide a substantial training set of signals-plus-listener-ratings on which object measures can be tuned and additional independent data sets on which their effectiveness can be tested.

Spinoffs and related evaluations
The Blizzard Challenge was itself inspired by the long tradition of common evaluation tasks from the field of ASR, and has in turn inspired others to use this methodology to measure (and hopefully promote) progress in other fields. The Hurricane Challenge (Cooke et al., 2013) evaluated methods for improving the intelligibility of natural or synthetic speech in the presence of additive noise, and its organisation closely followed the Blizzard model, with an open invitation to the community to participate, a common data set and set of rules, and a large centralised listening test run by the organisers. The Albayzin Challenges in 2010 (Díaz et al., 2011) and 2012 included a replication of the Blizzard Challenge, using a Spanish corpus.

What to evaluate
Naturalness and intelligibility remain the main evaluation criteria for speech synthesis, with judgements being elicited from listeners on a Lickert scale (Likert, 1932). Naturalness remains poorly defined, although listeners do seem to have a clear idea of what is being asked of them given the consistency of their judgements. Intelligibility is measured, as noted in Section 4.2.2, in a particularly unrealistic, or 'ecologically invalid', way. Blizzard also adds an evaluation of speaker similarity to the mix. This was introduced initially only as a check that participants were using the provided recordings and not entering pre-built systems. With the advent of speakeradaptive approaches, and for unit selection entires employing voice conversion, speaker similarity became a useful dimension of the evaluation in its own right.
Despite continued calls by the organisers, few researchers in the community have risen to their challenge to propose new and better listening test designs, and in particular to propose what to evaluate. The only exception to this is Hinterleitner et al. (2011), who proposed a multi-dimensional test for evaluating synthetic audiobooks. Their method was adopted by the Blizzard Challenge organisers in those later years where audiobook data was used.

How to evaluate
Playing synthetic speech to listeners and asking them to make some response (e.g., provide a rating for a specified property) or perform a task (e.g., transcribe the words they heard) is the bread and butter of synthetic speech evaluation. Whilst objective measures have their place in single-system tuning or in identifying gross differences between systems, a listening test remains the only sure way to demonstrate the superiority of one's proposed new method.
The problem of evaluating synthetic speech via listening tests is not a solved one. It is intrinsically difficult for two reasons. First, it is not clear exactly what properties to evaluate. Sec-ond, it is hard to know how to evaluate the chosen properties, and one can never be certain that all of the listeners have correctly performed the task you expected of them.
Blizzard takes a simple approach to alleviating these worries. The instructions given to listeners are generally simple and do not require any training or high level of knowledge on the listeners' part. A large number of listeners is employed, thus minimising the effect of individuals who fail to follow these instructions. The statistical tests for significant differences are deliberately conservative  in order to avoid false claims. Of course, the flip-side of this is that it is possible Blizzard fails to identify interesting differences some of the time.
The listening tests typically used by the TTS research community lack ecological validity in many ways. They take place in an unusual setting -quiet, comfortable listening booths with high-quality sound reproduction and no distractions -and ask listeners to perform tasks they would never do in everyday life. For example, in order to test the intelligibility of systems, listeners are asked to transcribe -by typing on a computer keyboard -the individual words they heard. It is hard to think of a real application where this would be done. Worse, the sentences played to listeners are deliberately hard to comprehend, often being devoid of meaning (Benoit and Grice, 1996). This is done to remove the ceiling effect: in other words, many synthesisers could be close to 100% intelligible if predictable, meaningful sentences were used.
Does the lack of ecological validity matter though? In some respects it certainly is not a problem: if our synthesiser is as intelligible as natural speech when using difficult, meaningless sentences then we would be confident that it would be at least as intelligible using normal sentences. That is, the laboratory testing situation can uncover effects that would shrink into insignificance in the real world and the only danger is that we are identifying rather small differences. We still have confidence that we can identify the best system, although we may over-estimate how much better than the next system it actually is.
But in other respects the lack of ecological validity is much more serious. The idealised environment is the most serious issue: real end users do not operate in quiet environments free of distractions. The 2009 Challenge included a condition in which the synthetic speech was corrupted by a simulated telephone channel (King and Karaiskos, 2009) and the Hurricane Challenge mentioned in Section 4.1.2 addressed the problem of speech-in-noise much more rigourously. The tasks used are also a problem, since listeners are allowed to perform them under no significant constraints on their attention or time. There is doubtless still much to learn from experimental psychology, including the use of distractors to disguise to true purpose of the experiment, or methods which can introduce realistic levels of cognitive load into our subjects.
Despite these widely-recognised potential problems with how TTS is generally evaluated, there have been few attempts to innovate. Perhaps this is for the simple reason that any alternative would almost certainly yield far fewer data points per hour of testing time than current paradigms, and so be less practical and more costly. But perhaps it is just plain laziness: researchers prefer to spend their time inventing exciting new methods for synthesising speech, not worrying about whether they are actually measuring the quality of their work in the best way, especially when the burden of some of that evaluation can be offloaded to an external Challenge.

Open issues
4.3.1. Whole system vs.

component-level evaluations
As we mentioned in Section 2, Blizzard only attempts end-to-end system evaluations. Moreover, it also bundles in the data preparation stages such as alignment with the text and optional hand-corrections performed by some participants. In other words, it evaluates the totality of the system components and the engineering skill and effort needed to make it work well on a new database. Conclusions about which method is "best" are therefore inevitably filtered through the level of expertise and available resources of the team implementing that method. This may be a partial explanation of the "failure" of some entries: the idea had merit, but the implementation was flawed. The availability of resources for checking and correcting the data varies widely between participants. To quantify the effect this has on overall quality, one year's Challenge did release hand-checked alignments but this was found to be of limited use because it does not guarantee consistency across systems, since some may use a different phonetic inventory or pronunciation dictionary. Some participants have themselves investigated the benefits of manual annotations (Chu et al., 2006).
Providing linguistic specifications may appear to be one way to isolate the waveform generation component, but it would not be possible for some participants to modify their systems to use an externally-provided linguistic specification.

Common data, but what else?
The core of the Blizzard Challenge is the shared corpus which all participants are required to use. Its size has varied over the years, generally getting larger over time, and several years have seen specific sub-challenges involving restricted corpus sizes. As we have mentioned a number of times throughout this paper, a common corpus only 'levels the playing field' to some degree and there remain many other uncontrolled factors which may explain differences between systems. It is probably impossible to entirely separate out the effectiveness of a proposed technique from the skill of the engineer who implements it. Simple techniques, implemented by experts, can perform very well. Certainly, complex techniques poorly implemented are not likely to succeed. Within a single year of the Challenge then, it is hard to say for sure that one technique is better than another.
But, by looking over several years of Challenges, as we have done here, we can start to find independently-constructed systems being entered that use a common technique. When we see several of these performing well, then it becomes more reasonable to say that this is a good technique. Clear examples of this (if implemented skilfully) include unit selection, which almost guarantees a good naturalness score, HMM-based methods, which almost guarantee good intelligibility, and hybrid systems which maintain the high naturalness of unit selection and start to approach the intelligibility of HMM systems.

Too much at stake leads to too little risk
As the Challenge became more and more established, and a firm fixture in the calendar, awareness of it began to rise outside the immediate circle of participating researchers. A negative effect of this is that participation in the Challenge has become a more public affair: poorlyperforming entries no longer go un-noticed but instead start to attract attention. For the research labs in large corporations, this presents a major barrier to participation in the Challenge, since their management/lawyers/marketing department are likely to say "Of course you can enter the Blizzard Challenge, provided that you win." It is often said that one learns more from mistakes than successes, and Blizzard is no exception. The organisers of Blizzard are always at pains to point out that it is not a competition, and there are no winners and losers -that is, 'mistakes' are encouraged. It is to be hoped that all participants resist the temptation to play it safe with their entries, and that normally riskaverse corporations see the benefits to taking part. They can easily mitigate the risks simply by describing their entry as a highly experimental research idea and not as a production system.