Measuring a decade of progress in Text-to-Speech

Authors

  • Simon King The Centre for Speech Technology Research, The University of Edinburgh

DOI:

https://doi.org/10.3989/loquens.2014.006

Keywords:

text-to-speech synthesis, evaluation, The Blizzard Challenge

Abstract


The Blizzard Challenge offers a unique insight into progress in text-to-speech synthesis over the last decade. By using a very large listening test to compare the performance of a wide range of systems that have been constructed using a common corpus of speech recordings, it is possible to make some direct comparisons between competing techniques. By reviewing over a hundred papers describing all entries to the Challenge since 2005, we can make a useful summary of the most successful techniques adopted by participating teams, as well as drawing some conclusions about where the Blizzard Challenge has succeeded, and where there are still open problems in cross-system comparisons of text-to-speech synthesisers.

Downloads

Download data is not yet available.

References

Andersson, J. S., Badino, L., Watts, O. S., & Aylett, M. P. (2008). The CSTR/Cereproc Blizzard entry 2008: The inconvenient data. In Blizzard Challenge Workshop 2008.

Andersson, J. S., Cabral, J. P., Badino, L., Yamagishi, J., & Clark, R. A. J. (2009). Glottal source and prosodic prominence modelling in HMM-based speech synthesis for the Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

Aylett, M. P., Andersson, J. S., Badino, L., & Pidcock, C. J. (2007). The Cerevoice Blizzard entry 2007: are small database errors worse than compression artifacts? In Blizzard Challenge Workshop 2007.

Aylett, M. P., & Pidcock, C. J. (2009). The CereProc Blizzard entry 2009: Some dumb algorithms that don't work. In Blizzard Challenge Workshop 2009.

Aylett, M. P., Pidcock, C. J., & Fraser, M. E. (2006). The Cerevoice Blizzard entry 2006: A prototype database unit selection engine. In Blizzard Challenge Workshop 2006.

Bennett, C. L. (2005). Large scale evaluation of corpus-based synthesizers: Results and lessons from the Blizzard Challenge 2005. In Blizzard Challenge Workshop 2005 (special session of Interspeech 2005), Lisbon.

Bennett, C. L., & Black, A. W. (2006). Blizzard Challenge 2006: Results. In Blizzard Challenge Workshop 2006.

Benoit, C., Grice, M., & Hazan, V. (1996). The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences. Speech Communication, 18, 381–392. http://dx.doi.org/10.1016/0167-6393(96)00026-X

Black, A. W., Bennett, C. L., Blanchard, B. C., Kominek, J., Langner, B., Prahallad, K., & Toth, A. (2007). CMU Blizzard 2007: a hybrid acoustic unit selection system from statistically predicted parameters. In Blizzard Challenge Workshop 2007.

Black, A. W., Bennett, C. L., Kominek, J., Langner, B., Prahallad, K., & Toth, A. (2008). CMU Blizzard 2008: Optimally using a large database for unit selection synthesis. In Blizzard Challenge Workshop 2008. http://dx.doi.org/10.1145/1394504

Black, A., & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Proc. Eurospeech (Vol. 2, pp. 601–604). Rhodes, Greece.

Black, A. W., & Tokuda, K. (2005a). The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Blizzard Challenge Workshop 2005 (special session of Interspeech 2005), Lisbon.

Black, A. W., & Tokuda, K. (2005b). The Blizzard Challenge - 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc Interspeech 2005, Lisbon.

Bonafonte, A., Adell, J., Agu.ero, P. D., Erro, D., Esquerra, I., Moreno, A., Pérez, J., & Polyakova, T. (2007). The UPC TTS system description for the 2007 Blizzard Challenge. In Blizzard Challenge Workshop 2007.

Bonafonte, A., Moreno, A., Adell, J., Aguero, P. D., Banos, E., Erro, D., Esquerra, I., Perez, J., & Polyakova, T. (2008). The UPC TTS system description for the 2008 Blizzard Challenge. In Blizzard Challenge Workshop 2008.

Buchholz, S., Braunschweiler, N., Morita, M., & Webster, G. (2007). The Toshiba entry for the 2007 Blizzard Challenge. In Blizzard Challenge Workshop 2007.

Chalamandaris, A., Tsiakoulis, P., Karabetsos, S., & Raptis, S. (2013). The ILSP/INNOETICS text-to-speech system for the Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Charfuelan, M. (2012). MARY TTS HMMbased voices for the Blizzard Challenge 2012. In Blizzard Challenge Workshop 2012.

Charfuelan, M., Pammi, S., & Steiner, I. (2013). MARY TTS unit selection and HMM-based voices for the Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Chen, L.-H., Ling, Z.-H., Song, Y. J. Y., Xia, X.-J., Zu, Y.-Q., Yan, R.-Q., & Dai, L.-R. (2013). The USTC system for Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Chen, L.-H., Yang, C.-Y., Ling, Z.-H., Jiang, Y., Dai, L.-R., Hu, Y., & Wang, R.-H. (2011). The USTC system for Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

Clark, R. A. J., Richmond, K., Strom, V., & King, S. (2006). Multisyn voice for the Blizzard Challenge 2006. In Blizzard Challenge Workshop 2006.

Cooke, M., Mayo, C., & Valentini-Botinhao, C. (2013). Intelligibility-enhancing speech modifications: the Hurricane Challenge. In Proc. Interspeech, Lyon, France.

Cotescu, M. (2011). PUB entry in the Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

Díaz, F. C., Pazó, F. J. M., Arza, M., Fernández, L. D., Bonafonte, A., Navas, E., & Sainz, I. (2011). Albayzín 2010: A Spanish text to speech evaluation. In Proc Interspeech 2011, Florence, Italy.

Ding, F., & Alhonen, J. (2007). Non-uniform unit selection through search strategy for Blizzard Challenge 2007. In Blizzard Challenge Workshop 2007.

Ding, F., & Alhonen, J. (2008). NTTS participation in the Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Dong, M., Cen, L., Chan, P., Huang, D., Zhu, D., Ma, B., & Li, H. (2009). I2R text-to-speech system for Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

Dong, M., Chan, P., Cen, L., Ma, B., & Li, H. (2010). I2R text-to-speech system for Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Dong, M., Lee, S. W., Chan, P., & Cen, L. (2011). I2R text-to-speech system for Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

Dong, M., Zhu, D., Ma, B., & Li, H. (2008). I2R's submission to Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Eide, E., Fernandez, R., Hoory, R., Hamza, W., Kons, Z., Picheny, M., Sagi, A., Shechtman, S., & Shuang, Z. W. (2006). The IBM submission to the 2006 Blizzard text-to-speech Challenge. In Blizzard Challenge Workshop 2006.

Fraser, M., & King, S. (2007). The Blizzard Challenge 2007. In Blizzard Challenge Workshop 2007.

Hashimoto, K., Takaki, S., Oura, K., & Tokuda, K. (2011). Overview of NIT HMMbased speech synthesis system for Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

Hinterleitner, F., Möller, S., Falk, T. H., & Polzehl, T. (2010). Comparison of approaches for instrumentally predicting the quality of text-to-speech systems: Data from Blizzard Challenges 2008 and 2009. In Blizzard Challenge Workshop 2010.

Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP-96 (pp. 373–376). Atlanta, Georgia. http://dx.doi.org/10.1109/icassp.1996.541110

Jiang, Y., Ling, Z.-H., Lei, M., Wang, C.-C., Heng, L., Hu, Y., Dai, L.-R., & Wang, R.-H. (2010). The ustc system for Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Jun, W.-S., Na, D.-S., Kim, S.-W., Kim, M., Lee, J.-W., & Lee, J.-S. (2007). The Voice-text text-to-speech system for the Blizzard Challenge 2007. In Blizzard Challenge Workshop 2007.

Karaiskos, V., King, S., Clark, R. A. J., & Mayo, C. (2008). The Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Kaszczuk, M., & Osowski, L. (2009). The IVO software Blizzard Challenge 2009 entry: Improving IVONA text-to-speech. In Blizzard Challenge Workshop 2009.

King, S., & Karaiskos, V. (2009). The Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

King, S., & Karaiskos, V. (2010). The Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

King, S., & Karaiskos, V. (2011). The Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

King, S., & Karaiskos, V. (2012). The Blizzard Challenge 2012. In Blizzard Challenge Workshop 2012.

King, S., & Karaiskos, V. (2013). The Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Kominek, J., Bennett, C. L., Langner, B., & Toth, A. R. (2005). The Blizzard Challenge 2005 cmu entry – a method for improving speech synthesis systems. In Blizzard Challenge Workshop 2005 (special session of Interspeech 2005), Lisbon.

Kominek, J., & Black, A. W. (2006). The Blizzard Challenge 2006 CMU entry introducing hybrid trajectory-selection synthesis. In Blizzard Challenge Workshop 2006.

Kumar, H. R. S., Ashwini, J. K., Rajaramand, B. S. R., & Ramakrishnan, A. G. (2013). MILE TTS for tamil and kannada for blizzard challenge 2013. In Blizzard Challenge Workshop 2013.

Latacz, L., Kong, Y. O., Mattheyses, W., & Verhelst, W. (2008). An overview of the VUB entry for the 2008 Blizzard Challenge. In Blizzard Challenge Workshop 2008.

Latacz, L., Mattheyses, W., & Verhelst, W. (2010). The VUB Blizzard Challenge 2010 entry: Towards automatic voice building. In Blizzard Challenge Workshop 2010.

Lee, S. W., Dong, M., Ang, S. T., & Chew, M. M. (2013). I2R text-to-speech system for Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Li, J., Luan, J., Yi, L., Lou, X., Wang, X., He, L., & Hao, J. (2009). The Toshiba Mandarin TTS system for the Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

Li, J., Xu, D., Yi, L., Lou, X., Luan, J., Wang, X., He, L., & Hao, J. (2008). The Toshiba Mandarin TTS system for the Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140, 1–55.

Ling, Z.-H., Lu, H., Hu, G.-P., & Li-Rong Dai, R.-H. W. (2008). The USTC system for Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Ling, Z.-H., Qin, L., Lu, H., Gao, Y., Dai, L.-R., Wang, R.-H., Jiang, Y., Zhao, Z.-W., Yang, J.-H., Chen, J., & Hu, G.-P. (2007). The USTC and iFlytek speech synthesis systems for Blizzard Challenge 2007. In Blizzard Challenge Workshop 2007.

Ling, Z.-H., Wu, Y.-J., Wang, Y.-P., Qin, L., & Wang, R.-H. (2006). USTC system for Blizzard Challenge 2006 an improved HMM-based speech synthesis method. In Blizzard Challenge Workshop 2006.

Ling, Z.-H., Xia, X.-J., Song, Y., Yang, C.-Y., Chen, L.-H., & Dai, L.-R. (2012). The USTC system for Blizzard Challenge 2012. In Blizzard Challenge Workshop 2012.

Louw, J. A., Schlu.nz, G. I., van der Walt, W., de Wet, F., & Pretorius, L. (2013). The Speect text-to-speech system entry for the Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Lu, H., Ling, Z.-H., Lei, M., Wang, C.-C., Zhao, H.-H., Chen, L.-H., Hu, Y., Dai, L.-R., & Wang, R.-H. (2009). The USTC system for Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

Maia, R., Ni, J., Sakai, S., Toda, T., Tokuda, K., Shimizu, T., & Nakamura, S. (2008). The NICT/ATR speech synthesis system for the Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Maia, R., Toda, T., Sakai, S., Shiga, Y., Ni, J., Kawai, H., Tokuda, K., Tsuzaki, M., & Nakamura, S. (2009). The NICT entry for the Blizzard Challenge 2009: an enhanced HMM-based speech synthesis system with trajectory training considering global variance and state-dependent mixed excitation. In Blizzard Challenge Workshop 2009.

Nitisaroj, R., Wilhelms-Tricarico, R., Mottershead, B., Nitisaroj, R., Baumgartner, M., Reichenbach, J., & Marple, G. (2011). The Lessac Technologies system for Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

Nitisaroj, R., Wilhelms-Tricarico, R., Mottershead, B., Reichenbach, J., & Marple, G. (2010). The Lessac Technologies system for Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Möller, S. (2012). Towards perceptual quality modeling of synthesized audiobooks -Blizzard Challenge 2012. In Blizzard Challenge Workshop 2012.

Oliveira, L. C., Paulo, S., Figueira, L., & Mendes, C. (2008). The INESC-ID Blizzard entry: Unsupervised voice building and synthesis. In Blizzard Challenge Workshop 2008.

Oura, K., Hashimoto, K., Shiota, S., & Tokuda, K. (2010). Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Oura, K., Wu, Y.-J., & Tokuda, K. (2009). Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

Prahallad, K., Vadapalli, A., Elluru, N., Mantena, G., Pulugundla, B., Bhaskararao, P., Murthy, H. A., King, S., Karaiskos, V., & Black, A. W. (2013). The Blizzard Challenge 2013 – Indian language task. In Blizzard Challenge Workshop 2013.

Qian, Y., Zhi-Jie Yan, Y.-J. W., Soong, F. K., Zhang, G., & Wang, L. (2010). An HMM trajectory tiling (HTT) approach to high quality TTS -Microsoft entry to Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Raghavendra, E., Desai, S., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2008). Blizzard 2008: Experiments on unit size for unit selection speech synthesis. In Blizzard Challenge Workshop 2008.

Raptis, S., Chalamandaris, A., Tsiakoulis, P., & Karabetsos, S. (2010). The ILSP text-to-speech system for the Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Raptis, S., Chalamandaris, A., Tsiakoulis, P., & Karabetsos, S. (2011). The ILSP text-to-speech system for the Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

Raptis, S., Chalamandaris, A., Tsiakoulis, P., & Karabetsos, S. (2012). The ILSP text-to-speech system for the Blizzard Challenge 2012. In Blizzard Challenge Workshop 2012.

Richmond, K., Strom, V., Clark, R. A., Yamagishi, J., & Fitt, S. (2007). Festival Multisyn voices for the 2007 Blizzard Challenge. In Blizzard Challenge Workshop 2007.

Rozak, M. (2007). Text-to-speech designed for a massively multiplayer online role-playing game (MMORPG). In Blizzard Challenge Workshop 2007.

Rozak, M. (2008). Circumreality functionality delta: Blizzard Challenge 2007 to 2008. In Blizzard Challenge Workshop 2008.

Sainz, I., Erro, D., Navas, E., Adell, J., & Bonafonte, A. (2011). BUCEADOR hybrid TTS for Blizzard Challenge 2011. In Blizzard Challenge Workshop 2011.

Scholtz, P., Visagie, A., & du Preez, J. (2008). Statistical speech synthesis for the Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Schroeder, M., Charfuelan, M., Pammi, S., & Tu.rk, O. (2008). The MARY TTS entry in the Blizzard Challenge 2008. In Blizzard Challenge Workshop 2008.

Schroeder, M., & Hunecke, A. (2007). MARY TTS participation in the Blizzard Challenge 2007. In Blizzard Challenge Workshop 2007.

Schroeder, M., Hunecke, A., & Krstulovic, S. (2006). OpenMary - open source unit selection as the basis for research on expressive synthesis. In Blizzard Challenge Workshop 2006.

Schroeder, M., Pammi, S., & Türk, O. (2009). Multilingual MARY TTS participation in the Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

Suendermann, D., Höge, H., & Black, A. (2010). Challenges in speech synthesis. In Chen, F., & Jokinen, K., editors, Speech Technology, chapter 2, pages 19–32. Springer US.
http://dx.doi.org/10.1007/978-0-387-73819-2_2

Suni, A., Raitio, T., Vainio, M., & Alku, P. (2010). The glottHMM speech synthesis entry for Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Suni, A., Raitio, T., Vainio, M., & Alku, P. (2011). The glottHMM speech synthesis entry for Blizzard Challenge 2011: Utilizing source unit selection in HMM-based speech synthesis for improved excitation generation. In Blizzard Challenge Workshop 2011.

Suni, A., Raitio, T., Vainio, M., & Alku, P. (2012). The glottHMM entry for Blizzard Challenge 2012: Hybrid approach. In Blizzard Challenge Workshop 2012.

Takaki, S., Sawada, K., Hashimoto, K., Oura, K., & Tokuda, K. (2012). Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012. In Blizzard Challenge Workshop 2012.

Takaki, S., Sawada, K., Hashimoto, K., Oura, K., & Tokuda, K. (2013). Overview of NITECH HMM-based speech synthesis system for Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Tao, J., Li, Y., Pan, S., Zhang, M., Sun, H., & Wen, Z. (2009). The WISTON text-to-speech system for Blizzard Challenge 2009. In Blizzard Challenge Workshop 2009.

Tao, J., Pan, S., Li, Y., Wen, Z., & Wang, Y. (2010). The WISTON text to speech system for Blizzard Challenge 2010. In Blizzard Challenge Workshop 2010.

Tao, J., Yu, J., Huang, L., Liu, F., Jia, H., & Zhang, M. (2008). The wiston text to speech system for Blizzard 2008. In Blizzard Challenge Workshop 2008.

Taylor, P. (2009). Text-to-speech synthesis. Cambridge, UK: Cambridge University Press. http://dx.doi.org/10.1017/CBO9780511816338

Toda, T., Kawai, H., Hirai, T., Ni, J., Nishizawa, N., Yamagishi, J., Tsuzaki, M., Tokuda, K., & Nakamura, S. (2006). Developing a test bed of English text-to-speech system XIMERA for the Blizzard Challenge 2006. In Blizzard Challenge Workshop 2006.

Watts, O., Stan, A., Mamiya, Y., Suni, A., Burgos, J., & Montero, J. (2013). The Simple4All entry to the Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Wilhelms-Tricarico, R., Mottershead, B., Reichenbach, J., & Marple, G. (2012). The Lessac Technologies hybrid concatenated system for Blizzard Challenge 2012. In Blizzard Challenge Workshop 2012.

Wilhelms-Tricarico, R., Reichenbach, J., & Marple, G. (2013). The Lessac Technologies hybrid concatenated system for Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Wouters, J. (2007). SVOX participation in Blizzard 2007. In Blizzard Challenge Workshop 2007.

Yamagishi, J., Lincoln, M., King, S., Dines, J., Gibson, M., Tian, J., & Guan, Y. (2009). Analysis of unsupervised and noise-robust speaker-adaptive HMM-based speech synthesis systems toward a unified ASR and TTS framework. In Blizzard Challenge Workshop 2009.

Yamagishi, J., & Watts, O. (2010). The CSTR/EMIME HTS system for Blizzard Challenge. In Blizzard Challenge Workshop 2010.

Yamagishi, J., Zen, H., Toda, T., & Tokuda, K. (2007). Speaker-independent HMM-based speech synthesis system - HTS-2007 system for the Blizzard Challenge 2007. In Blizzard Challenge Workshop 2007.

Yamagishi, J., Zen, H., Wu, Y.-J., Toda, T., & Tokuda, K. (2008). The HTS-2008 system: Yet another evaluation of the speaker-adaptive HMM-based speech synthesis system in the 2008 Blizzard Challenge. In Blizzard Challenge Workshop 2008.

Yu, Y., Zhu, F., Li, X., Liu, Y., Zou, J., Yang, Y., Yang, G., Fan, Z., & Wu, X. (2013). Overview of SHRC-Ginkgo speech synthesis system for Blizzard Challenge 2013. In Blizzard Challenge Workshop 2013.

Zen, H., & Toda, T. (2005). An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005. In Blizzard Challenge Workshop 2005 (special session of Interspeech 2005), Lisbon.

Zen, H., Toda, T., & Tokuda, K. (2006). The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. In Blizzard Challenge Workshop 2006.

Zhang, B., Alhonen, J., Guan, Y., & Tian, J. (2010). Multilingual TTS system of Nokia entry for Blizzard 2010. In Blizzard Challenge Workshop 2010.

Zhang, Z., Xian, X., Luo, L., & Wu, X. (2009). PKU Mandarin speech synthesis system for Blizzard 2009. In Blizzard Challenge Workshop 2009.

Published

2014-06-30

How to Cite

King, S. (2014). Measuring a decade of progress in Text-to-Speech. Loquens, 1(1), e006. https://doi.org/10.3989/loquens.2014.006

Issue

Section

Articles