An attribute detection based approach to automatic speech processing

Sabato Marco Siniscalchi; Chin-Hui Lee

doi:10.3989/loquens.2014.005

Authors

Sabato Marco Siniscalchi University of Enna “Kore”
Chin-Hui Lee Georgia Institute of Technology

DOI:

https://doi.org/10.3989/loquens.2014.005

Keywords:

speech attribute detection, knowledge-rich systems, artificial neural networks, hidden Markov models

Abstract

State-of-the-art automatic speech and speaker recognition systems are often built with a pattern matching framework that has proven to achieve low recognition error rates for a variety of resource-rich tasks when the volume of speech and text examples to build statistical acoustic and language models is plentiful, and the speaker, acoustics and language conditions follow a rigid protocol. However, because of the “blackbox” top-down knowledge integration approach, such systems cannot easily leverage a rich set of knowledge sources already available in the literature on speech, acoustics and languages. In this paper, we present a bottom-up approach to knowledge integration, called automatic speech attribute transcription (ASAT), which is intended to be “knowledge-rich”, so that new and existing knowledge sources can be verified and integrated into current spoken language systems to improve recognition accuracy and system robustness. Since the ASAT framework offers a “divide-and-conquer” strategy and a “plug-andplay” game plan, it will facilitate a cooperative speech processing community that every researcher can contribute to, with a view to improving speech processing capabilities which are currently not easily accessible to researchers in the speech science community.

Downloads

Download data is not yet available.

References

Association for Computational Linguistics and Chinese Language Processing (ACLCLP) (2013a). Mandarin microphone speech corpus–TCC300 [Database]. Retrieved from http://www.aclclp.org.tw/use_mat.php#tcc300edu

Association for Computational Linguistics and Chinese Language Processing (ACLCLP) (2013b). Sinica balanced corpus (version 4.0) [Corpus]. Retrieved from http://www.aclclp.org.tw/use_asbc.php

Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. (1986). Maximum mutual information estimation of HMM parameters for speech recognition. Proceedings of the 1986 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '86), 11, 49–512. http://dx.doi.org/10.1109/ICASSP.1986.1169179

Baker, J. (1975). The DRAGON system–An overview. IEEE Transactions on Acoustics, Speech and Signal Processing, 23(1), 24–29. http://dx.doi.org/10.1109/TASSP.1975.1162650

Baum, L. E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3, 1–8.

Baum, L. E., & Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6), 1554–1563. http://dx.doi.org/10.1214/aoms/1177699147

Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1), 164–171. http://dx.doi.org/10.1214/aoms/1177697196

Bellegarda, J. R. (2000). Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8), 1279–1296. http://dx.doi.org/10.1109/5.880084

Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.

Bellman, R. E., & Dreyfus, S. E. (1962). Applied dynamic programming. Princeton, NJ: Princeton University Press.

Chiang, C.-Y., Siniscalchi, S. M., Wang, Y.-R., Chen, S.-H., & Lee, C.-H. (2012). A study on cross-language knowledge integration in Mandarin LVCSR. Proceedings of the 2012 8th International Symposium on Chinese Spoken Language Processing (ISCSLP), 315–319. http://dx.doi.org/10.1109/ISCSLP.2012.6423528

De Mori, R. (Ed.). (1998). Spoken dialogues with computers. San Diego, CA: Academic Press.

Deng, L., & Yu, D. (2011). Deep convex network: A scalable architecture for speech pattern classification. Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH '11), 2285–2288.

Fousek, P., & Hermansky, H. (2006). Towards ASR based on hierarchical posterior-based keyword recognition. Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), 1, 433–436. http://dx.doi.org/10.1109/ICASSP.2006.1660050

Gauvain, J.-L., & Lamel, L. (2000). Large vocabulary continuous speech recognition: Advances and applications. Proceedings of the IEEE, 88(8), 1181–1200. http://dx.doi.org/10.1109/5.880079

Group, C. W. (2013). Taiwan Panorama Magazine text corpus. Retrieved from http://www.aclclp.org.tw/use_gh_c.php

Hacioglu, K., Pellom, B., & Ward, W. (2004). Parsing speech into articulatory events. Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'04), 1, 925–928. http://dx.doi.org/10.1109/ICASSP.2004.1326138

Hasegawa-Johnson, M., Baker, J., Borys, S., Chen, K., Coogan, E., Greenberg, ... Wang, T. (2005). Landmark-based speech recognition: Report of the 2004 Johns Hopkins summer workshop. Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), 213–216. http://dx.doi.org/10.1109/ICASSP.2005.1415088

Hou, J. (2009). On the use of frame and segment-based methods for the detection and classification of speech sounds and features. Doctoral dissertation. Rutgers University, NJ, USA.

Hou, J., Rabiner, L. R., & Dusan, S. (2007). On the use of time-delay neuralnetworksforhighlyaccurateclassificationofstopconsonants. Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH '07), 1929–1932.

Huang, Q.-Q., Chiang, C.Y., Wang, Y.-R., Yu, H.-M., & Chen, S.H. (2010). Variable speech rate Mandarin Chinese text-to-speech system. Proceedings of the 22th Conference on Computational Linguistics and Speech Processing (ROCLING '10), 222–235.

Jelinek, F. (1997). Statistical method for speech recognition. Cambridge, MA: The MIT Press.

Jeon, W., & Juang, B. H. (2007). Speech analysis in a model of the central auditory system. IEEE Transactions on Audio, Speech, and Language Processing, 15(6), 1802–1817. http://dx.doi.org/10.1109/TASL.2007.900102

Kawahara, T., Lee, C.H., & Juang, B.-H. (1998). Flexible speech understanding based on combined key-phrase detection and verification. IEEE Transactions on Speech and Audio Processing, 6(6), 558–568. http://dx.doi.org/10.1109/89.725322

Lafferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01), 282–289.

Lee, C.-H. (2003). On automatic speech recognition at the dawn of the 21st century. IEICE Transactions on Information and Systems, 86(3), 377–396.

Lee, C.-H., Clements, M. A., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B.-H., & Rabiner, L. R. (2007). An overview on automatic speech attribute transcription (ASAT). Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH '07), 1825–1828.

Lee, C.-H., & Huo, Q. (2000). On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proceedings of the IEEE, 88(8), 1241–1269. http://dx.doi.org/10.1109/5.880082

Lee, C.-H., & Siniscalchi, S. M. (2013). An information-extraction approach to speech processing: Analysis, detection, verification, and recognition. Proceedings of the IEEE, 101(5), 1089–1115. http://dx.doi.org/10.1109/JPROC.2013.2238591

Lee, C.-H., Soong, F. K., & Paliwal, K. K. (Eds.). (1996). Automatic speech and speaker recognition: Advanced topics. Boston, MA: Kluwer Academic. http://dx.doi.org/10.1007/978-1-4613-1367-0

Lehmann, E. L. (1959). Testing statistical hypotheses. New York, NY: Wiley. http://dx.doi.org/10.1007/978-1-4757-1923-9

Li, J., & Lee, C.-H. (2005). On designing and evaluating speech event detectors. Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH '05–EUROSPEECH '05), 3365–3368.

Li, J., Tsao, J., & Lee, C.-H. (2005). A study on knowledge source integration for candidate rescoring in automatic speech recognition. Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'05), 1, 837–840. http://dx.doi.org/10.1109/ICASSP.2005.1415244

LIPS & Labs, N. (2013). Chinese Information Retrieval Benchmark (CIRB030) (Version 3.0) [test collection]. Retrieved from http://www.aclclp.org.tw/use_cir.php

Liu, S. A. (1996). Landmark detection for distinctive feature-based speech recognition. Journal of the Acoustical Society of America, 100(5), 3417–3430. http://dx.doi.org/10.1121/1.416983

Ma, C., & Lee, C.-H. (2007). A study on word detector design and knowledge-based pruning and rescoring. Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH '07), 1473–1476.

Mohamed, A. R., Dahl, G., & Hinton, G. E. (2009, December). Deep Belief Networks for phone recognition. NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, Whistler, BC, Canada.

Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational Linguistics, 23(2), 269–311.

Morris, J., & Folser-Lussier, E. (2006). Combining phonetic attributes using conditional random fields. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH '06), 597–600.

Ney, H., & Ortmanns, S. (2000). Progresses in dynamic programming search for LVCSR. Proceedings of the IEEE, 88(8), 1224–1240. http://dx.doi.org/10.1109/5.880081

Niyogi, P., Mitra, P., & Sondhi, M. (1998). A detection framework for locating phonetic events. Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP '98), paper 0665.

O'Shaughnessy, D. (2000). Speech communication: Human and machine. IEEE Press. Paul, D. B., & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the Workshop on Speech and Natural Language (pp. 357–362). Stroudsburg, PA: Association for Computational Linguistics. http://dx.doi.org/10.3115/1075527.1075614

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected application in speech recognition. Proceedings of the IEEE, 77(2), 257–286. http://dx.doi.org/10.1109/5.18626

Rabiner, L. R., & Juang, B.-H. (1993). Fundamentals of speech recognition. Prentice Hall.

Rabiner, L. R., & Schafer, R. W. (2011). Theory and applications of digital speech processing. Pearson Higher Education.

Ramesh, P. and Niyogi, P. (1998). The voice feature for stop consonants: Acoustic phonetic analysis and automatic speech recognition experiments. Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP '98), paper 0881.

Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH '11), 437–440.

Shamma, S. (2001). On the role of space and time in auditory processing. Trends in Cognitive Sciences, 5(8), 340–348. http://dx.doi.org/10.1016/S1364-6613(00)01704-6

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. http://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x Ibid., 27(4), 623–656. http://dx.doi.org/10.1002/j.1538-7305.1948.tb00917.x

Siniscalchi, S. M., & Lee, C.-H. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51(11), 1139–1153. http://dx.doi.org/10.1016/j.specom.2009.05.004

Siniscalchi, S. M., Li, J., & Lee, C.-H. (2006). A study on lattice rescoring with knowledge scores for automatic speech recognition. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH '06), 517–520.

Siniscalchi, S. M., Lyu, D.-C., Svendsen, T., & Lee, C.-H. (2012). Experiments on cross-language attribute detection and phone recognition with minimal target-specific training data. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 875–887. http://dx.doi.org/10.1109/TASL.2011.2167610

Siniscalchi, S. M., Reed, J., Svendsen, T., & Lee, C.-H. (2013). Universal attribute characterization of spoken languages for automatic spoken language recognition. Computer Speech & Language, 27(1), 209–227. http://dx.doi.org/10.1016/j.csl.2012.05.001

Siniscalchi, S. M., Svendsen, T., & Lee, C.-H. (2009). A phonetic feature based lattice rescoring approach to LVCSR. Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), 3865–3868. http://dx.doi.org/10.1109/ICASSP.2009.4960471

Siniscalchi, S. M., Svendsen, T., & Lee, C.-H. (2011). A bottom-up stepwise knowledge-integration approach to large vocabulary continuous speech recognition using weighted finite state machines. Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH'11), 901–904.

Stolcke, A. (2002). SRILM–An extensible language modeling toolkit. Proceedings of the 7th Conference on Spoken Language Processing (ICSLP '02–INTERSPEECH '02), 16–20.

Sukkar, R. A., & Lee, C.-H. (1996). Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition. IEEE Transactions on Speech and Audio Processing, 4(6), 420–429. http://dx.doi.org/10.1109/89.544527

Tsao, Y., Li, J., & Lee, C. H. (2005). A study on separation between acoustic models and its applications. Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH '05–EUROSPEECH '05), 1109–1112.

Yu, D., Siniscalchi, S. M., Deng, L., & Lee, C.-H. (2012). Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition. Proceedings of the 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '12), 4169–4172. http://dx.doi.org/10.1109/ICASSP.2012.6288837

Zue, V. W. (1981). Acoustic-phonetic knowledge representation: Implications from spectrograms reading experiments. In J.-P. Jaton (Ed.), Automatic speech analysis and recognition (pp. 101–120). http://dx.doi.org/10.1007/978-94-009-7879-9_5