Title An Anthropomorphic Perspective for Audiovisual Speech Synthesis
Author Samuel Silva, António J. S. Teixeira
Booktitle Proc. BIOSIGNALS
Address Porto, Portugal
Month February
Year 2017
In speech communication, both the auditory and visual streams play an important role, ensuring both a certain level of redundancy (e.g., lip movement) and transmission of complementary information (e.g., to emphasize a word). The common current approach to audiovisual speech synthesis, generally based on data-driven methods, yields good results, but relies on models controlled by parameters that do not relate with how humans do it, being hard to interpret and adding little to our understanding of the human speech production apparatus. Modelling the actual system, adopting an anthropomorphic perspective would provide a myriad of novel research paths. This article proposes a conceptual framework to support research and development of an articulatory-based audiovisual speech synthesis system. The core idea is that the speech production system is modelled to produce articulatory parameters with anthropomorphic meaning (e.g., lip opening) driving the synthesis of both the auditory and visual streams. A first instantiation of the framework for European Portuguese illustrates its viability and constitutes an important tool for research in speech production and the deployment of audiovisual speech synthesis in multimodal interaction scenarios, of the utmost relevance for the current and future complex services and applications.