HMM-BASED INCREMENTAL SPEECH SYNTHESIS Maël Pouget 1, 2, Thomas Hueber 1, 2, Gérard Bailly 1, 2 1 CNRS/GIPSA-Lab, Grenoble, France 2 Univ. Grenoble Alpes/GIPSA-Lab, Grenoble, France 1, 2 [email protected] Abstract Incremental speech synthesis aims at delivering the synthetic voice while the sentence is still being typed. The main challenges are the online estimation of the target prosody from a partial knowledge of the sentence’s syntactic structure, the online phonetization and estimation of parts of speech and the timing of the delivery. This thesis aims at solving these challenges resulting in the implementation of a fully functioning incremental text to speech synthesizer. This thesis abstract briefly exposes the problem we are facing, the work already achieved and the future work. Index Terms: HMM-based speech synthesis, incremental, TTS, HTS, prosody 1. Introduction Text-To-Speech (TTS) systems are more and more used in several fields of application such as smartphone’s personal assistants, text message reading, robotics or as a prosthetic voice for speech impaired people. The latter application is the main application targeted by this Ph.D. thesis. In its conventional implementation, a TTS system delivers the synthetic voice after the user has entered a whole utterance. Therefore, a conventional TTS system implies an intrinsic delay of one sentence (or at least a group of several words) when interacting with someone. This may affect the fluency and the naturalness of the conversation. To tackle this problem, in this PhD thesis we investigate an alternative paradigm called “incremental speech synthesis” which has recently emerged in the literature. Incremental Text-To-Speech (iTTS) systems aim at starting delivery of the synthetic voice before the full sentence context becomes available, e.g. while a user is still typing the text to vocalize. Contrary to a conventional TTS, the synthesis follows the text input, words after words (potentially with a delay of one word). This ‘synthesis-while-typing’ approach is illustrated in Figure 1, on the bottom. By reducing the latency between text input and speech output, iTTS should enhance the interactivity of communication. In particular, it should improve the user experience of people with communication disorders who use a TTS system in their daily life, as a substitute voice. Besides, iTTS could be chained with incremental speech recognition systems, in order to design highly responsive speech-to-speech conversion systems (for application in automatic translation, silent speech interface, real-time enhancement of pathological voice, etc.). The concept of incremental speech synthesis was initially formulated in [1] in the context of dialogue systems. However, in the proposed proof-of-concept, the speech generation was delivered incrementally but was generated in a non-incremental way. In [2], Baumann & Schlangen proposed the first complete software architecture dedicated to incremental speech processing (including recognition, dialogue management and TTS modules). Another proof-of-concept based on the reactive HMM-based parameter generation system MAGE [3] was described in [4]. The design of iTTS systems faces three main challenges: 1. Incremental Natural Language Processing (NLP): extracting linguistic information needed for the generation of the audio waveform (e.g. grammatical/syntactic structure, phonetization) from an ‘incomplete’ sentence (e.g. “My name is”, with no more words). 2. Incremental waveform generation: the generation of the speech waveform from a set of potentially incomplete or inaccurate linguistic features. In this PhD. thesis, we address this problem in the framework of HMM-based speech synthesis, with a special focus on the estimation of the prosodic content. 3. Time management: when do we deliver the synthetic voice chunks to maintain fluency and naturalness? So far, our work has mainly focused on the second point. In [5], we proposed a procedure to train HMM voice which is adapted to incremental processing, i.e. able to infer an acceptable prosodic content from an incomplete sentence. Our research plan for addressing these 3 technological bolts is briefly summarized in the next section. Figure 1: Classical versus incremental Text-ToSpeech 2. Toward a fully incremental TTS Text-to-Speech conversion is a two-step process (in the context of HMM-based speech synthesis). The text is first analyzed by a NLP module and converted into relevant symbolic descriptors (phonemes, syllables, parts-of-speech) which are used to infer acoustic parameters. In the following subsections, we describe our current work on a iTTS system for French language. - 2.1. Incremental NLP. In its conventional implementation, a phonetizer and a morphological analyzer (such as the one used in our system for French language [6]) considers a complete sentence to determine the pronunciation and grammatical nature of each word. To achieve incremental NLP, we will first analyze the ‘stability’ of our NLP module, i.e. the minimum lookahead needed to have no ambiguity on the linguistic nature and function of a specific word. To decrease this lookahead, we will investigate the prediction of the most likely linguistic features (such as POS) of next words. To that purpose, we intend to adapt some state-of-the-art techniques used in statistical language modeling for ASR (such as n-gram models), to incremental speech synthesis. 2.2. Incremental waveform generation. In this section, we briefly described our strategy to train HTS [7] voice, in the context of incremental speech synthesis. The parameters inference is realized using the HTS toolkit adapted for French. For inferring parametric trajectories, HTS uses clues about the structure of the sentence. Since the process is incremental, some of these clues might be sometimes known, sometimes not (for instance, if nothing beyond the end of the current word is known, the next phone is unknown when we reach the end of the word). In [8], Baumann proposed to tackle this problem at synthesis time, by substitute those missing values with default ones, computed on the training set. In [5] (Interspeech 2015), we propose to address this problem when training the HTS voice. In our approach, contextual models include a case where a contextual features is potentially unknown, as illustrated figure 2. - - 3. References [1] [2] [3] [4] [5] [6] [7] [8] Figure 2: Representation of the decision tree when facing an unknown feature 2.3. Time management Several strategies can be envisioned to decide when synthetic voice chunks have to be delivered. Among others, we will investigate the followings: Delivering the synthetic voice ‘word by word’. In that scenario, the synthetic audio chunks could be timestretched to follow text input. Delivering the synthetic voice ‘phrase-by-phrase” module (i.e.nominal/verbal groups): this will require the phrase segmentation to be done incrementally by the NLP. Customizing the system to the user’s voice: modeling the user specific temporal dependencies between text input and vocalization. In a preliminary calibration phase, the user is invited to speak out loud what he/she is currently typing. J. Edlund, “Incremental speech synthesis,” in Proceedings of Swedish Language Technology Conference, Stockholm, Sweden, 2008, pp. 53–54. T. Baumann and D. Schlangen, “The INPROTK 2012 release,” in Proceedings of NAACL-HLT Workshop on Future Directions and Needs in the Spoken Dialog Community: Tools and Data, Stroudsburg, PA, USA, 2012, pp. 29–32. M. Astrinaki, N. d’ Alessandro, and T. Dutoit, “MAGE: A Platform for Performative Speech Synthesis New Approach in Exploring Applications Beyond Text-ToSpeech,” in Proceedings of The Listening Talker Workshop, Edinburgh, Scotland, 2012, p. 53. Y. Rybarczyk, T. Cardoso, J. Rosas, L. CamarinhaMatos, N. d’ Alessandro, J. Tilmanne, M. Astrinaki, T. Hueber, R. Dall, T. Ravet, A. Moinet, H. Cakmak, O. Babacan, A. Barbulescu, V. Parfait, V. Huguenin, E. Kalaycı, and Q. Hu, “Reactive Statistical Mapping: Towards the Sketching of Performative Control with Data,” in Innovative and Creative Developments in Multimodal Interaction Systems, vol. 425, Springer Berlin Heidelberg, 2014, pp. 20–49. M. Pouget, T. Hueber, G. Bailly, and T. Baumann, “HMM Training Strategy for incremental speech synthesis,” in Proceedings of Interspeech, Dresden, Germany, 2015. M. Alissali and G. Bailly, “COMPOST: a client-server model for applications using text-to-speech systems.,” in Proceedings of European Conference on Speech Communication and Technology, Berlin, Germany, 1993, pp. 2095–2098. “The HTS toolkit.” [Online]. Available: http://hts.sp.nitech.ac.jp/. T. Baumann, “Decision tree usage for incremental parametric speech synthesis,” in Proceedings of ICASSP, Florence, Italy, 2014, pp. 3819–3823.
© Copyright 2026 Paperzz