hmm-based incremental speech synthesis

HMM-BASED INCREMENTAL SPEECH SYNTHESIS
Maël Pouget 1, 2, Thomas Hueber 1, 2, Gérard Bailly 1, 2
1
CNRS/GIPSA-Lab, Grenoble, France
2
Univ. Grenoble Alpes/GIPSA-Lab, Grenoble, France
1, 2
[email protected]
Abstract
Incremental speech synthesis aims at delivering the
synthetic voice while the sentence is still being typed. The main
challenges are the online estimation of the target prosody from
a partial knowledge of the sentence’s syntactic structure, the
online phonetization and estimation of parts of speech and the
timing of the delivery. This thesis aims at solving these
challenges resulting in the implementation of a fully
functioning incremental text to speech synthesizer. This thesis
abstract briefly exposes the problem we are facing, the work
already achieved and the future work.
Index Terms: HMM-based speech synthesis, incremental,
TTS, HTS, prosody
1. Introduction
Text-To-Speech (TTS) systems are more and more used in
several fields of application such as smartphone’s personal
assistants, text message reading, robotics or as a prosthetic
voice for speech impaired people. The latter application is the
main application targeted by this Ph.D. thesis. In its
conventional implementation, a TTS system delivers the
synthetic voice after the user has entered a whole utterance.
Therefore, a conventional TTS system implies an intrinsic delay
of one sentence (or at least a group of several words) when
interacting with someone. This may affect the fluency and the
naturalness of the conversation. To tackle this problem, in this
PhD thesis we investigate an alternative paradigm called
“incremental speech synthesis” which has recently emerged in
the literature.
Incremental Text-To-Speech (iTTS) systems aim at starting
delivery of the synthetic voice before the full sentence context
becomes available, e.g. while a user is still typing the text to
vocalize. Contrary to a conventional TTS, the synthesis follows
the text input, words after words (potentially with a delay of one
word). This ‘synthesis-while-typing’ approach is illustrated in
Figure 1, on the bottom. By reducing the latency between text
input and speech output, iTTS should enhance the interactivity
of communication. In particular, it should improve the user
experience of people with communication disorders who use a
TTS system in their daily life, as a substitute voice. Besides,
iTTS could be chained with incremental speech recognition
systems, in order to design highly responsive speech-to-speech
conversion systems (for application in automatic translation,
silent speech interface, real-time enhancement of pathological
voice, etc.).
The concept of incremental speech synthesis was initially
formulated in [1] in the context of dialogue systems. However,
in the proposed proof-of-concept, the speech generation was
delivered incrementally but was generated in a non-incremental
way. In [2], Baumann & Schlangen proposed the first complete
software architecture dedicated to incremental speech
processing (including recognition, dialogue management and
TTS modules). Another proof-of-concept based on the reactive
HMM-based parameter generation system MAGE [3] was
described in [4].
The design of iTTS systems faces three main challenges:
1. Incremental Natural Language Processing (NLP):
extracting linguistic information needed for the
generation of the audio waveform (e.g.
grammatical/syntactic structure, phonetization) from
an ‘incomplete’ sentence (e.g. “My name is”, with no
more words).
2. Incremental waveform generation: the generation of
the speech waveform from a set of potentially
incomplete or inaccurate linguistic features. In this
PhD. thesis, we address this problem in the
framework of HMM-based speech synthesis, with a
special focus on the estimation of the prosodic
content.
3. Time management: when do we deliver the synthetic
voice chunks to maintain fluency and naturalness?
So far, our work has mainly focused on the second point. In
[5], we proposed a procedure to train HMM voice which is
adapted to incremental processing, i.e. able to infer an
acceptable prosodic content from an incomplete sentence.
Our research plan for addressing these 3 technological bolts
is briefly summarized in the next section.
Figure 1: Classical versus incremental Text-ToSpeech
2. Toward a fully incremental TTS
Text-to-Speech conversion is a two-step process (in the context
of HMM-based speech synthesis). The text is first analyzed by
a NLP module and converted into relevant symbolic descriptors
(phonemes, syllables, parts-of-speech) which are used to infer
acoustic parameters. In the following subsections, we describe
our current work on a iTTS system for French language.
-
2.1. Incremental NLP.
In its conventional implementation, a phonetizer and a
morphological analyzer (such as the one used in our system for
French language [6]) considers a complete sentence to
determine the pronunciation and grammatical nature of each
word. To achieve incremental NLP, we will first analyze the
‘stability’ of our NLP module, i.e. the minimum lookahead
needed to have no ambiguity on the linguistic nature and
function of a specific word. To decrease this lookahead, we will
investigate the prediction of the most likely linguistic features
(such as POS) of next words. To that purpose, we intend to
adapt some state-of-the-art techniques used in statistical
language modeling for ASR (such as n-gram models), to
incremental speech synthesis.
2.2. Incremental waveform generation.
In this section, we briefly described our strategy to train HTS
[7] voice, in the context of incremental speech synthesis.
The parameters inference is realized using the HTS toolkit
adapted for French. For inferring parametric trajectories, HTS
uses clues about the structure of the sentence. Since the process
is incremental, some of these clues might be sometimes known,
sometimes not (for instance, if nothing beyond the end of the
current word is known, the next phone is unknown when we
reach the end of the word). In [8], Baumann proposed to tackle
this problem at synthesis time, by substitute those missing
values with default ones, computed on the training set. In [5]
(Interspeech 2015), we propose to address this problem when
training the HTS voice. In our approach, contextual models
include a case where a contextual features is potentially
unknown,
as
illustrated
figure
2.
-
-
3. References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Figure 2: Representation of the decision tree when
facing an unknown feature
2.3. Time management
Several strategies can be envisioned to decide when synthetic
voice chunks have to be delivered. Among others, we will
investigate the followings:
Delivering the synthetic voice ‘word by word’. In that
scenario, the synthetic audio chunks could be timestretched to follow text input.
Delivering the synthetic voice ‘phrase-by-phrase”
module (i.e.nominal/verbal groups): this will require
the phrase segmentation to be done incrementally by
the NLP.
Customizing the system to the user’s voice: modeling
the user specific temporal dependencies between text
input and vocalization. In a preliminary calibration
phase, the user is invited to speak out loud what he/she
is currently typing.
J. Edlund, “Incremental speech synthesis,” in
Proceedings of Swedish Language Technology
Conference, Stockholm, Sweden, 2008, pp. 53–54.
T. Baumann and D. Schlangen, “The INPROTK 2012
release,” in Proceedings of NAACL-HLT Workshop on
Future Directions and Needs in the Spoken Dialog
Community: Tools and Data, Stroudsburg, PA, USA,
2012, pp. 29–32.
M. Astrinaki, N. d’ Alessandro, and T. Dutoit, “MAGE:
A Platform for Performative Speech Synthesis New
Approach in Exploring Applications Beyond Text-ToSpeech,” in Proceedings of The Listening Talker
Workshop, Edinburgh, Scotland, 2012, p. 53.
Y. Rybarczyk, T. Cardoso, J. Rosas, L. CamarinhaMatos, N. d’ Alessandro, J. Tilmanne, M. Astrinaki, T.
Hueber, R. Dall, T. Ravet, A. Moinet, H. Cakmak, O.
Babacan, A. Barbulescu, V. Parfait, V. Huguenin, E.
Kalaycı, and Q. Hu, “Reactive Statistical Mapping:
Towards the Sketching of Performative Control with
Data,” in Innovative and Creative Developments in
Multimodal Interaction Systems, vol. 425, Springer
Berlin Heidelberg, 2014, pp. 20–49.
M. Pouget, T. Hueber, G. Bailly, and T. Baumann,
“HMM Training Strategy for incremental speech
synthesis,” in Proceedings of Interspeech, Dresden,
Germany, 2015.
M. Alissali and G. Bailly, “COMPOST: a client-server
model for applications using text-to-speech systems.,” in
Proceedings of European Conference on Speech
Communication and Technology, Berlin, Germany,
1993, pp. 2095–2098.
“The
HTS
toolkit.”
[Online].
Available:
http://hts.sp.nitech.ac.jp/.
T. Baumann, “Decision tree usage for incremental
parametric speech synthesis,” in Proceedings of
ICASSP, Florence, Italy, 2014, pp. 3819–3823.