Consistent Estimation of Fujisaki`s Intonation Model Parameters

Consistent Estimation of Fujisaki’s Intonation Model Parameters
Pablo Daniel Agüero and Antonio Bonafonte
TALP Research Center
Signal Theory and Communications Department
Universitat Politècnica de Catalunya, Barcelona, Spain
[email protected]
Abstract
This paper presents a novel method to estimate an intonation
model based on the representation of the fundamental frequency
contour proposed by Fujisaki. Unlike other methods, this approach does not find the commands sentence by sentence but
it estimates the commands analysing simultaneously all the
phrase and accent components in the corpus that belong to the
same linguistic class. The classes are defined top-down using
CART based on linguistic features derived from the text. This
method avoids unvoiced interpolation, stylization and the separation of the high and low frequency of the f0 contours. All this
preprocessing can produce ambiguities and noisy commands.
Furthermore, the commands are very consistent because all the
components of a given linguistic class are represented by the
same command. Furthermore, for a given class definition, we
present a closed formulation to derive the amplitudes of the
commands assuming that the time instants are known. This can
be used also in sentence-by-sentence analysis to speed up the
extraction.
1. Introduction
The intonation model is an important component in text-tospeech systems. A correct fundamental frequency contour increases the intelligibility and naturalness of the system.
Several intonation models have been proposed in the literature. In this paper we will work on Fujisaki’s intonation model.
Fujisaki’s intonation model is based on a physiological explanation of the fundamental frequency contour generation [1].
The physical model consists of two critically damped secondorder filters (the cutoff frequencies depend on α and β). Impulses and pulses (phrase and accent commands, respectively)
are used as excitations to produce the long-term (Gp (t)) and
short-term (Ga (t)) components that can be observed in the fundamental frequency contour. A scheme is shown in figure 1.
The mathematical formulation is shown in equations 1, 2 and 3.
lnF0 (t) = lnfb +
I
X
Api Gp (t − T0i )+
i=1
+
J
X
Aaj (Ga (t − T1j ) − Ga (t − T2j ))
j=1
(
Gp (t) =
α2 te−αt
0
t≥0
t<0
(1)
(2)
This work has been partially sponsored by the European Union under grant FP6-506738 (TC-STAR project, http://www.tc-star.org) and
the Spanish Government under grant TIC2002-04447-C02 (ALIADO
project, http://gps-tsc.upc.es/veu/aliado
(
Ga (t) =
min[1 − (1 + βt)e−βt , γ]
0
t≥0
t<0
(3)
Figure 1: Fujisaki’s intonation model
The extraction of the impulses and the pulses parameters
from a given fundamental frequency contour is not an easy task.
A closed-form solution does not exist for the model. Usually a
first estimation is performed which is optimized using gradientdescent techniques. Several combinations of different pulses
and impulses can approximate the original contour, depending
on the initial value of the variables given to the gradient-descent
algorithm. In this way, the extracted parameters can be inconsistent, because sentences with similar structure and contour may
have a different set of impulses and pulses.
In addition, the extraction of the parameters sentence by
sentence implies the use of interpolation techniques to fill with
frequency values the unvoiced segments. It is not recommendable to extract parameters from a contour with missing data because the error can not be evaluated on those missing points. As
a consequence, the shape of the extracted command could have
any shape on those unvoiced segments and the inconsistency of
the parameterization would be increased.
Several papers treated the problem of parameter extraction
[2][3][4][5][6]. In general, these algorithms perform an initial
sentence-by-sentence parameterization of the fundamental frequency contours of the database and then generate an intonation
model that predicts a set of parameters based on linguistic information extracted from the text. These parameters are used
to synthesize a suitable fundamental frequency contour for the
utterance. Agüero et al. [7] propose a new approach to generate an intonation model that combines parameter extraction and
model generation into a loop. Using this approach the performance of the intonation model is better.
A recent paper of Silva and Netto [8] proposed a closedform estimation of the amplitude commands for the Fujisaki’s
intonation model assuming that the time instants of the commands are known. A closed-form solution for a group of parameters is useful, because reduces the time of convergence to
the optimal solution of gradient-descent algorithms.
In order to obtain the solution, Silva and Netto proposed
(based on the work of Mixdorff [4]) to decompose the fundamental frequency contour in two components: HFC (high
frequency components, related to accent commands) and LFC
(low frequency components, related to phrase commands). This
is done using a third-order high-pass Butterworth filter with a
cutoff frequency of 0.5 Hz. Then, independent close-form solutions are derived for each component sentence by sentence.
This technique may not work properly due to some aspects:
• The filters may not separate properly the two components, because the tuning of the cutoff frequency is a
difficult task. In fact, it is not possible to separate the
components by filtering because the two components can
overlap in the frequency domain.
• The extracted parameters may depend on the initial stylization using MOMEL [9]. The initial stylization is
neccesary due to the continuity constraint required by
the filters. This stylization step may produce that different commands are derived for similar contours (inconsistency).
• The sentence by sentence extraction approach may cause
inconsistent sets of parameters: similar contours with
different sets of parameters.
The approach proposed by Silva y Netto is interesting but it
is applied to both components (HFC and LFC) separately using
a sentence-by-sentence approach.
In this paper we propose a closed-form estimation for all the
amplitude commands (pulses and impulses) using a global optimization (all sentences are optimized simultaneously), avoiding
the problems that may arise using previous approaches: separation of the two components using filters, the sentence by sentence extraction of parameters and the voiceless interpolation
of the fundamental frequency contour.
This approach is combined with a joint extraction and prediction approach that we have applied to another intonation
model [10]. It allows a more consistent parameterization and increases the performance of the machine learning technique used
to predict the commands from the text for TTS. In our case we
are using classification and regression trees [11], because they
provide the necessary flexibility to use discrete and continuous
features.
Section 2 explains the closed-form formulation, and section
3 shows the intonation model training algorithm. Then, section
4 contains the results of the intonation model applied to different domains. Finally, section 5 provides the conclusions of this
work, and future work in the area.
2. Closed-form determination of amplitude
parameters
In the Fujisaki’s intonation model it is not possible to obtain a
closed-form solution for all the parameters of the model. However, it is possible to obtain an optimal solution for the amplitude commands assuming that the time instants are known (T0 ,
T1 and T2 ).
The optimal values of the time instants can be found using
grid search or gradient descent techniques. In our case we used
gradient descent, because it provides a more accurate solution.
Figure 2: Update loop for parameter optimization.
The update loop is shown in figure 2. It is a combination of
closed-form optimization of amplitude values, and the update
of the values of the time instants according to the gradient.
The Fujisaki’s intonation model formulation can be expressed as in equation 4.
fˆ0 = (lnfb )u + Gp Ap + Ga Aa
(4)
where:
M : number of points in the f0 contour.
fb : scalar that represents the base frequency.
u : vector of ones (M × 1).
Gp : phrase command component matrix (M × I).
Ap : phrase command amplitude vector (I × 1).
Ga : accent command component matrix (M × J).
Aa : accent command amplitude vector (J × 1).
The f0 vector is a concatenation of all the contours of the
training set. The Gp and Ga matrices have the corresponding
components of the phrase and accent commands, respectively.
In our case we apply a global strategy but the same equations can be applied to the sentence-by-sentence case, building
a system of equations for each sentence (in this case each sentence must have interpolated voiceless segments in the fundamental frequency contour).
In order to find the optimal parameters of the model (lnfb ,
Ap and Aa ) we minimize the mean squared error of the approximation shown in equation 5.
e2 = (f0 − fˆ0 )T (f0 − fˆ0 )
(5)
Taking the derivatives with respect to lnfb , Ap and Aa , we
obtain a system of equations that can be expressed in matrix
notation as in equation 6. The solution of this system of equations provides the optimal values of the amplitudes of phrase
commands (Ap ), accent commands (Aa ) and base frequency
(lnfb ).
2
3 2 T
u T f0
u u
T
4Gp f0 5 = 4GTp u
GTa f0
GTa u
uT Gp
GTp Gp
GTa Gp
3
32
uT Ga
lnfb
T
G p G a 5 4 Ap 5
Aa
GTa Ga
(6)
3. Intonation model training algorithm
In this section we explain the training algorithm that analyses all
the corpus performing simultaneously a joint parameter extraction and prediction of Fujisaki’s intonation model parameters.
Two regression trees are grown, one tree is related to minor
phrases, and the other to accent groups. Each node splits the
data using binary questions based on linguistic features as in
cause a bias in the parameter extraction. Another advantage of
global optimization is the consistency of the parameters. Nonconsistent parameters increase the dispersion, and limit the prediction capabilities of machine learning techniques.
The closed-form formulation allows a faster training of the
intonation model, compared to the gradient descent optimization of all parameters of our previous paper [7]. In each optimization 40% of the parameters are obtained using a closedform formulation which contributes to speed up the training.
Figure 3: Classes obtained with the linguistic clustering using
CART.
[10]. We assume we know the segmentation of each utterance
in minor phrases and accent groups. Each minor phrase is modeled by one phrase command, and each accent group is modeled by one accent command. The time instant of the phrase
command (T0 ) is defined with respect to the initial boundary of
the minor phrase. The initial time instant of the accent command (T1 ) is defined with respect to the initial boundary of the
stressed syllable, and the duration of the accent command is the
other time parameter (T2 ). The criteria to grow the tree is the
minimization of the MSE.
Thus, each leaf of the tree collects a set of fundamental frequency contours from the training corpus that must be approximated with a command response (command class). Figure 3
shows an example of a contour and an assignation of classes
given the linguistic features related to accent groups and minor
phrases.
Due to the superpositional nature of the intonation model,
each partition of a tree affects the optimal solutions of the parameters of both trees. Therefore, the optimization must be
jointly performed for phrase and accent commands.
The steps of the algorithm are:
1. Each tree (accent group tree and minor phrase tree) has
an unique root node, which groups all the accent and
phrase commands. For these two classes the optimal
solution is found: it approximates all contours with the
same phrase command for each minor phrase, and with
the same accent command for each accent group. In section 2 is explained the procedure that is used to find the
parameters that provide a global optimal approximation
to all the fundamental frequency contours. Note that at
this step Gp and Ga are matrices of dimension (M × 1)
and are the concatenation of all the phrase and accent
commands.
2. All possible questions are examined in the leafs. For
each question, the optimal parameters for phrase and accent commands are determined (as shown in section 2),
and the approximation error is obtained. The dimension
of Gp and Ga increases by one.
3. The splitting questions for phrase and accent command
trees are chosen. The selection criterion of the optimal
node question is the minimization of the approximation
error.
4. Then, the global optimal values for Fujisaki’s filters (α
and β) are searched using a grid of values.
5. The process is iterated from the second step, until a minimum number of elements in the leafs is reached or the
differential gain on accuracy is lower than a threshold.
As discussed previously, the global optimization avoids
the interpolation step of the stylization process, which can
4. Experimental results
The experiments were performed for a Spanish female voice using a corpus of 750 sentences. It includes enuntiative sentences
of our previous work [7] and also prompts recorded for a dialogue system.
The utterances were manually segmented in phones. The
fundamental frequency contour was obtained from the laryngograph channel. The train set for the experiments was 70% of
the corpus and the test set was 30% of the corpus.
The intonation model training technique presented in this
paper is compared with another intonation model using Bézier
curves.
4.1. Bézier intonation model
The goal of the comparison is to show that a model with constrained shapes, such as Fujisaki’s intonation model, can obtain
similar objective results than a technique with higher freedom
for approximation. Cardeñoso et al [12] obtained very good
results in their paper. We have extended their work to use a
superpositional approach.
In the superpositional Bézier intonation model [10], accent
groups and minor phrases are modeled using a superposition
of two components. Each component is modeled using Bézier
curves. The global optimization of the parameters enables the
separation of the effects of each component. The joint extraction and prediction algorithm presented in section 3 is also applied to this model.
Advantages of Superpositional Bézier intonation model relies on the possibility to obtain a closed-form solution for all
the optimal polynomial coefficients and the accuracy of the approximation to the fundamental frequency contour. The accuracy depends on the degree of the polynomial. A four-degree
polynomial is able to capture the most relevant information of
the fundamental frequency contours.
Figure 4: Fujisaki’s approximation of a sentence. Vertical axis
in fundamental frequency contour is in log-scale.
The use of Fujisaki’s intonation model is encouraged based
on its physiological basis. The results obtained in this paper
show that the shapes of the model approximate as good as
Bézier intonation model the fundamental frequency contours.
Figure 4 shows the approximation of a contour using Fujisaki’s
intonation model.
4.2. Objective evaluation
The global results for both approaches are shown in Table 1.
The joint parameter extraction and prediction algorithm has
a high performance for both intonation models, taking into
account RMSE and correlation measures. The high prediction results are the result of the consistency of the parameterization due to the global optimization procedure. The better performance of this training approach over two-stages approaches was tested in our previous paper [7] where the RMSE
and Pearson correlation coefficiente measures were improved:
RMSE=21.79→18.67 ρ=0.68→0.73.
The intonation model based on Bézier curves has a slight
better performance. Although the Fujisaki’s intonation model
has restrictions on the shape of the commands that decrease
the flexibility of the approximation it achieves the same performance than Bézier intonation model.
Method
Superpositional Bézier
Fujisaki
RMSE [Hz]
20.9
21.2
ρ
0.764
0.761
Table 1: Global results for test data.
4.3. Subjective evaluation
In order to obtain a perceptual measure of the quality of the intonation we performed a listening test. Twelve evaluators were
asked to judge the naturalness of the intonation of several sentences using a five point scale (1:unnatural, 5:natural).
The sentences are produced by resynthesis using Praat [13].
In this way the naturalness is only affected by the predicted intonation contour and the quality of the resynthesis.
The perceptual evaluation shows a MOS of 4.7 for sentences with natural intonation, 3.1 for Bézier intonation model
and 3.2 for Fujisaki’s intonation model. A natural intonation is
included in the test as a reference for the evaluators and also to
ensure the competence of the evaluators.
This evaluation shows that the performance in the RMSE
and correlation coefficient measures is not high enough to obtain natural contours. The cause of these results are missing
input features that can not be extracted from plain text. Also,
the use of resynthesis may affect the judgement when evaluating synthetic speech.
5. Conclusions and future work
This paper presents a closed-form formulation to calculate optimal command amplitudes for the Fujisaki’s intonation model,
assuming that time instants are known. Time instants are optimized using a gradient descent technique. The closed-form
formulation allows a faster training and a higher accuracy in the
parameter estimation.
The optimization is jointly performed for all the fundamental frequency contours which introduces some advantages:
avoidance of stylization steps (continuity constraint) and the
separation of the components using filters.
This formulation is combined with a novel approach to
train intonation models that combines parameter extraction and
model generation into a loop. This approach has been successfully applied to Fujisaki’s intonation model and Bézier intonation model.
The experiments support the theoretical advantages of this
approach (consistency and avoidance of some assumptions regarding continuity and component separation), with similar objective results compared with a more flexible intonation model
(Bézier).
Perceptual experiments show that the predicted contours
have lower scores of naturalness than natural contours. Further
work must be done to improve the naturalness of the contours
and the quality of the resynthesis.
6. References
[1] H. Fujisaki, S. Ohno, and S. Narusawa, “Physiological
mechanisms and biomechanical modeling of fundamental
frequency control for the common Japanese and the standard Chinese,” Proceedings of the 5th Seminar on Speech
Production, pp. 145–148, 2000.
[2] B. Möbius, “Components of a quantitative model of German intonation,” Proceedings of ICPhS, pp. 108–115,
1995.
[3] H. Fujisaki, S. Narusawa, and M. Maruno, “Preprocessing of fundamental frequency contours of speech
for automatic parameter extraction,” Proceedings of ICSLP, pp. 722–725, 2000.
[4] H. Mixdorff, “A novel approach to the fully automatic
extraction of Fujisaki model parameters,” Proceedings of
ICASSP, pp. 1281–1284, 2000.
[5] E. Navas, I. Hernaez, and J. Sanchez, “Basque intonation
modelling for text to speech conversion,” Proceedings of
ICSLP, pp. 2409–2412, 2002.
[6] P. S. Rossi, “A method for automatic extraction of
Fujisaki-model parameters,” Speech Prosody, pp. 615–
618, 2002.
[7] P. D. Agüero, K. Wimmer, and A. Bonafonte, “Joint extraction and prediction of Fujisaki’s intonation model parameters,” Proceedings of ICSLP, 2004.
[8] S. de S. Silva and S. L. Netto, “Closed-form estimation
of the amplitude commands in the automatic extraction of
the Fujisaki’s model,” Proceedings of ICASSP, pp. 621–
624, 2004.
[9] D. Hirst, A. D. Cristo, and R. Espesser, “Levels of representation and levels of analysis for the description of
intonation systems,” Prosody : Theory and Experiment.
Dordrecht: Kluwer Academic Publishers, 2000.
[10] P. D. Agüero and A. Bonafonte, “Intonation modeling for
TTS using a joint extraction and prediction approach,” 5th
ISCA Speech Synthesis Workshop, pp. 67–72, 2004.
[11] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and regression trees,” Chapman Hall, 1984.
[12] V. C. Payo and D. E. Mancebo, “A strategy to solve data
scarcity problems in corpus based intonation modelling,”
Proceedings of ICASSP, pp. 665–668, 2004.
[13] P. Boersma and D. Weenink, “Praat: doing phonetics by
computer,” http://www.fon.hum.uva.nl/praat/.