Consistent Estimation of Fujisaki’s Intonation Model Parameters Pablo Daniel Agüero and Antonio Bonafonte TALP Research Center Signal Theory and Communications Department Universitat Politècnica de Catalunya, Barcelona, Spain [email protected] Abstract This paper presents a novel method to estimate an intonation model based on the representation of the fundamental frequency contour proposed by Fujisaki. Unlike other methods, this approach does not find the commands sentence by sentence but it estimates the commands analysing simultaneously all the phrase and accent components in the corpus that belong to the same linguistic class. The classes are defined top-down using CART based on linguistic features derived from the text. This method avoids unvoiced interpolation, stylization and the separation of the high and low frequency of the f0 contours. All this preprocessing can produce ambiguities and noisy commands. Furthermore, the commands are very consistent because all the components of a given linguistic class are represented by the same command. Furthermore, for a given class definition, we present a closed formulation to derive the amplitudes of the commands assuming that the time instants are known. This can be used also in sentence-by-sentence analysis to speed up the extraction. 1. Introduction The intonation model is an important component in text-tospeech systems. A correct fundamental frequency contour increases the intelligibility and naturalness of the system. Several intonation models have been proposed in the literature. In this paper we will work on Fujisaki’s intonation model. Fujisaki’s intonation model is based on a physiological explanation of the fundamental frequency contour generation [1]. The physical model consists of two critically damped secondorder filters (the cutoff frequencies depend on α and β). Impulses and pulses (phrase and accent commands, respectively) are used as excitations to produce the long-term (Gp (t)) and short-term (Ga (t)) components that can be observed in the fundamental frequency contour. A scheme is shown in figure 1. The mathematical formulation is shown in equations 1, 2 and 3. lnF0 (t) = lnfb + I X Api Gp (t − T0i )+ i=1 + J X Aaj (Ga (t − T1j ) − Ga (t − T2j )) j=1 ( Gp (t) = α2 te−αt 0 t≥0 t<0 (1) (2) This work has been partially sponsored by the European Union under grant FP6-506738 (TC-STAR project, http://www.tc-star.org) and the Spanish Government under grant TIC2002-04447-C02 (ALIADO project, http://gps-tsc.upc.es/veu/aliado ( Ga (t) = min[1 − (1 + βt)e−βt , γ] 0 t≥0 t<0 (3) Figure 1: Fujisaki’s intonation model The extraction of the impulses and the pulses parameters from a given fundamental frequency contour is not an easy task. A closed-form solution does not exist for the model. Usually a first estimation is performed which is optimized using gradientdescent techniques. Several combinations of different pulses and impulses can approximate the original contour, depending on the initial value of the variables given to the gradient-descent algorithm. In this way, the extracted parameters can be inconsistent, because sentences with similar structure and contour may have a different set of impulses and pulses. In addition, the extraction of the parameters sentence by sentence implies the use of interpolation techniques to fill with frequency values the unvoiced segments. It is not recommendable to extract parameters from a contour with missing data because the error can not be evaluated on those missing points. As a consequence, the shape of the extracted command could have any shape on those unvoiced segments and the inconsistency of the parameterization would be increased. Several papers treated the problem of parameter extraction [2][3][4][5][6]. In general, these algorithms perform an initial sentence-by-sentence parameterization of the fundamental frequency contours of the database and then generate an intonation model that predicts a set of parameters based on linguistic information extracted from the text. These parameters are used to synthesize a suitable fundamental frequency contour for the utterance. Agüero et al. [7] propose a new approach to generate an intonation model that combines parameter extraction and model generation into a loop. Using this approach the performance of the intonation model is better. A recent paper of Silva and Netto [8] proposed a closedform estimation of the amplitude commands for the Fujisaki’s intonation model assuming that the time instants of the commands are known. A closed-form solution for a group of parameters is useful, because reduces the time of convergence to the optimal solution of gradient-descent algorithms. In order to obtain the solution, Silva and Netto proposed (based on the work of Mixdorff [4]) to decompose the fundamental frequency contour in two components: HFC (high frequency components, related to accent commands) and LFC (low frequency components, related to phrase commands). This is done using a third-order high-pass Butterworth filter with a cutoff frequency of 0.5 Hz. Then, independent close-form solutions are derived for each component sentence by sentence. This technique may not work properly due to some aspects: • The filters may not separate properly the two components, because the tuning of the cutoff frequency is a difficult task. In fact, it is not possible to separate the components by filtering because the two components can overlap in the frequency domain. • The extracted parameters may depend on the initial stylization using MOMEL [9]. The initial stylization is neccesary due to the continuity constraint required by the filters. This stylization step may produce that different commands are derived for similar contours (inconsistency). • The sentence by sentence extraction approach may cause inconsistent sets of parameters: similar contours with different sets of parameters. The approach proposed by Silva y Netto is interesting but it is applied to both components (HFC and LFC) separately using a sentence-by-sentence approach. In this paper we propose a closed-form estimation for all the amplitude commands (pulses and impulses) using a global optimization (all sentences are optimized simultaneously), avoiding the problems that may arise using previous approaches: separation of the two components using filters, the sentence by sentence extraction of parameters and the voiceless interpolation of the fundamental frequency contour. This approach is combined with a joint extraction and prediction approach that we have applied to another intonation model [10]. It allows a more consistent parameterization and increases the performance of the machine learning technique used to predict the commands from the text for TTS. In our case we are using classification and regression trees [11], because they provide the necessary flexibility to use discrete and continuous features. Section 2 explains the closed-form formulation, and section 3 shows the intonation model training algorithm. Then, section 4 contains the results of the intonation model applied to different domains. Finally, section 5 provides the conclusions of this work, and future work in the area. 2. Closed-form determination of amplitude parameters In the Fujisaki’s intonation model it is not possible to obtain a closed-form solution for all the parameters of the model. However, it is possible to obtain an optimal solution for the amplitude commands assuming that the time instants are known (T0 , T1 and T2 ). The optimal values of the time instants can be found using grid search or gradient descent techniques. In our case we used gradient descent, because it provides a more accurate solution. Figure 2: Update loop for parameter optimization. The update loop is shown in figure 2. It is a combination of closed-form optimization of amplitude values, and the update of the values of the time instants according to the gradient. The Fujisaki’s intonation model formulation can be expressed as in equation 4. fˆ0 = (lnfb )u + Gp Ap + Ga Aa (4) where: M : number of points in the f0 contour. fb : scalar that represents the base frequency. u : vector of ones (M × 1). Gp : phrase command component matrix (M × I). Ap : phrase command amplitude vector (I × 1). Ga : accent command component matrix (M × J). Aa : accent command amplitude vector (J × 1). The f0 vector is a concatenation of all the contours of the training set. The Gp and Ga matrices have the corresponding components of the phrase and accent commands, respectively. In our case we apply a global strategy but the same equations can be applied to the sentence-by-sentence case, building a system of equations for each sentence (in this case each sentence must have interpolated voiceless segments in the fundamental frequency contour). In order to find the optimal parameters of the model (lnfb , Ap and Aa ) we minimize the mean squared error of the approximation shown in equation 5. e2 = (f0 − fˆ0 )T (f0 − fˆ0 ) (5) Taking the derivatives with respect to lnfb , Ap and Aa , we obtain a system of equations that can be expressed in matrix notation as in equation 6. The solution of this system of equations provides the optimal values of the amplitudes of phrase commands (Ap ), accent commands (Aa ) and base frequency (lnfb ). 2 3 2 T u T f0 u u T 4Gp f0 5 = 4GTp u GTa f0 GTa u uT Gp GTp Gp GTa Gp 3 32 uT Ga lnfb T G p G a 5 4 Ap 5 Aa GTa Ga (6) 3. Intonation model training algorithm In this section we explain the training algorithm that analyses all the corpus performing simultaneously a joint parameter extraction and prediction of Fujisaki’s intonation model parameters. Two regression trees are grown, one tree is related to minor phrases, and the other to accent groups. Each node splits the data using binary questions based on linguistic features as in cause a bias in the parameter extraction. Another advantage of global optimization is the consistency of the parameters. Nonconsistent parameters increase the dispersion, and limit the prediction capabilities of machine learning techniques. The closed-form formulation allows a faster training of the intonation model, compared to the gradient descent optimization of all parameters of our previous paper [7]. In each optimization 40% of the parameters are obtained using a closedform formulation which contributes to speed up the training. Figure 3: Classes obtained with the linguistic clustering using CART. [10]. We assume we know the segmentation of each utterance in minor phrases and accent groups. Each minor phrase is modeled by one phrase command, and each accent group is modeled by one accent command. The time instant of the phrase command (T0 ) is defined with respect to the initial boundary of the minor phrase. The initial time instant of the accent command (T1 ) is defined with respect to the initial boundary of the stressed syllable, and the duration of the accent command is the other time parameter (T2 ). The criteria to grow the tree is the minimization of the MSE. Thus, each leaf of the tree collects a set of fundamental frequency contours from the training corpus that must be approximated with a command response (command class). Figure 3 shows an example of a contour and an assignation of classes given the linguistic features related to accent groups and minor phrases. Due to the superpositional nature of the intonation model, each partition of a tree affects the optimal solutions of the parameters of both trees. Therefore, the optimization must be jointly performed for phrase and accent commands. The steps of the algorithm are: 1. Each tree (accent group tree and minor phrase tree) has an unique root node, which groups all the accent and phrase commands. For these two classes the optimal solution is found: it approximates all contours with the same phrase command for each minor phrase, and with the same accent command for each accent group. In section 2 is explained the procedure that is used to find the parameters that provide a global optimal approximation to all the fundamental frequency contours. Note that at this step Gp and Ga are matrices of dimension (M × 1) and are the concatenation of all the phrase and accent commands. 2. All possible questions are examined in the leafs. For each question, the optimal parameters for phrase and accent commands are determined (as shown in section 2), and the approximation error is obtained. The dimension of Gp and Ga increases by one. 3. The splitting questions for phrase and accent command trees are chosen. The selection criterion of the optimal node question is the minimization of the approximation error. 4. Then, the global optimal values for Fujisaki’s filters (α and β) are searched using a grid of values. 5. The process is iterated from the second step, until a minimum number of elements in the leafs is reached or the differential gain on accuracy is lower than a threshold. As discussed previously, the global optimization avoids the interpolation step of the stylization process, which can 4. Experimental results The experiments were performed for a Spanish female voice using a corpus of 750 sentences. It includes enuntiative sentences of our previous work [7] and also prompts recorded for a dialogue system. The utterances were manually segmented in phones. The fundamental frequency contour was obtained from the laryngograph channel. The train set for the experiments was 70% of the corpus and the test set was 30% of the corpus. The intonation model training technique presented in this paper is compared with another intonation model using Bézier curves. 4.1. Bézier intonation model The goal of the comparison is to show that a model with constrained shapes, such as Fujisaki’s intonation model, can obtain similar objective results than a technique with higher freedom for approximation. Cardeñoso et al [12] obtained very good results in their paper. We have extended their work to use a superpositional approach. In the superpositional Bézier intonation model [10], accent groups and minor phrases are modeled using a superposition of two components. Each component is modeled using Bézier curves. The global optimization of the parameters enables the separation of the effects of each component. The joint extraction and prediction algorithm presented in section 3 is also applied to this model. Advantages of Superpositional Bézier intonation model relies on the possibility to obtain a closed-form solution for all the optimal polynomial coefficients and the accuracy of the approximation to the fundamental frequency contour. The accuracy depends on the degree of the polynomial. A four-degree polynomial is able to capture the most relevant information of the fundamental frequency contours. Figure 4: Fujisaki’s approximation of a sentence. Vertical axis in fundamental frequency contour is in log-scale. The use of Fujisaki’s intonation model is encouraged based on its physiological basis. The results obtained in this paper show that the shapes of the model approximate as good as Bézier intonation model the fundamental frequency contours. Figure 4 shows the approximation of a contour using Fujisaki’s intonation model. 4.2. Objective evaluation The global results for both approaches are shown in Table 1. The joint parameter extraction and prediction algorithm has a high performance for both intonation models, taking into account RMSE and correlation measures. The high prediction results are the result of the consistency of the parameterization due to the global optimization procedure. The better performance of this training approach over two-stages approaches was tested in our previous paper [7] where the RMSE and Pearson correlation coefficiente measures were improved: RMSE=21.79→18.67 ρ=0.68→0.73. The intonation model based on Bézier curves has a slight better performance. Although the Fujisaki’s intonation model has restrictions on the shape of the commands that decrease the flexibility of the approximation it achieves the same performance than Bézier intonation model. Method Superpositional Bézier Fujisaki RMSE [Hz] 20.9 21.2 ρ 0.764 0.761 Table 1: Global results for test data. 4.3. Subjective evaluation In order to obtain a perceptual measure of the quality of the intonation we performed a listening test. Twelve evaluators were asked to judge the naturalness of the intonation of several sentences using a five point scale (1:unnatural, 5:natural). The sentences are produced by resynthesis using Praat [13]. In this way the naturalness is only affected by the predicted intonation contour and the quality of the resynthesis. The perceptual evaluation shows a MOS of 4.7 for sentences with natural intonation, 3.1 for Bézier intonation model and 3.2 for Fujisaki’s intonation model. A natural intonation is included in the test as a reference for the evaluators and also to ensure the competence of the evaluators. This evaluation shows that the performance in the RMSE and correlation coefficient measures is not high enough to obtain natural contours. The cause of these results are missing input features that can not be extracted from plain text. Also, the use of resynthesis may affect the judgement when evaluating synthetic speech. 5. Conclusions and future work This paper presents a closed-form formulation to calculate optimal command amplitudes for the Fujisaki’s intonation model, assuming that time instants are known. Time instants are optimized using a gradient descent technique. The closed-form formulation allows a faster training and a higher accuracy in the parameter estimation. The optimization is jointly performed for all the fundamental frequency contours which introduces some advantages: avoidance of stylization steps (continuity constraint) and the separation of the components using filters. This formulation is combined with a novel approach to train intonation models that combines parameter extraction and model generation into a loop. This approach has been successfully applied to Fujisaki’s intonation model and Bézier intonation model. The experiments support the theoretical advantages of this approach (consistency and avoidance of some assumptions regarding continuity and component separation), with similar objective results compared with a more flexible intonation model (Bézier). Perceptual experiments show that the predicted contours have lower scores of naturalness than natural contours. Further work must be done to improve the naturalness of the contours and the quality of the resynthesis. 6. References [1] H. Fujisaki, S. Ohno, and S. Narusawa, “Physiological mechanisms and biomechanical modeling of fundamental frequency control for the common Japanese and the standard Chinese,” Proceedings of the 5th Seminar on Speech Production, pp. 145–148, 2000. [2] B. Möbius, “Components of a quantitative model of German intonation,” Proceedings of ICPhS, pp. 108–115, 1995. [3] H. Fujisaki, S. Narusawa, and M. Maruno, “Preprocessing of fundamental frequency contours of speech for automatic parameter extraction,” Proceedings of ICSLP, pp. 722–725, 2000. [4] H. Mixdorff, “A novel approach to the fully automatic extraction of Fujisaki model parameters,” Proceedings of ICASSP, pp. 1281–1284, 2000. [5] E. Navas, I. Hernaez, and J. Sanchez, “Basque intonation modelling for text to speech conversion,” Proceedings of ICSLP, pp. 2409–2412, 2002. [6] P. S. Rossi, “A method for automatic extraction of Fujisaki-model parameters,” Speech Prosody, pp. 615– 618, 2002. [7] P. D. Agüero, K. Wimmer, and A. Bonafonte, “Joint extraction and prediction of Fujisaki’s intonation model parameters,” Proceedings of ICSLP, 2004. [8] S. de S. Silva and S. L. Netto, “Closed-form estimation of the amplitude commands in the automatic extraction of the Fujisaki’s model,” Proceedings of ICASSP, pp. 621– 624, 2004. [9] D. Hirst, A. D. Cristo, and R. Espesser, “Levels of representation and levels of analysis for the description of intonation systems,” Prosody : Theory and Experiment. Dordrecht: Kluwer Academic Publishers, 2000. [10] P. D. Agüero and A. Bonafonte, “Intonation modeling for TTS using a joint extraction and prediction approach,” 5th ISCA Speech Synthesis Workshop, pp. 67–72, 2004. [11] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and regression trees,” Chapman Hall, 1984. [12] V. C. Payo and D. E. Mancebo, “A strategy to solve data scarcity problems in corpus based intonation modelling,” Proceedings of ICASSP, pp. 665–668, 2004. [13] P. Boersma and D. Weenink, “Praat: doing phonetics by computer,” http://www.fon.hum.uva.nl/praat/.
© Copyright 2025 Paperzz