16 Phonetic and Timing Considerations in a Swiss High German TTS System Beat Siebenhaar, Brigitte Zellner Keller, and Eric Keller Laboratoire d'Analyse Informatique de la Parole (LAIP) Universite de Lausanne, CH-1015 Lausanne, Switzerland [email protected], [email protected], unil.ch Eric.Keller@imm. Introduction The linguistic situation of German-speaking Switzerland shows many differences from the situation in Germany or in Austria. The Swiss dialects are used by everybody in almost every situation ± even members of the highest political institution, the Federal Council, speak their local dialect in political discussions on TV. By contrast, spoken Standard German is not a high-prestige variety. It is used for reading aloud, in school, and in contact with people who do not know the dialect. Thus spoken Swiss High German has many features distinguishing it from German and Austrian variants. If a TTS system respects the language of the people to whom it has to speak, this will improve the acceptability of speech synthesis. Therefore a German TTS system for Switzerland has to consider these peculiarities. As the prestigious dialects are not generally written, the Swiss variant of Standard German is the best choice for a Swiss German TTS system. At the Laboratoire d'analyse informatique de la parole (LAIP) of the University of Lausanne, such a Swiss High German TTS system is under construction. The dialectal variant to be synthesised is the implicit Swiss High German norm such as might be used by a Swiss teacher. In the context of the linguistic situation of Switzerland this means an adaptation of TTS systems to linguistic reality. The design of the system closely follows the French TTS system developed at LAIP since 1991, LAIPTTS-F.1 On a theoretical level the goal of the German system, LAIPTTS-D, is to see if the assumptions underlying the French system are also 1 Information on LAIPTTS-F can be found at http://www.unil.ch/imm/docs/LAIP/LAIPTTS.html 166 Improvements in Speech Synthesis applicable to other languages, especially to a lexical stress language such as German. Some considerations on the phonetic and timing levels in designing LAIPTTS-D will be presented here. The Phonetic Alphabet The phonetic alphabet used for LAIPTTS-F corresponds closely to the SAMPA2 convention. For the German version, this convention had to be extended (a) to cover Swiss phonetic reality; and (b) to aid the transcription of stylistic variation: 1. Long and short variants of vowels represent distinct phonemes in German. There is no simple relation to change long into short vowels. Therefore they are treated as different segments. 2. Lexical stress has a major effect on vowels, but again no simple relation with duration could be identified. Consequently, stressed and non-stressed vowels are treated as different segments, while consonants in stressed or non-stressed syllables are not. Lexical stress, therefore, is a segmental feature of vowels. 3. The phonemes /@l/, /@m/, /@n/ and /@r/ are usually pronounced as syllabic consonants [lt], [mt], [nt] and [6t]. These are shorter than the combination of /@/ and the respective consonant, but longer than the consonant itself.3 In formal styles, schwa and consonant replace most syllabic consonants, but this is not a 1:1 relation. These findings led to the decision to define the syllabic consonants as special segments. 4. Swiss speakers tend to be much more sensitive to the orthographic representation than German speakers are. On the phonetic level, the phonetic set had to be enlarged by a sign for an open /EH / that is the normal realisation of the grapheme <aÈ> (Siebenhaar, 1994). These distinctions result in a phonetic alphabet of 83 distinct segments: 27 consonants, 52 vowels and 4 syllabic consonants. That is almost double the 44 segments used in the French version of LAIPTTS. The Timing Model As drawn up for French (Zellner, 1996; Keller et al., 1997), the LAIP approach to TTS synthesis is first to design a timing model and only then to model the fundamental frequency. The task of the timing component is to compute the temporal structure from an annotated phonetic string. In the case of LAIPTTS-D, this string contains the orthographic punctuation marks, marks for word stress, and the distinction between grammatical and lexical words. The timing model has two components. The first one groups prosodic phrases and identifies pauses; the other calculates segmental durations. 2 Specifications at http://www.phon.ucl.ac.uk/home/sampa/home.htm [@n] mean 110.2 ms, [nt] mean 90.4 ms; [@m] mean 118.3 ms, [mt] mean 86.8 ms; [@l] mean 100.1 ms, [lt] mean 80.9 ms; [@r] mean 84.4 ms, [rt] mean 58.5 ms 3 Phonetic and Timing Considerations 167 The Design of French LAIPTTS and its Adaptation to German A series of experiments involving multiple general linear models (GLM) for determinants of French segment duration established seven significant factors that could easily be derived from text input: (a) the durational class of the preceding segment; (b) the durational class of the current segment; (c) the durational class of the subsequent segment; (d) the durational class of the next segment but one; (e) the position in the prosodic group of the syllable containing the current segment; (f) the grammatical status of the word containing the current segment; and (g) the number of segments in the syllable containing the current segment. `Durational class' refers to one of nine clusters of typical durations for segmental duration. These factors have been implemented in LAIPTTS-F. In the move to a multilingual TTS Synthesis, LAIPTTS-D should optimally be based on a similar analysis. Nevertheless, some significant changes had to be considered. The general structure of the German system and its differences from the French system are discussed below. Database Ten minutes of read text from a single speaker were manually labelled. The stylistic variants of the text were news, addresses, isolated phrases, fast and slow reading. As the raw segment duration is not normally distributed, the log transformation was chosen for the further calculations. This gave a distribution that was much closer to normal. Factors Affecting Segmental Duration To produce a general linear model for timing, the factors with statistical relevance were established in a parametric regression. Most of the factors mentioned in the literature were considered. Step-wise non-significant factors were excluded. Table 16.1 shows the factors finally retained in the model of segmental duration in German, compared to the French system. The Segmental Aspect Most TTS systems base their analysis and synthesis of segment durations on phonetic characteristics of the segments and on supra-segmental aspects. For the segmental aspects of LAIPTTS-F, Keller and Zellner (1996) chose a different approach. They grouped the segments according to their mean durations and their articulatory definitions. Zellner (1998, pp. 85 ff.) goes one step further and leaves out the articulatory aspect. This grouping is quite surprising. There are categories containing only one segment, for example [S] in fast speech or [o] in normal speech, which have a statistically different length from all other segments. Other groups contain segments as different as [e, a, b, m and t]. 168 Table 16.1 Improvements in Speech Synthesis Factors affecting segmental duration in German and French German French Durational class of the current segment Type of segment preceding the current segment Type of subsequent segment ± Type of syllable containing the current segment. Position of the segment in the syllable Lexical stress Grammatical status of the word containing the current segment Location of the syllable in the word Position in the prosodic group of the syllable containing the current segment Durational class of the current segment Durational class of the segment preceding the current segment Durational class of the subsequent segment Durational class of the next segment but one Number of segments in the syllable containing the current segment. Position of the segment in the syllable Syllable containing Schwa Grammatical status of the word containing the current segment ± Position in the prosodic group of the syllable containing the current segment For three reasons, this classification could not be applied directly to German: First, there are more segments in German than in French. Second, there are the phonological differences of long and short vowels. Third, there are major differences in German between stressed and unstressed vowels. Therefore a more traditional approach of using phonetically different classes was employed initially. Any segment was defined by two parameters, containing 17 or 14 phonetic categories (cf. Riedi, 1998, pp. 50±2). Using these segmental parameters and the parameters for the syllable, word, minor and major prosodic group, a general linear model was built to obtain a timing model. Comparing the real values and the values predicted by the model, a correlation of r .71 was found. With only 4 500 segments, the main problem comes from sparsely populated cells. The generalisation of the model was therefore not apparent. There were two ways to rectify this situation: one was to record quite a bit more data, and the other was to switch to the Keller/Zellner model and to group the segments only by their duration. It was decided to do both. Some 1500 additional segments were recorded and manually labelled. The whole set was then clustered according to segment durations. Initially, an analysis of the single segments was conducted. Then, step by step, segments with no significant difference were included in the groups. At first articulatory definitions were considered significant, but it emerged ± as Zellner (1998) had found ± that this criterion could be dropped, and only the confidence intervals between the segments were taken into account. In the end, there were 7 groups of segments, and 1 for pauses. Table 16.2 shows these groups. There is no 1:1 relation between stressed and non-stressed vowels. In group seven, stressed and unstressed diphthongs coincide: stressed [`a:] and [`EH :] are in this group, while the unstressed versions are in different groups ([a:] is in group six, [EH:] in group five). There is also no 1:1 relation between long and short vowels. Unaccented long and short [a] and [E] show different distributions. Short [a] and [E] are both in group three, but [a:] is in group six while [E:] is in group five. 169 Phonetic and Timing Considerations Table 16.2 Phoneme class with mean, standard deviation, coefficient of variation, court, percentage Group Segments Mean 1 2 [r, 6] [E, I, i, o, U, u, Y, y, @, j, d, l, ?, v, w] [`I, `Y, ` U, `i:, `y:, O, e, EH , a, ú, |, 6t, h, N, n] [`a, `EH , `E, `O, `ú, i:, u:, g, b] [`i:, `y:, EH :, e:, |:, o:, u:, mt, nt, lt, t, s, z, f, S, Z, x] [`e:, `|:, `o:, `u:, a:, C, p, k] [`aUu , `auI , `OuI , `a:, `EH :, `a~:, `E~:, `ú~:, `o~:, aUu , aIu , OIu , a~:, E~:, ú~:, o~:, pf, ts] Pause 36.989 50.174 16.463 23.131 0.445 0.461 363 1 634 6.09 27.39 64.797 23.267 0.359 1 119 18.76 73.955 91.337 22.705 35.795 0.307 0.392 553 1 288 9.27 21.59 111.531 126.951 38.132 41.414 0.342 0.326 384 412 6.44 6.91 620.542 458.047 0.738 212 3.55 3 4 5 6 7 8 Standard deviation Coefficient of variation Count % Keller and Zellner (1996) use the same groups for the influence of the previous and the following segments, as do other systems for input into neural networks. Doing the same with the German data led to an overfitting of the model. Most classes showed only small differences and these were not significant, so the same step-by-step procedure for establishing significant factors as for the segmental influence was performed for the influence of the previous and the following segment. Four classes for the previous segment were distinguished, and three for the following segment: 1. For the previous segment the following classes were distinguished: (a) vowels; (b) affricates and pauses; (c) fricatives and plosives; (d) nasals, liquids, syllabic consonants. 2. The following segment showed influences for (a) pauses; (b) vowels, syllabic consonants and affricates; (c) fricatives, plosives, nasals and liquids. These three segmental factors explain only 49.5% of the variation of the segments, and 62.1% of the variation including pauses. The model's predicted segmental durations correlated with the measured durations at r 0.703 for the segments only, or at r 0.788 including pauses. This simplified model fits as well as the first model with the articulatory definitions of the segments, but it has the advantage that it has only three instead of six variables, and every variable only has three to eight classes, as compared to 14 to 17 of the first model. The second model is therefore more stable. The last segmental aspect taken into consideration was the segment's position in the syllable. Besides the position relative to the nucleus, Riedi (1998, p. 52) considers the absolute position as relevant. The data used for present study indicate that this absolute position is not significant. Three positions with significant differ- 170 Improvements in Speech Synthesis ences were found: nucleus, onset, offset. A slightly better fit was achieved when liquids and nasals were considered as belonging to the nucleus. Aspects at the Syllable Level For French, the number of segments in the syllable is a relevant factor. For German this aspect was not significant, but it was found that the structure of the syllable containing the current segment is important for every segment. Each of the traditional linguistic distinctions V, CV, VC, CVC was significantly distinct from all others. Although stress was defined as a segmental feature of vowels, it appeared that a supplementary variable at the syllable level was also significant. For French LAIPTTS-F distinguishes syllables containing a schwa (0 ) from those with other vowels (1 ) as nucleus: Ce vi1lage est parfois encombre de touristes. Ce0 =vi1 =llage1 =est1 =par1 =fois1 =en1 =com1 =bre1 =de0 =tou1 =ristes1 In addition to the French distinction, a distinction between stressed and unstressed vowels was considered resulting in three stress classes. LAIPTTS-D distinguishes syllables with schwa (0 ), non-stressed syllables (1 ) and stressed syllables (2 ): Dieses Dorf ist manchmal uÈberschwemmt von Touristen. Die1 =ses0 =Dorf 2 =ist1 =manch2 =mal 1 = u1 =ber0 =schwemmt2 =von1 =Tou1 =ris2 =ten0 This is not as differentiated as other systems because only the main lexical stress is considered, while others also consider stress levels based on syntactic analysis (Riedi, 1998, p. 53; van Santen, 1998, p. 124). While Riedi (1998, p. 53) considers the number of syllables in the word and the absolute position of the syllable, this was not significant in the present data. The relative position of the syllable was taken into account: monosyllabic words, first, last and medial syllables of polysyllabic words were distinguished. The marking of the grammatical status of the word containing the current segment is identical to the French system which simply distinguishes lexical and grammatical words. Articles, pronouns, prepositions and conjunctions, modal and auxiliary verbs are considered as grammatical words, all others are lexical words. This distinction is the basis for the definition of minor prosodic groups. Position of the Syllable Relative to Minor and Major Breaks LAIPTTS does not perform syntactic analysis beyond the simple phrase. Only the grammatical status of words and the length of the prosodic group define the boundaries of prosodic groups. This approach means that the temporal hierarchy is independent of accent and fundamental frequency effects. It is generally agreed that the first of a series of grammatical words normally marks the beginning of a prosodic group. A prosodic break between a grammatical and a lexical word is Phonetic and Timing Considerations 171 unlikely except for the rare postpositions. The relation between syllables and minor breaks was analysed, revealing three significantly different positions: (a) the first syllable of a minor prosodic group; (b) the last syllable of a minor prosodic group; and (c) a neutral position. These classes are the same as in French. In both languages, segments in the last syllable are lengthened and segments in the first syllable are shortened. These minor breaks define only a small part of the rhythmic structure. The greater part is covered by the position of syllables in relation to major breaks. A first set of major breaks is defined by punctuation marks, and others are inserted to break up longer phrases. Grosjean and Collins (1979) found that people tend to put these major breaks at the centre of longer phrases.4 The maximal number of syllables within a major prosodic group is 12, but for different speaking rates, this value has to be adapted. In the French system, there are five pertinent positions: first, second, neutral, penultimate and last syllable in a major phrase. In the German data the difference between the second and neutral syllables was not significant. There are thus four classes in German: (a) shortened first syllables, (b) neutral syllables, (c) lengthened second to last syllables, and (d) even more lengthened last syllables. Reading Styles Speaking styles influence many aspects of speech, and should therefore be modelled by TTS systems to improve the naturalness of synthetic speech. For this analysis news, short sentences, addresses, slow and fast reading were recorded. To start with, the analysis distinguished all of these styles, but only the timing of fast and slow reading differed significantly from normal reading. Not all segments differ to the same extent between the two speech rates (Zellner, 1998), and only consonants and vowels were distinguished here: this crude distinction needs to be refined in future studies. Type of Pause The model was also intended to predict the length of pauses. These were included in the analysis, with four classes based on the graphic representation of the text: (a) pauses at paragraph breaks; (b) pauses at full stops; (c) pauses at commas; (d) pauses inserted at other major breaks. This coarse classification produces quite good results. As a further refinement, pauses at commas marking the beginning of a relative clause were reduced to pauses of the fourth degree (d), a simple adjustment that can be done at the text level. Results The model achieves a reasonable explanation of segment durations for this speaker. The Pearson correlation reaches a value of r 0.844, explaining 71.2% of the 4 Grosjean confirmed these findings in several subsequent articles with various co-authors. 172 Cell Mean of difference between measured and predicted data, log scale Improvements in Speech Synthesis ,16 ,15 ,14 ,13 ,12 ,11 ,1 ,09 ,08 1 2 3 4 5 Segment class 6 7 Figure 16.1 Interaction line plot of differences between predicted and measured data (mean and 95% confidence interval), by segment class Cell Mean of difference between measured and predicted data, log scale overall variance. If pauses are excluded, these values drop to a correlation of r 0.763 and an explained variance of 58.2%. Compared with the values for the segmental information only, this shows that the main information lies in the segment itself, and that a large amount of the variation is still not explained. The correlations of Riedi (1998) and van Santen (1998) are somewhat better. This might be explained by the fact that (a) they have a database that is three to four times larger; (b) their speakers are professionals who may read more regularly; (c) the input for their database is more structured due to syntactically-based stress values; (d) the neural network approach handles more exceptions than a linear model. The model proposed here produces acceptable durations, although it still needs considerable refinement. ,13 ,125 ,12 ,115 ,11 ,105 ,1 ,095 Schwa stressed unstressed Stress type Figure 16.2 Interaction line plot of differences between predicted and measured data (mean and 95% confidence interval), by stress 173 Cell Mean of difference between measured and predicted data, log scale Phonetic and Timing Considerations ,14 ,135 ,13 ,125 ,12 ,115 ,11 g l Grammatical status of word Figure 16.3 Interaction line plot of differences between predicted and measured data (mean and 95% confidence interval), by grammatical status of the word containing the segment Cell Mean of difference between measured and predicted data, log scale Comparing predicted and actual durations, it seems that the longer segment classes are modelled better than the shorter segment classes (Figure 16.1). Segments in stressed syllables are modelled better than those in unstressed syllables (Figure 16.2), and segments in lexical words are modelled better than those in grammatical words (Figure 16.3). It appears that the different styles or speaking rates can all be modelled in the same manner (Figure 16.4). This approach also predicts the number of pauses and their position quite well, although compared to the natural data it introduces more pauses and in some cases a major break is placed too early. ,132 ,13 ,127 ,125 ,122 ,12 ,117 ,115 ,112 ,11 ,107 fast neutral slow Reading style (speed) Figure 16.4 Interaction line plot of differences between predicted and measured data (mean and 95% confidence interval), by style 174 Improvements in Speech Synthesis Conclusion For the timing component of a TTS system, the psycholinguistic approach of Keller and Zellner for French can be transferred to German with minor modifications. The results show that refinement of the model should focus on specific aspects. On the one hand, extending the database may improve the results generally. On the other hand, only specific parts of the model need be refined. Particular attention should be given to intrinsically short segments, and perhaps different timing models could be used for stressed and non-stressed syllables, or for lexical and grammatical words. Preliminary tests show that the chosen phonetic alphabet makes it easy to produce different styles by varying the extent of assimilation in the phonetic string: there is no need to build completely different timing models for different speaking styles. The integration of different reading speeds into a single timing model already marks an improvement over the linear shortening of traditional approaches (cf. the accompanying audio examples). The fact that LAIP does not yet have its own diphone database and still uses a Standard German MBROLA database forces us to translate our sophisticated output into a cruder transcription for the sound output. This obscures some contrasts we would have liked to illustrate. First results of the implementation of this TTS system are available at www.unil.ch/imm/docs/LAIP/LAIPTTS_D_SpeechMill_dl.htm. Acknowledgements This research was supported by the BBW/OFES, Berne, in conjunction with the COST 258 European Action. References Grosjean, F. and Collins, M. (1979). Breathing, pausing, and reading. Phonetica, 36, 98±114. Keller, E. (1997). Simplification of TTS architecture vs. operational quality. Proceedings of EUROSPEECH '97. Paper 735. Rhodes, Greece. September 1997. Keller, E. and Zellner, B. (1996). A timing model for fast French. York Papers in Linguistics, 17, 53±75. Keller, E., Zellner, B. and Werner, S. (1997). Improvements in prosodic processing for speech synthesis. Proceedings of Speech Technology in the Public Telephone Network: Where are we Today? (pp. 73±76) Rhodes, Greece. Riedi, M. (1998). Controlling Segmental Duration in Speech Synthesis Systems. Doctoral thesis. ZuÈrich: ETH-TIK. Siebenhaar, B. (1994). Regionale Varianten des Schweizerhochdeutschen. Zeitschrift fuÈr Dialektologie und Linguistik, 61, 31±65. van Santen, J. (1998). Timing. In R. Sproat (ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach (pp. 115±139). Kluwer. Zellner, B. (1996). Structures temporelles et structures prosodiques en francËais lu. Revue FrancËaise de Linguistique AppliqueÂe: la communication parleÂe, 1, 7±23. Phonetic and Timing Considerations 175 Zellner, B. (1998). CaracteÂrisation et preÂdiction du deÂbit de parole en francËais. Une eÂtude de cas. Unpublished doctoral thesis, University of Lausanne. Available: www.unil.ch/imm/docs/LAIP/ps.files/DissertationBZ.ps
© Copyright 2026 Paperzz