Phonetic and Timing Considerations in a Swiss High German TTS

16
Phonetic and Timing
Considerations in a Swiss
High German TTS System
Beat Siebenhaar, Brigitte Zellner Keller, and Eric Keller
Laboratoire d'Analyse Informatique de la Parole (LAIP)
Universite de Lausanne, CH-1015 Lausanne, Switzerland
[email protected], [email protected],
unil.ch
Eric.Keller@imm.
Introduction
The linguistic situation of German-speaking Switzerland shows many differences
from the situation in Germany or in Austria. The Swiss dialects are used by everybody in almost every situation ± even members of the highest political institution,
the Federal Council, speak their local dialect in political discussions on TV. By
contrast, spoken Standard German is not a high-prestige variety. It is used for
reading aloud, in school, and in contact with people who do not know the dialect.
Thus spoken Swiss High German has many features distinguishing it from German
and Austrian variants. If a TTS system respects the language of the people to
whom it has to speak, this will improve the acceptability of speech synthesis.
Therefore a German TTS system for Switzerland has to consider these peculiarities.
As the prestigious dialects are not generally written, the Swiss variant of Standard
German is the best choice for a Swiss German TTS system.
At the Laboratoire d'analyse informatique de la parole (LAIP) of the University
of Lausanne, such a Swiss High German TTS system is under construction. The
dialectal variant to be synthesised is the implicit Swiss High German norm such as
might be used by a Swiss teacher. In the context of the linguistic situation of
Switzerland this means an adaptation of TTS systems to linguistic reality. The
design of the system closely follows the French TTS system developed at LAIP
since 1991, LAIPTTS-F.1 On a theoretical level the goal of the German system,
LAIPTTS-D, is to see if the assumptions underlying the French system are also
1
Information on LAIPTTS-F can be found at http://www.unil.ch/imm/docs/LAIP/LAIPTTS.html
166
Improvements in Speech Synthesis
applicable to other languages, especially to a lexical stress language such as
German. Some considerations on the phonetic and timing levels in designing
LAIPTTS-D will be presented here.
The Phonetic Alphabet
The phonetic alphabet used for LAIPTTS-F corresponds closely to the SAMPA2
convention. For the German version, this convention had to be extended (a) to
cover Swiss phonetic reality; and (b) to aid the transcription of stylistic variation:
1. Long and short variants of vowels represent distinct phonemes in German.
There is no simple relation to change long into short vowels. Therefore they are
treated as different segments.
2. Lexical stress has a major effect on vowels, but again no simple relation with
duration could be identified. Consequently, stressed and non-stressed vowels are
treated as different segments, while consonants in stressed or non-stressed syllables are not. Lexical stress, therefore, is a segmental feature of vowels.
3. The phonemes /@l/, /@m/, /@n/ and /@r/ are usually pronounced as syllabic consonants [lt], [mt], [nt] and [6t]. These are shorter than the combination of /@/ and the
respective consonant, but longer than the consonant itself.3 In formal styles,
schwa and consonant replace most syllabic consonants, but this is not a 1:1
relation. These findings led to the decision to define the syllabic consonants as
special segments.
4. Swiss speakers tend to be much more sensitive to the orthographic representation than German speakers are. On the phonetic level, the phonetic set had to be
enlarged by a sign for an open /EH / that is the normal realisation of the grapheme
<aÈ> (Siebenhaar, 1994).
These distinctions result in a phonetic alphabet of 83 distinct segments: 27 consonants, 52 vowels and 4 syllabic consonants. That is almost double the 44 segments
used in the French version of LAIPTTS.
The Timing Model
As drawn up for French (Zellner, 1996; Keller et al., 1997), the LAIP approach to
TTS synthesis is first to design a timing model and only then to model the fundamental frequency. The task of the timing component is to compute the temporal
structure from an annotated phonetic string. In the case of LAIPTTS-D, this string
contains the orthographic punctuation marks, marks for word stress, and the distinction between grammatical and lexical words. The timing model has two components. The first one groups prosodic phrases and identifies pauses; the other
calculates segmental durations.
2
Specifications at http://www.phon.ucl.ac.uk/home/sampa/home.htm
[@‡n] mean ˆ 110.2 ms, [nt] mean ˆ 90.4 ms; [@‡m] mean ˆ 118.3 ms, [mt] mean ˆ 86.8 ms; [@‡l]
mean ˆ 100.1 ms, [lt] mean ˆ 80.9 ms; [@‡r] mean ˆ 84.4 ms, [rt] mean ˆ 58.5 ms
3
Phonetic and Timing Considerations
167
The Design of French LAIPTTS and its Adaptation to German
A series of experiments involving multiple general linear models (GLM) for determinants of French segment duration established seven significant factors that could
easily be derived from text input: (a) the durational class of the preceding segment;
(b) the durational class of the current segment; (c) the durational class of the
subsequent segment; (d) the durational class of the next segment but one; (e) the
position in the prosodic group of the syllable containing the current segment; (f)
the grammatical status of the word containing the current segment; and (g) the
number of segments in the syllable containing the current segment. `Durational
class' refers to one of nine clusters of typical durations for segmental duration.
These factors have been implemented in LAIPTTS-F. In the move to a multilingual
TTS Synthesis, LAIPTTS-D should optimally be based on a similar analysis.
Nevertheless, some significant changes had to be considered. The general structure
of the German system and its differences from the French system are discussed
below.
Database
Ten minutes of read text from a single speaker were manually labelled. The stylistic
variants of the text were news, addresses, isolated phrases, fast and slow reading.
As the raw segment duration is not normally distributed, the log transformation
was chosen for the further calculations. This gave a distribution that was much
closer to normal.
Factors Affecting Segmental Duration
To produce a general linear model for timing, the factors with statistical relevance
were established in a parametric regression. Most of the factors mentioned in the
literature were considered. Step-wise non-significant factors were excluded. Table
16.1 shows the factors finally retained in the model of segmental duration in
German, compared to the French system.
The Segmental Aspect
Most TTS systems base their analysis and synthesis of segment durations on
phonetic characteristics of the segments and on supra-segmental aspects.
For the segmental aspects of LAIPTTS-F, Keller and Zellner (1996) chose a different approach. They grouped the segments according to their mean durations
and their articulatory definitions. Zellner (1998, pp. 85 ff.) goes one step further and leaves out the articulatory aspect. This grouping is quite surprising. There are categories containing only one segment, for example [S] in
fast speech or [o] in normal speech, which have a statistically different length
from all other segments. Other groups contain segments as different as [e, a, b, m
and t].
168
Table 16.1
Improvements in Speech Synthesis
Factors affecting segmental duration in German and French
German
French
Durational class of the current segment
Type of segment preceding the current
segment
Type of subsequent segment
±
Type of syllable containing the current
segment.
Position of the segment in the syllable
Lexical stress
Grammatical status of the word containing
the current segment
Location of the syllable in the word
Position in the prosodic group of the
syllable containing the current segment
Durational class of the current segment
Durational class of the segment preceding the
current segment
Durational class of the subsequent segment
Durational class of the next segment but one
Number of segments in the syllable containing
the current segment.
Position of the segment in the syllable
Syllable containing Schwa
Grammatical status of the word containing
the current segment
±
Position in the prosodic group of the syllable
containing the current segment
For three reasons, this classification could not be applied directly to German:
First, there are more segments in German than in French. Second, there are the
phonological differences of long and short vowels. Third, there are major differences in German between stressed and unstressed vowels. Therefore a more traditional approach of using phonetically different classes was employed initially. Any
segment was defined by two parameters, containing 17 or 14 phonetic categories (cf.
Riedi, 1998, pp. 50±2). Using these segmental parameters and the parameters for the
syllable, word, minor and major prosodic group, a general linear model was built to
obtain a timing model. Comparing the real values and the values predicted by the
model, a correlation of r ˆ .71 was found. With only 4 500 segments, the main
problem comes from sparsely populated cells. The generalisation of the model
was therefore not apparent. There were two ways to rectify this situation: one was to
record quite a bit more data, and the other was to switch to the Keller/Zellner
model and to group the segments only by their duration. It was decided to do both.
Some 1500 additional segments were recorded and manually labelled. The whole
set was then clustered according to segment durations. Initially, an analysis of the
single segments was conducted. Then, step by step, segments with no significant
difference were included in the groups. At first articulatory definitions were considered significant, but it emerged ± as Zellner (1998) had found ± that this criterion could be dropped, and only the confidence intervals between the segments were
taken into account. In the end, there were 7 groups of segments, and 1 for pauses.
Table 16.2 shows these groups.
There is no 1:1 relation between stressed and non-stressed vowels. In group
seven, stressed and unstressed diphthongs coincide: stressed [`a:] and [`EH :] are in this
group, while the unstressed versions are in different groups ([a:] is in group six, [EH:]
in group five). There is also no 1:1 relation between long and short vowels. Unaccented long and short [a] and [E] show different distributions. Short [a] and [E]
are both in group three, but [a:] is in group six while [E:] is in group five.
169
Phonetic and Timing Considerations
Table 16.2 Phoneme class with mean, standard deviation, coefficient of variation, court,
percentage
Group
Segments
Mean
1
2
[r, 6]
[E, I, i, o, U, u, Y, y, @, j, d, l,
?, v, w]
[`I, `Y, ` U, `i:, `y:, O, e, EH , a, ú, |,
6t, h, N, n]
[`a, `EH , `E, `O, `ú, i:, u:, g, b]
[`i:, `y:, EH :, e:, |:, o:, u:, mt, nt, lt,
t, s, z, f, S, Z, x]
[`e:, `|:, `o:, `u:, a:, C, p, k]
[`aUu , `auI , `OuI , `a:, `EH :, `a~:, `E~:, `ú~:, `o~:,
aUu , aIu , OIu , a~:, E~:, ú~:, o~:, pf, ts]
Pause
36.989
50.174
16.463
23.131
0.445
0.461
363
1 634
6.09
27.39
64.797
23.267
0.359
1 119
18.76
73.955
91.337
22.705
35.795
0.307
0.392
553
1 288
9.27
21.59
111.531
126.951
38.132
41.414
0.342
0.326
384
412
6.44
6.91
620.542
458.047
0.738
212
3.55
3
4
5
6
7
8
Standard
deviation
Coefficient of
variation
Count
%
Keller and Zellner (1996) use the same groups for the influence of the previous
and the following segments, as do other systems for input into neural networks.
Doing the same with the German data led to an overfitting of the model. Most
classes showed only small differences and these were not significant, so the same
step-by-step procedure for establishing significant factors as for the segmental influence was performed for the influence of the previous and the following segment.
Four classes for the previous segment were distinguished, and three for the
following segment:
1. For the previous segment the following classes were distinguished: (a) vowels;
(b) affricates and pauses; (c) fricatives and plosives; (d) nasals, liquids, syllabic
consonants.
2. The following segment showed influences for (a) pauses; (b) vowels, syllabic
consonants and affricates; (c) fricatives, plosives, nasals and liquids.
These three segmental factors explain only 49.5% of the variation of the segments,
and 62.1% of the variation including pauses. The model's predicted segmental
durations correlated with the measured durations at r ˆ 0.703 for the segments
only, or at r ˆ 0.788 including pauses. This simplified model fits as well as the first
model with the articulatory definitions of the segments, but it has the advantage
that it has only three instead of six variables, and every variable only has three to
eight classes, as compared to 14 to 17 of the first model. The second model is
therefore more stable.
The last segmental aspect taken into consideration was the segment's position in
the syllable. Besides the position relative to the nucleus, Riedi (1998, p. 52) considers the absolute position as relevant. The data used for present study indicate
that this absolute position is not significant. Three positions with significant differ-
170
Improvements in Speech Synthesis
ences were found: nucleus, onset, offset. A slightly better fit was achieved when
liquids and nasals were considered as belonging to the nucleus.
Aspects at the Syllable Level
For French, the number of segments in the syllable is a relevant factor. For
German this aspect was not significant, but it was found that the structure of
the syllable containing the current segment is important for every segment. Each
of the traditional linguistic distinctions V, CV, VC, CVC was significantly distinct
from all others.
Although stress was defined as a segmental feature of vowels, it appeared that a
supplementary variable at the syllable level was also significant. For French
LAIPTTS-F distinguishes syllables containing a schwa (0 ) from those with other
vowels (1 ) as nucleus:
Ce vi1lage est parfois encombre de touristes.
Ce0 =vi1 =llage1 =est1 =par1 =fois1 =en1 =com1 =bre1 =de0 =tou1 =ristes1
In addition to the French distinction, a distinction between stressed and unstressed
vowels was considered resulting in three stress classes. LAIPTTS-D distinguishes
syllables with schwa (0 ), non-stressed syllables (1 ) and stressed syllables (2 ):
Dieses Dorf ist manchmal uÈberschwemmt von Touristen.
Die1 =ses0 =Dorf 2 =ist1 =manch2 =mal 1 =
u1 =ber0 =schwemmt2 =von1 =Tou1 =ris2 =ten0
This is not as differentiated as other systems because only the main lexical stress is
considered, while others also consider stress levels based on syntactic analysis
(Riedi, 1998, p. 53; van Santen, 1998, p. 124).
While Riedi (1998, p. 53) considers the number of syllables in the word and
the absolute position of the syllable, this was not significant in the present data.
The relative position of the syllable was taken into account: monosyllabic words,
first, last and medial syllables of polysyllabic words were distinguished.
The marking of the grammatical status of the word containing the current
segment is identical to the French system which simply distinguishes lexical
and grammatical words. Articles, pronouns, prepositions and conjunctions,
modal and auxiliary verbs are considered as grammatical words, all others are
lexical words. This distinction is the basis for the definition of minor prosodic
groups.
Position of the Syllable Relative to Minor and Major Breaks
LAIPTTS does not perform syntactic analysis beyond the simple phrase. Only the
grammatical status of words and the length of the prosodic group define the
boundaries of prosodic groups. This approach means that the temporal hierarchy
is independent of accent and fundamental frequency effects. It is generally agreed
that the first of a series of grammatical words normally marks the beginning of a
prosodic group. A prosodic break between a grammatical and a lexical word is
Phonetic and Timing Considerations
171
unlikely except for the rare postpositions. The relation between syllables and minor
breaks was analysed, revealing three significantly different positions: (a) the first
syllable of a minor prosodic group; (b) the last syllable of a minor prosodic group;
and (c) a neutral position. These classes are the same as in French. In both languages, segments in the last syllable are lengthened and segments in the first syllable are shortened.
These minor breaks define only a small part of the rhythmic structure. The greater
part is covered by the position of syllables in relation to major breaks. A first set of
major breaks is defined by punctuation marks, and others are inserted to break up
longer phrases. Grosjean and Collins (1979) found that people tend to put these
major breaks at the centre of longer phrases.4 The maximal number of syllables
within a major prosodic group is 12, but for different speaking rates, this value has to
be adapted. In the French system, there are five pertinent positions: first,
second, neutral, penultimate and last syllable in a major phrase. In the German
data the difference between the second and neutral syllables was not significant.
There are thus four classes in German: (a) shortened first syllables, (b) neutral
syllables, (c) lengthened second to last syllables, and (d) even more lengthened last
syllables.
Reading Styles
Speaking styles influence many aspects of speech, and should therefore be modelled
by TTS systems to improve the naturalness of synthetic speech. For this analysis
news, short sentences, addresses, slow and fast reading were recorded. To start
with, the analysis distinguished all of these styles, but only the timing of fast and
slow reading differed significantly from normal reading. Not all segments differ to
the same extent between the two speech rates (Zellner, 1998), and only consonants
and vowels were distinguished here: this crude distinction needs to be refined in
future studies.
Type of Pause
The model was also intended to predict the length of pauses. These were included
in the analysis, with four classes based on the graphic representation of the text: (a)
pauses at paragraph breaks; (b) pauses at full stops; (c) pauses at commas; (d)
pauses inserted at other major breaks. This coarse classification produces quite
good results. As a further refinement, pauses at commas marking the beginning of
a relative clause were reduced to pauses of the fourth degree (d), a simple adjustment that can be done at the text level.
Results
The model achieves a reasonable explanation of segment durations for this speaker.
The Pearson correlation reaches a value of r ˆ 0.844, explaining 71.2% of the
4
Grosjean confirmed these findings in several subsequent articles with various co-authors.
172
Cell Mean of difference between
measured and predicted data, log scale
Improvements in Speech Synthesis
,16
,15
,14
,13
,12
,11
,1
,09
,08
1
2
3
4
5
Segment class
6
7
Figure 16.1 Interaction line plot of differences between predicted and measured data (mean
and 95% confidence interval), by segment class
Cell Mean of difference between
measured and predicted data, log scale
overall variance. If pauses are excluded, these values drop to a correlation of r ˆ
0.763 and an explained variance of 58.2%. Compared with the values for the segmental information only, this shows that the main information lies in the segment
itself, and that a large amount of the variation is still not explained. The correlations of Riedi (1998) and van Santen (1998) are somewhat better. This might be
explained by the fact that (a) they have a database that is three to four times larger;
(b) their speakers are professionals who may read more regularly; (c) the input for
their database is more structured due to syntactically-based stress values; (d) the
neural network approach handles more exceptions than a linear model. The model
proposed here produces acceptable durations, although it still needs considerable
refinement.
,13
,125
,12
,115
,11
,105
,1
,095
Schwa
stressed
unstressed
Stress type
Figure 16.2 Interaction line plot of differences between predicted and measured data (mean
and 95% confidence interval), by stress
173
Cell Mean of difference between
measured and predicted data, log scale
Phonetic and Timing Considerations
,14
,135
,13
,125
,12
,115
,11
g
l
Grammatical status of word
Figure 16.3 Interaction line plot of differences between predicted and measured data (mean
and 95% confidence interval), by grammatical status of the word containing the segment
Cell Mean of difference between
measured and predicted data, log scale
Comparing predicted and actual durations, it seems that the longer segment
classes are modelled better than the shorter segment classes (Figure 16.1). Segments
in stressed syllables are modelled better than those in unstressed syllables (Figure
16.2), and segments in lexical words are modelled better than those in grammatical
words (Figure 16.3). It appears that the different styles or speaking rates can all
be modelled in the same manner (Figure 16.4). This approach also predicts
the number of pauses and their position quite well, although compared to the
natural data it introduces more pauses and in some cases a major break is placed
too early.
,132
,13
,127
,125
,122
,12
,117
,115
,112
,11
,107
fast
neutral
slow
Reading style (speed)
Figure 16.4 Interaction line plot of differences between predicted and measured data (mean
and 95% confidence interval), by style
174
Improvements in Speech Synthesis
Conclusion
For the timing component of a TTS system, the psycholinguistic approach of
Keller and Zellner for French can be transferred to German with minor modifications.
The results show that refinement of the model should focus on specific aspects.
On the one hand, extending the database may improve the results generally. On the
other hand, only specific parts of the model need be refined. Particular attention
should be given to intrinsically short segments, and perhaps different timing models
could be used for stressed and non-stressed syllables, or for lexical and grammatical
words.
Preliminary tests show that the chosen phonetic alphabet makes it easy to produce different styles by varying the extent of assimilation in the phonetic string:
there is no need to build completely different timing models for different speaking
styles. The integration of different reading speeds into a single timing model
already marks an improvement over the linear shortening of traditional approaches
(cf. the accompanying audio examples). The fact that LAIP does not yet
have its own diphone database and still uses a Standard German MBROLA database forces us to translate our sophisticated output into a cruder transcription
for the sound output. This obscures some contrasts we would have liked to illustrate.
First results of the implementation of this TTS system are available at www.unil.ch/imm/docs/LAIP/LAIPTTS_D_SpeechMill_dl.htm.
Acknowledgements
This research was supported by the BBW/OFES, Berne, in conjunction with the
COST 258 European Action.
References
Grosjean, F. and Collins, M. (1979). Breathing, pausing, and reading. Phonetica, 36, 98±114.
Keller, E. (1997). Simplification of TTS architecture vs. operational quality. Proceedings of
EUROSPEECH '97. Paper 735. Rhodes, Greece. September 1997.
Keller, E. and Zellner, B. (1996). A timing model for fast French. York Papers in Linguistics,
17, 53±75.
Keller, E., Zellner, B. and Werner, S. (1997). Improvements in prosodic processing for
speech synthesis. Proceedings of Speech Technology in the Public Telephone Network:
Where are we Today? (pp. 73±76) Rhodes, Greece.
Riedi, M. (1998). Controlling Segmental Duration in Speech Synthesis Systems. Doctoral
thesis. ZuÈrich: ETH-TIK.
Siebenhaar, B. (1994). Regionale Varianten des Schweizerhochdeutschen. Zeitschrift fuÈr Dialektologie und Linguistik, 61, 31±65.
van Santen, J. (1998). Timing. In R. Sproat (ed.), Multilingual Text-to-Speech Synthesis: The
Bell Labs Approach (pp. 115±139). Kluwer.
Zellner, B. (1996). Structures temporelles et structures prosodiques en francËais lu. Revue
FrancËaise de Linguistique AppliqueÂe: la communication parleÂe, 1, 7±23.
Phonetic and Timing Considerations
175
Zellner, B. (1998). CaracteÂrisation et preÂdiction du deÂbit de parole en francËais. Une eÂtude de
cas. Unpublished doctoral thesis, University of Lausanne. Available:
www.unil.ch/imm/docs/LAIP/ps.files/DissertationBZ.ps