Mid-sagittal cut to area function transformations: Direct

Speech Communication 36 (2002) 169–180
www.elsevier.com/locate/specom
Mid-sagittal cut to area function transformations: Direct
measurements of mid-sagittal distance and area with MRI
A. Soquet
b
a,*
, V. Lecuit b, T. Metens c, D. Demolin
a
a
Laboratoire de Phonologie, Universit
e Libre de Bruxelles, 50 Av. F.D. Roosevelt, CP 175, 1050 Brussels, Belgium
Laboratoire de Phon
etique Exp
erimentale, Institut des Langues Vivantes et de Phon
etique, Universit
e Libre de Bruxelles,
Brussels, Belgium
c
Unit
e de R
esonance Magn
etique, H^
opital Erasme, Universit
e Libre de Bruxelles, Brussels, Belgium
Received 25 June 1999; received in revised form 3 June 2000; accepted 15 September 2000
Abstract
This paper presents a comparative study of transformations used to compute the area of cross-sections of the vocal
tract from the mid-sagittal measurements of the vocal tract. MRI techniques have been used to obtain both mid-sagittal
distances and cross-sections of the vocal tract for French oral vowels uttered by two subjects. The measured crosssectional areas can thus be compared to the cross-sectional areas computed by the different transformations. The
evaluation is performed with a jackknife method where the parameters of the transformation are estimated from all but
one measurement of a speaker’s vocal tract region and evaluated on the remaining measurement. This procedure allows
the study of both the performance of the different forms of transformation as a function of the vocal tract region and
the stability of the transformation parameters for a given vocal tract region. Three different forms of transformation are
compared: linear, polynomial and power function. The estimation performances are also compared with four existing
transformations. 2002 Elsevier Science B.V. All rights reserved.
Resume
Cet article presente une etude comparee de differentes transformations utilisees pour calculer la section du conduit
vocal a partir de la distance sagittale. Les distances sagittales et les sections du conduit vocal ont ete mesurees sur des
coupes obtenues par Resonance Magnetique pour les voyelles orales du Francßais prononcees par deux locuteurs. La
section mesuree peut ainsi ^etre comparee aux sections calculees au moyen des differentes transformations. L’evaluation
est realisee au moyen d’une technique de ‘‘jackknife’’: les parametres de la transformation sont estimes pour une region
du conduit vocal a partir de l’ensemble des donnees sauf une, qui permet ensuite d’evaluer la transformation. Cette
procedure permet d’etudier a la fois les performances des transformations et la stabilite des parametres des transformations pour chaque region du conduit vocal. Trois formes differentes de transformation ont ete comparees: lineaire,
polynomiale et exponentielle. Les performances de quatre transformations existantes sont egalement presentees.
2002 Elsevier Science B.V. All rights reserved.
Keywords: Mid-sagittal profile; Area function; Articulatory data
*
Corresponding author. Tel.: +32-2-650-20-18; fax: +32-2-650-20-07.
E-mail address: [email protected] (A. Soquet).
0167-6393/02/$ - see front matter 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 6 3 9 3 ( 0 0 ) 0 0 0 8 4 - 4
170
A. Soquet et al. / Speech Communication 36 (2002) 169–180
1. Introduction
Until recently, most articulatory data consisted
of sagittal information either in the form of sagittal
X-ray projection images (Chiba and Kajiyama,
1941; Fant, 1960), or movement of structures in the
oral cavity obtained by point tracking methods
such as the X-ray microbeam (Fujimura et al.,
1973) or magnetometers (Schonle et al., 1987). The
availability of this sagittal information had two
major consequences for speech research: (i) the
development of so-called articulatory models describing the possible geometry of the sagittal cuts of
the vocal tract and often based on sagittal X-ray
projections (Mermelstein, 1973; Maeda, 1978); (ii)
the search for a transformation relating the sagittal distance to the cross-sectional area, derived,
for example, from plaster casts of the oral cavity
(Ladefoged et al., 1971; Sundberg et al., 1987),
from measurements on cadavers (Heinz and Stevens, 1965) or from X-ray computed tomography
(CT) (Johanson et al., 1983; Sundberg et al., 1987).
The generation of area functions from measurements of the sagittal section is an important
step in the study of the relationship between
vocal tract geometry and speech acoustics. Many
authors have proposed transformations aimed at
performing this particular task. The use of imaging
techniques seems to be the solution of choice for
studying these transformations: if the imaging
plane is adequately placed, it allows one to measure both the mid-sagittal distance and the crosssectional area of live speaking subjects. Sundberg
et al. (1987), for example, used axial CT to study
the pharynges of two subjects.
For this study, we used an MRI sequence that
permits 14 scans of 4 mm thickness to be taken
simultaneously. Important characteristics of this
sequence are that (i) scans can be placed all along
the vocal tract, and (ii) the study of static vocal
tract configurations can be made during only one
sustained phonation and does not require reiterate
phonation as reported in numerous studies (see,
for example, Baer et al., 1991; Lakshminarayanan
et al., 1991; Greenwood et al., 1992). On these
scans, both the mid-sagittal distance and the crosssectional area can be measured. Hence, it is possible to study the transformations between the
mid-sagittal distance and the actual cross-sectional
area.
In this paper, we investigated different forms of
transformation, and compared their performances
with those of four published transformations.
Using published transformations is by no means
guided by the hope that these might be adequate
to all subjects. These transformations are known
to be speaker-specific. The objective here is to
evaluate the performances of unadapted versus
speaker-specific transformations.
Most transformations going from the mid-sagittal distance to the cross-sectional area are based
on the original transformation defined by Heinz
and Stevens (1965), which is
AðxÞ ¼ adðxÞb ;
ð1Þ
where d is the mid-sagittal distance, A the crosssectional area, x the position along the vocal tract
mid-line, and a and b are the two parameters of
the transformation.
For the pharynx, Johanson et al. (1983) proposed a linear relationship between the square of
the mid-sagittal distance and the cross-sectional
area,
AðxÞ ¼ p0 þ p2 dðxÞ2 :
ð2Þ
In general, the authors adapt the value of the
transformation parameters to the speaker and the
position along the vocal tract mid-line. Some extend this dependence to the mid-sagittal distance
itself (Fant, 1992; Perrier et al., 1992; Beautemps
et al., 1995).
Thus, the transformations rely upon an accurate distinction of the different regions of the vocal
tract. Hence, we divided the vocal tract into height
regions: larynx, low-pharynx, mid-pharynx, oropharynx, velum, hard palate, alveolar region, and
labial region. The limit between the mid- and oropharynx is defined as half the distance from the
top of the epiglottis to the velum (Perrier et al.,
1992). The other boundaries are placed according
to the corresponding articulators.
Fig. 1 represent these different regions on a midsagittal contour of the vocal tract. This division
A. Soquet et al. / Speech Communication 36 (2002) 169–180
Fig. 1. Representation of the different regions of the vocal tract
on a mid-sagittal profile for the vowel [u] uttered by the female
subject.
into nine regions allows accurate implementation
of most of the published transformations.
171
first stack was in the transverse plane, contained 6
slices and covered the larynx to the mid-pharynx.
The second stack was in a coronal–oblique plane,
contained 3 slices and covered the oro-pharynx
and the velum. The third stack was in the coronal
plane, contained 5 slices and covered the hard
palate to the labial region.
For each vowel, the three stacks were placed
orthogonal to the mid-line of the vocal tract estimated on a mid-sagittal scan of the vowel pronounced by the subject.
Figs. 2 and 3 show the position of the 14 scans
on the mid-sagittal profile and the resulting 14
scans, respectively, for the vowel [u] uttered by the
female subject and for the vowel [i] uttered by the
male subject. On these scans, it is possible to
measure both the mid-sagittal distance and the
cross-sectional area.
2.2. Measurement of mid-sagittal distance and
cross-sectional area
2. Material
2.1. MRI data
The magnetic resonance images were acquired
at the Magnetic Resonance Unit of the H^
opital
Erasme, Universite Libre de Bruxelles on a 1.5 T
MRI system with a quadrature Head–Neck coil
(Philips Gyroscan NT ACS, Best, The Netherlands). We used a sequence allowing simultaneous
multi-stack acquisition of up to 14 slices of 4 mm
thickness in less than 14 s. Slices can be grouped
into different stacks, and each stack can have a
different orientation.
MR images have been acquired for one female
speaker (subject 1) and one male speaker (subject
2), both native French speakers living in Brussels.
The task of the subjects was to sustain a vowel
during the acquisition sequence. The reference was
a word, which contained the vowel to be pronounced. This reference word was given orally a
few seconds before the recording session by one
of the experimenters. For both the speakers, data
were collected for the 10 French oral vowels [i, e, e,
a, o, , u, y, ø, œ].
For each vowel, the 14 MRI scans were distributed along the vocal tract in three stacks. The
Until now, there has been no automatic and
reliable method to determine mid-sagittal distance
and cross-sectional area of a section of the vocal
tract on an MRI scan. Measurements were carried
out following a procedure first devised for the
treatment of mid-sagittal profiles of the vocal tract
(Soquet et al., 1996). Outlines of the sections are
traced by hand on a transparent sheet. By means
of a digitization tablet, the outlines are introduced
in the computer and each area is computed by a
polygon surface computation algorithm. The digitization process may be biased by human factors.
A test on the reproducibility of the measurement
of the outlined area was made. Areas were computed for series of 10 repetitions of the same
measure on three different outlines corresponding to three different reference sections of known
areas: large, medium and small. The results are
displayed in Table 1. The mean and standard deviation are given for each section. Results show
that the standard deviation is similar in the three
cases and is lower than 0.005 cm2 . This measurement reproducibility can be considered to be satisfactory for our purposes. Similarly, the sagittal
distance was measured on the outlines of the
sections.
c
172
A. Soquet et al. / Speech Communication 36 (2002) 169–180
Fig. 2. Position of the 14 scans on the mid-sagittal profile for the vowel [u] uttered by the female subject along with the corresponding
scans.
A. Soquet et al. / Speech Communication 36 (2002) 169–180
173
Fig. 3. Position of the 14 scans on the mid-sagittal profile for the vowel [i] uttered by the male subject along with the corresponding
scans.
174
A. Soquet et al. / Speech Communication 36 (2002) 169–180
Table 1
Reproducibility of the digitization process
3. Method
Areas (cm2 )
3.1. Selection of sagittal to area transformations
Repetition
Section 1
Section 2
Section 3
1
2
3
4
5
6
7
8
9
10
2.177
2.178
2.176
2.175
2.177
2.174
2.183
2.188
2.176
2.181
0.507
0.502
0.508
0.508
0.511
0.503
0.507
0.507
0.503
0.509
0.064
0.070
0.073
0.072
0.070
0.066
0.073
0.067
0.068
0.069
Mean
Standard
deviation
2.179
0.0044
0.506
0.0030
0.069
0.0030
Among the many transformations proposed in
the literature, we chose four: the transformations
designed by Maeda (1978, 1990), Sundberg (1969),
Sundberg et al. (1987), Fant (1992) and Perrier
et al. (1992). This choice was motivated by the
variety of techniques used by the authors to define
their transformation. Maeda’s model is based on
the study of 1000 sagittal profiles corresponding
to 10 sentences uttered by one French female
speaker. Sundberg’s model is based on the study of
X-ray tomographic data from one male and one
female Swedish speakers and plaster casts from
three male and three female Swedish speakers;
we used the data from the subject Male 2 (see
Sundberg et al., 1987). Fant’s model is based on
the study of X-ray lateral views supported by
limited X-ray tomographic data from a Swedish
male subject. Perrier’s model is based on the study
of a vocal tract cast for large sagittal dimensions
and on CT scans of the vocal tract constriction
regions of one male speaker for the three cardinal
vowels [i, a, u] of French.
For the labial region, among the four studied
transformations, only Maeda (1978, 1990) and
Fant (1992) provide the transformation to convert
the lip height to the lip area.
3.2. Speaker-specific transformation
Fig. 4. Measurement of the sagittal distance and the crosssectional area on an MRI scan.
We have investigated three possible forms of
the transformation. The first was a linear relationship,
AðxÞ ¼ l0 þ l1 dðxÞ;
An example of a measurement superimposed
on the corresponding MRI slice is given in Fig. 4.
In the lower pharynx region, the area and the
corresponding sagittal distance were limited by the
epiglottis. The contour of the teeth has been approximated when necessary during the outline
of the section, using data on tooth size and location obtained from plaster casts and visual estimates.
ð3Þ
where l0 and l1 are the parameters.
The second was a polynomial transformation,
2
AðxÞ ¼ p0 þ p1 dðxÞ þ p2 dðxÞ ;
ð4Þ
where p0 ; p1 and p2 are the parameters. The order
of the polynomial was limited to two in order to
A. Soquet et al. / Speech Communication 36 (2002) 169–180
have enough data to estimate the transformation
parameters reliably.
The third transformation was the classical
power function, as in Eq. (1), where the parameters are a and b.
For each transformation, the parameter values
depend on the speaker and on the region in the
vocal tract.
3.3. Evaluation
As the purpose of the transformation is the estimation of the unknown cross-sectional area from
a measured mid-sagittal distance, the evaluation
has to be made on measurements not used for
determining the parameters defining the transformation. Therefore, we performed the evaluation of
each measurement using a jackknife method. The
parameters of the transformation were estimated
from all but one measurement of a vocal tract
region of a speaker. The resulting transformation
was tested on the remaining measurement. This
procedure was repeated for each measurement of a
particular vocal tract region, for each vocal tract
region, each form of transformation and both
speakers.
This procedure has two main characteristics.
First, the transformation will be evaluated relative
to their intended purpose, the estimation of the
unknown cross-sectional area from a measured
mid-sagittal distance, and not the best possible fit
to a set of measurements. Second, the stability of
the parameters of each form of transformation for
the different vocal tract regions can be studied to
give an insight on the generality or the over-specificity of the transformation.
4. Results
4.1. Comparison of the different transformations
In order to compare the performance of the
different transformations described above, we computed for both speakers and for each region of
the vocal tract the mean and the standard deviation of the relative errors. The relative error is
positive if the area is overestimated by the trans-
175
formation and negative in the opposite case. The
results are presented in Tables 2 and 3 for the female and the male speaker, respectively.
The main observations are as follows:
• As expected, the speaker-specific transformations have, in general, a smaller mean relative
error and standard deviation. It is however interesting to notice that for the alveolar region
of the female subject, Sundberg and Fant transformations give a lower mean and all give a
lower standard deviation.
• In general, the power transformation gives
lower mean relative error than the linear and
the polynomial transformations. This tendency
is only contradicted in the oro-pharynx region
for the female subject, where the linear transformation is better and in the lip region for the
male subject, where the polynomial transformation is better.
• The standard deviations are comparable for the
polynomial and the power transformations, and
somewhat larger for the linear one.
• The speaker-specific transformations give lowrelative estimation error in the regions between
the mid-pharynx and the hard palate. The other
regions are not modeled correctly.
• The four selected transformations overestimate
the area for the regions between the mid-pharynx and the hard palate (especially for the male
subject). Only Maeda’s transformation provides
a good estimate of the areas in these regions for
the female subject.
4.2. Speaker-specific transformations parameters
Tables 4 and 5 display the mean coefficients
derived from the MRI data for the female and the
male subjects, respectively. The standard deviations are also given.
It can be seen that the classical power transformation provides stable parameters in every
vocal tract region for both the speakers. The second order polynomial transformation turns out to
be the most sensitive to details in the training set,
especially in the larynx, the alveolar and the labial
regions. This sensitivity is not a good property,
since it indicates that the transformation does not
176
Table 2
Mean and standard deviation in percentage of the relative estimation errors for the different transformations for each region of the vocal tract of the female speaker
Speaker-specific transformations
Linear
Larynx
Low-pharynx
Mid-pharynx
Oro-pharynx
Velum
Hard palate
Alveolar
Lips
17.8
4.8
)3.2
2.2
6.7
)17.5
24.8
9.8
[57.3]
[29.2]
[20.2]
[38.6]
[34.9]
[77.1]
[89.1]
[50.3]
Transformations of different studies
Polynomial
Power
18.8
4.0
2.0
6.4
6.9
3.7
8.4
8.5
8.0
2.9
)0.0
2.9
4.2
3.2
16.3
6.4
[69.7]
[29.1]
[12.7]
[27.4]
[33.9]
[32.5]
[98.4]
[60.6]
Maeda
[40.1]
[28.8]
[13.4]
[38.5]
[33.7]
[29.4]
[82.8]
[44.5]
106.6
)1.5
3.4
11.7
6.3
)4.2
)23.5
19.5
Sundberg
[83.1]
[23.8]
[11.0]
[37.4]
[34.4]
[26.2]
[36.0]
[37.7]
44.4
)16.0
7.9
30.3
44.7
34.9
)3.5
)40.0
Fant
[51.5]
[24.7]
[16.1]
[39.1]
[45.5]
[36.6]
[48.8]
[30.4]
81.1
14.7
23.7
30.8
50.0
36.6
)2.2
)40.0
Perrier
[52.6]
[25.3]
[22.3]
[51.2]
[48.6]
[36.5]
[48.1]
[30.4]
169.3
11.1
23.5
21.5
30.5
)0.3
)47.8
)40.0
[129.3]
[53.8]
[54.6]
[66.9]
[73.8]
[31.4]
[24.3]
[30.4]
Table 3
Mean and standard deviation in percentage of the relative estimation errors for the different transformations for each region of the vocal tract of the male speaker
Speaker-specific transformations
Region
Linear
Larynx
Low-pharynx
Mid-pharynx
Oro-pharynx
Velum
Hard palate
Alveolar
Lips
9.2
3.8
)5.0
)1.1
1.5
)4.8
6.9
6.1
[42.6]
[26.7]
[42.1]
[17.2]
[14.5]
[26.5]
[38.8]
[42.2]
Transformations of different studies
Polynomial
Power
3.7
5.6
2.5
0.3
1.7
6.4
14.9
2.1
5.7
2.7
1.5
0.1
0.8
2.3
8.4
5.5
[54.8]
[26.9]
[20.5]
[13.3]
[15.1]
[31.2]
[51.3]
[50.4]
Maeda
[37.3]
[25.3]
[21.6]
[13.6]
[13.9]
[21.2]
[40.6]
[29.6]
3.3
)2.0
33.2
65.0
48.5
28.4
9.0
5.5
Sundberg
[43.9]
[23.5]
[34.3]
[27.7]
[24.1]
[22.4]
[37.6]
[43.3]
)37.6
)18.3
45.1
71.1
98.0
65.1
31.4
)60.9
Fant
[28.4]
[23.0]
[46.6]
[39.2]
[30.3]
[35.3]
[43.2]
[9.6]
)63.7
18.3
50.2
99.1
109.7
73.2
35.4
)60.9
Perrier
[23.7]
[25.6]
[30.5]
[48.2]
[34.1]
[33.8]
[45.0]
[9.6]
72.1
21.7
42.9
91.2
107.4
84.3
)21.4
)60.9
[65.1]
[43.0]
[32.2]
[54.0]
[63.9]
[51.1]
[34.0]
[9.6]
A. Soquet et al. / Speech Communication 36 (2002) 169–180
Region
Table 4
Mean and standard deviation in percentage of the parameter values for the three transformation forms for each region of the vocal tract of the female speaker
Linear
l0
Larynx
Low-pharynx
Mid-harynx
Oro-pharynx
Velum
Hard palate
Alveolar
Lips
)0.06
)0.24
)0.74
)0.02
)0.10
)1.00
)0.29
)0.88
l1
[0.18]
[0.03]
[0.10]
[0.20]
[0.07]
[0.04]
[0.30]
[0.17]
1.01
2.19
2.47
1.89
1.95
3.09
3.08
3.41
p0
[0.33]
[0.02]
[0.07]
[0.11]
[0.05]
[0.05]
[0.31]
[0.32]
0.83
)0.49
0.58
1.52
0.16
)0.18
)3.34
)1.21
Power
p1
[1.25]
[0.08]
[0.14]
[0.20]
[0.11]
[0.07]
[1.65]
[1.00]
)1.46
2.61
0.26
)0.49
1.50
1.39
10.11
4.48
a
p2
[3.87]
[0.11]
[0.26]
[0.31]
[0.20]
[0.19]
[3.23]
[3.44]
1.92
)0.15
0.78
0.73
0.16
0.64
)3.71
)0.77
[2.36]
[0.04]
[0.11]
[0.11]
[0.07]
[0.09]
[1.53]
[2.59]
0.78
1.86
1.74
1.99
1.84
1.82
2.67
2.42
b
[0.08]
[0.02]
[0.03]
[0.11]
[0.03]
[0.02]
[0.11]
[0.18]
0.70
1.22
1.23
0.81
0.93
1.43
1.48
1.67
[0.20]
[0.03]
[0.03]
[0.08]
[0.03]
[0.01]
[0.22]
[0.10]
Table 5
Mean and standard deviation in percentage of the parameter values for the three transformation forms for each region of the vocal tract of the male speaker
Linear
Region
l0
Larynx
Low-pharynx
Mid-pharynx
Oro-pharynx
Velum
Hard palate
Alveolar
Lips
)3.48
)1.02
)1.30
)2.87
)0.22
)1.40
)0.46
)1.63
Polynomial
l1
[0.24]
[0.12]
[0.05]
[0.25]
[0.07]
[0.11]
[0.14]
[0.09]
4.44
2.87
2.75
2.74
1.60
2.71
2.47
5.35
p0
[0.14]
[0.07]
[0.05]
[0.10]
[0.05]
[0.07]
[0.13]
[0.19]
)4.42
0.27
0.20
1.35
)0.04
0.36
0.58
)1.89
Power
p1
[1.77]
[0.33]
[0.10]
[0.72]
[0.16]
[0.15]
[0.16]
[0.58]
5.64
0.86
0.01
)1.00
1.36
0.33
1.05
6.27
a
p2
[2.34]
[0.48]
[0.20]
[0.60]
[0.25]
[0.23]
[0.24]
[2.06]
)0.37
0.70
1.06
0.79
0.07
0.68
0.36
)0.75
[0.68]
[0.16]
[0.08]
[0.12]
[0.09]
[0.08]
[0.09]
[1.55]
1.11
1.79
1.34
0.73
1.39
1.34
1.92
4.72
b
[0.05]
[0.05]
[0.01]
[0.05]
[0.01]
[0.03]
[0.04]
[0.28]
2.35
1.38
1.62
1.81
1.08
1.51
1.20
2.48
[0.07]
[0.05]
[0.02]
[0.07]
[0.02]
[0.03]
[0.15]
[0.07]
A. Soquet et al. / Speech Communication 36 (2002) 169–180
Region
Polynomial
177
178
A. Soquet et al. / Speech Communication 36 (2002) 169–180
capture a general tendency but fits details peculiar
to the training set.
The stability of the parameters of the linear
transformation is better than those of the polynomial transformation and comparable to the
power transformation.
It can be observed that, for the power transformation, the low- and the mid-pharyngeal regions turn out to have similar parameters and
could be merged. The same holds for the oropharyngeal and the velar regions. On the contrary,
the polynomial transformation does not allow
such grouping. Again, this shows the higher sensitivity of the polynomial transformation to details
present in the data.
5. Discussion
5.1. MRI data
The sequence we used to obtain the MRI scans
yielded 14 different images during a single production of each vowel. However some problems
remain. Firstly, there is a limitation on the number
of images for a given acquisition time. This does
not allow the number of images to be increased so
as to obtain a continuous area function during a
single vowel phonation.
Secondly, it was not possible to adjust the position of each individual cut to be perfectly orthogonal to the midline of the vocal tract. This
drawback has to be considered in relation to the
fact that both the area and the corresponding
sagittal distance are measured in a plane that is
perhaps misaligned. The imaging plane was however orthogonal to the sagittal plane. Thus, both
the sagittal distance and the area should vary
proportionally to the cosine of the misalignment
angle. Concerning the transformations, the amplitude of the error caused by the misalignment
will depend on the non-linearity of the relationship.
Thirdly, the experimenter should be aware that
information is degraded or lost at the intersection
between two different cuts. It follows that intersections in the regions of interest in the vocal
tract are to be avoided. This phenomenon is il-
Fig. 5. Intersections of coronal cuts (scans 10–14) with coronal
oblique cut (scan 9) showed by the darkened lines (vowel [i]
pronounced by the male subject).
lustrated in Fig. 5, where darkened lines represent
the intersections between the coronal cuts and
the coronal oblique cut in the anterior part of the
face.
5.2. Transformation evaluation
The problem of the estimation of the area
function from the sagittal cut plays an important
role in many studies and models of speech production mechanisms. Indeed, most vocal tract
models appear to model the geometry of the sagittal cut (Mermelstein, 1973, Maeda, 1978). Modeling of the sagittal cut is convenient because
it allows the overall position of the different articulators involved in speech production to be
captured in one single two-dimensional representation.
Up to the late eighties, the scarce data available
was composed of sagittal images obtained by Xrays (see for example, Chiba and Kajiyama, 1941;
Fant, 1960; Bothorel et al., 1986). Therefore, most
representations of the vocal tract were made in the
A. Soquet et al. / Speech Communication 36 (2002) 169–180
179
Fig. 6. Computation of the formant frequencies from a measured or model-based sagittal cut.
sagittal plane. Sagittal images obtained with Xrays are obtained by projecting the vocal tract in a
sagittal plane. The contour that can then be traced
on the resulting image does not strictly correspond
to the sagittal cut of the speaker. For example,
Stone (1990) and Demolin et al. (1996) have shown
the importance of the depression in the tongue
profile (see Fig. 5). This depression has to be taken
into account if one estimates the area function
from a sagittal cut measured on an X-ray image.
It is well known that when one deals with sagittal cuts – either measured or obtained with a
vocal tract model – and wishes to infer the
acoustical properties of the vocal tract thereby
described, the normal procedure is to rely on a
sagittal cut to area transformation to obtain the
area function, and then to compute the acoustic
result from this area function (see Fig. 6).
Therefore, depending on the transformation,
the acoustic result can be noticeably different.
When one deals with acoustic-to-articulatory
inversion, the influence of the transformation is
even more obvious (Beautemps et al., 1995). Indeed, if the inversion is based on an acoustical
criterion, to obtain the measured acoustic cues, the
parameters of the vocal tract model will have to be
adjusted so as to obtain an area function that
produces similar acoustic cues. The sagittal cut will
then differ, depending on the transformation. The
interpretation of parameters thus-obtained has to
be done with care.
Advances in imaging techniques have allowed
improvements in the way that vocal tract geometry
is studied. It is well known that the human vocal
tract is a highly flexible structure (Fant, 1960), but
it has only recently been confirmed that crosssections can (i) vary considerably along its length
(Stone, 1991; Demolin et al., 1996; Soquet et al.,
1996) and (ii) show a high degree of asymmetry
(Stone, 1991). Thus two-dimensional data are not
sufficient to fully understand the production of
speech sounds. Moreover, the stabilization of the
tongue against other articulators, such as the teeth
and hard palate, facilitates the production of certain tongue shapes which could otherwise seem
difficult to produce accurately, as, for example,
narrow constrictions leading to turbulent airflow
in fricatives and laterals (Stone, 1991; Narayanan
et al., 1995). It is thus obvious that the vocal tract
has to be considered as a three-dimensional structure in order to study both speech production
and the link between articulatory and acoustic
space.
6. Conclusion
The data provided by MRI proved to be of
considerable interest in this study. The availability
of information about the shape of the tract allowed us to compare the transformations on the
basis of a reliable reference. These transformations
were based on the study of a few subjects, with
imaging techniques less elaborate than the MRI
techniques used today. The study confirms that the
use of these transformations on subjects other than
the original ones is inappropriate and may lead to
errors (Sundberg et al., 1987; Fant, 1992; Perrier
et al., 1992).
Three forms of sagittal to area transformation
have been studied: linear, polynomial (of order 2)
and the classical power function. The evaluation
showed that the power transformation seems more
adapted to the problem, providing parameters that
are stable with regard to the measurements used
for their estimation, with relatively small mean
errors. However, these transformations only capture general properties of the relationship.
180
A. Soquet et al. / Speech Communication 36 (2002) 169–180
Acknowledgements
This work has been partially supported by the
‘‘Fonds National de la Recheche Scientifique’’
(Didier Demolin – Credits Chercheurs No.
8.4519.95 and No. 1.5.194.97; Alain Soquet –
Collaborateur Scientifique) and by the ‘‘Communaute Francßaise de Belgique’’ in the framework
of the ARC 98-02 No. 226.
References
Baer, T., Gore, J.C., Gracco, L.C., Nye, P.W., 1991. Analysis of vocal tract shape and dimensions using magnetic
resonance imaging: vowels. J. Acoust. Soc. Am. 90 (2),
799–828.
Beautemps, D., Badin, P., Laboissiere, R., 1995. Deriving
vocal-tract area functions from midsagittal profiles and
formant frequencies: A new model for vowels and fricative
consonants based on experimental data. Speech Communication 16, 27–47.
Bothorel, A., Simon, P., Wioland, F., Zerling, J.P., 1986.
Cineradiographie des voyelles et consonnes du francßais.
Travaux de l’Institut de Phonetique de Strasbourg.
Chiba, T., Kajiyama, M., 1941. The Vowel, its Nature and
Structure. Tokyo-Kaiseikan, Tokyo.
Demolin, D., Metens, T., Soquet, A., 1996. Threedimensional
measurements of the vocal tract by MRI. In: Proceedings
of the International Conference on Spoken Language
Processing, Philadelphia, PA, pp. 272–275.
Fant, G., 1960. Acoustic Theory of Speech Production.
Mouton, The Hague.
Fant, G., 1992. Vocal tract area functions of Swedish vowels
and a new three-parameter model. In: Proceedings of the
International Conference on Spoken Language Processing,
Banff, pp. 807–810.
Fujimura, O., Kiritani, S., Ishida, H., 1973. Computer controller
radiography for observation of movements of articulatory
and other human organs. Comput. Biol. Med. 3, 371–384.
Greenwood, A.R., Goodyear, C.C., Martin, P.A., 1992. Measurements of vocal tract shapes using magnetic resonance
imaging. IEE Proc. – I 139 (6), 553–560.
Heinz, J.M., Stevens, K.N., 1965. On the relations between
lateral cineradiographs area functions, and acoustic spectra
of speech. In: Proc. Fifth Int. Congr. Acoust. Liege, Paper
A44.
Johanson, C., Sundberg, J., Wilbrand, H., Ytterbergh, C., 1983.
From sagittal distance to area: a study of transverse, crosssectional area in the pharynx by means of computer
tomography. STL-QPSR 4, 39–49.
Ladefoged, P., Anthony, J.F.K., Riley, C., 1971. Direct
measurement of the vocal tract. UCLA Working Papers in
Phonetics, pp. 4–13.
Lakshminarayanan, A.V., Lee, S., McCutcheon, M.J., 1991.
MR imaging of the vocal tract during vowel production.
J. Mag. Res. Imag. 1 (1), 71–76.
Maeda, S., 1978. Une analyse statistique sur les positions de
la langue: etude preliminaire sur les voyelles francßaises.
In: Actes des IXemes Journees d’Etude sur la Parole,
pp. 191–199.
Maeda, S., 1990. Compensatory articulation during speech:
evidence from the analysis and synthesis of vocal-tract
shapes using an articulatory model. In: Hardcastle, W.J.,
Marchal, A. (Eds.), Speech Production and Speech Modeling. Kluwer Academic Publishers, pp. 131–149.
Mermelstein, P., 1973. Articulatory model for the study of
speech production. J. Acoust. Soc. Am. 53, 1070–1082.
Narayanan, S.S., Alwan, A.A., Haker, K., 1995. An articulatory study of fricative consonants using magnetic resonance imaging. J. Acoust. Soc. Am. 98, 1325–1347.
Perrier, P., Bo€e, L.J., Sock, R., 1992. Vocal tract area function
estimation from midsagittal dimensions with CT scans and a
vocal tract cast: modeling the transition with two sets of
coefficients. J. Speech Hear. Res. 35, 53–67.
Schonle, P., Grabe, K., Wenig, P., Hohne, J., Schrader, J.,
Conrad, B., 1987. Electromagnetic articulography: use of
alternating magnetic fields for tracking movements of
multiple points inside and outside the vocal tract. Brain
Lang. 31, 26–35.
Soquet, A., Lecuit, V., Metens T., Demolin, D., 1996.
From sagittal cut to area function: an MRI investigation. In: Proceedings of the International Conference on
Spoken Language Processing, Philadelphia, PA, pp. 1205–
1208.
Stone, M., 1990. A three-dimensional model of tongue movement based on ultrasound and X-ray microbeam data.
J. Acoust. Soc. Am. 87 (5), 2207–2217.
Stone, M., 1991. Toward a model of three-dimensional tongue
movement. J. Phonetics 19, 309–320.
Sundberg, J., 1969. On the problem of obtaining area functions
from lateral X-ray pictures of the vocal tract. STL QPSR.
Stockholm, pp. 43–45.
Sundberg, J., Johansson, C., Wilbrand, H., Ytterbergh, C.,
1987. From sagittal distance to area. A study of transverse,
vocal tract cross-sectional area. Phonetica 44, 76–90.