Speech Communication 36 (2002) 169–180 www.elsevier.com/locate/specom Mid-sagittal cut to area function transformations: Direct measurements of mid-sagittal distance and area with MRI A. Soquet b a,* , V. Lecuit b, T. Metens c, D. Demolin a a Laboratoire de Phonologie, Universit e Libre de Bruxelles, 50 Av. F.D. Roosevelt, CP 175, 1050 Brussels, Belgium Laboratoire de Phon etique Exp erimentale, Institut des Langues Vivantes et de Phon etique, Universit e Libre de Bruxelles, Brussels, Belgium c Unit e de R esonance Magn etique, H^ opital Erasme, Universit e Libre de Bruxelles, Brussels, Belgium Received 25 June 1999; received in revised form 3 June 2000; accepted 15 September 2000 Abstract This paper presents a comparative study of transformations used to compute the area of cross-sections of the vocal tract from the mid-sagittal measurements of the vocal tract. MRI techniques have been used to obtain both mid-sagittal distances and cross-sections of the vocal tract for French oral vowels uttered by two subjects. The measured crosssectional areas can thus be compared to the cross-sectional areas computed by the different transformations. The evaluation is performed with a jackknife method where the parameters of the transformation are estimated from all but one measurement of a speaker’s vocal tract region and evaluated on the remaining measurement. This procedure allows the study of both the performance of the different forms of transformation as a function of the vocal tract region and the stability of the transformation parameters for a given vocal tract region. Three different forms of transformation are compared: linear, polynomial and power function. The estimation performances are also compared with four existing transformations. 2002 Elsevier Science B.V. All rights reserved. Resume Cet article presente une etude comparee de differentes transformations utilisees pour calculer la section du conduit vocal a partir de la distance sagittale. Les distances sagittales et les sections du conduit vocal ont ete mesurees sur des coupes obtenues par Resonance Magnetique pour les voyelles orales du Francßais prononcees par deux locuteurs. La section mesuree peut ainsi ^etre comparee aux sections calculees au moyen des differentes transformations. L’evaluation est realisee au moyen d’une technique de ‘‘jackknife’’: les parametres de la transformation sont estimes pour une region du conduit vocal a partir de l’ensemble des donnees sauf une, qui permet ensuite d’evaluer la transformation. Cette procedure permet d’etudier a la fois les performances des transformations et la stabilite des parametres des transformations pour chaque region du conduit vocal. Trois formes differentes de transformation ont ete comparees: lineaire, polynomiale et exponentielle. Les performances de quatre transformations existantes sont egalement presentees. 2002 Elsevier Science B.V. All rights reserved. Keywords: Mid-sagittal profile; Area function; Articulatory data * Corresponding author. Tel.: +32-2-650-20-18; fax: +32-2-650-20-07. E-mail address: [email protected] (A. Soquet). 0167-6393/02/$ - see front matter 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 0 0 ) 0 0 0 8 4 - 4 170 A. Soquet et al. / Speech Communication 36 (2002) 169–180 1. Introduction Until recently, most articulatory data consisted of sagittal information either in the form of sagittal X-ray projection images (Chiba and Kajiyama, 1941; Fant, 1960), or movement of structures in the oral cavity obtained by point tracking methods such as the X-ray microbeam (Fujimura et al., 1973) or magnetometers (Schonle et al., 1987). The availability of this sagittal information had two major consequences for speech research: (i) the development of so-called articulatory models describing the possible geometry of the sagittal cuts of the vocal tract and often based on sagittal X-ray projections (Mermelstein, 1973; Maeda, 1978); (ii) the search for a transformation relating the sagittal distance to the cross-sectional area, derived, for example, from plaster casts of the oral cavity (Ladefoged et al., 1971; Sundberg et al., 1987), from measurements on cadavers (Heinz and Stevens, 1965) or from X-ray computed tomography (CT) (Johanson et al., 1983; Sundberg et al., 1987). The generation of area functions from measurements of the sagittal section is an important step in the study of the relationship between vocal tract geometry and speech acoustics. Many authors have proposed transformations aimed at performing this particular task. The use of imaging techniques seems to be the solution of choice for studying these transformations: if the imaging plane is adequately placed, it allows one to measure both the mid-sagittal distance and the crosssectional area of live speaking subjects. Sundberg et al. (1987), for example, used axial CT to study the pharynges of two subjects. For this study, we used an MRI sequence that permits 14 scans of 4 mm thickness to be taken simultaneously. Important characteristics of this sequence are that (i) scans can be placed all along the vocal tract, and (ii) the study of static vocal tract configurations can be made during only one sustained phonation and does not require reiterate phonation as reported in numerous studies (see, for example, Baer et al., 1991; Lakshminarayanan et al., 1991; Greenwood et al., 1992). On these scans, both the mid-sagittal distance and the crosssectional area can be measured. Hence, it is possible to study the transformations between the mid-sagittal distance and the actual cross-sectional area. In this paper, we investigated different forms of transformation, and compared their performances with those of four published transformations. Using published transformations is by no means guided by the hope that these might be adequate to all subjects. These transformations are known to be speaker-specific. The objective here is to evaluate the performances of unadapted versus speaker-specific transformations. Most transformations going from the mid-sagittal distance to the cross-sectional area are based on the original transformation defined by Heinz and Stevens (1965), which is AðxÞ ¼ adðxÞb ; ð1Þ where d is the mid-sagittal distance, A the crosssectional area, x the position along the vocal tract mid-line, and a and b are the two parameters of the transformation. For the pharynx, Johanson et al. (1983) proposed a linear relationship between the square of the mid-sagittal distance and the cross-sectional area, AðxÞ ¼ p0 þ p2 dðxÞ2 : ð2Þ In general, the authors adapt the value of the transformation parameters to the speaker and the position along the vocal tract mid-line. Some extend this dependence to the mid-sagittal distance itself (Fant, 1992; Perrier et al., 1992; Beautemps et al., 1995). Thus, the transformations rely upon an accurate distinction of the different regions of the vocal tract. Hence, we divided the vocal tract into height regions: larynx, low-pharynx, mid-pharynx, oropharynx, velum, hard palate, alveolar region, and labial region. The limit between the mid- and oropharynx is defined as half the distance from the top of the epiglottis to the velum (Perrier et al., 1992). The other boundaries are placed according to the corresponding articulators. Fig. 1 represent these different regions on a midsagittal contour of the vocal tract. This division A. Soquet et al. / Speech Communication 36 (2002) 169–180 Fig. 1. Representation of the different regions of the vocal tract on a mid-sagittal profile for the vowel [u] uttered by the female subject. into nine regions allows accurate implementation of most of the published transformations. 171 first stack was in the transverse plane, contained 6 slices and covered the larynx to the mid-pharynx. The second stack was in a coronal–oblique plane, contained 3 slices and covered the oro-pharynx and the velum. The third stack was in the coronal plane, contained 5 slices and covered the hard palate to the labial region. For each vowel, the three stacks were placed orthogonal to the mid-line of the vocal tract estimated on a mid-sagittal scan of the vowel pronounced by the subject. Figs. 2 and 3 show the position of the 14 scans on the mid-sagittal profile and the resulting 14 scans, respectively, for the vowel [u] uttered by the female subject and for the vowel [i] uttered by the male subject. On these scans, it is possible to measure both the mid-sagittal distance and the cross-sectional area. 2.2. Measurement of mid-sagittal distance and cross-sectional area 2. Material 2.1. MRI data The magnetic resonance images were acquired at the Magnetic Resonance Unit of the H^ opital Erasme, Universite Libre de Bruxelles on a 1.5 T MRI system with a quadrature Head–Neck coil (Philips Gyroscan NT ACS, Best, The Netherlands). We used a sequence allowing simultaneous multi-stack acquisition of up to 14 slices of 4 mm thickness in less than 14 s. Slices can be grouped into different stacks, and each stack can have a different orientation. MR images have been acquired for one female speaker (subject 1) and one male speaker (subject 2), both native French speakers living in Brussels. The task of the subjects was to sustain a vowel during the acquisition sequence. The reference was a word, which contained the vowel to be pronounced. This reference word was given orally a few seconds before the recording session by one of the experimenters. For both the speakers, data were collected for the 10 French oral vowels [i, e, e, a, o, , u, y, ø, œ]. For each vowel, the 14 MRI scans were distributed along the vocal tract in three stacks. The Until now, there has been no automatic and reliable method to determine mid-sagittal distance and cross-sectional area of a section of the vocal tract on an MRI scan. Measurements were carried out following a procedure first devised for the treatment of mid-sagittal profiles of the vocal tract (Soquet et al., 1996). Outlines of the sections are traced by hand on a transparent sheet. By means of a digitization tablet, the outlines are introduced in the computer and each area is computed by a polygon surface computation algorithm. The digitization process may be biased by human factors. A test on the reproducibility of the measurement of the outlined area was made. Areas were computed for series of 10 repetitions of the same measure on three different outlines corresponding to three different reference sections of known areas: large, medium and small. The results are displayed in Table 1. The mean and standard deviation are given for each section. Results show that the standard deviation is similar in the three cases and is lower than 0.005 cm2 . This measurement reproducibility can be considered to be satisfactory for our purposes. Similarly, the sagittal distance was measured on the outlines of the sections. c 172 A. Soquet et al. / Speech Communication 36 (2002) 169–180 Fig. 2. Position of the 14 scans on the mid-sagittal profile for the vowel [u] uttered by the female subject along with the corresponding scans. A. Soquet et al. / Speech Communication 36 (2002) 169–180 173 Fig. 3. Position of the 14 scans on the mid-sagittal profile for the vowel [i] uttered by the male subject along with the corresponding scans. 174 A. Soquet et al. / Speech Communication 36 (2002) 169–180 Table 1 Reproducibility of the digitization process 3. Method Areas (cm2 ) 3.1. Selection of sagittal to area transformations Repetition Section 1 Section 2 Section 3 1 2 3 4 5 6 7 8 9 10 2.177 2.178 2.176 2.175 2.177 2.174 2.183 2.188 2.176 2.181 0.507 0.502 0.508 0.508 0.511 0.503 0.507 0.507 0.503 0.509 0.064 0.070 0.073 0.072 0.070 0.066 0.073 0.067 0.068 0.069 Mean Standard deviation 2.179 0.0044 0.506 0.0030 0.069 0.0030 Among the many transformations proposed in the literature, we chose four: the transformations designed by Maeda (1978, 1990), Sundberg (1969), Sundberg et al. (1987), Fant (1992) and Perrier et al. (1992). This choice was motivated by the variety of techniques used by the authors to define their transformation. Maeda’s model is based on the study of 1000 sagittal profiles corresponding to 10 sentences uttered by one French female speaker. Sundberg’s model is based on the study of X-ray tomographic data from one male and one female Swedish speakers and plaster casts from three male and three female Swedish speakers; we used the data from the subject Male 2 (see Sundberg et al., 1987). Fant’s model is based on the study of X-ray lateral views supported by limited X-ray tomographic data from a Swedish male subject. Perrier’s model is based on the study of a vocal tract cast for large sagittal dimensions and on CT scans of the vocal tract constriction regions of one male speaker for the three cardinal vowels [i, a, u] of French. For the labial region, among the four studied transformations, only Maeda (1978, 1990) and Fant (1992) provide the transformation to convert the lip height to the lip area. 3.2. Speaker-specific transformation Fig. 4. Measurement of the sagittal distance and the crosssectional area on an MRI scan. We have investigated three possible forms of the transformation. The first was a linear relationship, AðxÞ ¼ l0 þ l1 dðxÞ; An example of a measurement superimposed on the corresponding MRI slice is given in Fig. 4. In the lower pharynx region, the area and the corresponding sagittal distance were limited by the epiglottis. The contour of the teeth has been approximated when necessary during the outline of the section, using data on tooth size and location obtained from plaster casts and visual estimates. ð3Þ where l0 and l1 are the parameters. The second was a polynomial transformation, 2 AðxÞ ¼ p0 þ p1 dðxÞ þ p2 dðxÞ ; ð4Þ where p0 ; p1 and p2 are the parameters. The order of the polynomial was limited to two in order to A. Soquet et al. / Speech Communication 36 (2002) 169–180 have enough data to estimate the transformation parameters reliably. The third transformation was the classical power function, as in Eq. (1), where the parameters are a and b. For each transformation, the parameter values depend on the speaker and on the region in the vocal tract. 3.3. Evaluation As the purpose of the transformation is the estimation of the unknown cross-sectional area from a measured mid-sagittal distance, the evaluation has to be made on measurements not used for determining the parameters defining the transformation. Therefore, we performed the evaluation of each measurement using a jackknife method. The parameters of the transformation were estimated from all but one measurement of a vocal tract region of a speaker. The resulting transformation was tested on the remaining measurement. This procedure was repeated for each measurement of a particular vocal tract region, for each vocal tract region, each form of transformation and both speakers. This procedure has two main characteristics. First, the transformation will be evaluated relative to their intended purpose, the estimation of the unknown cross-sectional area from a measured mid-sagittal distance, and not the best possible fit to a set of measurements. Second, the stability of the parameters of each form of transformation for the different vocal tract regions can be studied to give an insight on the generality or the over-specificity of the transformation. 4. Results 4.1. Comparison of the different transformations In order to compare the performance of the different transformations described above, we computed for both speakers and for each region of the vocal tract the mean and the standard deviation of the relative errors. The relative error is positive if the area is overestimated by the trans- 175 formation and negative in the opposite case. The results are presented in Tables 2 and 3 for the female and the male speaker, respectively. The main observations are as follows: • As expected, the speaker-specific transformations have, in general, a smaller mean relative error and standard deviation. It is however interesting to notice that for the alveolar region of the female subject, Sundberg and Fant transformations give a lower mean and all give a lower standard deviation. • In general, the power transformation gives lower mean relative error than the linear and the polynomial transformations. This tendency is only contradicted in the oro-pharynx region for the female subject, where the linear transformation is better and in the lip region for the male subject, where the polynomial transformation is better. • The standard deviations are comparable for the polynomial and the power transformations, and somewhat larger for the linear one. • The speaker-specific transformations give lowrelative estimation error in the regions between the mid-pharynx and the hard palate. The other regions are not modeled correctly. • The four selected transformations overestimate the area for the regions between the mid-pharynx and the hard palate (especially for the male subject). Only Maeda’s transformation provides a good estimate of the areas in these regions for the female subject. 4.2. Speaker-specific transformations parameters Tables 4 and 5 display the mean coefficients derived from the MRI data for the female and the male subjects, respectively. The standard deviations are also given. It can be seen that the classical power transformation provides stable parameters in every vocal tract region for both the speakers. The second order polynomial transformation turns out to be the most sensitive to details in the training set, especially in the larynx, the alveolar and the labial regions. This sensitivity is not a good property, since it indicates that the transformation does not 176 Table 2 Mean and standard deviation in percentage of the relative estimation errors for the different transformations for each region of the vocal tract of the female speaker Speaker-specific transformations Linear Larynx Low-pharynx Mid-pharynx Oro-pharynx Velum Hard palate Alveolar Lips 17.8 4.8 )3.2 2.2 6.7 )17.5 24.8 9.8 [57.3] [29.2] [20.2] [38.6] [34.9] [77.1] [89.1] [50.3] Transformations of different studies Polynomial Power 18.8 4.0 2.0 6.4 6.9 3.7 8.4 8.5 8.0 2.9 )0.0 2.9 4.2 3.2 16.3 6.4 [69.7] [29.1] [12.7] [27.4] [33.9] [32.5] [98.4] [60.6] Maeda [40.1] [28.8] [13.4] [38.5] [33.7] [29.4] [82.8] [44.5] 106.6 )1.5 3.4 11.7 6.3 )4.2 )23.5 19.5 Sundberg [83.1] [23.8] [11.0] [37.4] [34.4] [26.2] [36.0] [37.7] 44.4 )16.0 7.9 30.3 44.7 34.9 )3.5 )40.0 Fant [51.5] [24.7] [16.1] [39.1] [45.5] [36.6] [48.8] [30.4] 81.1 14.7 23.7 30.8 50.0 36.6 )2.2 )40.0 Perrier [52.6] [25.3] [22.3] [51.2] [48.6] [36.5] [48.1] [30.4] 169.3 11.1 23.5 21.5 30.5 )0.3 )47.8 )40.0 [129.3] [53.8] [54.6] [66.9] [73.8] [31.4] [24.3] [30.4] Table 3 Mean and standard deviation in percentage of the relative estimation errors for the different transformations for each region of the vocal tract of the male speaker Speaker-specific transformations Region Linear Larynx Low-pharynx Mid-pharynx Oro-pharynx Velum Hard palate Alveolar Lips 9.2 3.8 )5.0 )1.1 1.5 )4.8 6.9 6.1 [42.6] [26.7] [42.1] [17.2] [14.5] [26.5] [38.8] [42.2] Transformations of different studies Polynomial Power 3.7 5.6 2.5 0.3 1.7 6.4 14.9 2.1 5.7 2.7 1.5 0.1 0.8 2.3 8.4 5.5 [54.8] [26.9] [20.5] [13.3] [15.1] [31.2] [51.3] [50.4] Maeda [37.3] [25.3] [21.6] [13.6] [13.9] [21.2] [40.6] [29.6] 3.3 )2.0 33.2 65.0 48.5 28.4 9.0 5.5 Sundberg [43.9] [23.5] [34.3] [27.7] [24.1] [22.4] [37.6] [43.3] )37.6 )18.3 45.1 71.1 98.0 65.1 31.4 )60.9 Fant [28.4] [23.0] [46.6] [39.2] [30.3] [35.3] [43.2] [9.6] )63.7 18.3 50.2 99.1 109.7 73.2 35.4 )60.9 Perrier [23.7] [25.6] [30.5] [48.2] [34.1] [33.8] [45.0] [9.6] 72.1 21.7 42.9 91.2 107.4 84.3 )21.4 )60.9 [65.1] [43.0] [32.2] [54.0] [63.9] [51.1] [34.0] [9.6] A. Soquet et al. / Speech Communication 36 (2002) 169–180 Region Table 4 Mean and standard deviation in percentage of the parameter values for the three transformation forms for each region of the vocal tract of the female speaker Linear l0 Larynx Low-pharynx Mid-harynx Oro-pharynx Velum Hard palate Alveolar Lips )0.06 )0.24 )0.74 )0.02 )0.10 )1.00 )0.29 )0.88 l1 [0.18] [0.03] [0.10] [0.20] [0.07] [0.04] [0.30] [0.17] 1.01 2.19 2.47 1.89 1.95 3.09 3.08 3.41 p0 [0.33] [0.02] [0.07] [0.11] [0.05] [0.05] [0.31] [0.32] 0.83 )0.49 0.58 1.52 0.16 )0.18 )3.34 )1.21 Power p1 [1.25] [0.08] [0.14] [0.20] [0.11] [0.07] [1.65] [1.00] )1.46 2.61 0.26 )0.49 1.50 1.39 10.11 4.48 a p2 [3.87] [0.11] [0.26] [0.31] [0.20] [0.19] [3.23] [3.44] 1.92 )0.15 0.78 0.73 0.16 0.64 )3.71 )0.77 [2.36] [0.04] [0.11] [0.11] [0.07] [0.09] [1.53] [2.59] 0.78 1.86 1.74 1.99 1.84 1.82 2.67 2.42 b [0.08] [0.02] [0.03] [0.11] [0.03] [0.02] [0.11] [0.18] 0.70 1.22 1.23 0.81 0.93 1.43 1.48 1.67 [0.20] [0.03] [0.03] [0.08] [0.03] [0.01] [0.22] [0.10] Table 5 Mean and standard deviation in percentage of the parameter values for the three transformation forms for each region of the vocal tract of the male speaker Linear Region l0 Larynx Low-pharynx Mid-pharynx Oro-pharynx Velum Hard palate Alveolar Lips )3.48 )1.02 )1.30 )2.87 )0.22 )1.40 )0.46 )1.63 Polynomial l1 [0.24] [0.12] [0.05] [0.25] [0.07] [0.11] [0.14] [0.09] 4.44 2.87 2.75 2.74 1.60 2.71 2.47 5.35 p0 [0.14] [0.07] [0.05] [0.10] [0.05] [0.07] [0.13] [0.19] )4.42 0.27 0.20 1.35 )0.04 0.36 0.58 )1.89 Power p1 [1.77] [0.33] [0.10] [0.72] [0.16] [0.15] [0.16] [0.58] 5.64 0.86 0.01 )1.00 1.36 0.33 1.05 6.27 a p2 [2.34] [0.48] [0.20] [0.60] [0.25] [0.23] [0.24] [2.06] )0.37 0.70 1.06 0.79 0.07 0.68 0.36 )0.75 [0.68] [0.16] [0.08] [0.12] [0.09] [0.08] [0.09] [1.55] 1.11 1.79 1.34 0.73 1.39 1.34 1.92 4.72 b [0.05] [0.05] [0.01] [0.05] [0.01] [0.03] [0.04] [0.28] 2.35 1.38 1.62 1.81 1.08 1.51 1.20 2.48 [0.07] [0.05] [0.02] [0.07] [0.02] [0.03] [0.15] [0.07] A. Soquet et al. / Speech Communication 36 (2002) 169–180 Region Polynomial 177 178 A. Soquet et al. / Speech Communication 36 (2002) 169–180 capture a general tendency but fits details peculiar to the training set. The stability of the parameters of the linear transformation is better than those of the polynomial transformation and comparable to the power transformation. It can be observed that, for the power transformation, the low- and the mid-pharyngeal regions turn out to have similar parameters and could be merged. The same holds for the oropharyngeal and the velar regions. On the contrary, the polynomial transformation does not allow such grouping. Again, this shows the higher sensitivity of the polynomial transformation to details present in the data. 5. Discussion 5.1. MRI data The sequence we used to obtain the MRI scans yielded 14 different images during a single production of each vowel. However some problems remain. Firstly, there is a limitation on the number of images for a given acquisition time. This does not allow the number of images to be increased so as to obtain a continuous area function during a single vowel phonation. Secondly, it was not possible to adjust the position of each individual cut to be perfectly orthogonal to the midline of the vocal tract. This drawback has to be considered in relation to the fact that both the area and the corresponding sagittal distance are measured in a plane that is perhaps misaligned. The imaging plane was however orthogonal to the sagittal plane. Thus, both the sagittal distance and the area should vary proportionally to the cosine of the misalignment angle. Concerning the transformations, the amplitude of the error caused by the misalignment will depend on the non-linearity of the relationship. Thirdly, the experimenter should be aware that information is degraded or lost at the intersection between two different cuts. It follows that intersections in the regions of interest in the vocal tract are to be avoided. This phenomenon is il- Fig. 5. Intersections of coronal cuts (scans 10–14) with coronal oblique cut (scan 9) showed by the darkened lines (vowel [i] pronounced by the male subject). lustrated in Fig. 5, where darkened lines represent the intersections between the coronal cuts and the coronal oblique cut in the anterior part of the face. 5.2. Transformation evaluation The problem of the estimation of the area function from the sagittal cut plays an important role in many studies and models of speech production mechanisms. Indeed, most vocal tract models appear to model the geometry of the sagittal cut (Mermelstein, 1973, Maeda, 1978). Modeling of the sagittal cut is convenient because it allows the overall position of the different articulators involved in speech production to be captured in one single two-dimensional representation. Up to the late eighties, the scarce data available was composed of sagittal images obtained by Xrays (see for example, Chiba and Kajiyama, 1941; Fant, 1960; Bothorel et al., 1986). Therefore, most representations of the vocal tract were made in the A. Soquet et al. / Speech Communication 36 (2002) 169–180 179 Fig. 6. Computation of the formant frequencies from a measured or model-based sagittal cut. sagittal plane. Sagittal images obtained with Xrays are obtained by projecting the vocal tract in a sagittal plane. The contour that can then be traced on the resulting image does not strictly correspond to the sagittal cut of the speaker. For example, Stone (1990) and Demolin et al. (1996) have shown the importance of the depression in the tongue profile (see Fig. 5). This depression has to be taken into account if one estimates the area function from a sagittal cut measured on an X-ray image. It is well known that when one deals with sagittal cuts – either measured or obtained with a vocal tract model – and wishes to infer the acoustical properties of the vocal tract thereby described, the normal procedure is to rely on a sagittal cut to area transformation to obtain the area function, and then to compute the acoustic result from this area function (see Fig. 6). Therefore, depending on the transformation, the acoustic result can be noticeably different. When one deals with acoustic-to-articulatory inversion, the influence of the transformation is even more obvious (Beautemps et al., 1995). Indeed, if the inversion is based on an acoustical criterion, to obtain the measured acoustic cues, the parameters of the vocal tract model will have to be adjusted so as to obtain an area function that produces similar acoustic cues. The sagittal cut will then differ, depending on the transformation. The interpretation of parameters thus-obtained has to be done with care. Advances in imaging techniques have allowed improvements in the way that vocal tract geometry is studied. It is well known that the human vocal tract is a highly flexible structure (Fant, 1960), but it has only recently been confirmed that crosssections can (i) vary considerably along its length (Stone, 1991; Demolin et al., 1996; Soquet et al., 1996) and (ii) show a high degree of asymmetry (Stone, 1991). Thus two-dimensional data are not sufficient to fully understand the production of speech sounds. Moreover, the stabilization of the tongue against other articulators, such as the teeth and hard palate, facilitates the production of certain tongue shapes which could otherwise seem difficult to produce accurately, as, for example, narrow constrictions leading to turbulent airflow in fricatives and laterals (Stone, 1991; Narayanan et al., 1995). It is thus obvious that the vocal tract has to be considered as a three-dimensional structure in order to study both speech production and the link between articulatory and acoustic space. 6. Conclusion The data provided by MRI proved to be of considerable interest in this study. The availability of information about the shape of the tract allowed us to compare the transformations on the basis of a reliable reference. These transformations were based on the study of a few subjects, with imaging techniques less elaborate than the MRI techniques used today. The study confirms that the use of these transformations on subjects other than the original ones is inappropriate and may lead to errors (Sundberg et al., 1987; Fant, 1992; Perrier et al., 1992). Three forms of sagittal to area transformation have been studied: linear, polynomial (of order 2) and the classical power function. The evaluation showed that the power transformation seems more adapted to the problem, providing parameters that are stable with regard to the measurements used for their estimation, with relatively small mean errors. However, these transformations only capture general properties of the relationship. 180 A. Soquet et al. / Speech Communication 36 (2002) 169–180 Acknowledgements This work has been partially supported by the ‘‘Fonds National de la Recheche Scientifique’’ (Didier Demolin – Credits Chercheurs No. 8.4519.95 and No. 1.5.194.97; Alain Soquet – Collaborateur Scientifique) and by the ‘‘Communaute Francßaise de Belgique’’ in the framework of the ARC 98-02 No. 226. References Baer, T., Gore, J.C., Gracco, L.C., Nye, P.W., 1991. Analysis of vocal tract shape and dimensions using magnetic resonance imaging: vowels. J. Acoust. Soc. Am. 90 (2), 799–828. Beautemps, D., Badin, P., Laboissiere, R., 1995. Deriving vocal-tract area functions from midsagittal profiles and formant frequencies: A new model for vowels and fricative consonants based on experimental data. Speech Communication 16, 27–47. Bothorel, A., Simon, P., Wioland, F., Zerling, J.P., 1986. Cineradiographie des voyelles et consonnes du francßais. Travaux de l’Institut de Phonetique de Strasbourg. Chiba, T., Kajiyama, M., 1941. The Vowel, its Nature and Structure. Tokyo-Kaiseikan, Tokyo. Demolin, D., Metens, T., Soquet, A., 1996. Threedimensional measurements of the vocal tract by MRI. In: Proceedings of the International Conference on Spoken Language Processing, Philadelphia, PA, pp. 272–275. Fant, G., 1960. Acoustic Theory of Speech Production. Mouton, The Hague. Fant, G., 1992. Vocal tract area functions of Swedish vowels and a new three-parameter model. In: Proceedings of the International Conference on Spoken Language Processing, Banff, pp. 807–810. Fujimura, O., Kiritani, S., Ishida, H., 1973. Computer controller radiography for observation of movements of articulatory and other human organs. Comput. Biol. Med. 3, 371–384. Greenwood, A.R., Goodyear, C.C., Martin, P.A., 1992. Measurements of vocal tract shapes using magnetic resonance imaging. IEE Proc. – I 139 (6), 553–560. Heinz, J.M., Stevens, K.N., 1965. On the relations between lateral cineradiographs area functions, and acoustic spectra of speech. In: Proc. Fifth Int. Congr. Acoust. Liege, Paper A44. Johanson, C., Sundberg, J., Wilbrand, H., Ytterbergh, C., 1983. From sagittal distance to area: a study of transverse, crosssectional area in the pharynx by means of computer tomography. STL-QPSR 4, 39–49. Ladefoged, P., Anthony, J.F.K., Riley, C., 1971. Direct measurement of the vocal tract. UCLA Working Papers in Phonetics, pp. 4–13. Lakshminarayanan, A.V., Lee, S., McCutcheon, M.J., 1991. MR imaging of the vocal tract during vowel production. J. Mag. Res. Imag. 1 (1), 71–76. Maeda, S., 1978. Une analyse statistique sur les positions de la langue: etude preliminaire sur les voyelles francßaises. In: Actes des IXemes Journees d’Etude sur la Parole, pp. 191–199. Maeda, S., 1990. Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In: Hardcastle, W.J., Marchal, A. (Eds.), Speech Production and Speech Modeling. Kluwer Academic Publishers, pp. 131–149. Mermelstein, P., 1973. Articulatory model for the study of speech production. J. Acoust. Soc. Am. 53, 1070–1082. Narayanan, S.S., Alwan, A.A., Haker, K., 1995. An articulatory study of fricative consonants using magnetic resonance imaging. J. Acoust. Soc. Am. 98, 1325–1347. Perrier, P., Bo€e, L.J., Sock, R., 1992. Vocal tract area function estimation from midsagittal dimensions with CT scans and a vocal tract cast: modeling the transition with two sets of coefficients. J. Speech Hear. Res. 35, 53–67. Schonle, P., Grabe, K., Wenig, P., Hohne, J., Schrader, J., Conrad, B., 1987. Electromagnetic articulography: use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract. Brain Lang. 31, 26–35. Soquet, A., Lecuit, V., Metens T., Demolin, D., 1996. From sagittal cut to area function: an MRI investigation. In: Proceedings of the International Conference on Spoken Language Processing, Philadelphia, PA, pp. 1205– 1208. Stone, M., 1990. A three-dimensional model of tongue movement based on ultrasound and X-ray microbeam data. J. Acoust. Soc. Am. 87 (5), 2207–2217. Stone, M., 1991. Toward a model of three-dimensional tongue movement. J. Phonetics 19, 309–320. Sundberg, J., 1969. On the problem of obtaining area functions from lateral X-ray pictures of the vocal tract. STL QPSR. Stockholm, pp. 43–45. Sundberg, J., Johansson, C., Wilbrand, H., Ytterbergh, C., 1987. From sagittal distance to area. A study of transverse, vocal tract cross-sectional area. Phonetica 44, 76–90.
© Copyright 2026 Paperzz