Regression Analysis on Levenshtein-Pointwise Mutual Information Segment Distance Across Languages and Acoustic Distance Eliza Margaretha, Martijn Wieling, John Nerbonne [email protected], [email protected], [email protected] University of Groningen Abstract We compare phonetic segment distances induced by Levenshtein with pointwise mutual information (PMI) weights among 3 languages, namely Dutch, German and Bulgarian. Our results show that Dutch Levenshtein-PMI segment distances have a significant correlation with those of Bulgarian. While Dutch and Bulgarian pair yields a rather low correlation, Dutch and German pair yields a high correlation. Furthermore, we are interested in bridging the phonetic-linguistic and distribution information approaches together. We observe how well vowel quality would influence Levenshtein-PMI segment distances by presenting the correlation of Levenshtein-PMI distances and acoustic vowel distances derived from formant measurements. 1 Introduction Phonetic segment distance measures how far a phonetic segment considered from another segment, namely the similarity between two segments. It serves as fundamental information for a wide range of research, for instance, speech recognition and spoken language processing. From transcribing spoken discourse (Geutner, et al., 1998) to the studies of dialectology (Nerbonne, et al., 1996), segment distances are of great importance to predict different pronunciations produced due to speaker differences such as geographic location, gender and the size and shape of vocal tracts. In addition to the classic Levenshtein algorithm, Wieling, et al. (2009) proposed a variation using PMI as generated-weights which in turn suggests the best performance compared to the other variations. In this work, we are interested in comparing the segment distances induced by this Levenshtein-PMI method across Dutch, German and Bulgarian. We explore whether segment distances of different languages significantly correlate to each other and thus also whether segment distances in one language can be used to predict those in other languages. Levenshtein-PMI method attempts to estimate segment distances automatically. Wieling, et al. (2009) describes that the method has a very strong relationship with PairHMM method which correlate highly with acoustic distances. In this work, we seek the direct relationship between the Levenshtein-PMI method and vowel quality. To generalize, we would like to perceive the relationship between information distribution and phonetic-linguistic approaches as we figure out how well Levenshtein-PMI distances can be estimated by acoustic distances. Previous study conducted by Ellison (1992) shows similar attempt to derive content from distribution of word information rather than from acoustic data. They proposed a methodology to construct unsupervised, cipher-independent and language-independent machine learning systems for learning phonology. Their induction model learns from surface information of words derived from a lexicon and by using such information, they show that they were able to perform a consonant-vowel classification. This report is structured as follows. We give brief explanations of the Levenshtein-PMI method, formant measurement and Mantel test in the next section. We describe our datasets and methodology in section 3 and report the results in section 4. Finally, discussions and summary of our work are pointed out in sections 5 and 6. 2 Literature This section highlights previous works related to our research. Firstly, we describe the Levenshtein-PMI method by which we obtain our segment distances. Secondly, we represent the concept of vowel quality and formant measurements to obtain the acoustic distances. Lastly, we explain Mantel test as a decent method to clarify the significance of a correlation of distance matrices. 2.1 Levenshtein using Pointwise Mutual Information-generated segment distances Levenshtein algorithm (Levenshtein, 1965) is a well-known distance measure which has been applied widely, also known for computing segment distances, namely how often segment is aligned with segment . In the context of string alignment, an insertion is regarded as an alignment of a segment against a gap, a deletion is an alignment of a gap against a segment, and a substitution is alignment of two segments (Wieling, et al., 2009). The following is an example of string alignment between 2 different pronunciations of milk in Dutch. The first string is /molke/ which is a Frisian dialect and the second is /melek/, another dialect spoken in several regions in the Netherlands such as Limburg. In the example, we discover a substitution (S) between the vowel // and //, a deletion (D) as the alignment between a gap and //, and an insertion (I) of // into a gap. For computing segment distances, we are interested most in the substitution operation. m m k k S D I PMI proposed by Church & Hanks (1990) was originally to measure word association norms. PMI compares the probability of observing two variables x and y together (joint probability) with the probability of observing the variables independently (chances). PMI applied to Levenshtein algorithm attempts to adjust an alignment distance by giving a weight which specify the distance according to the alignment frequency. Therefore, it is able to explain whether an alignment is nearer or further than other similar alignments. , log , In the view of information distribution, segment distances can be considered as how far the distribution of segment from the distribution of segment . Wieling, et al. (2009) defined the properties of PMI between a segment pair and as described below with regard to generating segment distances: • , is the relative occurrence of the aligned segments and in the whole data set. Specifically, , is computed as the number of and occurrences at the 2 • same position in 2 aligned strings of and , divided by the total number of aligned segments. or shows the relative occurrence of or respectively in the whole dataset, namely the number of the occurrences of or divided by the total number of segment occurrences. PMI value goes proportionally with the number of the segment pair's co-occurrence. If segments x and y are likely to co-occur, , will be much larger than the and consequently the , will be much larger than 0. Conversely, PMI negative values indicate that segments are not likely to co-occur. To set corresponding segments at low distance, the segment distances are transformed by subtracting PMI value from 0 and adding the maximum PMI value. Segment distances are trained using an iterative procedure in the following manner. First, string alignments are generated using Levensthein algorithm which does not allow alignments of vowels with consonants. Second, the PMI values for every segment pair is calculated and transformed. Third, Levenshtein algorithm is applied to these segment distances to create a new alignment sets. Step 2 and 3 are repeated until convergence is reached, namely the difference between two consecutive iterations is very small, close to 0. 2.2 Vowel Quality and Formant measurement Vowel quality is the property that makes one vowel sound different from another, for example, /⍧/ as in sheep from // as in ship (McArthur, 1998). The quality of a vowel is determined by the position of the vocal tracts (the parts of the anatomy which produce vocal sounds) during pronunciation, i.e. the tongue, lips, and lower jaw, and the resulting size and shape of the mouth and pharynx. The most common way to measure vowel quality by means of acoustic signals is formant measurement (Leinonen, 2010). Formants specify the energy concentration positions in the acoustic signals, i.e. the lowest resonance frequencies (Peterson & Barney, 1952). At a resonance frequency, similar acoustic signals oscillate at larger amplitudes than at other frequencies. These vocal resonances are able to characterize distinguishable vowel sounds. The 2 first formants are the most distinguishing features and the 3rd formant would be useful when pronunciation is very much affected by the position of the lips (Ladefoged, 2005). Figure 2.1 illustrates vowel distinguishments via formant measurements. A formant is presented as a darker band in a spectrogram. It shows that // and // has similar first formants but the second formant of // is higher than that of //. The third formant provides additional information for the distinguishments. Figure 2.1 Illustration of Format Measurements (Leinonen, 2010) 3 An acoustic distance between 2 vowels can be acquired by calculating the Euclidean distance of their formant values (Wieling, et al., 2007). To generalize the acoustic distance, normalizing non-linguistic speaker-dependent differences, such as pitch, in acoustic signal is required. A common way to do so is by applying a band-pass filtering using Bark filters or Mel filters. To match human pitch and perception which are not linear, linear Hertz frequency should be transformed to non-linear, almost logarithmic, Bark or Mel scales. In addition to Bark and Mel scales, z-score transformation was suggested by Lobanov (1971) with the intention of achieving normalization per speaker. Thus, z-scores transformation would help to assimilate the voice differencies between men and women. While Bark and Mel scales are based on one vowel token, z-scores transformation make use of information across vowels. More vowel normalization methods such as Gerstman’s range normalization and Millers’ formant-ratio model are discussed and compared by Adank (2003). 2.3 Mantel Test Normally we compare independent objects in carrying out regression analysis. In other words, we assume that the objects to correlate are independent. However, distance matrices are typically dependent in some ways (Manly, 1994). Figure 2.2 Triangle In the case of acoustic distances derived from Inequality the first and second formants, the distances are dependent as particularly they obey the theorem of triangle inequality. According to the theorem, if the straight distance between 2 objects A and C is smaller than the sum of other distances through another object B, then the straight distance is dependent to the other distances. This concept is illustrated in Figure 2.2. On the other hand, Levenshtein-PMI distances can be viewed as independent as they do not necessarily obey the triangle inequality. Moreover, since Levenshtein-PMI distances come from information distribution theory, they are not guaranteed as distances or metrics in mathematical sense. Comparing Levenshtein-PMI distances to acoustic distances would introduce a comparison between an independent matrix and a dependent matrix. In such a case, it is essential to test if the relationship between such matrices would be truly significant. Evaluating their correlation coefficient and testing its significance are not sufficient. Mantel test introduced by Mantel (1967) is a prevalent test for dealing with such a purpose. It was primarily suggested as a solution for identifying space and time clustering of disease. Mantel test is based on randomization and permutation test. To assure the significance of the correlation of two distance matrices, their correlation is compared with the correlations of permutated matrices, i.e. multiple comparisons with correlations of one original matrix and all possible permutated matrices where the rows and columns of the other matrix are permutated randomly. The null hypothesis is set as there is no relationship between the two matrices. If it is satisfied, then the correlation coefficients of permuted matrices should be equally larger or smaller. An observation value can be used to show a positive relationship. Specifically, we compute the observation value by adding 1 for every 1, 2 1, 2, where D1 and D2 are distance matrices and PD1 is permutated D1, and then divided by the number of replicates. To be precise, we need to perform comparisons for all possible permutations. However, the number of permutation would grow enormously as the size of the matrix grows larger. 4 Therefore, Monte-Carlo test (Metropolis & Ulam, 1949) would be a good alternative. It suggests that taking a small random sample of the possible replicates should be sufficient. 3 Dataset Our Dutch data came from digital Dutch dialect data transcriptions from the GoemanTaeldeman-Van Reenen-Project (GTRP) as used by (Wieling, et al., 2007). It consists of dialect varieties from 424 different locations in the Netherlands. For each variety, there are 562 transcriptions of different words which altogether comprise 82 Dutch phonetic segment types. On the other hand, the Bulgarian data was collected from various resources, namely students’ theses at the University of Sofia, published monographs, dictionaries, and the archive of the Ideographic Dictionary of Bulgarian Dialects (Prokić, et al., 2009). It contains transcriptions of 152 words from 197 locations all over Bulgaria and there are 67 different segment types including diacritics and suprasegmentals, i.e. vocal effects such as emphasis or prosodic. Additionally, we made use of the German dataset (Nerbonne & Siedle, 2005) which consists of 78 segment types. The transcriptions of 196 words were collected from 186 locations in Germany for the Kleiner Deutscher Lautatlas project. For each language, L04 program1 developed by Peter Kleiweg was employed to compute its Levenshtein-PMI segment distances. The program produces a contingency matrix where the rows and columns designate segment types and each cell , describes the distance between 2 corresponding segment types and . For each non-alignment pair, we give a very high penalty which turns out to be a very high distance. Since the segment labels of different languages are written in different Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) formats, we transform the X-SAMPA labels to their corresponding International Phonetic Alphabet (IPA) standard. Then, we map each shared segment label between 2 languages, namely Dutch and Bulgarian, and Dutch and German. For each shared segment in a language pair, we collect all segment alignments which have low distances in both languages. Additionally, we also calculate the number of vowel alignments and consonant alignments separately. Table 3.1 Dutch, Bulgarian, and German Segment Figures Language Pair Dutch and Bulgarian Dutch and German Shared Types 43 71 Vowel Consonant Segment Alignments Alignments Alignments 235 92 143 870 261 609 Dutch and Bulgarian share 43 identical phonetic segments in our data. In total, there are 903 segment alignments, but there are only 235 alignments with low distances consisting of 92 vowel alignments and 143 consonant alignments. On the other hand, Dutch and German data share 71 identical phonetic segments. There are 870 alignments with low distances including 261 vowel alignments and 609 consonant alignments. These figures are summarized in Table 3.1. The normal Q-Q plots of Dutch and Bulgarian low segment distances plotting observed values against expected normal values in Figure 3.2 (a) and (b) suggest that the data are 1 http://www.let.rug.nlkleiweg/L04/Manuals/leven.html 5 (a) (b) Figure 3.2 Q-Q Plots of (a) Bulgarian and (b) Dutch Data normally distributed with one outlier in Dutch. Similarly, the Q-Q plot of Dutch and German low segment distances also show that the data are normally distributed (see Appendix I.2). The segment distances vary from 0 to 5000. The box plot of the data depicted in Figure 3.2 show that the Dutch and Bulgarian medians are close to each other and most of the data overlap. Thus, we expect the data would be fairly similar, i.e. no significant difference. The box plot of Dutch and German data also show Figure 3.1 Box Plot of Bulgarian and Dutch Data comparable manner (see Appendix I.2). Our acoustic data was obtained from Pols, et al. (1973) containing three first formants of 50 Dutch male speakers and Van Nierop, et al. (1973) those of 25 female speakers. The formants of all speakers are averaged and the acoustic distances were computed as the Euclidean distances of the formant values. In total, there are 36 acoustic vowel alignments in the acoustic data. All of these alignments also appear in our Levenshtein-PMI Dutch data. Beside the raw Hertz frequency of the formants, we use the transformed formants in Bark and Mel scale. Since raw Hertz frequency is linear whereas our perception is not, transformed formants in Bark and Mel scales which are nonlinear should fit to the nature of our perception better. Additionally, we apply z-score transformation to our acoustic data in the following manner. Raw hertz values are transformed to standardized z-scores of each speaker so as to normalize the differences over all the vowels per speaker. Then, the average of z-scores per vowel of all speakers is taken. 4 Results and Analysis In this section, we present regression analyses over different setups. First, we highlight our comparisons of Levenshtein-PMI distances across languages. Second, we compare Levenshtein-PMI distance with various variations of acoustic distances. 6 4.1 Comparing Dutch, Bulgarian and German Levenshtein-PMI Distances We analyze Levenshtein-PMI distances across languages with the following arrangement. We compare variable pairs, which are Dutch and Bulgarian and Dutch and German, for all existing cases, namely all segment alignments occurring in both languages. The value of each variable for each case is the corresponding segment distance. For example, the value of // /a/ alignment in Dutch is its distance and we aim at comparing it with such a distance in Bulgarian and in German. We carry on the task by performing a regression analysis on 2 Levenshtein-PMI distance sets and computing their correlation coefficient to measure the effect size. The task is modeled in Figure 4.1 below. Dutch Bulgarian/German Segment alignment Levenshtein-PMI distance Levenshtein-PMI distance Figure 4.1 Regression Analysis Model on Comparing Levenshtein-PMI Distances The scatter plot in Figure 4.2 (a) visualizes the relationship between Dutch and Bulgarian data, whilst (b) Dutch and German. Each point in the scatter plots indicates a case where is the segment distance in Dutch and is the segment distance in Bulgarian or German. We assume Dutch as the independent variable which somewhat determine the values of the dependent variables (Bulgarian or German). (a) (b) Figure 4.2 Scatter Plots of (a) Dutch and Bulgarian, (b) Dutch and German A straight line (regression line) in each scatter plot is drawn suggesting linear dependency. The line has a formula where is the intercept and is the slope. Each ! in the regression line is the predicted segment distance in another language (Bulgarian or German) estimated by the corresponding Dutch segment distance. The difference between the actual and predicted segment distances is the residual, "# # $ !# . Least-squares regression is applied to find the minimal squared residuals for all segment alignments. The points in Dutch and Bulgarian data seem to scatter more than those in Dutch and German data which is fairly concentrated nearby its regression line. Although the points look moderately random, the points in Dutch and German show a slight trend that the residuals become larger as the distances become larger. To examine the residuals accurately, we plot the residuals against the predicted value as depicted in Figure 4.3 (a). The residuals imply linearity since the points are moderately random and widely spread. They are also reasonably normally distributed as shown by P-P plot in Figure 4.3 (b). Besides, Dutch and German data show similar manners with extra data 7 (a) (b) Figure 4.3 Plots of Dutch and Bulgarian Residuals points (see Appendix I.2). It also shows the trend mentioned before, i.e. residuals become larger as the distances become larger. According to our SPPS results given in Figure 4.4, the regression line for Dutch and Bulgarian is 1568.562 0.3. By using this regression line, we are able to calculate the predicted Bulgarian segment distance given the corresponding Dutch distance. For example, given the distance of // aligned to // is 1556 in Dutch, the predicted alignment distance in Bulgarian is ! 1568.562 0.31556 2053.362. If we allow 5% error, i.e. with 95% confidence interval, the mean of // and // alignment distance in Bulgarian should lie between 2053.362 + 1083 970,3136 where 1083 is the standard error ! for specific 1556. In our data, the real distance is 1675 which indeed lie in the interval. Figure 4.4 Dutch and Bulgarian Regression Line Coefficients The t-statistics is 5.454 presenting that the relationship between Dutch and Bulgarian data is significant at / 0.000. In other words, Dutch segment distances can be considered as a good predictor for Bulgarian segment distances. The correlation coefficient 0.336 shows that Dutch and Bulgarian has a low positive correlation. A coefficient of determination, that is the square of correlation coefficient ( ) shows the proportion of variability in a data set accounted for by a regression model (Moore & McCabe, 2006). It compares the variations explained by an explanatory variable, i.e. Dutch segment distances in our case, to the total variations in the whole data set. Therefore, it presents the explanatory size of the independent variable to the dependent variable. It also specifies how well future outcomes can be predicted by the model. In an ANOVA’s point of view as depicted in Figure 4.5, the coefficient of determination is computed as the sum of squares of regression model divided by the total sum of squares. For Dutch and Bulgarian case, the coefficient shows that Dutch segment distances account for approximately 11% variation of Bulgarian segment distances. Figure 4.5 also presents that Dutch distances have a significant effect on Bulgarian distances as their F-statistics is significant at (p < 0.000). 8 Figure 4.5 ANOVA Summary of Dutch and Bulgarian Data In the case of Dutch and German (see Appendix I.2), the t-statistics also indicates significant relationship, namely 23.925 ( / 0.000). The regression line formula is 879.010 550. The correlation ( 0.630) is stronger than Dutch and Bulgarian. Almost 40% variation of German segment distances is accounted for by Dutch segment distances. Furthermore, we compare vowel alignments and consonant alignments separately. Generally, both vowel and consonant alignments yield significant correlations at / 0.000. We figure that vowel alignments obtain better correlations than consonant alignments. Dutch and Bulgarian vowel alignments correlate significantly at 0.418 which means nearly 18% variation of Bulgarian vowel distances can be predicted by Dutch vowel distances. Their consonant alignments on the other hand, yield correlation at 0.339, that is roughly 11% variation of Bulgarian consonant distances are accounted for by Dutch consonant distances. Table 4.1 Dutch, Bulgarian and German Levenshtein-PMI Distances Correlations (p < 0.001) Language Pair Alignment Sets Dutch and Bulgarian Dutch and German Pearson Correlation (r) Explanatory size (r2) All 0.336 0.113 Vowel 0.418 0.178 Consonant 0.339 0.115 All 0.630 0.397 Vowel 0.620 0.384 Consonant 0.587 0.345 For Dutch and German, their vowel distances correlate at 0.620 and thus Dutch vowel distances account for over 38% variation of German vowels. Their consonant distances have a slightly lower correlation at 0.587 suggesting that approximately 35% German consonants are accounted for by Dutch consonant distances. 4.2 Comparing Levenshtein-PMI Distances to Acoustic Distances Our second task is to compare segment distances produced by information distribution approach to the common assessment concerning vowel quality, phonetic-linguistics approach. Specifically, we compare Dutch Levenshtein-PMI distances to acoustic distances. Since we are interested in perceiving how well acoustic distances would explain Levenshtein-PMI distances, we set acoustic distance as the explanatory variable and 9 Levenshtein-PMI distance as the response variable. Akin to the previous task, the cases and values of the variables are segment alignments and distances in the corresponding approaches. We evaluate each variation of the acoustic distances as described in section 3, namely raw Hertz frequency and transformed frequency in Bark scale, Mel scale and Z-scores. For each variation, we compute Pearson correlation coefficients (r) and coefficients of determination showing the explanatory size (r2) for the first 2 and the first 3 formants. The results are presented in Table 4.2. Table 4.2 Dutch Levenshtein-PMI and Acoustic Distances Correlations Acoustic variation Hertz Z-scores Bark Scale Mel Scale Number of first formants 2 3 2 3 2 3 2 3 Pearson Explanatory Significance Correlation (r) Size (r2) (p-value) 0.481 0.231 0.003 0.426 0.181 0.010 0.720 0.518 0.000 0.640 0.410 0.000 0.616 0.379 0.000 0.517 0.267 0.001 0.603 0.364 0.000 0.507 0.257 0.002 The correlations between acoustic distances using raw Hertz and Levenshtein-PMI distance are not remarkable. Raw hertz with 2 first formants has correlation at 0.481 which shows that it accounts for 23% variation of Levenshtein-PMI distance. Taking into account the third formant slightly lower the correlation coefficient to 0.426 signifying that the acoustic distance accounts for 18% variation of the Levenshtein-PMI distance. Normalizing the raw Hertz is indeed improving the results. Our acoustic z-scores distances yield the best correlations at 0.720 for 2 first formants and 0.640 for the 3 first formants. Both results are significant at / 0.000. Using the 2 first formants, it accounts for nearly 52% variation of Levenshtein-PMI distance. Considering the third formant does not help refining the results and yields a poorer result, explicitly over 10% minor explanatory size than excluding the formant. Only 41% variation of Levenshtein-PMI distance accounted for by acoustic z-scores distances with 3 first formants. Bark and Mel scales produce similar results although Bark scales are marginally better than Mel scales. Almost 38% variation of Levenshtein-PMI distance is explained by acoustic distances in Bark scale with 2 first formants ( 0.616) and over 36% is explained by Mel scale, also with 2 first formants 0.603). The third formant is again exacerbating the results. With the third formant, acoustic distances in Bark scale predict nearly 27% variation of Levenshtein-PMI distance ( 0.517) and the distances in Mel scale predict almost 26% ( 0.507). While using 2 first formants in Bark and Mel scales is significance at / 0.001, the significance of using 3 first formants also fall to / 0.005. As mentioned in section 2.3, p-value is not sufficient for validating the significance of a correlation coefficient of distance matrices. Although Levenshtein-PMI distances can be recognized as independent, acoustic distance is not independent. Since we are comparing an independent object with a dependent object, we need to perform Mantel test to the significance of their correlation coefficient. Instead of testing all possible permutations, we 10 use Monte-Carlo sampling of 10000 replicates. The outcomes of the Mantel test with MonteCarlo sampling are given in Table 4.3. Table 4.3 Mantel Test Results of Dutch Levenshtein-PMI and Acoustic Distances Acoustic variation Observation value Significance (p-value) Hertz 2 0.168 0.013 Hertz 3 0.132 0.035 Z-score 2 0.410 1e-04 Z-score 3 0.317 3e-04 Bark 2 0.303 2e-04 Bark 3 0.206 0.002 Mel 2 0.286 2e-04 Mel 3 0.195 0.004 The significances in Mantel test goes proportionately with the significances of the correlation coefficients in Table 4.2. The previous table shows that Z-scores using first 2 and 3 formants, Bark scale 2 formants, Mel scale 2 formants are significant at p < 0.001. On the other hand, Table 4.3 highlights that these variations have tremendously low values implying that permuting the rows and columns does not really affect the correlations between the acoustic distances and Levenshtein-PMI distances. Thus, the two kinds of distances have a decent relationship and their correlation is truly significant. 5 Discussion Our results show that Dutch Levenshtein-PMI distances is able to predict distances in German and Bulgarian. Dutch prediction over German, which has similar characteristics to Dutch, is much better than the prediction over Bulgarian, which has different characteristics. Dutch and German are deemed to be grouped in Germanic languages category. Since they have the same earlier parent language during the historical developments, they share a wide range of similarities including types of consonants, vowels and accents (Auwera & König, 1994). On the other side, Bulgarian is included in Slavonic languages which are mainly spoken in Eastern Europe. Therefore, Bulgarian has diverse phonetic properties from Dutch. Since the sound systems of Slavonic languages are rich in consonants, Slavonic people particularly are not accustomed to pronounce vowels. They typically find difficulties in pronouncing vowels and they pronounce vowel in different ways from Dutch people. Another issue that should be taken into account is that the phonetic notation system in International Phonetic Alphabet (IPA) does not necessarily denote exactly the same phonetic sounds from different languages. The alphabet was originally defined based on English. A sound which is alike but not exactly the same as in English could be signified to the same alphabet. For instance, /i/ sound in Bulgarian might be pronounced slightly different from English /i/ but it is labeled to the same alphabet /i/. Comparisons of Levenshtein-PMI distances and various transformations of acoustic distances show that Z-score transformation yields the best results. Bark and Mel scales help in normalizing the formants to meet human perception which is nonlinear. They improve the 11 estimation of Levenshtein-PMI distances for more than 10%. However, z-score transformation suits our acoustic data better since the data was collected from male and female speaker and z-score transformation attempt to normalize differences of all vowels per speaker. Therefore, z-scores assist properly in smoothing speaker differences with regard to gender. It improves nearly 30% of the predictions. It appears that the third formant is not useful in our experiments, even impact poorer outcomes. This phenomenon is not peculiar as it was also previously discovered in Wieling, et al., (2007). We suspect that it might be due to the pronunciations in our data are not much determined by lips position which greatly affects the third formant. Instead of helping in distinguishing vowels, the third formant seems to make the differences among the vowels more unclear. 6 Summary We have described 2 segment distance comparison tasks. First, we demonstrate comparisons of Levenshtein-PMI segment distances between 2 pairs of languages, namely Dutch-Bulgarian and Dutch-German. Second, we present comparison of Levenshtein-PMI distances and some variations of acoustic distances induced by formant measurements. For both tasks, we show significant correlations between the variables to compare. Our results reveal that Levenshtein-PMI distances of Dutch are able to predict those of Bulgarian and German. That is to say Levenshtein-PMI distances of one language are able to predict distances in other languages. We also report that prediction of a language whose similar characteristics to the predictor is better than that of a language whose different characteristics. Particularly in our work, Dutch prediction over German is better than Dutch prediction over Bulgarian. Dutch distances account for up to 40% variation of German distances. In Bulgarian case, Dutch distances are able to estimate approximately 11% variation. Additionally, we display that vowel quality as represented by acoustic distances correlate reasonably highly with Levenshtein-PMI distances. This implies that phonetic-linguistic approach has a significant relationship with distribution information approach and the former can finely explain the latter to some extent. In our case, we evaluate how well Dutch acoustic distances are capable of predicting its Levenshtein-PMI distances. We show that the former can predict up to 52% of the latter. Acoustic distances using raw Hertz frequency from 2 first formants are able to estimate about 23% variation of Levenshtein-PMI distances. Normalizing the raw Hertz frequency is indeed improving the results. Bark and Mel scales transform linear raw Hertz to nonlinear frequency in order to match human perception. The acoustic distances in Bark and Mel scales produce comparable results where Bark is faintly better than Mel scales. They achieve approximation about 37% variations of Levenshtein-PMI distances. The best prediction is attained by z-score transformation. Since our acoustic data combine male and female speaker and z-score transformation attempt to normalize differences of all vowels per speaker, it helps to smooth the differences between men and women. Since we compare independent Levenshtein-PMI distances to dependent acoustic distances, we also test the significance of their correlations. We do so by carrying out Mantel test which eventually assure the significance. Especially for correlations with normalized acoustic distances: z-score, Bark and Mel scale using 2 first formants, the p-values are very low indicating that there is a relationship between the 2 compared distances. 12 Appendix I Data I.1 Dutch and Bulgarian Data I.2 Dutch and German Data 13 Appendix II Results II. 1 Results of Levenshtein-PMI Dutch and German Segment Distance Comparison 14 Bibliography Adank, P. M. (2003). Vowel Normalization: a perceptual acoustic study of Dutch vowels. Wageningen: Ponsen & Looijen. Auwera, J. v., & König, E. (1994). The Germanic Languages. London: Routledge. Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1), 22-29. Ellison, T. M. (1992). The Machine Learning of Phonological Structure. Phd Thesis, University of Western Australia, Department of Computer Science. Geutner, P., Finke, M., & Waibel, A. (1998). Phonetic-Distance-Based Hypothesis Driven Lexical Adaptation For Transcribing Multlingual Broadcast News. In Proceedings of the International Conference on Spoken Language Processing. Ladefoged, P. (2005). Vowels and Consonants: An Introduction to the Sounds of Languages (2nd ed.). Malden, MA: Blackwell. Leinonen, T. (2010). An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects. PhD Thesis, Groningen. Levenshtein, V. (1965). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 163(4), 845-848. Lobanov, B. M. (1971). Classification of Russian Vowels Spoken by Different Speakers. J. Acoust. Soc. Am., 49, 606-608. Manly, B. F. (1994). Multivariate Statistical Methods: A Primer (2nd ed.). USA: Chapman and Hall. Mantel, N. (1967). The Detection of Disease Clustering and a Generalized Regression Approach. Cancer Research, 27(2), 209-220. McArthur, T. (1998). "VOWEL QUALITY" Concise Oxford Companion to the English Language. Retrieved May 5, 2010, from Oxford Reference Online: http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t29.e1288 Metropolis, N., & Ulam, S. (1949). The Monte Carlo Method. Journal of the American Statistical Association, 44(247), 335-341. Moore, D. S., & McCabe, G. P. (2006). Introduction to the Practice of Statistics 5th edition. New York: W. H. Freeman. Nerbonne, J., & Siedle, C. (2005). Dialektklassifikation auf der Grundlage aggregierter Ausspracheunterschiede. Zeitschrift für Dialektologie und Linguistik, 72(2), 129–147. Nerbonne, J., Heeringa, W., van den Hout, E., van der Koo, P., Otten, S., & van de Vis, W. (1996). Phonetic Distance between Dutch Dialects. G.Durieux, W.Daelemans, & S.Gillis (eds.) CLIN VI: Proc. of the Sixth CLIN Meeting, (pp. 185-202). Antwerp, Centre for Dutch Language and Speech (UIA). Peterson, G., & Barney, H. (1952). Control methods used in a study of the vowels. J.Acoust.Soc.Am, 24(2), 175-184. Pols, L. C., Tromp, H. R., & Plomp, R. (1973). Frequency analysis of Dutch vowels from 50 male speakers. The Journal of Acoustical Society of America, 43, 1093–1101. 15 Prokić, J., Nerbonne, J., Zhobov, V., Osenova, P., Simov, K., Zastrow, T., et al. (2009). The computational analysis of Bulgarian dialect pronunciation. Serdica Journal of Computing. Statistical Consulting Group. (n.d.). How can I perform a Mantel test in R? Retrieved May 8, 2010, from UCLA: Academic Technology Services: http://www.ats.ucla.edu/stat/R/faq/mantel_test.htm Van Nierop, D. J., Pols, L. C., & Plomp, R. (1973). Frequency analysis of Dutch vowels from 25 female speakers. Acoustica, 29, 110–118. Wieling, M., Heeringa, W., & Nerbonne, J. (2007). An Aggregate Analysis of Pronunciation in the Goeman-Taeldeman-van Reenen-Project Data. Taal en Tongval, 59(1), 84-116. Wieling, M., Leinonen, T., & Nerbonne, J. (2007). Inducing sound segment differences using Pair Hidden Markov Models. SigMorPhon '07: Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology (pp. 48-56). Prague, Czech Republic: Association for Computational Linguistics. Wieling, M., Prokić, J., & Nerbonne, J. (2009). Evaluating the pairwise string alignment of pronunciations. Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (pp. 2634). Athens, Greece: Association for Computational Linguistics. Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands. The Journal of the Acoustical Society of America, 33(2), 248. 16
© Copyright 2026 Paperzz