Linguistic Divergence in Austronesian on the Island of Timor Riley Burkart August 15, 2014 Abstract The island of Timor, located in South East Asia, experienced a wave of migration from the neighboring islands several thousand years ago, resulting in its domination by the Malayo-Polynesian branch of the Austronesian languages. The spread of these languages and their relationships to each other remain poorly understood. Using statistical tools and computation, a data set comprised of words from 53 languages will be analyzed in order to explore a new method for estimating linguistic divergence that has only previously been applied to the Turkic languages. Regular sound correspondences will be inferred and cognates will be identified to estimate the amount of time that has passed since the languages split. These new methods, incorporating linguistic knowledge, will hopefully compare favorably with current computational strategies and will contribute to a deeper understanding of the movement of the Austronesian people in Timor. General Discussion Located near the eastern end of the Lesser Sunda Islands, the small island of Timor demonstrates a remarkable degree of linguistic diversity, containing at least 16 distinct indigenous languages. First settled by Australoid peoples between 40,000 and 20,000 BC, Timor has since experienced at least two successive migrations. The first of these occurred around 3,000 BC and consisted of Melanesians who spoke Papuan languages and migrated from the nearby island of New Guinea. Around 2,500 BC, Austronesians from the north settled on the island, where they remain the predominant ethnic group (Timor, 2010). The Austronesian language family originated in South China before spreading to Taiwan between 3,000 and 4,000 BC, around the advent of rice cultivation. Over the next several millennia, the Austronesian people would continue to spread to other islands, and today, an estimated 270 million people speak between 1,000 and 1,200 distinct Austronesian languages distributed more than half way around the world from Easter Island to Madagascar. The Austronesian languages spoken on the island of Timor, as well as those spoken on the nearby 1 island of Sumba, fall under the Central Malayo-Polynesian subgroup. (Bellwood et al., 1995) Aside from Indonesian and Portuguese, the only languages commonly spoken on Timor today are either Papuan or Austronesian. Papuan languages, including Bunak, Makasai, and Fatalaku, are largely confined to the mountainous regions of the island, and are believed to have originated from a single common ancestor. The Austronesian languages of Timor are generally referred to a Timoric, and are divided by Geoffrey Hull (2004) into two subgroups: Fabronic and Ramelaic. According to this classification, the significant languages in the Fabronic group are Tetum and Atoni, while those in the Ramelaic group are Mambai and Kemak. Ramelaic Languages are believed to differ from Fabronic languages mainly in that they were influenced by a stronger Papuan substratum and a greater degree of creolization from trade languages such as Malay and Ambonese during the 15th century (Hull, 2004). Because they have not been extensively studied, the relationships between the Timoric languages remain rather poorly understood. Using cognate analysis, however, maps and phylogenies can be created to correct this. Two words are considered to be cognates if they share a common etymological origin; in other words, they must have a similar meaning and share systematic sound correspondences. The number of cognates shared between two languages can be used in order to estimate the distance between the languages. Assuming cognate replacement to be a memoryless system, the probability that two languages share cognates can be modeled as P = e−d , where d is the distance between the languages (Dyen and University, 1973). Replacing the probability with m/n, where m is the number of cognates two languages share and n is the number of words compared between the languages yields: d = − log m/n Using this equation in conjunction with a list of words in different languages with their cognates identified, a distance matrix can be created. Methods such as multidimensional scaling and neighbor-joining can then be utilized to transform such distance matrices into maps and trees, respectively. A significant drawback to the use of such distance matrix methods, however, lies in their tacit assumption that the rate of cognate replacement is the same for all meanings. It has been shown that this is not the case, as can be seen when comparing words like ’mother,’ which are replaced very slowly, to words like ’food,’ which are replaced over a much shorter period of time (Dyen, Isidore et al., 1967). A bayesian Markov Chain Monte Carlo model applied to the cognates can correct for this by directly estimating the rates of cognate replacement (Pagel et al., 2007). There remains, nevertheless, an inherent problem with any method that uses cognates to determine the distance between languages. Given three cognates such as the English hound, the German hund, and the Russian suka, a cognatebased approach of analysis would assign all three of these words an identical cognate code and treat them the same. It is clear, however, that hound and 2 hund are far more similar to each other than either is to suka, a fact that the cognate-based approach fails to take into account. A new method that analyzes the sound correspondences rather than cognates and has only been used before in the Turkic languages will be applied to the Timoric languages in order to produce a more accurate phylogeny for the island. To do so, the phonemes in the languages must first be aligned, so that regular and irregular sound correspondences can be inferred. Finally, these sound correspondences will be subjected to MCMC methods, yielding a more accurate phylogenetic tree. Data Set The data set used in the analysis is comprised of 7,118 words from 52 distinct language samples (hereafter referred to as ’languages’) and was collected by Stephen Lansing and his team. In addition to languages from the island of Timor, Balinese and several languages from the island of Sumba are also included, but Mambai is not. Each word is written in IPA and is listed with two other associated properties: a cognate code and a gloss, which is the meaning of the word. Several difficulties are encountered in using the list. For one, many of the languages lack cognate codes, rendering them unusable for cognate analysis. Furthermore, certain cognate codes 6 and 4142 refer respectively to errors and isolates, which requires that more words be discarded. Finally, the glosses are in Indonesian, and are not yet translated into English. Therefore, the glosses have thus far not been used in this research. Cognate Analysis Classical Multidimensional Scaling One way to examine the relationships between languages is through the creation of a map of the languages’ locations, and one method of mapping is to use classical multidimensional scaling, or CMDS. Given a distance matrix, CMDS sets up a matrix of squared distances, to which it then applies double centering. It then extracts the two largest positive eigenvalues and their corresponding eigenvectors, which are used to create a coordinate matrix that maps the points in two dimensions (Wickelmaier, 2003). Because the map is dependent on only two eigenvalues, it is important that these eigenvalues are significantly larger than all the others. If this is not the case, then the map produced will have a high degree of error. This fact can be used to determine a goodness of fit for the map. Two such methods are implemented in R, and can be seen below: Pm |λi | = 0.5372 GOF 1 = Pni=1 |λ i| i=1 3 Pm max(0, λi ) = 0.6682 GOF 2 = Pi=1 n i=1 max(0, λi ) The results of the CMDS, as applied by the cmdscale command in R, can be seen in Figure 1. As neither of the goodness of fit values is very good, it is not surprising to see that the CMDS only managed to resolve three rough groups of languages, which can be categorized as Sumba, Tetum, and Atoni; the Kemak group has been mixed in with the Tetum languages. When the resulting distances are plotted against the original distance matrix, a large, aberrant group of language comparisons, predominantly composed of Kemak languages, can be seen. ● 1.5 ● ● ● 3.5 Sumba ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ●●●●● ●● ●● ●●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●●● ●● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ●● ●●●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ●●● ●● ●● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● 3.0 ● 1.0 ● ●● ● Tetum ● ● 0.0 −0.5 2.5 Distance Matrix ● ● 0.5 2.0 1.5 1.0 ● ● ● ● −1.0 Atoni ● 0.5 ●● ● ● ● ● −1.5 −1.5 −1.0 −0.5 0.0 0.0 0.5 1.0 1.5 ● ●● ●●●●● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ●● ●●● ● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 2.5 CMDS Figure 1: On the left is the CMDS map, which displays three cluster of languages that correspond roughly to Sumba, Tetum, and Atoni. On the right is the plot of the CMDS distances by the original distance matrix. It is roughly monotonic, as it should be, but contains a major aberration. It is clear that the CMDS method is insufficient for describing the relationships between the languages, as it cannot distinguish between certain recognized language groups. Additionally, the goodness of fit is rather poor. One of the problems with CMDS is that it gives priority to representing the languages with the exact Euclidean distances that appear in the distance matrix, which is not always possible. Ordinal Multidimensional Scaling As opposed to CMDS, ordinal or non-metric multidimensional scaling (NMDS) neglects the actual Euclidean differences in favor of rank. Using a monotonically 4 3.0 3.5 increasing function, θ(d), the distance (d) is transformed into disparity, which is an ordinal value. Using θ, the function attempts to minimize a value known as stress: P 2 i<j (θ(δij ) − dij ) P ST RESS = 2 i<j dij ● Sumba 3.441 ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●●● ●● ●● ● ● ●● ● ● ● ●●● ●● ● ●● ● ●●● ● ● ●● ● ● ●● ●● ●●● ● ●● ●● ● ●●●● ●● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ●● ● ● ● ●● ● ●● ●●●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●●● ●● ● ● ● 1.0 1.5 Where d is the distance in the map that is being produced by the NMDS (Wickelmaier, 2003). When 2-dimensional NMDS is used on the data set, the stress value comes out to 9.9141%, while 6-dimensional NMDS produces a stress of 0.056%, which is significantly better. The 6-dimensional NMDS produced by the isoMDS command in the MASS package of R and mapped onto 2 dimensions can be seen below: ● ● 3.433 Rindi 0.5 ● Tetum Kemak 3.423 Kolhua 0.0 ● ● −0.5 0.017 −1.5 −1.0 0.008 −1.5 −1.0 ● Atoni −0.5 0.0 0 0.5 1.0 1.5 ● ● ●●● ●● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ●●● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ●●● ● ●● ● ● ●●● ● ● 0.0 0.5 1.0 1.5 2.0 2.5 Figure 2: On the left is the 6-dimensional NMDS map. It discerns all four major language groups (Tetum, Kemak, Atoni, and Sumba), and additionally recognizes that Rindi and Kolhua are outliers in the Sumba and Atoni groups, respectively. On the right is the comparison with the original distance matrix, which is monotonically increasing. The map that is produced by the NMDS did a significantly better job at discerning languages than the CMDS method. Not only did it resolve that Tetum, Kemak, Atoni, and Sumba were separate groups, but it also indicated that two languages, Rindi and Kolhua, were outliers within their respective groups of Sumba and Atoni. Similarly, when compared with the original distance matrix, the points arrange in a monotonically increasing fashion, which is what would be expected. Unfortunately, the ordinal factor that made the NMDS so successful also 5 3.0 3.5 makes the results less useful. All of the languages are confined to a small number of points, because the NMDS completely ignores the Euclidean distances. All the languages are either extremely far from each other or extremely close, so although we learned that Rindi and Kolhua are unique, we also lost a great deal of information on the individual languages that are not so unique. Ultimately, the map-like topology does not seem to be an accurate model for the Timoric languages. Such a model would imply that languages that are closer are closer because of geographic proximity, which would lead to mixing, and because this does not seem to be the case, the next step is to see if the data conforms to a tree-like topology. Neighbor-Joining Tree The most common method for producing a tree from a distance matrix is the neighbor-joining algorithm. Essentially, the algorithm decides which two languages are closest to each other and then joins them with a node, the position of which is determined by the average of the two languages’ distances to all the other languages. The two languages are then dropped and replaced by the node in the distance matrix. The algorithm then repeats itself until the distance matrix is reduced down to three nodes, at which point the remaining nodes are connected by branches (Saitou and Nei, 1987). ● ● ● L540 L154 L1490 L1241 L803 0.0 L134 1.5 0.5 L136 L1778 L867 L800 L1315 ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● 1.0 L164 treemat 2.0 L294 L884 ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ● ●● ●● ● ●● ●● ● ● ● ●●●● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ●●● ● ● ● ● ●● ●● ● ●● ● ● ●●●●●● ● ●● ● ● ●● ● ● ● ●● ●● ●●● ● ●● ● ● ● ● ●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ●●●●● ● ●● ●● ● ●● ● ● ● ● ● ● ●●●●● ● ●● ● ● 3.0 L290 L299 L289L169 L1222 L288 L168 L802 L170 L137 2.5 Tetum Kemak Atoni Sumba ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 2.5 dist2mod Figure 3: On the left is the neighbor-joining tree. On the right, the tree’s impressive monotonic correlation with the distance matrix that produced it can be seen. As seen above, the tree produced by the bionj command in R’s ape pack6 3.0 3.5 age seems to display the languages rather accurately. All four of the major language groups are accounted for, and it can also be seen that the Rindi and Kolhua languages are outliers within their own groups, which conforms with the information gleaned from the NMDS model. Furthermore, when the tree is compared with the distance matrix, it produces an impressive approximation of a straight line. It is rather clear that the tree-like topology, which views the languages in a one-dimension, hierarchical manner, is superior to the map-like topology for the Timoric languages. The tree differs greatly, however, from the taxonomy proposed by Geoffrey Hull. Hull classified the Tetum and Atoni languages into one group and considered the Kemak group to be in an entire other classification. The neighbor-joining, however, determined that Kemak is closely related to Tetum, closer even than Tetum is to Atoni. Bayesian Markov Chain Monte Carlo Simulation Although the neighbor-joining tree proved to be quite successful, the distance matrix data that it used is fundamentally flawed itself, due to its assumption that the cognate replacement rate is the same for all cognates. The MCMC method can make up for this weakness. Using Bayes’ Theorem, one can determine the probability of a certain tree being true given the data using: p(D|τi )p(τi ) j=1 p(D|τj )p(τj ) p(τi |D) = P Where p(τi |D) is the probability of tree i given the data, p(D|τi ) is the probability of the data given tree i, and p(τi ) is the posterior probability of tree i. Unfortunately, this equation is extremely difficult to solve, because it requires summing up all possible trees. On the other hand, it is easy to determine the probability of one tree relative to another given the data: r= p(D|τi )p(τi ) p(τi |D) = p(τj |D) p(D|τj )p(τj )) MCMC takes advantage of this fact with an algorithm that satisfies Markov’s Theorem, producing a time average distribution that is equal to p(τi |D) (Pagel et al., 2004). After running the MCMC for 40 million iterations with a burn in of 2 million, 751 trees were sampled. The resulting consensus tree can be seen in Figure 4. 7 Tetum Kemak Atoni Sumba 70 78 63 52 89 57 52 24 67 28 48 26 88 59 22 73 79 53 76 38 60 78 100 Figure 4: The Markov Chain Monte Carlo tree. To the right of each node is its posterior probability. The MCMC tree clearly distinguishes between the four different languages and also suggests that the Atoni group can be split into three subgroups, including the standalone Kolhua language. It is also in agreement with the neighborjoining tree on the relation of the four groups, with Tetum being more closely related to Kemak than to Atoni. Simple comparison shows that the topology of the MCMC tree resembles that of the neighbor-joining tree quite closely (Figure 5). Using the PH85 method in the dist.topo command in R’s ape package, which counts the number of different nodes, it is seen that the distance from the MCMC tree to the neighbor-joining tree is 33, while the distance in the other direction is only 23. On the other hand, the average distance between two random trees is roughly 43, confirming that the MCMC tree and the neighbor-joining tree are indeed similar. 8 L290 L284 L289 L288 L299 L169 L168 L137 L170 L1222 L802 L294 L1490 L154 L540 L164 L1241 L134 L803 L867 L800 L136 L1315 L884 L1778 Tetum Kemak Atoni Sumba 5 L1 00 0 L8 49 5 31 L1 L15 31 L867 0 L8 0 L86 L1241 03 L1 64 L1 36 L1 L5 40 L134 7 Bionj Tree L8 L134 41 L12 L803 MCMC Tree L1 54 36 L1 4 L884 L540 L884 L1490 L1778 L164 L1778 L294 L290 L2 84 2 7 68 L1 2 88 89 L2 L169 L168 02 L290 88 9 L29 L169 L13 L8 37 L1 L2 L2 L1 70 L2 89 L1 22 22 L1 L170 L294 L299 2 L80 Figure 5: A comparison between the MCMC tree and the neighbor-joining tree. When the branch lengths of the two trees are compared, however, the trees differ significantly. This is not unexpected, because the assumption made in using the P = e−d model does not take into account that different cognates have different replacement rates. The MCMC method is generally accepted as more accurate. Sound Correspondence Analysis Despite the successes of methods using cognate analysis, the information lost through replacing phoneme information with simple cognate codes is noteworthy. In order to improve the Timoric phylogenies, therefore, it is necessary to analyze the sound correspondences. To do this, one must first align the phonemes. A program written by Daniel Hruschka and Tanmoy Bhattacharya that has only previously been used on the Turkic languages was used to this end. Comparing the probabilities of phonemes in words that are cognates, the program produces an alignment, as seen in Figure 6. 9 1 2 3 4 5 6 7 8 9 L136 k a k o r o k L164 L800 k L1241 k Ä” k o r o g o r u Äı̂ r Äı̂ r Äı̂ Figure 6: An example of words from multiple languages aligned by the program. Following the alignment, the program can be used to infer regular and irregular sound changes. Because of a bug in the code, this step has not yet been completed. Once the sound changes have been found, the MCMC model can be used to produce an extremely accurate phylogeny. Conclusion The island of Timor displays an incredible degree of linguistic diversity that is challenging to interpret and classify using conventional linguistic methods. Statistical analysis can be of great value in understanding the relationships between these languages. Certain methods work better than others, such as the superiority of the tree-like topology to the map-like topology, and the superiority of the MCMC simulation to the neighbor-joining method. There is good reason to believe that our method of sound correspondence analysis, using aligned phonemes, will prove to be an even better means of determining linguistic relationships and divergence. Interestingly, the results thus far disagree with Geoffrey Hull’s classification of the Timoric languages into the Fabronic and Ramelaic groups. Rather, both the tree and map methods indicated that the Fabronic Tetum languages were closer to the Ramelaic Kemak languages than they were to other Fabronic languages, such as the Atoni group. Furthermore, the trees both indicate that the Atoni languages were about as similar to the languages of the nearby island of Sumba as they were to the Tetum and Kemak groups. Such observations lend valuable information to understanding the migrations that have resulted in the modern distribution of peoples and languages on Timor. Archaeological and cultural data will hopefully be used in conjunction with this linguistic data in the near future in order to form a complete picture of the Austronesian migrations to Timor. 10 References Bellwood, P., Fox, J., Tryon, D., and Project, A. N. U. C. A. (1995). The Austronesians: Historical and Comparative Perspectives. Comparative Austronesian Series. Department of Anthropology as part of the Comparative Austronesian Project, Research School of Pacific and Asian Studies, Australian National University. Dyen, I. and University, Y. (1973). Lexicostatistics in genetic linguistics: Proceedings of the Yale Conference, Yale University, April 3-4, 1971. Janua linguarum. Mouton. Dyen, Isidore, James, A. T., and Cole, J. W. L. (1967). Language divergence and estimated word retention rate. Language, 43(1):150–171. Hull, G. (2004). The languages of east timor: Some basic facts. The National Linguistic Institute of the National University of East Timor. Pagel, M., Atkinson, Q. D., and Meade, A. (2007). Frequency of word-use predicts rates of lexical evolution throughout indo-european history. Nature, 449(7163):717–720. Pagel, M., Meade, A., and Barker, D. (2004). Bayesian estimation of ancestral character states on phylogenies. Systematic Biology, 53(5):673–684. Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406– 425. Timor, E. (2010). A brief history of timor-leste. East Timor Presidential Website: http://www.presidencia.tl/eng/files/Presskit-Brief Wickelmaier, F. (2003). An introduction to mds. 11
© Copyright 2026 Paperzz