Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb a Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology, University of California, Berkeley, CA 94720 Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review March 19, 2012) One of the oldest problems in linguistics is reconstructing the words that appeared in the protolanguages from which modern languages evolved. Identifying the forms of these ancient languages makes it possible to evaluate proposals about the nature of language change and to draw inferences about human history. Protolanguages are typically reconstructed using a painstaking manual process known as the comparative method. We present a family of probabilistic models of sound change as well as algorithms for performing inference in these models. The resulting system automatically and accurately reconstructs protolanguages from modern languages. We apply this system to 637 Austronesian languages, providing an accurate, large-scale automatic reconstruction of a set of protolanguages. Over 85% of the system’s reconstructions are within one character of the manual reconstruction provided by a linguist specializing in Austronesian languages. Being able to automatically reconstruct large numbers of languages provides a useful way to quantitatively explore hypotheses about the factors determining which sounds in a language are likely to change over time. We demonstrate this by showing that the reconstructed Austronesian protolanguages provide compelling support for a hypothesis about the relationship between the function of a sound and its probability of changing that was first proposed in 1955. ancestral | computational | diachronic R econstruction of the protolanguages from which modern languages are descended is a difficult problem, occupying historical linguists since the late 18th century. To solve this problem linguists have developed a labor-intensive manual procedure called the comparative method (1), drawing on information about the sounds and words that appear in many modern languages to hypothesize protolanguage reconstructions even when no written records are available, opening one of the few possible windows to prehistoric societies (2, 3). Reconstructions can help in understanding many aspects of our past, such as the technological level (2), migration patterns (4), and scripts (2, 5) of early societies. Comparing reconstructions across many languages can help reveal the nature of language change itself, identifying which aspects of language are most likely to change over time, a long-standing question in historical linguistics (6, 7). In many cases, direct evidence of the form of protolanguages is not available. Fortunately, owing to the world’s considerable linguistic diversity, it is still possible to propose reconstructions by leveraging a large collection of extant languages descended from a single protolanguage. Words that appear in these modern languages can be organized into cognate sets that contain words suspected to have a shared ancestral form (Table 1). The key observation that makes reconstruction from these data possible is that languages seem to undergo a relatively limited set of regular sound changes, each applied to the entire vocabulary of a language at specific stages of its history (1). Still, several factors make reconstruction a hard problem. For example, sound changes are often context sensitive, and many are string insertions and deletions. In this paper, we present an automated system capable of large-scale reconstruction of protolanguages directly from words that appear in modern languages. This system is based on a probabilistic model of sound change at the level of phonemes, 4224–4229 | PNAS | March 12, 2013 | vol. 110 | no. 11 building on work on the reconstruction of ancestral sequences and alignment in computational biology (8–12). Several groups have recently explored how methods from computational biology can be applied to problems in historical linguistics, but such work has focused on identifying the relationships between languages (as might be expressed in a phylogeny) rather than reconstructing the languages themselves (13–18). Much of this type of work has been based on binary cognate or structural matrices (19, 20), which discard all information about the form that words take, simply indicating whether they are cognate. Such models did not have the goal of reconstructing protolanguages and consequently use a representation that lacks the resolution required to infer ancestral phonetic sequences. Using phonological representations allows us to perform reconstruction and does not require us to assume that cognate sets have been fully resolved as a preprocessing step. Representing the words at each point in a phylogeny and having a model of how they change give a way of comparing different hypothesized cognate sets and hence inferring cognate sets automatically. The focus on problems other than reconstruction in previous computational approaches has meant that almost all existing protolanguage reconstructions have been done manually. However, to obtain more accurate reconstructions for older languages, large numbers of modern languages need to be analyzed. The Proto-Austronesian language, for instance, has over 1,200 descendant languages (21). All of these languages could potentially increase the quality of the reconstructions, but the number of possibilities increases considerably with each language, making it difficult to analyze a large number of languages simultaneously. The few previous systems for automated reconstruction of protolanguages or cognate inference (22–24) were unable to handle this increase in computational complexity, as they relied on deterministic models of sound change and exact but intractable algorithms for reconstruction. Being able to reconstruct large numbers of languages also makes it possible to provide quantitative answers to questions about the factors that are involved in language change. We demonstrate the potential for automated reconstruction to lead to novel results in historical linguistics by investigating a specific hypothesized regularity in sound changes called functional load. The functional load hypothesis, introduced in 1955, asserts that sounds that play a more important role in distinguishing words are less likely to change over time (6). Our probabilistic reconstruction of hundreds of protolanguages in the Austronesian phylogeny provides a way to explore this question quantitatively, producing compelling evidence in favor of the functional load hypothesis. Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H. performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C., D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. N.C. is a guest editor invited by the Editorial Board. Freely available online through the PNAS open access option. See Commentary on page 4159. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1204678110/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1204678110 SEE COMMENTARY Table 1. Sample of reconstructions produced by the system Bouchard-Côté et al. denote the subset of modern languages that have a word form in the cth cognate set. For each set c ∈ f1; . . . ; Cg and each language ℓ ∈ LðcÞ, we denote the modern word form by wcℓ . For cognate set c, only the minimal subtree τðcÞ containing LðcÞ and the root is relevant to the reconstruction inference problem for that set. Our model of sound change is based on a generative process defined on this tree. From a high-level perspective, the generative process is quite simple. Let c be the index of the current cognate set, with topology τðcÞ. First, a word is generated for the root of τðcÞ, using an (initially unknown) root language model (i.e., a probability distribution over strings). The words that appear at other nodes of the tree are generated incrementally, using a branch-specific distribution over changes in strings to generate each word from the word in the language that is its parent in τðcÞ. Although this distribution differs across branches of the tree, making it possible to estimate the pattern of changes involved in the transition from one language to another, it remains the same for all cognate sets, expressing changes that apply stochastically to all words. The probabilities of substitution, insertion and deletion are also dependent on the context in which the change occurs. Further details of the distributions that were used and their parameterization appear in Materials and Methods. The flexibility of our model comes at the cost of having literally millions of parameters to set, creating challenges not found in most computational approaches to phylogenetics. Our inference algorithm learns these parameters automatically, using established principles from machine learning and statistics. Specifically, we use a variant of the expectation-maximization algorithm (26), which alternates between producing reconstructions on the basis of the current parameter estimates and updating the parameter estimates on the basis of those reconstructions. The reconstructions are inferred using an efficient Monte Carlo inference algorithm (27). The parameters are estimated by optimizing a cost function that penalizes complexity, allowing us to obtain robust estimates of large numbers of parameters. See SI Appendix, Section 1 for further details of the inference algorithm. If cognate assignments are not available, our system can be applied just to lists of words in different languages. In this case it automatically infers the cognate assignments as well as the reconstructions. This setting requires only two modifications to the model. First, because cognates are not available, we index the words by their semantic meaning (or gloss) g, and there are thus G groups of words. The model is then defined as in the previous case, with words indexed as wgℓ . Second, the generation process is augmented with a notion of innovation, wherein a word wgℓ′ in some language ℓ′ may instead be generated independently from its parent word wgℓ . In this instance, the word is generated from a language model as though it were a root string. In effect, the tree is “cut” at a language when innovation happens, and so the word begins anew. The probability of innovation in any given PNAS | March 12, 2013 | vol. 110 | no. 11 | 4225 PSYCHOLOGICAL AND COGNITIVE SCIENCES Model We use a probabilistic model of sound change and a Monte Carlo inference algorithm to reconstruct the lexicon and phonology of protolanguages given a collection of cognate sets from modern languages. As in other recent work in computational historical linguistics (13–18), we make the simplifying assumption that each word evolves along the branches of a tree of languages, reflecting the languages’ phylogenetic relationships. The tree’s internal nodes are languages whose word forms are not observed, and the leaves are modern languages. The output of our system is a posterior probability distribution over derivations. Each derivation contains, for each cognate set, a reconstructed transcription of ancestral forms, as well as a list of sound changes describing the transformation from parent word to child word. This representation is rich enough to answer a wide range of queries that would normally be answered by carrying out the comparative method manually, such as which sound changes were most prominent along each branch of the tree. We model the evolution of discrete sequences of phonemes, using a context-dependent probabilistic string transducer (8). Probabilistic string transducers efficiently encode a distribution over possible changes that a string might undergo as it changes through time. Transducers are sufficient to capture most types of regular sound changes (e.g., lenitions, epentheses, and elisions) and can be sensitive to the context in which a change takes place. Most types of changes not captured by transducers are not regular (1) and are therefore less informative (e.g., metatheses, reduplications, and haplologies). Unlike simple molecular InDel models used in computational biology such as the TKF91 model (25), the parameterization of our model is very expressive: Mutation probabilities are context sensitive, depending on the neighboring characters, and each branch has its own set of parameters. This context-sensitive and branch-specific parameterization plays a central role in our system, allowing explicit modeling of sound changes. Formally, let τ be a phylogenetic tree of languages, where each language is linked to the languages that descended from it. In such a tree, the modern languages, whose word forms will be observed, are the leaves of τ. The most recent common ancestor of these modern languages is the root of τ. Internal nodes of the tree (including the root) are protolanguages with unobserved word forms. Let L denote all languages, modern and otherwise. All word forms are assumed to be strings in the International Phonetic Alphabet (IPA). We assume that word forms evolve along the branches of the tree τ. However, it is usually not the case that a word belonging to each cognate set exists in each modern language—words are lost or replaced over time, meaning that words that appear in the root languages may not have cognate descendants in the languages at the leaves of the tree. For the moment, we assume there is a known list of C cognate sets. For each c ∈ f1; . . . ; Cg let LðcÞ COMPUTER SCIENCES *Complete sets of reconstructions can be found in SI Appendix. † Randomly selected by stratified sampling according to the Levenshtein edit distance Δ. ‡ Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42). § The colors encode cognate sets. { We use this symbol for encoding missing data. Random Automated Fig. 1. Quantitative validation of Effect of tree Effect of tree size modern reconstruction reconstructions and identification quality Agreement between of some important factors influtwo linguists 0.55 encing reconstruction quality. (A) 0.500 Reconstruction error rates for a 0.5 baseline (which consists of picking 0.375 0.45 one modern word at random), our system, and the amount of dis0.250 0.4 agreement between two linguist’s manual reconstructions. Reconstruc0.125 0.35 tion error rates are Levenshtein N/A distances normalized by the mean 0.3 0 0 10 20 30 40 50 60 word form length so that errors Number of modern languages can be compared across languages. Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages. (B) The effect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ran on an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant. On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between using the consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C) Reconstruction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected to go down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue” in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in C is restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C). Kubokot Lungga Luqa Kus Hoav a aghe Rov Nduke a Simbo Patpatar iana Kuanua Results Our results address three questions about the performance of our system. First, how well does it reconstruct protolanguages? Second, how well does it identify cognate sets? Finally, how can this approach be used to address outstanding questions in historical linguistics? Protolanguage Reconstructions. To test our system, we applied it to a large-scale database of Austronesian languages, the Austronesian Basic Vocabulary Database (ABVD) (28). We used a previously established phylogeny for these languages, the Ethnologue tree (29) (we also describe experiments with other trees in Fig. 1). For this first test of our system we also used the cognate sets provided in the database. The dataset contained 659 languages at the time of download (August 7, 2010), including a few languages outside the Austronesian family and some manually reconstructed protolanguages used for evaluation. The total data comprised 142,661 word forms and 7,708 cognate sets. The goal was to reconstruct the word in each protolanguage that corresponded to each cognate set and to infer the patterns of sound changes along each branch in the phylogeny. See SI Appendix, Section 2 for further details of our simulations. We used the Austronesian dataset to quantitatively evaluate the performance of our system by comparing withheld words from known languages with automatic reconstructions of those words. The Levenshtein distance between the held-out and reconstructed forms provides a measure of the number of errors in these reconstructions. We used this measure to show that using more languages helped reconstruction and also to assess the overall performance of our system. Specifically, we compared the system’s error rate on the ancestral reconstructions to a baseline and also to the amount of divergence between the reconstructions of two linguists (Fig. 1A). Given enough data, the system can achieve reconstruction error rates close to the level of disagreement between manual reconstructions. In particular, most reconstructions perfectly agree with manual reconstructions, and only a few contain big errors. Refer to Table 1 for examples of reconstructions. See SI Appendix, Section 3 for the full lists. We also present in Fig. 1B the effect of the tree topology on reconstruction quality, reiterating the importance of using informative topologies for reconstruction. In Fig. 1C, we show that the accuracy of our method increases with the number of observed Oceanic languages, confirming that large-scale inference is desirable for automatic protolanguage reconstruction: Reconstruction improved statistically significantly with each increase am C Reconstruction error rate S a a roa Ts ou P uy uma Ka v a la n Bas ai CentralAmi ya S ira Pazeh S a is ia t Ya ka n Bajo M a pun S a ma lS ia s i Ina ba knon S uba nunS in S uba nonS io W e s te rnBuk ke o NehanHapSolos re ba a rov ga M M Vangunu Ughele Ghanong Reconstruction error rate Ta ne ma Te a nu Va no Buma As umboa Ta nimbili Ne mba o S e ima t WLuv ouulu Na una ipon SLiveis a Tita LLikum evei gau L oniu M us s au Ka s ira Gima Ira h n Buli M or M iny a BigaMisifuin As ool W a rope Numfor n Ma ra u AmbaiYapen W inde s iW NorthBabar an Da Da i we ra Tela Da we Ma EImroing s bua E a mpla s S e tM a swa s CentralMas rili e la S outhE W es Ta tDa a s tB rangama UjirNAru r Nga Ges iborS nB a er W E laa tubela Ar SHituAmbon tKe ouAmanaTe Amahai iB es Paulohi Alune M BonfiaurnatenAl W erinama S oboy BuruNamro M a mboru o MNgadha Pondok a nggarai l Kodi W Bima Soa ejewa Savu W a nukaka Ta GauraNg na CiuliAtaya S quliqAtay S e diq Bunun o ng a v orla FTha Ruka i n a iwa PKa na ka na bu MMa anoboIlia noboW e s t M a noboDiba M a noboAtad M a noboAta u a noboTigw MM M aa noboKa noboS la Binukid a ra Va u iGhon ris Hakuan Va ingga s e S is Nis BabatanaT Teop Taiof Tanga M S ia a goriS Ka r nda Vilirupu M outs otu M e Lkeo a la Roro Kuni Gabadi Doura MKiliv is ima ila Dobuan Ga W edau pa pa iwa M olima Ubir GumaDiodio wa M a na is S uain Auhelawa u Sa liba W ogeAli o Ka iriru Kis Ka y Riwo upula S obeuK Ta rpia i Kov e M eMnge a le u Biliaun Ge M ada tuka ge dr Kubokot Lungga Luqa Kus Hoav a aghe Rov Nduke a Simbo Patpatar iana Kuanua Tanga M S ia a goriS Ka r nda Vilirupu M outs otu M e Lkeo a la Roro Kuni Gabadi Doura MKiliv is ima ila Dobuan Ga W edau pa pa iwa M olima Ubir GumaDiodio wa M a na is S uain Auhelawa u Sa liba W ogeAli o Ka iriru Kis Ka y Riwo upula S obeuK Ta rpia i Kov e M eMnge a le u Biliaun Ge M ada tuka ge dr M e gia Bilibil r Ka ulongAuV S e ngs e ng L a moga Amara iM ul Aria ouk NumbaM miS ib Ya m W a be mpa P una nKe la ir Bukat Ka y a nUma Ju M Wori olio Baree Bangga iW di r ma Roma s tDa s e Ea mdena e tine nimba LYa Ke iTaru S e la iIria Koiwa Chamorro ring L a mpung Kome S undangRe ja ga b Re ja TboliTa lB ga bilida Ta Korona nga niB S a ra a ra BilaanKoro BilaanS IraMnun a ra na o Sa Bali s a k M a ma nwa Ilonggo Hiliga W a ra y non Butua yW Ta us ugJolo non a ra y S uriga AklanonBis onon Cebuano Ka M a la ga n Ta gans a ka logAnt TaBikolNa gba Tagba nwaga C BatakPalaw nwa Ka Pa Ab la wa Banggi S nB HaW P at a la nunoo Pa wa laua DaS inghi no Ka yakBa n tinga DayakNgaju ka n t M Ka e rina dorih M Tunjung Ma a MKe a ny la e larinc a n M y uSi Ogan e Indones L owMla y u a ra Banjares a la M M ian y M e la a la inangkab y uBrun y B eM PhanRan Chru a ha M HainanCh oken s Iban a gCh Gayo L ungaLunga BabatanaKa ghuav Va Babatana BabatanaLo ngga S e Ririoi BabatanaA ris language is initially unknown and must be learned automatically along with the other branch-specific model parameters. 4226 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 n n Nage Kambera PalueNitur n Anakalang e ka SPulauArgu RotiTerma Atoni a mbai ra le MTetunTerik le mak ma holotI Ke L a ma a ng SLika da i Ke E ra Talur Aputai i P era Tugun rua Iliun Se Nila unr Te a Kis Ta lis e Tolo Gha ri Gha riNdi Ta lis e M a la e M oli Ta lis Bugotu M bughotuDh se Ya peVitu il i iBla L a ka Na ka na ututu M aBarok s kL a ma M a da unglk L ihirSTiga Tia ng Na lik t Wes Ka ragTung Tunga M a da ra Banoni ka Ys G KilokaKokota nga Blabla ngas maKia Blablaana el ba L a ghuS ZaringeL oro Ma oP a Km ringeTat Ngga M aa ringe Bilur M ChekeHolo a uroa Uruav u M onoF Tora M onoAlu M ono s unur M r Idaan on BunduDu Timugon ong Bintulu hL uk KelabitBa a uM n l BerawanL Belait na na a an Keny e laa ha M L e EngganoM Nias dures c EngganoB a TobaBatak M atanBas an IvItbayat y Babuy Irarala yaten Itba amorong ay Is as nga IvImorod lBoto mi mpa y n YaS a mba pa oBanKa Ka ha nKe lla Ifuga Ka lla ha n loi Ka s ina nI Iniba nga P a kiduge k Ka oAmga IlongotKa Ifuga a ta oB GuinNo Ifuga BontokGuin na y Bontoc Ka nka wGui Balanga Ka lingainon gB ng Itne Ga dda g AttaPamplo Is negDiba s Agta ga tCa Duma no s ia n Iloka a oly ne Yogy a neasla y oP e s tevrnM OldJa W s eSo Bugine lohs s a r Maa ka ja MTa e S Tora bu S a ngirTa S a ngir a ra S a ngilS Bantik Ka idipa ng Goronta loHon Bola a ngM P opa lia Bone ra te W una M una Ka tobu Tonte mboa n Tons e a s unur M ar Idaan on BunduDu Timugon ong Bintulu hL uk KelabitB a uM n l BerawanL Belait na na a an Keny e laa ha M L e EngganoM Nias dures c EngganoB a TobaBatak M atanBas an IvItbayat y Babuy Irarala yaten Itba amorong ay Is as nga IvImorod lBoto mi mpa y n YaS a mba pa oBanKa Ka ha nKe lla Ifuga Ka lla ha loi ina n Ka nI s Iniba nga P a kiduge k Ka oAmga IlongotKa Ifuga a ta oB Guin No Ifuga BontokGuin na y Bontoc Ka nka wGui Balanga Ka linga ng gB inon Itne Ga dda g AttaPamplo gDiba Is ne s Agta ga tCa Duma no s ia n Iloka a oly ne Yogy a neasla y oP e s tevrnM OldJa W s eSo Bugine lohs s a r Maa ka ja MTa e S Tora bu S a ngirTa S a ngir a ra S a ngilS Bantik Ka idipa ng Goronta loHon Bola a ngM P opa lia Bone ra te W una M una Ka tobu Tonte mboa n Tons e a am umbana T tS Eas Baliledo Lamboy Ende LioFlores Sa nta FFaa ga Ana ga niRihu BauroP niAguf a Ka wa V hua Aros Aros Aros iTawat S a iOneib i Ka nta Ca hua F a M ta l BauroHa ga a mi ni BauroBaroo unu Saa AreareW AuluVil AreareMaas a ia Dorio Saa Saa Ulawa Oroha S S a a a a S a a Vill UkiNiM a L onggu L a uWL a uNorth a la de KwaFraa ta a S ol ka M ba e ele le le a Lau M ba e ngguu L a ngaKwa io la nga Kwa i Toa mba ita Ngge la L e ngoP a rip L e ngoGha im L e ngo Gha riNgini Gha TariNgge r lis e Koo Ta lis e P ole Gha riTa nda M bira o Gha riNgga e M a la ngo M e gia Bilibil r Ka ulongAuV S e ngs e ng L a moga Amara iM ul Aria ouk NumbaM miS ib Ya m W a be mpa P una nKe la ir Bukat Ka y a nUma Ju M Wori olio Baree Bangga iW di Ta ne ma Te a nu Va no L oniu M us s au Ka s ira Gima Ira h n Buli M or M iny a BigaMisifuin As ool W a rope Numfor n Ma ra AmbaiYapen W inde u s iW NorthBabar an Da Da i we ra Tela Da we Ma EImroing s bua E a mpla s S e tM a swa s CentralMas rili e la S outhE W es Ta tDa a s tB ra UjirNAru ma Nga nga r Ges iborS nB a er W E laa tubela Ar SHituAmbon tKe ouAmanaTe Amahai iB es Paulohi Alune M BonfiaurnatenAl W erinama S oboy BuruNamro M a mboru o MNgadha Pondok a nggarai l Kodi W Bima Soa ejewa Savu W a nukaka Ta GauraNg na gau SakaoPo S a a roa Ts ou P uy uma Ka v a la n Bas ai CentralAmi ya S ira Pazeh S a is ia t Ya ka n Bajo M a pun ke o NehanHapSolos re ba a rov ga M M Vangunu Ughele Ghanong Kiriba e Ponapea oleai okiles Pingilapes MW PuloAnna e oroles ipanCaro S a ons kes S Chuukes ortloceAK Carolinian lese ne M oleaians W Chuukes Satawa eei rtO PuloAnna Naman Puluwate NevAvava CiuliAtaya S quliqAtay S e diq Bunun o ng a v orla FTha Ruka i n a iwa PKa na ka na bu S a ma lS ia s i Ina ba knon S uba nunS in S uba nonS io W e s te rnBuk MMa anoboIlia noboW e s t M a noboDiba M a noboAtad M a noboAtau a noboTigw MM M aa noboKa noboS la Binukid a ra Va u iGhon ris Hakuan Va ingga s e S is BabatanaT Nis Teop Taiof urus Nalle ie Nelemwa ha a a rs Kus ti n M NesTapenei Nahav Namakir e aq Nguna Nati SouthEfa Orkon Paames M Raga wotlap PeteraraM AmbrymSo te eSou ArakiSouthM a ota M ut Sy e re eE i rroma Le TannaSouth na nUra Kwa kel Tawaroga me ra r ma Roma s tDa sna e Ea mdenimba e tine LYa Ke iTaru S e la iIria Koiwa Chamorro ring L a mpung Kome S undangRe ja ga b Re ja TboliTa lB ga bilida Ta Korona nga niB S a ra a ra BilaanKoro BilaanS IraMnun a ra na o Sa Bali s a k M a ma nwa Ilonggo Hiliga W a ra y non Butua yW Ta us ugJolo non a ra y S uriga AklanonBis onon Cebuano Ka M a la ga n Ta gans a ka logAnt TaBikolNa gba ga Ta gba nwa KaC BatakPalaw nwa Pa Ab la wa Banggi S nB HaW P at a la nunoo Pa wa la ua DaS inghi no Ka yakBa n tinga Da ka n t M Ka yakNgaju e rinaMa dorih M Tunjung a MKe a ny la e larinc a n M y uSi Ogan e Indones L owMla y u a ra Banjares a la M M ian y M e la a la inangkab y uBrun y B eM PhanRan Chru a ha Moken HainanCh s Iban a gCh Gayo L ungaLunga BabatanaKa ghuav Va Babatana BabatanaLo ngga S e Ririoi BabatanaA ris i Ia a Nengone Canalawe Ja AnejomA n n Nage Kambera PalueNitur n Anakalang e ka SPulauArgu RotiTerma Atoni a mbaik ra le MTetunTerik le ma ma holotI Ke L a ma a ng SLika da i Ke E ra Talur Aputai i ra Pe Tugun rua Iliun Se Nila r a Teun Kis Ta lis e Tolo Gha ri Gha riNdi Ta lis e M a la e M oli Ta lis Bugotu M bughotuDh se Ya peVitu il i iBla L a ka Na ka na ututu M aBarok s kL a ma M a da unglk L ihirSTiga Tia ng Na lik t Wes Ka ragTung Tunga M a da ra Banoni ka Ys G KilokaKokota nga Blabla ngas maKia Blablaana el ba L L a ghuS Zaringe oro Ma oP Tat ringe Kma Ngga M aa ringe Bilur M ChekeHolo a uroa Uruav u M onoF Tora M onoAlu M ono Tikopia Bellona n S a moaa s n nuiE iia Ra paHa wa v a re nga a n Ma rques nth M aTa hitia n o nM n Ra rotonga Ta hitia Rurutua nihiki M a motu n Tua e nrhy P hiti ori Ta M a ij n ternF W es Rotuma Dehu umbana T tS Eas Baliledo Lamboy Ende LioFlores ArakiSouthM a ota M ut Sy e re eE i rroma Le Ura TannaSouth n Kwa na kel Tawaroga me ra Sa nta FFaa ga Ana ga niRihu BauroP niAguf a Ka wa V hua Aros Aros Aros iTawat S a iOneib i Ka nta Ca hua F a M ta l BauroHa ga a mi ni BauroBaroo unu Saa AreareW AuluVil AreareMaas a ia Dorio Saa Saa Ulawa Oroha S S a a a a S a a Vill UkiNiM a L onggu L a uWL a uNorth a la de KwaFraa ta a S ol ka M ba e ele le le a Lau M ba e ngguu L a ngaKwa io la nga Kwa i Toa mba ita Ngge la L e ngoP a L e ngoGha rip im L e ngo Gha riNgini Gha TariNgge r lis e Koo Ta lis e P ole Gha riTa nda M bira o Gha riNgga e M a la ngo F ijia nB a u Niue nt eaEas Uv Tonga Tokela u P uka puka L ua ngiua Nukuoro r Ka pinga ma S ika ia na Tuv a lu Ta kuu u ka uTa Eas t Va eFautuna es t Uv eaW le M e Ifira M Aniw se nne lle utuna F Re e E ma Anuta SakaoPo Kiriba e Ponapea oleai okiles Pingilapes MW PuloAnna e oroles ipanCaro S a ons kes S Chuukes ortloceAK Carolinian lese ne M oleaians W Chuukes Satawa eei rtO PuloAnna Naman Puluwate NevAvava urus Nalle ie Nelemwa ha a a rs Kus ti n M i Ia a Nengone Canalawe Ja Tikopia Bellona n S a moaa s n nuiE iia Ra paHa wa v a re nga a n Ma rques nth M aTa hitia n o nM n Ra rotonga Ta hitia Rurutua nihiki M a motu n Tua e nrhy P hiti ori Ta M a ij n ternF W es Rotuma Dehu AnejomA NesTapenei Nahav Namakir e aq Nguna Nati SouthEfat Orkon Paames M Raga wotlap PeteraraM AmbrymSo e eSou Buma As umboa Ta nimbili Ne mba o S e ima t WLuv ouulu Na una ipon SLiveis a Tita LLikum evei B F ijia nB a u Niue nt eaEas Uv Tonga Tokela u P uka puka L ua ngiua Nukuoro r Ka pinga ma S ika ia na Tuv a lu Ta kuu u ka uTa Eas t Va eF autuna es t Uv eaW le M e Ifira M Aniw se nne lle utuna F Re e E ma Anuta Reconstruction error rate A except from 32 to 64 languages, where the average edit distance improvement was 0.05. For comparison, we also evaluated previous automatic reconstruction methods. These previous methods do not scale to large datasets so we performed comparisons on smaller subsets of the Austronesian dataset. We show in SI Appendix, Section 2 that our method outperforms these baselines. We analyze the output of our system in more depth in Fig. 2 A–C, which shows the system learned a variety of realistic sound changes across the Austronesian family (30). In Fig. 2D, we show the most frequent substitution errors in the Proto-Austronesian reconstruction experiments. See SI Appendix, Section 5 for details and similar plots for the most common incorrect insertions and deletions. Cognate Recovery. Previous reconstruction systems (22) required that cognate sets be provided to the system. However, the creation of these large cognate databases requires considerable annotation effort on the part of linguists and often requires that at least some reconstruction be done by hand. To demonstrate that our model can accurately infer cognate sets automatically, we used a version of our system that learns which words are cognate, starting only from raw word lists and their meanings. This system uses a faster but lower-fidelity model of sound change to infer correspondences. We then ran our reconstruction system on cognate sets that our cognate recovery system found. See SI Appendix, Section 1 for details. This version of the system was run on all of the Oceanic languages in the ABVD, which comprise roughly half of the Austronesian languages. We then evaluated the pairwise precision (the fraction of cognate pairs identified by our system that are also in the set of labeled cognate pairs), pairwise recall (the fraction of labeled cognate pairs identified by our system), and pairwise F1 measure (defined as the harmonic mean of precision and recall) for the cognates found by our system against the known cognates that are encoded in the ABVD. We also report cluster purity, which is the fraction of words that are in a cluster whose known cognate group matches the cognate group of the cluster. See SI Appendix, Section 2.3 for a detailed description of the metrics. Using these metrics, we found that our system achieved a precision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of 0.918. Thus, over 9 of 10 words are correctly grouped, and our system errs on the side of undergrouping words rather than clustering words that are not cognates. Because the null hypothesis in historical linguistics is to deem words to be unrelated unless proved otherwise, a slight undergrouping is the desired behavior. Bouchard-Côté et al. 11 n pb t d 10 8 a L P uka puk a Toke la u Uv e a E a s t Tonga n Niue F ijia nBa u Te a n 2 f v s z 5 kg q 4 9 h x 6 D Fraction of substitution errors Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant is available in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parentheses attached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change along each branch, as inferred automatically by our system in SI Appendix, Section 4. (B) The most supported sound changes across the phylogeny, with the width of links proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, and backness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realistic cross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/ (2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of place of articulation (9, 10), and vowel changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actually represented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesian family (ii) are visible. Several attested sound changes such as debuccalization to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are successfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first phoneme in each pair ðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among human experts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas vowels are common sources of disagreement. Because we are ultimately interested in reconstruction, we then compared our reconstruction system’s ability to reconstruct words given these automatically determined cognates. Specifically, we took every cognate group found by our system (run on the Oceanic subclade) with at least two words in it. Then, we automatically reconstructed the Proto-Oceanic ancestor of those words, using our system. For evaluation, we then looked at the average Levenshtein distance from our reconstructions to the known reconstructions described in the previous sections. This time, however, we average per modern word rather than per cognate group, to provide a fairer comparison. (Results were not substantially different when averaging per cognate group.) Compared with reconstruction from manually labeled cognate sets, automatically identified cognates led to an increase in error rate of only 12.8% and with a significant reduction in the cost of curating linguistic databases. See SI Appendix, Fig. S1 for the fraction of words with each Levenshtein distance for these reconstructions. Functional Load. To demonstrate the utility of large-scale recon- struction of protolanguages, we used the output of our system to investigate an open question in historical linguistics. The functional load hypothesis (FLH), introduced 1955 (6), claims that the probability that a sound will change over time is related to the amount of information provided by a sound. Intuitively, if two phonemes appear only in words that are differentiated from one another by at least one other sound, then one can argue that no information is lost if those phonemes merge together, because no new ambiguous forms can be created by the merger. A first step toward quantitatively testing the FLH was taken in 1967 (7). By defining a statistic that formalizes the amount of information lost when a language undergoes a certain sound change—on the basis of the proportion of words that are discriminated by each pair of phonemes—it became possible to evaluate the empirical support for the FLH. However, this initial Bouchard-Côté et al. investigation was based on just four languages and found little evidence to support the hypothesis. This conclusion was criticized by several authors (31, 32) on the basis of the small number of languages and sound changes considered, although they provided no positive counterevidence. Using the output of our system, we collected sound change statistics from our reconstruction of 637 Austronesian languages, including the probability of a particular change as estimated by our system. These statistics provided the information needed to give a more comprehensive quantitative evaluation of the FLH, using a much larger sample than previous work (details in SI Appendix, Section 2.4). We show in Fig. 3 A and B that this analysis provides clear quantitative evidence in favor of the FLH. The revealed pattern would not be apparent had we not been able to reconstruct large numbers of protolanguages and supply probabilities of different kinds of change taking place for each pair of languages. Discussion We have developed an automated system capable of large-scale reconstruction of protolanguage word forms, cognate sets, and sound change histories. The analysis of the properties of hundreds of ancient languages performed by this system goes far beyond the capabilities of any previous automated system and would require significant amounts of manual effort by linguists. Furthermore, the system is in no way restricted to applications like assessing the effects of functional load: It can be used as a tool to investigate a wide range of questions about the structure and dynamics of languages. In developing an automated system for reconstructing ancient languages, it is by no means our goal to replace the careful reconstructions performed by linguists. It should be emphasized that the reconstruction mechanism used by our system ignores many of the phenomena normally used in manual reconstructions. We have mentioned limitations due to the transducer PNAS | March 12, 2013 | vol. 110 | no. 11 | 4227 COMPUTER SCIENCES ii 3) i ( 177) ( 683) 2) ( 12 ( 563) ) ( 448 ( 541 ( 395 ) ( 305 ) ) ( 409) ( 280) ( 183) ( 482 ) ( 712 ( 421 ) ) ( 666) Ra m o a n S a ona e ll ) ( 396 ( 56) ) ( 464 ( 723 ) 6) ( 15 ( 753) ( 404) ) ( 554 ( 470) ( 594 ( 342 ) ) R o e s tei W or ( 523) ) ( 633 ) ( 341 ( 270) ( 631) ( 243) ( 735) un us ur n uD M on ar aa on Id nd ugbi tB Bu m la u nL g ul Ti Ke nt ra wait L onM uk Bi la ah al au Be Be ny an an oM an Ke el an an oB M ah L gg an ta k ) En gg Ba e ( 677 ) En as es c ( 276 Ni ba ur Ba s ) To ad an ( 65)62) ( M at at an ( 21799) ( Iv ay y y Itb bu la en Ba ra at ng Ira ay oro ) ) Itb am ay ) ) ) ( 397303 ( 169 ) Is as d ( 138 ( ( 242 ) to ( 168 Iv oro ( 78) ( 237 Im mi ba lBoan ga ( 39) ) ( 66) ) Yaam mp a y n ( 234 ) ( 278 ( 238 S pa oB nKa ( 235) Ka ga ha e ( 240) Ifu lla ha nK ( 228) n Ka lla loi ( 222) Ka ba s ina I ) ( 257) Ini ng a ug enk ( 241 ) ( 258) a P a kid ( 167 ) ) ( 561) Ka ng otK ga ( 455 ( 256) ( 263) ( 679 Ilo a oAm ta n ( 752) Ifug a oBa ( 232) Gui Ifug tok Gui n ( 64) ( 251) Bon toc y No ( 55) ( 220) ( 227) Bonnka naw ( 89) ( 221) ( 497) Ka la nga Gui ( 87) ( 498) Ba ling a on ( 113) ( 88) Ka gBin ( 219) ( 226) Itnedda ng o ( 262) ( 619) Ga P a mpla g ( 90) ( 254) Atta gDib ( 239) ( 41) ( 478) Is ne ( 184) tCa s Agta a ga ( 30) Dum no ( 236) ( 605) n ( 110) ( 255) Iloka a ( 2) ne s ia ( 144) Yogy v a ne s la y oP oly ( 216) OldJa rnM a W e s tee s e S o ( 469) ( 471) Bugin M a lohs s a r ( 754) ( 93) ( 468) M a kaTora ja ( 488) ( 354) Ta e S bu ( 224) ( 94) S a ngirTa ( 567) ( 344) S a ngir a ra ( 637) ( 566) S a ngilS ( 565) ( 611) Ba ntik ng ( 476) Ka idipa ( 248) ( 50) Goronta loHon ( 568) ( 202) ( 201) Bola a ngM ( 518) P opa lia ( 84) ( 203) ( 85) Bone ra te ( 691) ( 747) W una ( 423) ( 739) ( 424) M una Ka tobu ( 685) Tonte mboa n ( 466) ( 684) Tons e a ( 46) ( 51) Ba ngga iW di ( 746) ( 418) Ba re e W olio ( 271) ( 96) M ori ( 529) Ka y a nUma Ju ( 213) Buka t ( 715) P una nKe ( 749) W a mpa la i ( 484) Ya be m r ( 422) Numba ( 67) ( 624) ( 20) M ouk miS ib ( 309) ( 7) Aria ( 451) ( 462) ( 713) L a moga ( 500) ( 585) Ama ra iM ul ( 268) S e ngs ( 61) ( 73) e ng ( 475) Ka ulong ( 394) AuV Bilibil ( 188) ( 401) M e gia ( 292) ( 575) ( 72) ( 387) ( 353) r Ge ( 539) ge M ada ( 574) tuka d r Bilia ( 579) ( 272) Me u ( 661) ( 250) M a nge n ( 600) Kov le u ( 249) ( 4) e Ta rpia ( 627) ( 358) S obe ( 480) Ka i ( 343) ( 284) ( 499) Riwy upu la ( 743) ( 31) uK Ka o ( 558) Kis iriru ( 205) ( 463) ( 626) W ( 104) Ali oge o Auh ( 17) ( 281) ( 508 S a e la ( 140) ) ( 141) S ualiba wa ( 412) ( 16) M u ( 720) Gu a is ( 695) Dio ma in ( 185) M dio wa na Ub oli Ga ir ma W pa Do ed pa iwa ( 143) M bu au ( 295 Kil is iman ) Ga iv a Do ba ila Ku ura di Ro ni M ro L ek Vi al a eo M lir M ot up Ka ag u u ( 502 ) ta a 4) M h i ti n ( 37 ) 2 T a n rh yo tu ( 64 ) 5 P e a m ik i ( 50 9) T u n ih n ( 68 1) M a ru tu a n M o ( 36 6) R u itia 5) ( 54 ga n h ( 64 3) T a to nn th ( 64 a ro ( 534) R h itia s a n a e va ( 644) T a rq u 3) M n g a re ( 38 n s a ( 359) M wa ii a iE a ( 209) Ha a n u ( 384) p C 7 3 ( 524) ( 680) ( 702) ( 682) ( 459) ( 77) ( 519 ) ( 158) ( 521) ) ( 477 ( 736 ) ko ( 261 ( 655 ) ( 590 ) ( 293 ) ( 501 ) ) ( 544 Ta nd or ( 593 iS S ( 437 ) ) Kuia ng as ou ( 296 ) t Pa anr a ( 212 Ro ( 335 ) ) Si tp ua ( 336 ) Nd m v ia atar ( 294 ) bo na Ku ) Ho uk (Fig. S3) m 1 ( 481) ( 467 ) ( 111 ) ( 350 ) ( 730 ) Lu s ag e Lu av Ku qang a he bo ga n ba umo a T s tS ed y s Ea lilm bolo re Ba La oFde n Li ge be raitu g En m Na lu eNal an un Ka ak rg an Pa Anek aruA m S la er i Pu tiT i ik ) Ro on ba er ) e At am nT ( 326 ( 165 ) M tu ak eral otI ( 428 Te m al ol Keam ah ) L am ( 542 L ik a ng ( 29) ) S da ( 356 ) ) Kerai ( 669 ) ) ( 149 E lur i ) ( 277 ( 44)308 ) ) ( Ta uta ( 582 ( 525 Ap rai n ( 166 ) ) ) Pe gu ) ( 259 ( 731 Tu n a ( 307 ) ( 496 ) ( 11) ( 306 ) Iliueru ( 170) ( 155 ( 591 653) S a ( ( 273) ( 14) Nil un ( 506) Te a r ( 589) ( 690) Kis ma ma r ( 456) ( 223) Ro s tDa s e ) E a tine e na ( 479 L e md nim ba ( 457) ) ( 285) ) Ya iTa ( 740) ( 540) ( 670) ( 178 ( 461 Ke la ru S e wa iIria Koi mo rro ( 671) ( 751) Champ ung ( 274) L a e ring ( 286) ( 623) Kom a e ja ( 145) ( 322) ( 676) S und ngR b Re jaliTa ga ( 275) Tbo bili lB ( 583) Ta ga na da Koro nga niB ( 310) S a ra nKoro ( 290) ( 665) Bila aa nS a ra ( 638) 617) ( ( 291) Bila ta y a ( 573) ( 288) CiuliA ta y ( 70) S quliqA ( 71) ( 664) S e diq ( 133) Bunun ( 625) ( 127) ( 81) ( 580) ( 509) Tha o ng ( 634) F a v orla ( 535) ( 672) ( 28) Ruka i n ( 176) ( 74) P a iwaka na bu ( 100) Ka na ( 737) ( 260) S a a roa ( 545) ( 553) ( 492) Ts ou ( 687) P uy uma ( 688) ( 269) Ka v a la n ( 530) ( 54) ( 472) Ba s a i Ce ntra lAmi ( 147) ( 109) ( 596) S ira y a ( 504) P a ze h ( 556) S a is ia t ( 750) ( 559) Ya ka n ( 40) ( 632) ( 91) Ba jo ( 375) ( 560) ( 230) M a pun ( 629) S a ma lS ia s i ( 630) ( 628) Ina ba knon ( 620) S uba nunS in ( 733) ( 728) ( 362) S uba nonS io ( 123) ( 370) W e s te rnBuk ( 365) ( 366) M a noboW ( 43) ( 27) M a noboIliae s t ( 363) ( 614) ( 377) M a noboDib ( 364) ( 79) M a noboAta a ( 369) ( 367) ( 368) M a noboAt d ( 376) ( 355) M a noboT a u ( 576) ( 233) ( 405) M a noboK igw ( 42) ( 118) M a noboS a la ( 80) Binuk a ra ( 121) M a ra id na o ( 507) Ira nun ( 493) ( 613) ( 253) ( 311) ( 372) Sas ( 717) ( 225) ( 639) ( 3) ( 495) Ba li a k ( 102) ( 210) ( 52) ( 69) ( 107) M a ma ( 208) ( 635) Ilong nwa ( 103) Hilig go ( 662) W a raa y non ( 252) ( 371) Butu y W ( 641) ( 727) Ta us a nona ra y ( 640) ugJo S urig ( 151 ( 57) Akl a onolo ( 693 ) ( 494) ( 47) Ce a non n ) ( 547) ( 595) Ka bua noBis ( 136) M la ga ( 349 ( 615) Ta a ns a n ) ( 245) Bikga logka Ta olN Ant ( 126 a gb Ta a ga ( 410 ) ( 215 Ba gb nwa C ) ( 403 P ta kPa nw Ka ) ( 337 ) ( 267) Baa la wa a laa Ab ( 137) ( 327 ) S ng nB w ) at HaW P a gi Pa nu la ( 407 S la no wa no ( 279 ( 125 ) Daing ua o ( 400 ) ( 206 ) Ka y hi n ( 398 ) ) Da tinak Ba ( 487 ) ( 331 ) Ka y ga ka ( 231 ) t M doak Ngn ( 48) ) M er rih aju ( Tu aa in aM ( 399348 ) Ke nj ny ) ( 511 M rin un an al a ( 130 ) M el c g ) Og el ay i L anay uS ar u In ow a Ba do M M nj ne al ay M al ar M el ay ess ia Ph in ay Ba eMn Ch an uB Ha ruan Ra gk ruha s M Ib okin ng ab n Ga an enan Ch a Ch yo am o B o p ia ( 63) T ik a e ( 675) E m ta ( 163) An u n e ll e s e ( 13) R e n n a An iw ( 537) F u tuM e le M ( 180) ( 182) ) Ifi ra W e s t ( 218 ( 703) Uv e a a E a s t ( 181) F utunka uT a u ( 704) Va e a u ( 647) Ta ku ( 694) Tu v a lu ( 592) S ika ia nama r ( 162) ( 264) Ka pin ga ( 483) Nuk uoro ( 333) ua ngiu a ( 114 ) e 3) ( 54 4) ( 73 ) ) ( 528527 ) ( ( 745 ) ) ( 577 ( 132 ) ( 419 ) ( 106 ) ( 131 ) ( 520 ) ( 32)445 ) ( 433 ( ) ( 557 ) ( 351 ( 119 ) ( ( 444 ( 12)659 ) ) ( 179) ( 187 ) ( 117 ) a le gg he onun u Ug anng o ke ov Gh ) Va ar res e M balo ( 696 ) ) f M ap ( 193 So io nH ( 706 ) ) ( 382 Ta op ( 391 Te has an a Tu ) Ne s ( 646 ) ) Ni kuin gg na n ( 668 ) Hais ta ho ( 154 ( 439 ) ) S ba iG ( 458 ( 602 ) Ba ris i Va ris ( 572 Av ) ) Va rio gg a na ( 438 ) ( 597 Ri en ta S ba ua a o ( 38) ) ( 207 ) ( 711 ) Ba gh tan aL a 710 ) ( 440 ( ba Va aK tan ( 538 ) Ba ba tan un ga ( 584 Ba ba L ( 35) ) Baun gao ( 705 L on oAlu ( 34) M on u ( 37) ) M ra a uro ( 36) ( 129 ) To a v a ( 334) ( 610 ( 413) Uruon oF ( 414) M ur eHolo a ( 686) Bil ek ge Km t ( 700) Ch a rin ge Ta ( 415) M a rin oro l ( 128) M a oP e L e ( 416) ( 379) Ngga ring Kia s ( 381) ( 452) M ba naS a ma ( 75) ( 380) Za ghu nga ( 157) G ( 755) L a bla ( 302) Bla bla nga ( 82) Bla ota Ys ( 732) ( 83) Kok ka ka ( 289) ( 571) i Kilo ( 282) ( 124) ra Ba non ng M a da ga gTu Tunra W e s t ( 340) Ka ( 49) ( 692) Na lik ( 265) Tia ngk ( 431) Tiga ungl s ( 673) L ihirS kL a ma ( 674) M a da ( 316) ( 324) Ba rok ( 339) M a ututu ( 53) iBil Na ka na i ( 388) ( 338) L a ka la ( 430) Vitu ( 304) Ya pe s e Dh ( 741) M bughotu ( 393) ( 714) Bugotu ( 95) ( 92) Ta lis e M olila ( 651) ( 650) Ta lis e M a ( 195) Gha riNdi ( 194) Gha ri ( 681) Tolo ( 648) Ta lis e ( 347) M a la ngo ( 204) ( 196) Gha riNgga e ( 190) ( 392) M bira o ( 199) Gha riTa nda ( 652) Ta lis e P ole ( 197) ( 649) Gha riNgge r Ta lis e Koo ( 198) ( 319) Gha riNgini ( 189) ( 320) L e ngo ( 321) L e ngoGha im ( 453) L e ngoP ( 678) Ngge la a rip ( 298) Toa mba ita ( 618) ( 312) Kwa i ( 299) L a nga ( 473) ( 390) Kwa io la nga ( 313) M ba e ( 389) L a u ngguu ( 345) ( 301) M ba ( 175) Kwa e le le a ( 315) ( 328) F a tara a e S ol ( 314) ( 346) L a uWle ka ( 551) L a uNora la de ( 550) L ongg th ( 621) ( 490) u Saa ( 552) S a UkiNiM a ( 548) Saa Oroh Vill a ( 142) Sa a ( 18) S a aa Ula wa ( 19) Dori ( 58) ( 549) Are o ( 59) ( 564) ( 172) Are a re M ( 247) S a a re W a a s ( 570) Ba a Aul a ia ( 22) Ba uro BauVi l ( 23) F uro Ha roo ( 21) ( 657) Kaa ga ni unu ( 246) S hua ( 60) ( 171 Aroa nta M a ( 173) ) ( 174) Aro s iOnCa ta mi ( 569) Aro s iTa eib l ( 658)( 663) Ka s i wa ( 300) t Ba hu ( 318) F a uro a ( 726 ( 636 ga P a F ag ) ( 699 ) wa S a niA V ) Taan taAniR gu ( 150 Ta wa naihu f ( 15) ) Kw nn rog ( 402 L am aS a ( 10) ) ( 420 S en eraou th ( 510 ) Ur y eEak ( 491 ) Ar a rroel ( 427 ) m M ak ( 531 ) Amerei iS ou an ) M th Pe ot br Pa te a y m S M am ra ou ra Ra wo t So ga tla es M a eS Or ut ou Ng ko hE p Na Na un n fa te Na m a Ne ti ak ir Ta ha An pes e v aq ej om An ei ( 607 ( 489 ) ( 454 ( 432 ) ) ( ( 429434 ) ) ) ( 352 ) ( 709 ) ( 112 ) ProtoAustronesian ( 447 ) (Fig. S4) u 12 SEE COMMENTARY i y PSYCHOLOGICAL AND COGNITIVE SCIENCES B ( 533) 2) ( 56 (Fig. S2) (Fig. S5) tO or oP ka a i se Sa avv eean te n Av m wa na n e Ne lu An ia s K Na lo ea le s eAs Pu Pu ol ta wake ke n c W Sa uutlo ia s e Ch or linke es o M rouu orol ar Ca Chon spa nCna S ai An i e S lo ea s s ) Pu ol ile pe ( 603 ) W ok ila an ( 555 ) M ng pe i ( 526 ) Pi na at 744 s ) ( Po rib aie lle ( 411 ) Ki s ha ( 512 ) Ku ars ( 514 ) M uru wa ( 516 ) Na lem ( 515 Ne we la Ja na Ca a i on e ) ( 441) Ia ng ( 406 ( 244) ) Ne hu a n ij ( 283 De tum nF ( 297) Ro es ter ( 385) ) ( 474) W a ori i ( 522 M hit y n ( 105) ( 374) Ta nrh tu ( 642) e mo ( 436) 505) P ( 543) iki 446) ( ( ( 689) Tua ( 734) a nih a n o utua nM n ( 214) ( 361) M ( 443) ( 546) Rurhiti nga ( 645) ( 139) ( 643) Ta roto nth ( 534) Ra hiti a s a n ( 332) ( 644) Ta a rque ( 122) ( 725) re v a ( 536) ( 383) M n s a nga ( 156) ( 359) M a wa iia nuiE ( 209) Ha ( 384) n Ra pa S a moa Be llona ( 533) ( 63) pia ( 562) Tiko e ( 675) E ma ( 163) Anuta lle s e ( 13) Re nne Aniw ( 537) F utunae le M ( 180) ( 481) ( 182) Ifira MW e s t ( 218) ( 513) ( 703) Uv e a E a s t ( 146) ( 116) ( 181) F utuna uTa u ( 704) Va e a ka ( 647) ( 563) Ta kuu ( 694) Tuv a lu ( 592) S ika ia na r ( 162) ( 264) Ka pinga ma 483) ( Nukuoro ( 333) L ua ngiua ( 524) P uka puka ( 680) Toke la u ( 702) Uv e a E a s t ( 682) Tonga n ( 683) ( 459) Niue ( 177) F ijia nBa u ( 666) Te a nu ( 708) ( 707) Va no ( 654) ( 159) Ta ne ma ( 98) ( 26) Buma ( 701) ( 656) As umboa ( 738) ( 442) Ta nimbili ( 581) ( 1) Ne mba o ( 748) ( 616) S e ima t ( 330) ( 160) W uv ulu ( 435) ( 426) ( 153) L ou ( 373) ( 317) Na una ( 598) ( 729) L e ipon ( 608) ( 325) ( 609) S iv is a ( 329) ( 323) L ikum Tita ( 266) Lev e ( 200) ( 417) L oniui ( 108) ( 97) M us ( 532) Ka s s a u ( 718) ( 408) Gimaira Ira h ( 33) ( 485) ( 68) Buli n ( 465) ( 25) M or ( 120) M iny ( 724) Biga a ifuin ( 24) ( 378) As M is ool ( 612) ( 8) ( 460) Wa ( 742) ( 135) Numrope n ( 622) ( 134) M a for Am ra u ba iYa W ind ( 386) ( 667) ( 229) ( 152 Nor e s pe n ( 45) ( 164) ) Da thB iW a ( 660) ( 148) Da we ra a ba r n ( 697) ( 588) Da we Te i ( 450) ( 101 ( 115) ( 601 ( 586 Imr la M ( 606) ) ( 192) ) ) E mpoin a s bua g ( 161) E a la wa ( 486 s tM S eri s ) Ce li a s e la ( 587 ( 86) ( 191) S ntr ( 357 ( 722 ) ( 719) ( 360 W ou thEa lM ) ( 449 ) ( 9) ( 698 ) Ta es tDa a s a s ( 517 ( 287 ) ) tB Uji ra ng ma ) ( 6) ( 721 ) ( Ng rNA a ( 503 ( 76) ) ( 211) Ge aib ru nB ar ) ( 599 ( 604 ( 578 ) W s er orS ( 186716 ) ) ( 5) E atu Ar ) ( 425 ) Hitlat be ) S ouuAKe iBela mb Am Pa ahAm anons Al ul M un ohai aTe Bo ur e i W nf na te Bu er ia nA S ru in l M ob Na am M am oy m a Ng an o ro Po ad ggbo l ru Ko nd ha ar W ai Bi ej di ok So m ew Sa a a aT W an Ga anv u a ur uk aN ak gg a au A Fig. 3. Increasing the number of languages we can 1 1 106 reconstruct gives new ways to approach questions in historical linguistics, such as the effect of functional 0.8 0.8 load on the probability of merging two sounds. The 104 plots shown are heat maps where the color encodes 0.6 0.6 the log of the number of sound changes that fall into a given two-dimensional bin. Each sound 102 change x > y is encoded as a pair of numbers in the 0.4 0.4 unit square, ðl; mÞ, as explained in Materials and Methods. To convey the amount of noise one could 1 0.2 0.2 expect from a study with the number of languages that King previously used (7), we first show in A the 0 0 0 heat map visualization for four languages. Next, we 0 0 2x10-5 2x10-5 4x10-5 6x10-5 8x10-5 1x10-4 4x10-5 6x10-5 8x10-5 1x10-4 show the same plot for 637 Austronesian languages Functional load Functional load in B. Only in this latter setup is structure clearly visible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of the functional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details. formalism but other limitations include the lack of explicit modeling of changes at the level of the phoneme inventories used by a language and the lack of morphological analysis. Challenges specific to the cognate inference task, for example difficulties with polymorphisms, are also discussed in more detail in SI Appendix. Another limitation of the current approach stems from the assumption that languages form a phylogenetic tree, an assumption violated by borrowing, dialect variation, and creole languages. However, we believe our system will be useful to linguists in several ways, particularly in contexts where there are large numbers of languages to be analyzed. Examples might include using the system to propose short lists of potential sound changes and correspondences across highly divergent word forms. An exciting possible application of this work is to use the model described here to infer the phylogenetic relationships between languages jointly with reconstructions and cognate sets. This will remove a source of circularity present in most previous computational work in historical linguistics. Systems for inferring phylogenies such as ref. 13 generally assume that cognate sets are given as a fixed input, but cognacy as determined by linguists is in turn motivated by phylogenetic considerations. The phylogenetic tree hypothesized by the linguist is therefore affecting the tree built by systems using only these cognates. This problem can be avoided by inferring cognates at the same time as a phylogeny, something that should be possible using an extended version of our probabilistic model. Our system is able to reconstruct the words that appear in ancient languages because it represents words as sequences of sounds and uses a rich probabilistic model of sound change. This is an important step forward from previous work applying computational ideas to historical linguistics. By leveraging the full sequence information available in the word forms in modern languages, we hope to see in historical linguistics a breakthrough similar to the advances in evolutionary biology prompted by the transition from morphological characters to molecular sequences in phylogenetic analysis. Materials and Methods This section provides a more detailed specification of our probabilistic model. See SI Appendix, Section 1.2 for additional content on the algorithm and simulations. Distributions. The conditional distributions over pairs of evolving strings are specified using a lexicalized stochastic string transducer (33). Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have a word form x = wcℓ′ . The generative process for producing y = wcℓ works as follows. First, we consider x to be composed of characters x1 x2 . . . xn , with the first and last ones being a special boundary symbol x1 = # ∈ Σ, which is never deleted, mutated, or created. The process generates y = y1 y2 . . . yn in n chunks yi ∈ Σ* ; i ∈ f1; . . . ; ng, one for each xi . The yi s may be a single character, multiple characters, or even empty. To generate yi , we define a mutation Markov chain that incrementally adds zero or more characters to an initially empty yi . First, we decide whether the current phoneme in the top word t = xi will be deleted, in which case yi = e (the probabilities of the 4228 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110 B Merger posterior Merger posterior A decisions taken in this process depend on a context to be specified shortly). If t is not deleted, we choose a single substitution character in the bottom word. We write S = Σ∪ fζg for this set of outcomes, where ζ is the special outcome indicating deletion. Importantly, the probabilities of this multinomial can depend on both the previous character generated so far (i.e., the rightmost character p of yi−1 ) and the current character in the previous generation string (t), providing a way to make changes context sensitive. This multinomial decision acts as the initial distribution of the mutation Markov chain. We consider insertions only if a deletion was not selected in the first step. Here, we draw from a multinomial over S , where this time the special outcome ζ corresponds to stopping insertions, and the other elements of S correspond to symbols that are appended to yi . In this case, the conditioning environment is t = xi and the current rightmost symbol p in yi . Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote the probabilities over the substitution and insertion decisions in the current branch ℓ′ → ℓ. A similar process generates the word at the root ℓ of a tree or when an innovation happens at some language ℓ, treating this word as a single string y1 generated from a dummy ancestor t = x1 . In this case, only the insertion probabilities matter, and we separately parameterize these probabilities with θR;t;p;ℓ . There is no actual dependence on t at the root or innovative languages, but this formulation allows us to unify the parameterization, with each θω;t;p;ℓ ∈ RjΣj+1 , where ω ∈ fR; S; Ig. During cognate inference, the decision to innovate is controlled by a simple Bernoulli random variable ngℓ for each language in the tree. When known cognate groups are assumed, ncℓ is set to 0 for all nonroot languages and to 1 for the root language. These Bernoulli distributions have parameters νℓ . Mutation distributions confined in the family of transducers miss certain phylogenetic phenomena. For example, the process of reduplication (as in “bye-bye”, for example) is a well-studied mechanism to derive morphological and lexical forms that is not explicitly captured by transducers. The same situation arises in metatheses (e.g., Old English frist > English first). However, these changes are generally not regular and therefore less informative (1). Moreover, because we are using a probabilistic framework, these events can still be handled in our system, even though their costs will simply not be as discounted as they should be. Note also that the generative process described in this section does not allow explicit dependencies to the next character in ℓ. Relaxing this assumption can be done in principle by using weighted transducers, but at the cost of a more computationally expensive inference problem (caused by the transducer normalization computation) (34). A simpler approach is to use the next character in the parent ℓ′ as a surrogate for the next character in ℓ. Using the context in the parent word is also more aligned to the standard representation of sound change used in historical linguistics, where the context is defined on the parent as well. More generally, dependencies limited to a bounded context on the parent string can be incorporated in our formalism. By bounded, we mean that it should be possible to fix an integer k beforehand such that all of the modeled dependencies are within k characters to the string operation. The caveat is that the computational cost of inference grows exponentially in k. We leave open the question of handling computation in the face of unbounded dependencies such as those induced by harmony (35). Parameterization. Instead of directly estimating the transition probabilities of the mutation Markov chain (which could be done, in principle, by taking them to be the parameters of a collection of multinomial distributions) we express them as the output of a multinomial logistic regression model (36). This Bouchard-Côté et al. Bouchard-Côté et al. θω;t;p;ℓ = θω;t;p;ℓ ðξ; λÞ = expfÆλ; f ðω; t; p; ℓ; ξÞæg × μðω; t; ξÞ; Zðω; t; p; ℓ; λÞ SEE COMMENTARY [1] where ξ ∈ S , f : fS; I; Rg × Σ × Σ × L × S → Rk is the feature function (which indicates which features apply for each event), Æ · ; · æ denotes inner product, and λ ∈ Rk is a weight vector. Here, k is the dimensionality of the feature space of the logistic regression model. In the terminology of exponential families, Z and μ are the normalization function and the reference measure, respectively: Zðω; t; p; ℓ; λÞ = X exp Æλ; f ω; t; p; ℓ; ξ′ æ ξ′∈S 8 0 if ω = S; t = #; ξ ≠ # > > < 0 if ω = R; ξ = ζ μðω; t; ξÞ = > 0 if ω ≠ R; ξ = # > : 1 o:w: Here, μ is used to handle boundary conditions, ensuring that the resulting probability distribution is well defined. During cognate inference, the innovation Bernoulli random variables νgℓ are similarly parameterized, using a logistic regression model with two kinds of features: a global innovation feature κ global ∈ R and a language-specific feature κℓ ∈ R. The likelihood function for each νgℓ then takes the form νgℓ = 1 : 1 + exp −κglobal − κ ℓ [2] ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 from the National Science Foundation. 21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia). 22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. J Quant Linguist 7:233–244. 23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Toronto, Toronto). 24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc Comput Linguist 45:15–22. 25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33(2):114–124. 26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38. 27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel trees. Adv Neural Inf Process Syst 21:177–184. 28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evol Bioinform Online 4:271–283. 29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas, TX, 16th Ed). 30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press, Oxford, UK). 31. Hockett CF (1967) The quantification of functional load. Word 23:320–339. 32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and Beyond (Benjamins, Amsterdam). 33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned genomic regions with integrated parameter estimation. Genome Biol 9(10):R147. 34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin). 35. Hansson GO (2007) On the evolution of consonant harmony: The case of secondary articulation agreement. Phonology 24:77–120. 36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London). 37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions with finite-state methods. Empirical Methods on Natural Language Processing 13: 1080–1089. 38. Chen SF (2003) Conditional and joint models for grapheme-to-phoneme conversion. Eurospeech 8:2033–2036. 39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximum entropy model. Proceedings of the Workshop on Variation Within Optimality Theory eds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122. 40. Wilson C (2006) Learning phonology with substantive bias: An experimental and computational study of velar palatalization. Cogn Sci 30(5):945–982. 41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323(5913):479–483. 42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian comparative linguistics. Inst Linguist Acad Sinica 1:31–94. PNAS | March 12, 2013 | vol. 110 | no. 11 | 4229 COMPUTER SCIENCES 1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague, Netherlands). 2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture and Environment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia). 3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton, New York). 4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and Linguistic Hypotheses, eds Blench R, Spriggs M (Routledge, London). 5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press, Cambridge, UK). 6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phonetic sound changes] (Maisonneuve & Larose, Paris). 7. King R (1967) Functional load and sound change. Language 43:831–852. 8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple alignment. Bioinformatics 17(9):803–820. 9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence alignment. Mol Biol Evol 21(3):529–540. 10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22(16):2047–2048. 11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Oxford, UK). 12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res 18(11):1829–1843. 13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of Austronesian expansion. Nature 405(6790):1052–1055. 14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics. Trans Philol Soc 100:59–129. 15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical Inverse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonald Institute, Cambridge, UK). 16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426(6965):435–439. 17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language 81:382–420. 18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P, Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp 111–118. 19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological implications. Assoc Comput Linguist 45:65–72. 20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogeny in historical linguistics: Methodological explorations applied in Island Melanesia. Language 84:710–759. Using these features and parameter sharing, the logistic regression model defines the transition probabilities of the mutation process and the root language model to be PSYCHOLOGICAL AND COGNITIVE SCIENCES model specifies a distribution over transition probabilities by assigning weights to a set of features that describe properties of the sound changes involved. These features provide a more coherent representation of the transition probabilities, capturing regularities in sound changes that reflect the underlying linguistic structure. We used the following feature templates: OPERATION, which identifies whether an operation in the mutation Markov chain is an insertion, a deletion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or the end of an insertion event; MARKEDNESS, which consists of language-specific n-gram indicator functions for all symbols in Σ (during reconstruction, only unigram and bigram features are used for computational reasons; for cognate inference, only unigram features are used); FAITHFULNESS, which consists of indicators for mutation events of the form 1 [ x > y ], where x ∈ Σ, y ∈ S . Feature templates similar to these can be found, for instance, in the work of refs. 37 and 38, in the context of string-to-string transduction models used in computational linguistics. This approach to specifying the transition probabilities produces an interesting connection to stochastic optimality theory (39, 40), where a logistic regression model mediates markedness and faithfulness of the production of an output form from an underlying input form. Data sparsity is a significant challenge in protolanguage reconstruction. Although the experiments we present here use an order of magnitude more languages than previous computational approaches, the increase in observed data also brings with it additional unknowns in the form of intermediate protolanguages. Because there is one set of parameters for each language, adding more data is not sufficient to increase the quality of the reconstruction; it is important to share parameters across different branches in the tree to benefit from having observations from more languages. We used the following technique to address this problem: We augment the parameterization to include the current language (or language at the bottom of the current branch) and use a single, global weight vector instead of a set of branch-specific weights. Generalization across branches is then achieved by using features that ignore ℓ, whereas branch-specific features depend on ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, and FAITHFULNESS have universal and branch-specific versions. Supporting Information: Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côté David Hall Thomas L. Griffiths Dan Klein This Supporting Information describes the learning and inference algorithms used by our system (Section 1), details of simulations supporting the accuracy of our system and analyzing the effects of functional load (Section 2), lists of reconstructions (Section 3), lists of sound changes (Section 4), and analysis of frequent errors (Section 5). 1 Learning and inference The generative model introduced in the Materials and Methods section of the main paper sets us up with two problems to solve: estimating the values of the parameters characterizing the distribution on sound changes on each branch of the tree, and inferring the optimal values of the strings representing words in the unobserved protolanguages. Section 1.1 introduces the full objective function that we need to optimize in order to estimate these quantities. Section 1.2 describes the Monte Carlo Expectation-Maximization algorithm we used for solving the learning problem. We present the algorithm for inferring ancestral word forms in Section 1.4. 1.1 Full objective function The generative model specified in the Materials and Methods section of the main paper defines an objective function that we can optimize in order to find good protolanguage reconstructions. This objective function takes the form of a regularized log-likelihood, combining the probability of the observed languages with additional constraints intended to deal with data sparsity. This objective function can be written concisely if we let Pλ (·), Pλ (·|·) denote the root and branch probability models described in the Materials and Methods section of the paper (with transition probabilities given by the above logistic regression model), I(c), the set of internal (non-leaf) nodes in τ (c), pa(`), the parent of language `, r(c), the root of τ (c) and W (c) = (Σ∗ )|I(c)| . The full objective function is then Li(λ, κ) = C X c=1 log X Pλ (wc,r(c) ) w∈W ~ (c) Y Pλ (wc,` |wc,pa(`) , nc` )Pκ (nc` |νc` ) − `∈I(c) ||λ||22 + ||κ||22 2σ 2 (1) where the second term is a standard L2 regularization penalty intended to reduce over-fitting due to data sparsity (we used σ 2 = 1) [1]. The goal of learning is to find the value of λ, the parameters of the logistic regression model for the transition probabilities, that maximizes this function. 1 1.2 A Monte Carlo Expectation-Maximization algorithm for reconstruction Optimization of the objective function given in Equation 1 is done using a Monte Carlo variant of the ExpectationMaximization (EM) algorithm [2]. This algorithm breaks down into two steps, an E step in which the objective function is approximated and an M step in which this approximate objective function is optimized. The M step is convex and computed using L-BFGS [3] but the E step is intractable [4], in part because it requires solving the problem of inferring the words in the protolanguages. We approximate the solution to this inference problem using a Markov chain Monte Carlo (MCMC) algorithm [5]. This algorithm repeatedly samples words from the protolanguages until it converges on the distribution implied by our generative model. Since this procedure is guaranteed to find a local maximum of the objective, we ran it with several different random initializations of the model parameters. The next two subsections provide the details of these two parts of our system. 1.2.1 E step: Inferring the posterior over string reconstructions In the E step, the inference problem is to compute an expectation under the posterior over strings in a protolanguage given observed word forms at the leaves of the tree.1 The typical approach in biological InDel models [6] is to use Gibbs sampling, where the entire string at each node in the tree is repeatedly resampled, conditioned on its parent and children. We will call this method Single Sequence Resampling (SSR). While conceptually simple, this approach suffers from mixing problems in large trees, since it can take a long time for information to propagate from one region of the tree to another [6]. Consequently, we use a different MCMC procedure, called Ancestry Resampling (AR) that alleviates these mixing problems. This method was originally introduced for biological applications [7], but commonalities between the biological and linguistic cases make it possible to use it in our model. Concretely, the problem with SSR arises when the tree under consideration is large or unbalanced. In this case, it can take a long time for information from the observed languages to propagate to the root of the tree. Indeed, samples at the root will initially be independent of the observations. AR addresses this problem by resampling one thin vertical slice of all sequences at a time, called an ancestry (for the precise definition of the algorithm, see [7]). Slices condition on observed data, avoiding mixing problems, and can propagate information rapidly across the tree. We ran the ancestry resampling algorithm for a number of iterations that increased linearly with the number of iterations of the EM algorithm that had been completed, resulting in an approximation regime that could allow the EM algorithm to converge to a solution [8]. To speed-up the large experiments, we also used an approximation in AR. This approximation is based on fixing the value of a set-valued auxiliary variables zg , where zg = {wg,` }, and ` ranges over the set of all languages (both internal and at the leaves). Conditioning on these variables, sampling wg,` |zg can be done exactly using dynamic programming and rejection sampling. 1.2.2 M step: Convex optimization of the approximate objective In the M step, we individually update the parameters λ and κ as specified in Equation 1. We show how λ is updated in this section, κ can be optimized similarly. Let C = (ω, t, p, `) denote local transducer contexts from the space C = {S, I, R} × Σ × Σ × L of all such contexts. Let N (C, ξ) be the expected number of times the transition ξ was used in context C in the preceding E-step. Given these sufficient statistics, the estimate of λ is given by optimizing the expected complete (regularized) log-likelihood O(λ) derived from the original objective function given Equation [1] in the Materials and Methods section of the main paper (ignoring terms that do not involve λ), O(λ) = X X C∈C ξ∈S 1 To h i ||λ||2 X N (C, ξ) hλ, f (C, ξ)i − log exp{hλ, f (C, ξ 0 )i} − . 2σ 2 0 ξ be precise: the posterior is over both protolanguage strings and the derivations between these strings and the modern words. 2 We use L-BFGS [3] to optimize this convex objective function. L-BFGS requires the partial derivatives h i λ X X X ∂O(λ) j N (C, ξ) fj (C, ξ) − θC (ξ 0 ; λ)fj (C, ξ 0 ) − 2 = ∂λj σ 0 C∈C ξ∈S = F̂j − X X ξ N (C, ·)θC (ξ 0 ; λ)fj (C, ξ 0 ) − C∈C ξ∈S P λj , σ2 P where F̂j = C∈C ξ∈S N (C, ξ)fj (C, ξ) is the empirical feature vector and N (C, ·) = ξ N (C, ξ) is the number of times context C was used. F̂j and N (C, ·) do not depend on λ and thus can be precomputed at the beginning of the M-step, thereby speeding up each L-BFGS iteration. 1.3 P Approximate Expectation-Maximization for cognate inference The procedure for cognate inference is similar: we again operate within the Expectation-Maximization framework, and the M-steps are identical. However, because of different characteristics of the cognate inference problem, we make a few different choices for approximations for the E-step. The Monte Carlo inference algorithm described in the preceding section works exceedingly well for reconstructing words for known cognates. However, a Monte Carlo approach to determining which words are cognate requires resampling the innovation variables ng` , which is likely to lead to slow mixing of the Markov chain. We therefore take a different approach when doing cognate inference. Here, we restrict our system to not perform inference over all possible reconstructions for all words, but to only use words that correspond to some observed modern word with the same meaning. The simplification is of course false, but it works well in practice. Specifically, we perform inference on the tree for each gloss using message passing [9], also known as pruning in the computational biology literature [10], where each message µ(wg` ) has a non-zero score only when wg` is one of the observed modern word forms in gloss g. The result of inference yields expected alignments counts for each character in each language along with the expected number of innovations that occur at each language. These expectations can then be used in the convex M-step. 1.4 Ancestral word form reconstruction In the E step described in the preceding section, a posterior distribution π over ancestral sequences given observed forms is approximated by a collection of samples X1 , X2 , . . . XS . In this section, we describe how this distribution is summarized to produce a single output string for each cognate set. This algorithm is based on a fundamental Bayesian decision theoretic concept: Bayes estimators. Given a loss function over strings Loss : Σ∗ × Σ∗ → [0, ∞), an estimator is a Bayes estimator if it belongs to the random set: X argmin Eπ Loss(x, X) = argmin Loss(x, y)π(y). x∈Σ∗ x∈Σ∗ y∈Σ∗ Bayes estimators are not only optimal within the Bayesian decision framework, but also satisfy frequentist optimality criteria such as admissibility [11]. In our case, the loss we used is the Levenshtein [12] distance, denoted Loss(x, y) = Lev(x, y) (we discuss this choice in more detail in Section 2). Since we do not have access to π, but rather to an approximation based on S samples, the objective function we use for reconstruction rewrites as follows: argmin x∈Σ∗ X Lev(x, y)π(y) ≈ argmin x∈Σ∗ y∈Σ∗ = argmin x∈Σ∗ 3 S 1X Lev(x, Xs ) S s=1 S X s=1 Lev(x, Xs ). The raw samples contain both derivations and strings for all protolanguages, whereas we are only interested in reconstructing words in a single protolanguage. This is addressed by marginalization, which is done in sampling representations by simply discarding the irrelevant information. Hence, the random variables Xs in the above equation can be viewed as being string-valued random variables. Note that the optimum is not changed if we restrict the minimization to be taken on x ∈ Σ∗ such that m ≤ |x| ≤ M where m = mins |Xs |, M = maxs |Xs |. However, even with this simplification, optimization is intractable. As an approximation, we considered only strings built by at most k contiguous substrings taken from the word forms in X1 , X2 , . . . , XS . If k = 1, then it is equivalent to taking the min over {Xs : 1 ≤ s ≤ S}. At the other end of the spectrum, if k = S, it is exact. This scheme is exponential in k, but since words are relatively short, we found that k = 2 often finds the same solution as higher values of k. 1.5 Finding Cognate Groups Our cognate model finds cuts to the phylogeny in order to determine which words are cognate with one another. However, this approach cannot straightforwardly handle instances where the evolution of the words did not follow strictly treelike behavior. This limitation applies to polymorphisms—where multiple words for a given meaning are available in a language—and also for borrowing—where the words did not evolve according to the phylogeny. However, we can modify the inference procedure to capture these kinds of behaviors using a post hoc agglomerative “merge” procedure. Specifically, we can run our procedure to find an initial set of cognate groups, and then merge those cognate groups that produce an increase in model score. That is, we create several initial small subtrees containing some cognates, and then stitch them together into one or more larger trees. Thus, non-treelike behaviors like borrowing will be represented as multiple trees that “overlap.” For instance, if two languages each have two words for one meaning (say A and B), then the initial stage might find that the two A’s are cognate, leaving the two B’s as singleton cognate groups. However, merging these two words into a single group will likely produce a gain in likelihood, and so we can merge them. Note this procedure is unlikely to work well for long distance borrowings, as the sound changes involved in long distance borrowing are likely to be very different from those according to the phylogeny. Nevertheless, we found this procedure to be effective in practice. 2 Experiments In this section, we give more details on the results in the main paper, namely those concerning validation using reconstruction error rate, and those measuring the effect of the tree topology and the number of languages. We also include additional comparisons to other reconstruction methods, as well as cognate inference results. In Section 2.1, we analyze in isolation the effects of varying the set of features, the number of observed languages, the topology, and the number of iterations of EM. In Section 2.2 we compare performance to an oracle and to two other systems. Evaluation of all methods was done by computing the Levenshtein distance [12] (uniform-cost edit distance) between the reconstruction produced by each method and the reconstruction produced by linguists. The Levenshtein distance is the minimum number of substitutions, insertions, or deletions of a phoneme required to transform one word to another. While the Levenshtein distance misses important aspects of phonology (all phoneme substitutions are not equal, for instance), it is parameter-free and still correlates to a large extent with linguistic quality of reconstruction. It is also superior to held-out log-likelihood, which fails to penalize errors in the modeling assumptions, and to measuring the percentage of perfect reconstructions, which ignores the degree of correctness of each reconstructed word. We averaged this distance across reconstructed words to report a single number for each method. The statistical significance of all performance differences are assessed using a paired t-test with significance level of 0.05. 4 2.1 Evaluating system performance We used the Austronesian Basic Vocabulary Database (ABVD) [13] as the basis for a series of experiments used to evaluate the performance of our system and the factors relevant to its success. The database, downloaded from http://language.psy.auckland.ac.nz/austronesian/ on August 7, 2010, includes partial cognacy judgments and IPA transcriptions,2 as well as a several reconstructed protolanguages. In our main experiments, we used the tree topology induced by the Ethnologue classification [14] in order to facilitate interpretability of the results (i.e. so that we can give well-known names to clades in the tree in our analyses). Our method does not require specifying branch lengths since our unsupervised learning procedure provides a more flexible way of estimating the amount of change between points of the phylogenetic tree. The first claim we verified experimentally is that having more observed languages aids reconstruction of protolanguages. In the results in this section, we used the subset of the languages under Proto-Oceanic (POc) to speed-up the computations. To test this hypothesis we added observed modern languages in increasing order of distance dc to the target reconstruction of POc so that the languages that are most useful for POc reconstruction are added first. This prevents the effects of adding a close language after several distant ones being confused with an improvement produced by increasing the number of languages. The results are reported in Figure 1(c) of the main paper. They confirm that large-scale inference is desirable for automatic protolanguage reconstruction: reconstruction improved statistically significantly with each increase except from 32 to 64 languages, where the average edit distance improvement was 0.05. We then conducted a number of experiments intended to identify the contribution made by different factors it incorporates. We found that all of the following ablations significantly hurt reconstruction: using a flat tree (in which all languages are equidistant from the reconstructed root and from each other) instead of the consensus tree, dropping the markedness features, dropping the faithfulness features, and disabling sharing across branches. The results of these experiments are shown in Table S.1. For comparison, we also included in the same table the performance of a semi-supervised system trained by K-fold validation. The system was run K = 5 times, with 1 − K −1 of the POc words given to the system as observations in the graphical model for each run. It is semi-supervised in the sense that target reconstructions for many internal nodes are not available in the dataset, so they are still not filled.3 2.2 Comparisons against other methods The first competing method, PRAGUE was introduced in [15]. In this method, the word forms in a given protolanguage are reconstructed using a Viterbi multi-alignment between a small number of its descendant languages. The alignment is computed using hand-set parameters. Deterministic rules characterizing changes between pairs of observed languages are extracted from the alignment when their frequency is higher than a threshold, and a protophoneme inventory is built using linguistically motivated rules and parsimony. A reconstruction of each observed word is first proposed independently for each language. If at least two reconstructions agree, a majority vote is taken, otherwise no reconstruction is proposed. This approach has several limitations. First, it is not tractable for larger trees, since the time complexity of their multi-alignment algorithm grows exponentially in the number of languages. Second, deterministic rules, while elegant in theory, are not robust to noise: even in experiments with only four daughter languages, a large fraction of the words could not be reconstructed. Since PRAGUE does not scale to large datasets, we also built a second, more tractable baseline. This new baseline system, CENTROID, computes the centroid of the observed word forms in Levenshtein distance. Let 2 While most word forms in ABVD are encoded using IPA, there are a few exceptions, for example language family specific conventions such as the usage of the okina symbol (’) for glottal stops (P) in some Polynesian languages or ad hoc conventions such as using the bigram ‘ng’ for the agma symbol (N). We have preprocessed as many of these exceptions as possible, and probabilist methods are generally robust to reasonable amounts of encoding glitches. 3 We also tried a fully supervised system where a flat topology is used so that all of these latent internal nodes are avoided; but it did not perform as well—this is consistent with the -Topology experiment of Table S.1. 5 Lev(x, y) denote the Levenshtein distance between word forms x and y. Ideally, we would like the baseline system to return: X argmin Lev(x, y), x∈Σ∗ y∈O where O = {y1 , . . . , y|O| } is the set of observed word forms. This objective function is motivated by Bayesian decision theory [11], and shares similarity to the more sophisticated Bayes estimator described in Section 2. However it replaces the samples obtained by MCMC sampling by the set of observed words. Similarly to the algorithm of Section 2, we also restrict the minimization to be taken on x ∈ Σ(O)∗ such that m ≤ |x| ≤ M and to strings built by at most k contiguous substrings taken from the word forms in O, where m = mini |yi |, M = maxi |yi | and Σ(O) is the set of characters occurring in O. Again, we found that k = 2 often finds the same solution as higher values of k. The difference was in all the cases not statistically significant, so we report the approximation k = 2 in what follows. We also compared against an oracle, denoted ORACLE, which returns argmin Lev(y, x∗ ), y∈O where x∗ is the target reconstruction. We will denote it by ORACLE. This is superior to picking a single closest language to be used for all word forms, but it is possible for systems to perform better than the oracle since it has to return one of the observed word forms. Of course, this scheme is only available to assess system performance on held-out data: It cannot make new predictions. We performed the comparison against a previous system proposed in [15] on the same dataset and experimental conditions as used in [15]. The PMJ dataset was compiled by [16], who also reconstructed the corresponding protolanguage. Since PRAGUE is not guaranteed to return a reconstruction for each cognate set, only 55 word forms could be directly compared to our system. We restricted comparison to this subset of the data. This favors PRAGUE since the system only proposes a reconstruction when it is certain. Still, our system outperformed PRAGUE, with an average distance of 1.60 compared to 2.02 for PRAGUE. The difference is marginally significant, p = 0.06, partly due to the small number of word forms involved. To get a more extensive comparison, we considered the hybrid system that returns PRAGUE’s reconstruction when possible and otherwise back off to the Sundanese (Snd.) modern form, then Madurese (Mad.), Malay (Mal.) and finally Javanic (Jv.) (the optimal back-off order). In this case, we obtained an edit distance of 1.86 using our system against 2.33 for PRAGUE, a statistically significant difference. We also compared against ORACLE and CENTROID in a large-scale setting. Specifically, we compare to the experimental setup on 64 modern languages used to reconstruct POc described before. Encouragingly, while the system’s average distance (1.49) does not attain that of the ORACLE (1.13), we significantly outperform the CENTROID baseline (1.79). 2.3 Cognate recovery To test the effectiveness of our cognate recovery system, we ran our system on all of the Oceanic languages in the ABVD, which comprises roughly half of the Austronesian languages. We then evaluated the pairwise precision, recall, F1, and purity scores, defined as follows. Let G = {G1 , G2 , . . . , Gg } denote the known partitions of the forms into cognates, and let F = {F1 , F2 , . . . , Ff } denote the inferred partitions. Let pairs(F ) denote the set of unordered pairs of indices in the same partition in F : pairs(F ) = {{i, j} : ∃k s.t. i, j ∈ Fk , i 6= j}, and similarly 6 for pairs(G). The metrics are defined as follows: |pairs(G) ∩ pairs(F )| , |pairs(F )| |pairs(G) ∩ pairs(F )| recall = , |pairs(G)| precision · recall F1 = 2 , precision + recall 1 X max |Gg ∩ Ff |. purity = g N precision = f Using these metrics, we found that our system achieved a precision of 84.4, recall of 62.1, F1 of 71.5, and cluster purity of 91.8. Thus, over 9 out of 10 words are correctly grouped, and our system errs on the side of undergrouping words rather than clustering words that are not cognates. Since the null hypothesis in historical linguistics is to deem words to be unrelated unless proven otherwise, a slight under-grouping is the desired behavior. Since we are ultimately interested in reconstruction, we then compared our reconstruction system’s ability to reconstruct words given these automatically determined cognates. Specifically, we took every cognate group found by our system (run on the Oceanic subclade) with at least two words in it. That is, we excluded words that our system found to be isolates. Then, we automatically reconstructed the Proto-Oceanic ancestor of those words using our system (using the auxiliary variables z described in Section 1.2.1). For evaluation, we then looked at the average edit distance from our reconstructions to the known reconstructions described in the previous sections. This time, however, we average per modern word rather than per cognate group, to provide a fairer comparison. (Results were not substantially different averaging per cognate group.) Using known cognates from the ABVD, there was an average reconstruction error of 2.19, versus 2.47 for the automatically reconstructed cognates, or an increase in error rate of 12.8%. The fraction of words with each Levenshtein distance for these reconstructions is shown in Figure S.1. While the plots are similar, the automatic cognates exhibit a slightly longer tail. Thus, even with automatic cognates, the reconstruction system can reconstruct words faithfully in many cases, only failing in a few instances. 2.4 Computation of functional loads To measure functional load quantitatively, we used the same estimator as the one used in [17]. This definition is based on associating a context vector cx,` to each phoneme and language. For a given language with N (`) phoneme tokens, these context vectors are defined as follows: first, fix an enumeration order of all the contexts found in the corpus, where the context is defined as a pair of phonemes, one at the left and one at the right of a position. Element i in this enumeration will correspond to component i of the context vectors. Then, the value of component i in context vector cx,` is set to be the number of time phoneme x occurred in context i and language l. Finally, King’s definition of functional load FL` (x, y) is the dot product of the two induced context vectors: FL` (x, y) = 1 1 X hcx,` , cy,` i = cx,` (i) × cy,` (i), 2 N (`) N (`)2 i where the denominator is simply a normalization that insures FL` (x, y) ≤ 1. Note that if x and y are in complementary distribution in language `, then the two vectors cx,` and cy,` are orthogonal. The functional load is indeed zero in this case. In Figure 3 of the main paper, we show heat maps where the color encodes the log of the number of sound changes that fall into a given 2-dimensional bin. Each sound change x > y is encoded as pair of numbers in the unit interval, (ˆl, m̂), where ˆl is an estimate of the functional load of the pair and m̂ is the posterior fraction of the instances of the phoneme x that undergo a change to y. We now describe how ˆl, m̂ were estimated. The posterior 7 fraction m̂ for the merger x > y between languages pa(`) → ` is easily computed from the same expected sufficient statistics used for parameter estimation: P p∈Σ N (S, x, p, `, y) P m̂` (x > y) = P . 0 0 p0 ∈Σ y 0 ∈Σ N (S, x, p , `, y ) The estimate of the functional load requires additional statistics, i.e. the expected context vectors ĉx,` and expected phoneme token counts N̂ (`), but these can be readily extracted from the output of the MCMC sampler. The estimate is then: ˆl` (x, y) = 1 N̂ (`)2 hĉx,` , ĉy,` i. Finally, the set of points used to construct the heat map is: n o ˆlpa(`) (x, y), m̂` (x > y) : ` ∈ L − {root}, x ∈ Σ, y ∈ Σ, x 6= y . 3 Reconstruction lists In Table S.2–S.4, we show the lists of consensus reconstructions produced by our system (the ‘Automatic’ column). For comparison, we also include a baseline (randomly picking one modern word), and the edit distances (when it is greater than zero). In Proto-Oceanic, since two manual reconstructions are available, we include the distances of the automatic reconstruction to both manual reconstructions (‘P-A’ and ‘B-A’) and the distance between the two manual reconstructions (‘B-P’). 4 Sound changes In Figures S.2–S.5, we zoom and rotate each quadrant of the tree shown in Figure 2 of the main paper. For information on the most frequent change of each branch, refer to the row in Table S.5 corresponding to the code in parenthesis attached to each branch. By most frequent, we mean the change with the highest expected count in the last EM iteration, collapsing contexts for simplicity. In cases where no change is observed with an expected count of more than 1.0, we skip the corresponding entry—this can be caused for example by languages where cognacy information was too sparse in the dataset. The functional load, normalized to be in the interval [0, 1] is also shown for reference. 5 Frequent errors In Figure S.6, we analyze the frequent discrepancy between the PAn reconstruction from our system with those from [18]. The most frequent problematic substitutions, insertions, and deletions are shown. These frequencies were obtained by aligning, after running the system, the reconstructions with the references. The aligner is a pairHMM trained via EM, using only phoneme identity and gap information [19]. The frequencies were extracted from the posterior alignments. References and Notes [1] Hastie T, Tibshiranim R, Friedman J (2009) The Elements of Statistical Learning (Springer). [2] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39:1–38. 8 [3] Liu DC, Nocedal J, Dong C (1989) On the limited memory BFGS method for large scale optimization. Mathematical Programming 45:503–528. [4] Lunter GA, Miklós I, Song YS, Hein J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comp. Biol. 10:869–889. [5] Tierney L (1994) Markov chains for exploring posterior distributions. The Annals of Statistics 22:1701–1728. [6] Holmes I, Bruno WJ (2001) Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820. [7] Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel trees. Advances in Neural Information Processing Systems 21:177–184. [8] Caffo B, Jank W, Jones G (2005) Ascent-based Monte Carlo EM. Journal of the Royal Statistical Society - Series B 67:235–252. [9] Judea P (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. (Morgan Kaufmann). [10] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368– 376. [11] Robert CP (2001) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer). [12] Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10. [13] Greenhill S, Blust R, Gray R (2008) The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4:271–283. [14] Lewis MP, ed (2009) Ethnologue: Languages of the World, Sixteenth edition. (SIL International). [15] Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quantitative Linguistics 7:233–244. [16] Nothofer B (1975) The reconstruction of Proto-Malayo-Javanic (M. Nijhoff). [17] King R (1967) Functional load and sound change. Language 43:831–852. [18] Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian comparative linguistics. Inst. Linguist. Acad. Sinica 1:31–94. [19] Berg-Kirkpatrick T, Bouchard-Côté A, DeNero J, Klein D (2010) Painless unsupervised learning with features. Proceedings of the North American Conference on Computational Linguistics pp 582–590. Condition Unsupervised full system -FAITHFULNESS -MARKEDNESS -Sharing -Topology Semi-supervised system Edit dist. 1.87 2.02 2.18 1.99 2.06 1.75 Table S.1: Effects of ablation of various aspects of our unsupervised system on mean edit distance to POc. -Sharing corresponds to the restriction to the subset of the features in OPERATION, FAITHFULNESS and MARKEDNESS that are branch-specific, -Topology corresponds to using a flat topology where the only edges in the tree connect modern languages to POc. The semi-supervised system is described in the text. All differences (compared to the unsupervised full system) are statistically significant. 9 Table S.2. Proto-Austronesian reconstructions Gloss Reference Baseline Distance Automatic tohold(1) smoke(1) toscratch(1) and(2) leg/foot(1) shoulder(1) woman/female(1) left(1) day(1) mother(1) tosuck(1) small(2) night(1) tosqueeze(1) tohit(1) rat(1) you(1) toplant(1) toswell(1) tosee(1) one(1) tosleep(1) dog(1) topound,beat(20) stone(1) green(1) father(1) this(1) tooth(1) tochoose(1) star(1) tobuy(23) tovomit(1) towork(1) wide(1) tocut,hack(1) tofear(1) tolive,bealive(1) thunder(3) tofly(2) toshoot(1) name(1) tobuy(1) and(1) when?(1) todig(1) ash(1) big(1) tocut,hack(3) road/path(1) tostand(1) sand(1) *gemgem *qebel *ka Raw *mah *qaqay *qaba Ra *bahi *kawi Ri *qalejaw *tina *sepsep *kedi *be RNi *pe Req *palu *labaw *ikamu *mula *ba Req *kita *isa *tudu R *wasu *tutuh *batu *mataq *tama *ini *nipen *piliq *bituqen *baliw *utaq *qumah *malawas *ta Raq *matakut *maqudip *de RuN *layap *panaq *Najan *beli *ka *ijan *kalih *qabu *ma Raya *tektek *zalan *di Ri *qenay *higem *q1v1l *kagaw *ma *ai *vala *babinay *kayli *andew *qina *sipsip *kedhi *beyi *perah *mipalo *lapo *kamo *h1mula *ba Req *kita *sa *maturug *kahu *nutu *batu *mataq *qama *eni *lipon *piliP *bitun *taiw *mutjaq *quma *Gawig *ta Raq *takuP *maPuNi *zuN *layap *fanak *ngaza *poli *kae *pirang *kali *abu *ma Raya *tutek *dalan *diri *one 3 3 1 1 4 4 4 3 5 1 2 1 2 3 3 3 2 2 *gemgem *qebel *ka Raw *ma *qaqay *qaba Ra *vavaian *kawi Ri *qalejaw *tina *sepsep *kedi *be RNi *pereq *palu *kulavaw *kamu *mula *ba Req *kita *isa *tudu R *vatu *tutu *batu *mataq *tama *ani *nipen *piliq *bituqen *taiw *utaq *quma *malabe R *ta Raq *matakut *maqudip *deruN *layap *panaq *Najan *beli *ka *pijan *kali *qabu *ma Raya *tektek *zalan *di Ri *qenay 1 4 2 2 1 1 2 1 2 2 2 1 5 3 3 3 2 4 2 1 3 1 1 2 1 1 4 Distance 1 5 1 3 1 2 1 1 2 1 3 1 1 1 continued on next page 10 continued from previous page Gloss Reference Baseline meat/flesh(31) what?(2) toswell(26) bird(2) hand(1) in,inside(1) if(2) toknow,beknowledgeable(2) at(20) wind(2) new(1) blood(1) breast(1) i(1) salt(1) toflow(1) five(1) at(1) other(1) leaf(2) all(1) rotten(1) tocook(1) head(1) mouth(2) house(1) if(1) neck(1) needle(1) he/she(1) fruit(1) back(1) tochew(2) salt(2) we(2) long(1) we(1) three(1) lake(1) toeat(1) no,not(3) where?(1) how?(1) tothink(34) who?(2) tobite(1) tail(1) *isi *nanu * Ribawa *qayam *lima *idalem *nu *bajaq *di *bali *mabaqe Ru *da Raq *susu *iaku *qasi Ra *qalu R *lima *i *duma *bi Raq *amin *mabu Raq *tanek *qulu *Nusu * Rumaq *ka *liqe R *za Rum *siia *buaq *likud *qelqel *timus *kami *inaduq *ikita *telu *danaw *kaen *ini *inu *kuja *nemnem *siima *ka Rat *iku R *ici *anu *malifawa *ayam *lime *dalale *no *mafanaP *ri *feli *baPru *da Raq *soso *ako *sie *ilir *lima *i *duma *bela *kemon *mavuk *tan@k *PuGuh *Nutu *uma *ke *liq1g *dag1m *ia *buaq *likudeP *qmelqel *timus *sikami *nandu *itam *t1lu *ranu *kman *ini *sadinno *gagua *n1mn1m *cima *kagat *ikog 11 Distance Automatic 1 1 4 1 1 4 1 5 1 2 5 *isi *anu *abeh *qayam *lima *idalem *nu *mafanaP *di *beliu *vaquan *da Raq *susu *iaku *qasi Ra *qalu R *lima *i *duma *bela *amin *mabu Ruk *tanek *qulu *Nuju * Rumaq *ka *liqe R *za Rum *siia *buaq *likud *qmelqel *timus *kami *anaduq *kita *telu *danaw *kman *ini *ainu *kua *kinemnem *tima *ka Rat *iku R 2 2 4 4 3 3 4 1 3 1 2 1 2 3 2 2 1 2 3 3 1 3 2 5 4 2 2 1 2 Distance 1 5 5 2 6 3 2 1 1 1 1 2 1 1 2 2 Table S.3. Proto-Oceanic reconstructions Reconstructions Pairwise distances B-P P-A B-A 1 1 1 1 3 4 2 4 1 1 1 3 3 2 1 2 3 1 2 2 3 2 1 1 1 1 1 1 1 1 3 1 3 3 2 2 3 2 1 1 Gloss Blust (B) Pawley (P) Automatic (A) fish(1) five(1) what?(1) meat/flesh(1) star(1) fog(1) toscratch(44) shoulder(1) where?(3) toclimb(2) toeat(1) two(1) dry(11) narrow(1) todig(1) bone(2) stone(1) left(1) they(1) toliedown(1) tohide(1) rope(1) smoke(2) when?(1) we(2) this(1) egg(1) stick/wood(1) tosit(16) toshoot(1) liver(1) needle(1) feather(1) topound,beat(2) near(9) heavy(1) year(1) old(1) fire(1) tochoose(1) rain(1) togrow(1) tosee(1) tohear(1) tochew(1) louse(1) wind(1) bad,evil(1) hand(1) toflow(1) night(1) *ikan *lima *sapa *pisiko *pituqun *kaput *karu *pa Ra *pai *sake *kani *rua *maca *kopit *keli *su Ri *patu *mawi Ri *ira *qinop *puni *tali *qasu *Naican *kamami *ne *qatolu R *kayu *nopo *panaq *qate *sa Rum *pulu *tutuk *tata *mamat *taqun *matuqa *api *piliq *qusan *tubuq *kita *roNo R *mamaq *kutu *aNin *saqat *lima *tape *boNi *ikan *lima *saa *pisako *pituqon *kaput *kadru *qapa Ra *pea *sake *kani *rua *masa *kopit *keli *su Ri *patu *mawi Ri *ira *qeno *puni *tali *qasu *Naijan *kami *ani *katolu R *kayu *nopo *pana *qate *sa Rum *pulu *tuki *tata *mapat *taqun *matuqa *api *piliq *qusan *tubuq *kita *roNo R *mamaq *kutu *mataNi *saqat *lima *tape *boNi *ikan *lima *sava *kiko *vetuqu *kabu *kadru *vara *vea *cake *kani *rua *mamasa *kapi *keli *sui *patu *mawii *sira *eno *vuni *tali *qasu *Nisa *kami *eni *tolu *kai *nofo *pana *qate *sau *vulu *tutuk *tata *mamava *taqu *matuqa *avi *vili *usa *tubu *kita *roNo *mama *kutu *mataNi *saqa *lima *tave *boNi 1 2 2 1 2 1 2 2 1 1 3 2 1 1 3 1 2 1 3 2 1 3 1 2 1 1 2 2 1 1 2 2 1 1 1 1 1 1 4 1 1 1 4 continued on next page 12 continued from previous page Reconstructions Pairwise distances Gloss Blust (B) Pawley (P) Automatic (A) day(5) tospit(14) person/humanbeing(1) tovomit(1) name(1) snake(12) man/male(1) tobreathe(1) far(1) tobuy(1) tovomit(8) tocook(9) thick(3) leg/foot(1) tobite(1) leaf(1) sky(1) todrink(1) tostand(2) i(1) warm(1) moon(1) how?(1) three(1) toplant(2) mosquito(1) bird(1) four(1) water(2) one(1) skin(1) toyawn(1) he/she(1) nose(1) thatch/roof(1) towalk(2) flower(1) dust(1) neck(18) eye(1) father(1) tofear(1) root(2) tostab,pierce(8) breast(1) tolive,bealive(1) head(1) thou(1) fruit(1) *qaco *qanusi *taumataq *mumutaq *Najan *mwata *mwa Ruqane *manawa *sauq *poli *luaq *tunu *matolu *waqe *ka Rat *raun *laNit *inum *tuqur *au *mapanas *pulan *kua *tolu *tanum *namuk *manuk *pani *wai R *sakai *kulit *mawap *ia *isuN *qatop *pano *puNa *qapuk * Ruqa *mata *tama *matakut *waka Ra *soka *susu *maqurip *qulu *ko *puaq *qaco *qanusi *tamwata *mumuta *qajan *mwata *taumwaqane *manawa *sauq *poli *luaq *tunu *matolu *waqe *ka Rati *rau *laNit *inum *taqur *au *mapanas *pulan *kuya *tolu *tanom *namuk *manuk *pat *wai R *tasa *kulit *mawap *ia *ijuN *qatop *pano *puNa *qapuk * Ruqa *mata *tamana *matakut *waka R *soka *susu *maqurip *qulu *iko *puaq *qaso *aNusu *tamata *muta *qasa *mwata *mwane *manawa *sau *voli *lua *tunu *matolu *waqe *karat *dau *laNi *inum *tuqu *yau *mavana *vula *kua *tolu *tan@m *namu *manu *vati *wai *sa *kulit *mawa *ia *isu *qato *vano *vuNa *avu *ua *mata *tama *matakut *waka *soka *susu *maquri *qulu *kou *vua 13 B-P P-A B-A 3 1 1 1 3 1 2 2 1 3 2 3 3 5 5 4 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 1 3 1 1 1 2 1 1 1 3 2 1 1 1 1 3 2 2 2 1 1 2 1 1 2 2 1 2 1 1 2 3 1 Table S.4. Proto-Polynesian reconstructions Gloss Reference Baseline rope(9) short(11) strikewithfist coral cook painful,sick(1) downwards dive toface grasp bay tail toblow(6) channel pandanus urinate navel gall wave sleep warm(1) beak tooth dance leg drown water nose taro tohide(1) upwards overripe tosit(16) red tochew(1) three(1) tosleep(10) nine dawn feather(1) canoe left(11) ashes dew small house voice octopus toclimb(2) sea day branch *taula *pukupuku *moto *puNa *tafu *masaki *hifo *ruku *haNa *kapo *faNa *siku *pupusi *awa *fara *mimi *pito *Pahu *Nalu *mohe *mafanafana *Nutu *nifo *saka *waPe *lemo *wai *isu *talo *funi *hake *pePe *nofo *kula *mama *tolu *mohe *hiwa *ata *fulu *waka *sema *refu *sau *riki *fale *lePo *feke *kake *tahi *Paho *maNa *taura *puPupuPu *moko *puNa *tahu *mahaki *ifo *uku *haNa *Papo *hana *siPu *pupuhi *ava *fala *mimi *pito *au *Nalu *moe *mafanafana *Nutu *niho *haka *wae *lemo *vai *isu *kalo *huna *aPe *pee *noho *Pula *mama *tolu *moe *iwa *ata *fulu *vaPa *hema *lehu *sau *riki *hale *leo *hePe *aPe *kai *ao *maNa Distance 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 2 1 1 2 2 2 2 Automatic *taula *pukupuku *moto *puNa *tafu *masaki *hifo *ruku *haNa *kapo *faNa *siku *pupusi *awa *fara *mimi *pito *Pahu *Nalu *mohe *mafanafana *Nutu *nifo *saka *waPe *lemo *wai *isu *talo *suna *hake *pePe *nofo *kula *mama *tolu *mohe *hiwa *ata *fulu *waka *sema *refu *sau *riki *fale *lePo *feke *ake *tahi *Paho *maNa Distance 2 1 continued on next page 14 continued from previous page Gloss Reference Baseline Distance lovepity flyrun thick(3) tostrippeel sitdwell black torch *PaloPofa *lele *matolutolu *hisi *nofo *kele *rama *aroha *rere *matolutolu *hihi *noho *Pele *ama 5 2 1 1 1 1 Automatic Distance *PaloPofa *lele *matolutoru *hisi *nofo *kele *rama 1 Table S.5. Sound changes Code (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38) (39) (40) (41) (42) (43) (44) (45) (46) (47) (48) (49) (50) (51) (52) (53) (54) (55) (56) (57) (58) (59) (60) (61) (62) (63) (64) Parent v (ProtoOcean) > p a (Northern Dumagat) > 2 q (Bisayan) > P a (Schouten) > @ e (UlatInai) > a e (SeramStraits) > o a (SouthwestNewBritain) > e a (CentralWestern) > e i (SeramStraits) > a a (EastVanuatu) > e s (BimaSumba) > h a (Vanuatu) > e f (Futunic) > p n (Wetar) > N t (WestSanto) > r Not enough data available for reliable sound change estimates d (NorthPapuanMainlandDEntrecasteaux) > t O (Southern Malaita) > o O (Southern Malaita) > o a (Bibling) > e Not enough data available for reliable sound change estimates O (SanCristobal) > o Not enough data available for reliable sound change estimates b (ProtoCentr) > p a (RajaAmpat) > E o (Utupua) > u a (Central Manobo) > o s (Formosan) > h l (West NuclearTimor) > n t (Ibanagic) > q a (Suauic) > e N (MalekulaCentral) > n a (ProtoCentr) > e O (Choiseul) > o O (Choiseul) > o O (Choiseul) > o Not enough data available for reliable sound change estimates O (Choiseul) > o v (Ivatan) > b a (BorneoCoastBajaw) > e a (NuclearCordilleran) > 2 N (BaliSasak) > n e (ProtoMalay) > @ h (BimaSumba) > s p (East CentralMaluku) > f a (Sulawesi) > o a (Palawano) > o e (LocalMalay) > a a (SouthNewIrelandNorthwestSolomonic) > o r (Sangiric) > d b (Sulawesi) > w q (ProtoMalay) > P e (Madak) > o u (Northern EastFormosan) > o u (BashiicCentralLuzonNorthernMindoro) > o r (NorthernPhilippine) > y N (Palawano) > n O (SanCristobal) > o Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates p (Vitiaz) > f u (BerawanLowerBaram) > o f (Futunic) > h t (Pangasinic) > s Child Functional load Number of occurrences (AdmiraltyIslands) (Agta) (AklanonBis) (Ali) (Alune) (Amahai) (Amara) (AmbaiYapen) (Ambon) (AmbrymSout) (Anakalang) (AnejomAnei) (Anuta) (Aputai) (ArakiSouth) 0.09884 5.132e-05 0.03289 7.158e-05 1.00000 0.05802 0.19090 0.21003 0.27139 0.09085 0.03743 0.08559 0.09282 0.04788 0.21996 5.67800 110.26986 46.31015 5.45255 1.31832 6.91108 15.52212 5.74372 3.04082 14.42495 10.01598 14.43646 19.96156 7.10435 20.80080 (AreTaupota) (AreareMaas) (AreareWaia) (Aria) 0.06738 0.01151 0.01151 0.29446 2.92590 1.99913 1.99971 4.99706 (ArosiOneib) 0.00562 1.99971 (Aru) (As) (Asumboa) (AtaTigwa) (Atayalic) (Atoni) (AttaPamplo) (Auhelawa) (Avava) (Babar) (Babatana) (BabatanaAv) (BabatanaKa) 0.13711 0.05156 0.07562 0.00291 0.04098 0.23175 0.47297 0.18692 0.10747 0.04891 0.01953 0.01953 0.01953 9.35892 4.74243 3.70741 7.29063 4.31225 7.02772 7.87084 7.98362 15.49566 5.61192 8.94227 1.99952 2.99952 (BabatanaTu) (Babuyan) (Bajo) (Balangaw) (Bali) (BaliSasak) (Baliledo) (BandaGeser) (BanggaiWdi) (Banggi) (BanjareseM) (Banoni) (Bantik) (Baree) (Barito) (Barok) (Basai) (Bashiic) (BashiicCentralLuzonNorthernMindoro) (BatakPalaw) (BauroBaroo) 0.01953 0.01574 0.04774 0.00387 0.19166 6.064e-04 0.03743 0.05559 0.06991 6.735e-04 0.10667 0.16106 0.01606 0.05030 0.04087 0.04576 0.00130 0.02423 0.01665 0.04752 0.00562 2.99961 16.97554 21.59109 37.89097 13.91349 26.22332 7.64100 3.52479 8.61615 7.69053 36.24790 5.69268 6.99267 10.71565 15.54090 1.91240 5.88492 63.00884 3.54936 4.94491 1.99981 (Bel) (Belait) (Bellona) (Benguet) 0.01771 0.03125 0.01009 0.09759 1.06548 12.34195 15.97130 1.35377 continued on next page 15 continued from previous page Code (65) (66) (67) (68) (69) (70) (71) (72) (73) (74) (75) (76) (77) (78) (79) (80) (81) (82) (83) (84) (85) (86) (87) (88) (89) (90) (91) (92) (93) (94) (95) (96) (97) (98) (99) (100) (101) (102) (103) (104) (105) (106) (107) (108) (109) (110) (111) (112) (113) (114) (115) (116) (117) (118) (119) (120) (121) (122) (123) (124) (125) (126) (127) (128) (129) (130) (131) (132) (133) (134) (135) (136) (137) (138) (139) (140) (141) (142) (143) (144) (145) (146) Parent u (BerawanLowerBaram) > o k (NorthSarawakan) > P e (SouthwestNewBritain) > i O (RajaAmpat) > o k (CentralPhilippine) > c a (Blaan) > O N (Blaan) > n Not enough data available for reliable sound change estimates u (Northern NuclearBel) > o a (ProtoMalay) > 1 O (SouthNewIrelandNorthwestSolomonic) > o t (BimaSumba) > d b (ProtoCentr) > w a (NorthSarawakan) > e 1 (Manobo) > u 1 (CentralPhilippine) > u 1 (Bilic) > a Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates t (GorontaloMongondow) > s o (TukangbesiBonerate) > O u (Seram) > i u (Bontok) > o a (BontokKankanay) > 1 q (Bontok) > P u (NuclearCordilleran) > o u (SuluBorneo) > o v (GelaGuadalcanal) > p a (Bugis) > @ N (SouthSulawesi) > k s (Bughotu) > h e (KayanMurik) > a a (SouthHalmahera) > e o (Vanikoro) > O a (Sabahan) > o u (Formosan) > o u (CentralMaluku) > o o (South Bisayan) > u u (ButuanTausug) > o r (NorthPapuanMainlandDEntrecasteaux) > l a (NewCaledonian) > E N (ProtoChuuk) > n u (Bisayan) > o l (SouthHalmaheraWestNewGuinea) > r u (EastFormosan) > o u (SouthCentralCordilleran) > o e (ProtoMalay) > @ j (ProtoOcean) > s e (BashiicCentralLuzonNorthernMindoro) > i R (ProtoCentr) > r u (MaselaSouthBabar) > E N (RemoteOceanic) > g k (Peripheral PapuanTip) > g u (MesoPhilippine) > o x (NortheastVanuatuBanksIslands) > k i (CenderawasihBay) > u u (Bisayan) > o l (East Nuclear Polynesian) > r a (Manobo) > 1 a (SantaIsabel) > u t (Chamic) > k y (Malayic) > i b (ProtoMalay) > p 7 (East SantaIsabel) > g o (SouthNewIrelandNorthwestSolomonic) > O a (ChamChru) > @ a (ProtoChuuk) > e a (ProtoChuuk) > e @ (Atayalic) > a e (North Babar) > o n (North Babar) > l N (LandDayak) > n N (South West Barito) > n a (NorthSarawakan) > e a (LoyaltyIslands) > e g (Bwaidoga) > y l (NorthPapuanMainlandDEntrecasteaux) > r p (Southern Malaita) > b l (Nuclear WestCentralPapuan) > r u (Northern Dumagat) > o s (SouthwestMaluku) > h a (CentralPacific) > o Child Functional load Number of occurrences (BerawanLon) (BerawanLowerBaram) (Bibling) (BigaMisool) (BikolNagaC) (BilaanKoro) (BilaanSara) 0.03125 0.15797 0.03719 0.03843 9.303e-05 0.01132 0.08320 13.55252 10.19426 2.29657 5.00642 12.87548 23.88863 18.08416 (Bilibil) (Bilic) (Bilur) (Bima) (BimaSumba) (Bintulu) (Binukid) (Bisayan) (Blaan) 0.03687 0.00777 0.00242 0.08028 0.08312 0.07029 0.03862 0.01651 0.08598 2.95970 19.43024 1.97890 9.97797 11.67021 9.74568 2.90829 18.53014 17.06499 (BolaangMon) (Bonerate) (Bonfia) (BontocGuin) (Bontok) (BontokGuin) (BontokKankanay) (BorneoCoastBajaw) (Bughotu) (BugineseSo) (Bugis) (Bugotu) (Bukat) (Buli) (Buma) (BunduDusun) (Bunun) (BuruNamrol) (ButuanTausug) (Butuanon) (Bwaidoga) (Canala) (Carolinian) (Cebuano) (CenderawasihBay) (CentralAmi) (CentralCordilleran) (CentralEastern) (CentralEasternOceanic) (CentralLuzon) (CentralMaluku) (CentralMas) (CentralPacific) (CentralPapuan) (CentralPhilippine) (CentralVanuatu) (CentralWestern) (Central Bisayan) (Central East Nuclear Polynesian) (Central Manobo) (Central SantaIsabel) (ChamChru) (Chamic) (Chamorro) (ChekeHolo) (Choiseul) (Chru) (Chuukese) (ChuukeseAK) (CiuliAtaya) (Dai) (DaweraDawe) (DayakBakat) (DayakNgaju) (Dayic) (Dehu) (Diodio) (Dobuan) (Dorio) (Doura) (DumagatCas) (EastDamar) (EastFijianPolynesian) 0.03159 0.02521 0.14937 0.04335 0.10466 8.294e-04 0.03806 0.05043 0.09530 0.00654 0.10991 0.03395 0.06056 0.05583 0.13019 0.05640 2.254e-04 0.01570 0.03947 0.03983 0.10631 0.05974 0.04042 0.03656 0.11907 4.190e-04 0.02941 6.064e-04 0.00410 0.01063 0.02692 0.04503 0.00869 0.11515 0.01374 0.02455 0.12684 0.03656 0.02709 0.09378 0.27278 0.34269 0.00237 0.15451 0.03879 0.00242 0.02139 0.11108 0.11108 0.02870 0.02409 0.16784 0.26785 0.16918 0.07029 0.20861 0.08008 0.10631 8.555e-04 0.13489 0.01544 0.02479 0.23158 4.79938 18.76480 11.37310 58.88829 3.41602 49.64130 9.39059 3.53556 1.99015 17.32980 2.98167 8.55895 4.02644 3.50307 23.27248 1.58162 11.98634 19.92112 3.20776 30.32254 8.92246 4.64394 23.82083 8.68282 11.14761 12.82795 3.43974 32.49321 2.00058 2.14101 12.80932 4.21108 11.59225 5.29705 38.98219 3.46857 1.99006 7.64073 2.35911 18.50241 1.22175 7.73602 1.73963 10.03113 10.10942 8.56855 31.97181 36.12356 59.55135 6.00504 5.70633 23.27929 7.13031 29.67016 5.15706 4.46889 12.32717 12.30865 1.99973 5.63596 18.63984 6.75499 1.54042 continued on next page 16 continued from previous page Code (147) (148) (149) (150) (151) (152) (153) (154) (155) (156) (157) (158) (159) (160) (161) (162) (163) (164) (165) (166) (167) (168) (169) (170) (171) (172) (173) (174) (175) (176) (177) (178) (179) (180) (181) (182) (183) (184) (185) (186) (187) (188) (189) (190) (191) (192) (193) (194) (195) (196) (197) (198) (199) (200) (201) (202) (203) (204) (205) (206) (207) (208) (209) (210) (211) (212) (213) (214) (215) (216) (217) (218) (219) (220) (221) (222) (223) (224) (225) (226) (227) (228) Parent Child c (Formosan) > t (EastFormosan) b (MaselaSouthBabar) > v (EastMasela) i (BimaSumba) > u (EastSumban) b (NortheastVanuatuBanksIslands) > v (EastVanuatu) e (Barito) > i (East Barito) p (CentralMaluku) > f (East CentralMaluku) u (Manus) > o (East Manus) g (NewGeorgia) > h (East NewGeorgia) a (NuclearTimor) > e (East NuclearTimor) l (Nuclear Polynesian) > r (East Nuclear Polynesian) O (SantaIsabel) > o (East SantaIsabel) @ (CentralEastern) > o (EasternMalayoPolynesian) a (RemoteOceanic) > e (EasternOuterIslands) a (AdmiraltyIslands) > e (Eastern AdmiraltyIslands) a (BandaGeser) > o (ElatKeiBes) r (SamoicOutlier) > l (Ellicean) a (Futunic) > e (Emae) e (SouthwestBabar) > E (Emplawas) Not enough data available for reliable sound change estimates i (BimaSumba) > e (EndeLio) a (Sumatra) > @ (Enggano) Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates f (Wetar) > h (Erai) a (Vanuatu) > e (Erromanga) O (SanCristobal) > o (Fagani) O (SanCristobal) > o (FaganiAguf) O (SanCristobal) > o (FaganiRihu) Not enough data available for reliable sound change estimates u (WesternPlains) > o (Favorlang) s (EastFijianPolynesian) > c (FijianBau) f (Timor) > w (FloresLembata) l (ProtoAustr) > r (Formosan) r (Futunic) > l (FutunaAniw) r (Futunic) > l (FutunaEast) l (SamoicOutlier) > r (Futunic) l (WestCentralPapuan) > r (Gabadi) u (Ibanagic) > o (Gaddang) e (Are) > i (Gapapaiwa) a (BimaSumba) > o (GauraNggau) a (ProtoMalay) > @ (Gayo) r (Northern NuclearBel) > z (Gedaged) c (GelaGuadalcanal) > s (Gela) k (SoutheastSolomonic) > g (GelaGuadalcanal) b (GeserGorom) > w (Geser) f (BandaGeser) > w (GeserGorom) Not enough data available for reliable sound change estimates g (Guadalcanal) > G (Ghari) Not enough data available for reliable sound change estimates h (Guadalcanal) > G (GhariNggae) O (Guadalcanal) > o (GhariNgger) Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates a (SouthHalmahera) > o (Giman) a (GorontaloMongondow) > o (Gorontalic) k (Gorontalic) > P (GorontaloH) a (Sulawesi) > o (GorontaloMongondow) s (GelaGuadalcanal) > c (Guadalcanal) Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates N (NehanNorthBougainville) > n (Haku) 1 (MesoPhilippine) > u (Hanunoo) t (Marquesic) > k (Hawaiian) u (Peripheral Central Bisayan) > o (Hiligaynon) r (Ambon) > l (HituAmbon) g (West NewGeorgia) > G (Hoava) a (NorthNewGuinea) > e (HuonGulf) a (LoyaltyIslands) > e (Iaai) u (Malayic) > o (Iban) k (NorthernCordilleran) > q (Ibanagic) N (Sabahan) > n (Idaan) e (Futunic) > a (IfiraMeleM) 1 (NuclearCordilleran) > o (Ifugao) k (Ifugao) > q (IfugaoAmga) k (Ifugao) > q (IfugaoBata) 1 (Kallahan) > o (IfugaoBayn) n (Wetar) > N (Iliun) u (NorthernLuzon) > o (Ilokano) a (Peripheral Central Bisayan) > u (Ilonggo) a (SouthernCordilleran) > 1 (Ilongot) N (Ilongot) > n (IlongotKak) a (Ivatan) > e (Imorod) Functional load Number of occurrences 0.04224 0.00417 0.17438 0.16115 0.02495 0.09093 0.01231 0.02220 0.14170 0.04761 0.02383 0.00303 0.14981 0.16811 0.05446 0.03331 0.20786 0.18331 6.87583 5.01447 28.36485 2.78838 4.21938 7.47879 4.11061 5.62094 3.51835 79.14382 4.65059 18.76994 13.43554 7.58136 15.50349 5.08687 3.97652 11.05576 0.04706 0.00149 5.22061 1.61296 0.03346 0.08559 0.00562 0.00562 0.00562 7.93007 16.54778 1.99961 1.99952 1.99690 0.01234 0.04333 0.01025 0.01978 0.05835 0.05835 0.03331 0.13055 0.01362 0.06186 0.13543 0.00259 0.02021 0.01050 0.01749 0.06536 0.05946 20.09559 9.93879 7.38989 16.25315 2.94909 48.57376 74.90293 5.24097 18.75014 4.23744 29.26141 24.49089 7.64916 2.08324 19.76693 3.91631 4.12917 4.995e-04 9.42542 5.670e-05 0.00601 2.00000 2.99863 0.11966 0.12679 0.00973 0.06991 0.01050 11.83686 8.78982 18.65349 26.41042 3.50693 0.10208 0.02599 0.31942 0.02028 0.03275 0.00152 0.13866 0.20861 0.00621 0.40890 0.13834 0.20786 0.01254 0.34057 0.34057 0.01101 0.04788 0.02660 0.12642 0.07296 0.06631 0.08734 6.26107 24.61094 56.48048 1.95625 16.21990 7.69893 4.84173 8.25767 12.93856 11.09862 5.84917 4.95522 19.55025 4.76075 17.99754 17.75466 10.14036 29.66945 4.07172 10.35510 9.13131 6.03794 continued on next page 17 continued from previous page Code (229) (230) (231) (232) (233) (234) (235) (236) (237) (238) (239) (240) (241) (242) (243) (244) (245) (246) (247) (248) (249) (250) (251) (252) (253) (254) (255) (256) (257) (258) (259) (260) (261) (262) (263) (264) (265) (266) (267) (268) (269) (270) (271) (272) (273) (274) (275) (276) (277) (278) (279) (280) (281) (282) (283) (284) (285) (286) (287) (288) (289) (290) (291) (292) (293) (294) (295) (296) (297) (298) (299) (300) (301) (302) (303) (304) (305) (306) (307) (308) (309) (310) Parent Child n (SouthwestBabar) > m (Imroing) u (SamaBajaw) > o (Inabaknon) a (LocalMalay) > e (Indonesian) u (Benguet) > o (Inibaloi) y (MaranaoIranon) > i (Iranun) a (Ivatan) > e (Iraralay) l (Ivatan) > d (Isamorong) t (Ibanagic) > s (IsnegDibag) h (Ivatan) > x (Itbayat) o (Ivatan) > u (Itbayaten) u (KalingaItneg) > o (ItnegBinon) l (Ivatan) > d (Ivasay) g (Bashiic) > y (Ivatan) e (Ivatan) > 1 (IvatanBasc) q (ProtoMalay) > h (Javanese) a (Northern NewCaledonian) > e (Jawe) e (West Barito) > o (Kadorih) O (SanCristobal) > o (Kahua) O (SanCristobal) > o (KahuaMami) l (Gorontalic) > r (Kaidipang) k (KairiruManam) > q (Kairiru) u (Schouten) > i (KairiruManam) n (Ilongot) > N (KakidugenI) r (Mansakan) > l (Kalagan) i (MesoPhilippine) > 1 (Kalamian) 1 (KalingaItneg) > o (KalingaGui) a (CentralCordilleran) > i (KalingaItneg) s (Benguet) > h (Kallahan) s (Kallahan) > h (KallahanKa) 1 (Kallahan) > e (KallahanKe) n (BimaSumba) > N (Kambera) b (Tsouic) > v (Kanakanabu) v (PatpatarTolai) > w (Kandas) u (BontokKankanay) > o (KankanayNo) n (CentralLuzon) > N (Kapampanga) t (Ellicean) > d (Kapingamar) v (LavongaiNalik) > f (KaraWest) a (SouthHalmahera) > e (KasiraIrah) e (South West Barito) > E (Katingan) I (Pasismanua) > i (KaulongAuV) a (Northern EastFormosan) > i (Kavalan) b (ProtoMalay) > v (KayanMurik) a (KayanMurik) > e (KayanUmaJu) a (SarmiJayapuraBay) > e (KayupulauK) a (FloresLembata) > e (Kedang) b (KeiTanimbar) > B (KeiTanimba) a (SoutheastMaluku) > e (KeiTanimbar) a (Dayic) > e (KelabitBar) e (East NuclearTimor) > E (Kemak) i (NorthSarawakan) > e (KenyahLong) a (LocalMalay) > o (Kerinci) a (KilivilaLouisiades) > e (Kilivila) o (Peripheral PapuanTip) > a (KilivilaLouisiades) o (Central SantaIsabel) > O (KilokakaYs) N (MicronesianProper) > n (Kiribati) P (Manam) > k (Kis) t (KisarRoma) > k (Kisar) f (SouthwestMaluku) > w (KisarRoma) a (BimaSumba) > o (Kodi) l (ProtoCentr) > r (KoiwaiIria) O (Central SantaIsabel) > o (Kokota) e (Pesisir) > o (Komering) a (Blaan) > O (KoronadalB) s (NgeroVitiaz) > r (Kove) u (PatpatarTolai) > a (Kuanua) O (West NewGeorgia) > o (Kubokota) v (Nuclear WestCentralPapuan) > b (Kuni) Not enough data available for reliable sound change estimates a (MicronesianProper) > e (Kusaie) O (Northern Malaita) > o (Kwai) a (Northern Malaita) > o (Kwaio) l (Tanna) > r (Kwamera) N (Northern Malaita) > n (KwaraaeSol) Not enough data available for reliable sound change estimates e (MelanauKajang) > @ (Lahanan) i (Willaumez) > e (Lakalai) r (Nuclear WestCentralPapuan) > l (Lala) u (FloresLembata) > o (LamaholotI) Not enough data available for reliable sound change estimates s (BimaSumba) > h (Lamboya) o (Bibling) > u (LamogaiMul) a (Pesisir) > @ (Lampung) Functional load Number of occurrences 0.23281 0.02973 0.10667 0.03849 0.00959 0.08734 0.01340 0.05600 0.00445 5.820e-04 0.02352 0.01340 0.02574 0.00723 0.07129 0.13215 7.673e-05 0.00562 0.00562 0.01053 0.00897 0.17315 0.06631 0.04611 0.02195 0.01076 0.13957 0.02667 0.00566 0.00403 0.00600 0.01715 0.03857 0.04380 0.01456 3.974e-05 0.02487 0.05583 0.00426 0.01894 0.08596 1.449e-07 0.06056 0.32087 0.10748 4.198e-05 0.17157 0.07691 2.255e-04 0.03423 0.01029 0.14778 0.17286 0.02707 0.04469 0.07162 0.05336 0.03321 0.13543 0.06793 0.02707 0.00325 0.01132 0.06561 0.16803 0.00573 0.11533 10.35889 19.36631 3.03924 64.30170 4.31410 11.08683 14.95905 2.91874 31.26009 67.10087 60.52923 14.96335 5.02873 32.77267 15.44980 8.47484 3.80522 1.99971 1.99952 21.87631 10.55125 1.37804 9.38577 3.61546 3.10459 22.89422 2.25237 15.50239 1.92678 38.49811 25.28750 4.07361 3.91962 49.14539 6.07915 35.94977 4.21517 5.86174 27.03064 6.17672 4.99960 6.64852 5.93656 3.93441 14.89342 4.13306 1.74318 10.95832 5.94197 6.39511 37.28182 5.83519 2.23572 5.60509 10.23912 3.84299 24.56857 7.75331 22.04728 11.95360 31.07342 8.58447 30.81323 7.13972 1.98361 2.00145 6.21112 0.12251 0.00348 0.18875 0.05929 0.03728 14.33579 1.99952 8.31055 25.71039 9.80996 0.00231 0.11589 0.13489 0.02895 14.95868 1.41234 4.70309 10.53359 0.03743 0.12203 0.00929 9.36744 5.67124 12.80317 continued on next page 18 continued from previous page Code (311) (312) (313) (314) (315) (316) (317) (318) (319) (320) (321) (322) (323) (324) (325) (326) (327) (328) (329) (330) (331) (332) (333) (334) (335) (336) (337) (338) (339) (340) (341) (342) (343) (344) (345) (346) (347) (348) (349) (350) (351) (352) (353) (354) (355) (356) (357) (358) (359) (360) (361) (362) (363) (364) (365) (366) (367) (368) (369) (370) (371) (372) (373) (374) (375) (376) (377) (378) (379) (380) (381) (382) (383) (384) (385) (386) (387) (388) (389) (390) (391) (392) Parent e (ProtoMalay) > a Not enough data available for reliable sound change estimates k (Northern Malaita) > g Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates r (NewIreland) > l a (East Manus) > e a (Tanna) > e Not enough data available for reliable sound change estimates h (Gela) > G Not enough data available for reliable sound change estimates f (SouthwestMaluku) > w n (West Manus) > N a (LavongaiNalik) > e i (West Manus) > e e (EndeLio) > @ a (Malayan) > e Not enough data available for reliable sound change estimates r (Manus) > P a (SoutheastIslands) > e i (LocalMalay) > e a (RemoteOceanic) > e t (Ellicean) > k e (MonoUruava) > a O (West NewGeorgia) > o O (West NewGeorgia) > o b (East Barito) > w a (NewIreland) > e k (Madak) > g l (LavongaiNalik) > r u (ProtoMalay) > o l (CentralPapuan) > r a (Nuclear PapuanTip) > i n (SouthSulawesi) > N k (MalaitaSanCristobal) > P v (SoutheastSolomonic) > f Not enough data available for reliable sound change estimates N (LocalMalay) > n p (Malayic) > m q (ProtoMalay) > h a (Vanuatu) > e a (NortheastVanuatuBanksIslands) > e a (Vitiaz) > o e (Bugis) > a u (CentralPhilippine) > o u (East NuclearTimor) > a e (BimaSumba) > a k (KairiruManam) > P a (Marquesic) > e s (BimaSumba) > c r (Tahitic) > l N (SouthernPhilippine) > n 1 (AtaTigwa) > o 1 (AtaTigwa) > o 1 (Central Manobo) > a g (West Central Manobo) > h a (South Manobo) > 1 1 (South Manobo) > 2 o (AtaTigwa) > 1 a (West Central Manobo) > 1 l (Mansakan) > r o (CentralPhilippine) > u a (Eastern AdmiraltyIslands) > e v (Tahitic) > w l (BorneoCoastBajaw) > w u (MaranaoIranon) > o 1 (SouthernPhilippine) > e Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates s (East NewGeorgia) > c a (Marquesic) > e f (Central East Nuclear Polynesian) > h a (MicronesianProper) > e i (South Babar) > E e (Northern NuclearBel) > i t (Willaumez) > f Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates Child Functional load (LandDayak) 0.04499 Number of occurrences 11.18206 (Lau) 0.07399 10.37078 (LavongaiNalik) (Leipon) (Lenakel) 0.13377 0.13133 0.05393 4.12434 15.77294 9.10010 (LengoGhaim) 5.895e-05 1.98182 (Letinese) (Levei) (LihirSungl) (Likum) (LioFloresT) (LocalMalay) 0.03321 0.18931 0.07781 0.08687 0.00290 0.09530 8.50360 43.18724 10.60106 4.08584 10.17585 18.43396 (Loniu) (Lou) (LowMalay) (LoyaltyIslands) (Luangiua) (LungaLunga) (Lungga) (Luqa) (Maanyan) (Madak) (MadakLamas) (Madara) (Madurese) (MagoriSout) (Maisin) (Makassar) (Malaita) (MalaitaSanCristobal) 0.00619 0.09902 0.03770 0.14981 0.47404 0.13896 0.00573 0.00573 0.07537 0.16211 0.05218 0.10006 0.01585 0.12605 0.22933 0.18370 0.09152 0.06368 8.93647 12.46015 2.99404 15.79651 39.73227 3.85273 2.00155 2.00145 6.13215 8.90472 9.51948 10.95189 39.16274 3.95178 2.17692 10.70075 5.02310 25.62607 (MalayBahas) (Malayan) (Malayic) (MalekulaCentral) (MalekulaCoastal) (Maleu) (Maloh) (Mamanwa) (Mambai) (Mamboru) (Manam) (Mangareva) (Manggarai) (Manihiki) (Manobo) (ManoboAtad) (ManoboAtau) (ManoboDiba) (ManoboIlia) (ManoboKala) (ManoboSara) (ManoboTigw) (ManoboWest) (Mansaka) (Mansakan) (Manus) (Maori) (Mapun) (Maranao) (MaranaoIranon) 0.17169 0.17398 0.07129 0.08559 0.08702 0.11223 0.07175 0.01883 0.34369 0.13262 0.01548 0.26245 2.420e-05 9.361e-04 0.07976 0.03723 0.03723 0.12386 0.03983 0.12673 2.271e-04 0.03723 0.13743 0.04611 0.01883 0.13652 0.00255 0.03973 0.00377 0.00422 34.31949 3.33605 31.25704 25.93669 31.59187 11.90975 4.70755 38.86565 4.99285 7.46411 9.58872 3.72483 8.94154 13.67126 13.17836 66.54667 66.57068 3.76411 13.28241 7.79055 58.20252 11.93633 4.54113 18.44830 21.46241 4.75105 12.29480 7.82813 54.21909 7.67944 (Marovo) (Marquesan) (Marquesic) (Marshalles) (MaselaSouthBabar) (Matukar) (Maututu) 0.00144 0.26245 0.05235 0.12251 0.04148 0.04103 0.00143 4.87889 10.48977 2.06222 42.07442 3.02753 3.86380 5.97137 continued on next page 19 continued from previous page Code (393) (394) (395) (396) (397) (398) (399) (400) (401) (402) (403) (404) (405) (406) (407) (408) (409) (410) (411) (412) (413) (414) (415) (416) (417) (418) (419) (420) (421) (422) (423) (424) (425) (426) (427) (428) (429) (430) (431) (432) (433) (434) (435) (436) (437) (438) (439) (440) (441) (442) (443) (444) (445) (446) (447) (448) (449) (450) (451) (452) (453) (454) (455) (456) (457) (458) (459) (460) (461) (462) (463) (464) (465) (466) (467) (468) (469) (470) (471) (472) (473) (474) Parent Not enough data available for reliable sound change estimates e (Northern NuclearBel) > i b (Nuclear WestCentralPapuan) > p e (Northwest) > a a (MelanauKajang) > e N (LocalMalay) > n e (LocalMalay) > a Not enough data available for reliable sound change estimates r (Vitiaz) > l a (WestSanto) > e u (East Barito) > o p (WesternOceanic) > b e (ProtoMalay) > 1 s (ProtoMicro) > t e (Malayan) > a b (RajaAmpat) > p r (KilivilaLouisiades) > l u (Malayic) > o a (Ponapeic) > O k (Bwaidoga) > P r (MonoUruava) > l Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates N (SouthNewIrelandNorthwestSolomonic) > n t (CenderawasihBay) > P a (Sulawesi) > o d (ProtoChuuk) > t v (EastVanuatu) > w v (SinagoroKeapara) > h r (Bibling) > x u (Sulawesi) > o N (Western Munic) > n d (UlatInai) > r r (ProtoOcean) > l a (EastVanuatu) > E r (EndeLio) > z i (MalekulaCoastal) > e n (Willaumez) > l a (LavongaiNalik) > @ u (CentralVanuatu) > i a (MalekulaCentral) > e b (MalekulaCoastal) > p r (SoutheastIslands) > l a (ProtoMicro) > e Not enough data available for reliable sound change estimates s (NehanNorthBougainville) > h N (Nehan) > n a (SouthNewIrelandNorthwestSolomonic) > o e (Northern NewCaledonian) > a o (Utupua) > œ a (LoyaltyIslands) > e m (Vanuatu) > n a (MalekulaCentral) > e e (LoyaltyIslands) > a k (SouthNewIrelandNorthwestSolomonic) > g e (MesoMelanesian) > i w (BimaSumba) > v i (Aru) > e f (NorthNewGuinea) > w Not enough data available for reliable sound change estimates s (Gela) > h b (CentralVanuatu) > p p (Sumatra) > f f (NilaSerua) > h t (TeunNilaSerua) > l u (Nehan) > w a (Tongic) > e l (North Babar) > n R (ProtoCentr) > r v (WesternOceanic) > p a (Nuclear PapuanTip) > e a (Northwest) > e e (Babar) > E b (Sulawesi) > w u (Vanuatu) > o r (NorthernLuzon) > g m (NorthernPhilippine) > n N (ProtoMalay) > n o (NorthernCordilleran) > u l (EastFormosan) > n p (Malaita) > b a (NewCaledonian) > o Child Functional load Number of occurrences (Megiar) (Mekeo) (MelanauKajang) (MelanauMuk) (Melayu) (MelayuBrun) 0.04103 0.01057 0.05269 0.05542 0.17169 0.10667 3.96950 10.53926 4.10012 8.30760 15.57376 57.81798 (Mengen) (Merei) (MerinaMala) (MesoMelanesian) (MesoPhilippine) (MicronesianProper) (Minangkaba) (Minyaifuin) (Misima) (Moken) (Mokilese) (Molima) (Mono) 0.09236 0.12570 0.00538 0.06671 0.00192 0.14436 0.09530 0.08190 0.09569 0.00621 0.00863 0.02226 0.18018 7.76084 4.40581 45.42570 3.25436 27.69856 10.32542 36.65853 5.44290 2.85853 18.67524 20.21510 6.39306 7.90990 (MonoUruava) (Mor) (Mori) (Mortlockes) (Mota) (Motu) (Mouk) (MunaButon) (MunaKatobu) (MurnatenAl) (Mussau) (Mwotlap) (Nage) (Nahavaq) (NakanaiBil) (Nalik) (Namakir) (Naman) (Nati) (Nauna) (Nauru) 0.07198 0.00191 0.06991 0.10821 0.04447 0.01870 3.176e-05 0.04575 2.731e-04 0.00181 0.16171 0.01111 0.02129 0.05885 0.00412 0.00211 0.18127 0.13582 0.01049 0.10938 0.13661 4.50934 7.90698 15.10951 17.95820 4.78867 13.39891 16.89672 3.73293 6.02727 5.95393 3.95147 27.83527 1.96583 9.63004 1.02690 12.96772 21.52549 33.25250 19.54611 5.60587 5.67814 (Nehan) (NehanHape) (NehanNorthBougainville) (Nelemwa) (Nembao) (Nengone) (Nese) (Neveei) (NewCaledonian) (NewGeorgia) (NewIreland) (Ngadha) (NgaiborSAr) (NgeroVitiaz) 0.05608 0.11803 0.16106 0.13215 0.01621 0.20861 0.35703 0.13582 0.20861 0.08381 0.06281 4.053e-05 0.02299 0.01667 13.16549 18.93949 4.34745 10.43158 4.98208 5.30862 18.91500 34.11813 1.71479 5.00348 3.31247 14.09575 6.77784 3.53470 (Nggela) (Nguna) (Nias) (Nila) (NilaSerua) (Nissan) (Niue) (NorthBabar) (NorthBomberai) (NorthNewGuinea) (NorthPapuanMainlandDEntrecasteaux) (NorthSarawakan) (North Babar) (North Minahasan) (NortheastVanuatuBanksIslands) (NorthernCordilleran) (NorthernLuzon) (NorthernPhilippine) (Northern Dumagat) (Northern EastFormosan) (Northern Malaita) (Northern NewCaledonian) 0.05008 0.06113 0.00205 0.01125 0.13648 0.02937 0.18204 0.16784 0.02692 0.12924 0.27192 0.05269 0.03978 0.05030 0.06513 0.01688 0.15955 0.10751 0.01648 0.14981 0.00727 0.17212 17.07009 6.26520 8.90278 4.39065 1.88492 8.95963 4.31952 6.84401 10.12083 8.71798 3.13375 5.44048 1.27682 3.03447 4.83269 4.17784 5.87812 18.96525 1.25949 5.83985 4.99257 3.00312 continued on next page 20 continued from previous page Code (475) (476) (477) (478) (479) (480) (481) (482) (483) (484) (485) (486) (487) (488) (489) (490) (491) (492) (493) (494) (495) (496) (497) (498) (499) (500) (501) (502) (503) (504) (505) (506) (507) (508) (509) (510) (511) (512) (513) (514) (515) (516) (517) (518) (519) (520) (521) (522) (523) (524) (525) (526) (527) (528) (529) (530) (531) (532) (533) (534) (535) (536) (537) (538) (539) (540) (541) (542) (543) (544) (545) (546) (547) (548) (549) (550) (551) (552) (553) (554) (555) (556) Parent b (Bel) > p N (Sangiric) > n q (ProtoMalay) > P l (CentralCordilleran) > k b (Timor) > f l (PapuanTip) > n N (Polynesian) > n r (WestCentralPapuan) > l t (Ellicean) > d f (HuonGulf) > w t (CenderawasihBay) > k f (Seram) > h a (LocalMalay) > @ N (Javanese) > n t (CentralVanuatu) > r O (Southern Malaita) > o r (EastVanuatu) > l s (Formosan) > t a (ProtoMalay) > e 1 (Palawano) > @ 1 (MesoPhilippine) > u Not enough data available for reliable sound change estimates u (Pangasinic) > o Not enough data available for reliable sound change estimates a (WesternOceanic) > o N (SouthwestNewBritain) > n v (PatpatarTolai) > h a (SouthNewIrelandNorthwestSolomonic) > e e (SeramStraits) > i u (Formosan) > o f (Tahitic) > h n (Wetar) > N u (Central Bisayan) > o e (PapuanTip) > a q (ProtoMalay) > h v (EastVanuatu) > f a (ChamChru) > 1 Not enough data available for reliable sound change estimates v (EastFijianPolynesian) > f a (Ponapeic) > e a (PonapeicTrukic) > e t (MicronesianProper) > d e (BimaSumba) > a B (TukangbesiBonerate) > F e (CentralEastern) > @ o (PonapeicTrukic) > a N (ProtoAustr) > n v (RemoteOceanic) > f a (EasternMalayoPolynesian) > o f (SamoicOutlier) > w a (NorthBomberai) > e u (ProtoChuuk) > U d (ProtoChuuk) > t d (ProtoChuuk) > t u (KayanMurik) > o b (Formosan) > v s (EastVanuatu) > h r (CenderawasihBay) > l f (East Nuclear Polynesian) > h a (Tahitic) > e a (ProtoMalay) > @ i (CentralEasternOceanic) > a r (Futunic) > g O (Choiseul) > o r (NorthNewGuinea) > z r (KisarRoma) > R k (Nuclear WestCentralPapuan) > h f (West NuclearTimor) > b t (WestFijianRotuman) > f O (West NewGeorgia) > o u (Formosan) > o k (Tahitic) > P u (Palawano) > o f (Southern Malaita) > h Not enough data available for reliable sound change estimates O (Southern Malaita) > o O (Southern Malaita) > o Not enough data available for reliable sound change estimates u (Tsouic) > o a (Northwest) > o d (ProtoChuuk) > t u (Formosan) > o Child Functional load Number of occurrences (Northern NuclearBel) (Northern Sangiric) (Northwest) (NuclearCordilleran) (NuclearTimor) (Nuclear PapuanTip) (Nuclear Polynesian) (Nuclear WestCentralPapuan) (Nukuoro) (NumbamiSib) (Numfor) (Nunusaku) (Ogan) (OldJavanes) (Orkon) (Oroha) (PaameseSou) (Paiwan) (Palauan) (PalawanBat) (Palawano) 0.08564 0.10504 0.04087 0.14865 0.08459 0.10099 0.01742 0.13055 3.974e-05 0.03605 0.18043 0.02050 0.00113 0.12968 0.18595 0.01151 0.17094 0.10317 0.04499 4.146e-04 0.02599 2.04429 18.92173 30.80231 1.01516 2.80832 4.22343 5.34206 1.52585 48.84748 5.99986 10.67064 9.64666 22.90283 22.81888 20.59925 1.99952 18.23342 11.06529 11.70163 36.88428 5.66511 (Pangasinan) 0.03899 25.70031 (PapuanTip) (Pasismanua) (Patpatar) (PatpatarTolai) (Paulohi) (Pazeh) (Penrhyn) (Perai) (Peripheral Central Bisayan) (Peripheral PapuanTip) (Pesisir) (PeteraraMa) (PhanRangCh) 0.15356 0.09977 0.00182 0.15474 0.18832 2.254e-04 0.02693 0.04788 0.02827 0.19334 0.07129 0.01242 0.00636 3.56136 3.11319 8.75307 6.20403 3.98409 6.81642 5.16875 4.13934 1.93754 3.54732 14.80076 17.18743 12.19166 (Polynesian) (Ponapean) (Ponapeic) (PonapeicTrukic) (Pondok) (Popalia) (ProtoCentr) (ProtoChuuk) (ProtoMalay) (ProtoMicro) (ProtoOcean) (Pukapuka) (PulauArgun) (PuloAnna) (PuloAnnan) (Puluwatese) (PunanKelai) (Puyuma) (Raga) (RajaAmpat) (RapanuiEas) (Rarotongan) (RejangReja) (RemoteOceanic) (Rennellese) (Ririo) (Riwo) (Roma) (Roro) (RotiTerman) (Rotuman) (Roviana) (Rukai) (Rurutuan) (SWPalawano) (Saa) 0.05871 0.16942 0.13099 0.04291 0.13262 0.00989 0.00248 0.09341 0.13485 0.04980 0.09981 0.00204 0.13318 8.595e-06 0.10821 0.10821 0.00695 0.02080 0.03550 0.10632 0.05254 0.23947 0.00259 0.30671 0.01560 0.01953 0.01094 0.00324 0.04267 0.05510 0.07237 0.00573 2.254e-04 0.00737 7.014e-04 0.00833 17.52595 10.52294 19.78210 17.55398 6.53158 10.98695 3.97165 2.83945 12.99231 13.02087 7.99270 8.90305 12.93777 28.37331 26.92081 32.33646 11.47865 5.97573 19.03806 10.89494 6.51440 2.99725 33.94479 1.87121 46.78448 8.10758 1.23968 9.86839 7.79458 4.89359 18.85436 6.86288 9.26163 41.17228 10.93716 23.36029 (SaaSaaVill) (SaaUkiNiMa) 0.01151 0.01151 2.00000 1.99961 (Saaroa) (Sabahan) (SaipanCaro) (Saisiat) 0.00593 0.01254 0.10821 2.254e-04 29.89765 7.65235 30.24687 32.49754 continued on next page 21 continued from previous page Code (557) (558) (559) (560) (561) (562) (563) (564) (565) (566) (567) (568) (569) (570) (571) (572) (573) (574) (575) (576) (577) (578) (579) (580) (581) (582) (583) (584) (585) (586) (587) (588) (589) (590) (591) (592) (593) (594) (595) (596) (597) (598) (599) (600) (601) (602) (603) (604) (605) (606) (607) (608) (609) (610) (611) (612) (613) (614) (615) (616) (617) (618) (619) (620) (621) (622) (623) (624) (625) (626) (627) (628) (629) (630) (631) (632) (633) (634) (635) (636) (637) (638) Parent a (Vanuatu) > E v (Suauic) > h e (ProtoMalay) > a a (SuluBorneo) > 1 u (CentralLuzon) > o k (SamoicOutlier) > P r (Nuclear Polynesian) > l Not enough data available for reliable sound change estimates h (Northern Sangiric) > r q (Northern Sangiric) > P P (Northern Sangiric) > q a (Sulawesi) > e l (SanCristobal) > r O (SanCristobal) > o s (SouthNewIrelandNorthwestSolomonic) > h l (NehanNorthBougainville) > n a (Blaan) > u o (SarmiJayapuraBay) > a l (NorthNewGuinea) > r a (BaliSasak) > @ a (ProtoChuuk) > e a (BimaSumba) > e n (NorthNewGuinea) > N a (Atayalic) > u f (Western AdmiraltyIslands) > h u (NorthBomberai) > i b (SoutheastMaluku) > h Not enough data available for reliable sound change estimates E (Pasismanua) > e r (East CentralMaluku) > l l (Nunusaku) > r e (MaselaSouthBabar) > E f (NilaSerua) > w N (PatpatarTolai) > n n (FloresLembata) > N f (Ellicean) > h O (West NewGeorgia) > o a (CentralPapuan) > o a (LandDayak) > o u (EastFormosan) > o O (Choiseul) > o Not enough data available for reliable sound change estimates r (BimaSumba) > z a (Sarmi) > e k (CentralMaluku) > P a (NehanNorthBougainville) > e Not enough data available for reliable sound change estimates l (Ambon) > r r (NorthernLuzon) > l n (MaselaSouthBabar) > l a (CentralVanuatu) > e v (SouthHalmaheraWestNewGuinea) > p u (EasternMalayoPolynesian) > i i (NewIreland) > u u (Sulawesi) > o t (Babar) > k l (Bisayan) > y a (Manobo) > 1 y (West Barito) > i o (Eastern AdmiraltyIslands) > u R (ProtoCentr) > r r (CentralEasternOceanic) > l t (SouthCentralCordilleran) > b e (ProtoMalay) > 1 l (Malaita) > n o (South Babar) > u b (Timor) > w i (Vitiaz) > u u (Atayalic) > o k (Suauic) > P r (Nuclear PapuanTip) > l 1 (Subanun) > o a (SouthernPhilippine) > 1 g (Subanun) > d e (ProtoMalay) > a u (SamaBajaw) > o e (ProtoMalay) > o N (ProtoMalay) > n u (South Bisayan) > o a (Erromanga) > e t (SouthSulawesi) > P N (Tboli) > n Child Functional load Number of occurrences (SakaoPortO) (Saliba) (SamaBajaw) (SamalSiasi) (SambalBoto) (Samoan) (SamoicOutlier) 1.415e-05 0.01548 0.04499 0.03121 0.03570 7.913e-05 0.04761 7.37912 5.59439 11.37100 4.55207 51.44942 40.57180 2.52276 (SangilSara) (Sangir) (SangirTabu) (Sangiric) (SantaAna) (SantaCatal) (SantaIsabel) (SaposaTinputz) (SaranganiB) (Sarmi) (SarmiJayapuraBay) (Sasak) (Satawalese) (Savu) (Schouten) (Sediq) (Seimat) (Sekar) (Selaru) 0.01859 0.17663 0.17663 0.06360 0.20803 0.00562 0.01828 0.13005 0.10239 0.30507 0.09033 0.07763 0.11108 0.13262 0.12461 0.15623 0.03838 0.07251 0.00154 10.57112 33.66747 5.11841 9.41251 20.89020 1.99971 6.35726 8.23827 1.65720 3.74591 9.81639 7.22450 19.48892 10.66529 3.55253 4.94540 5.52135 14.57543 5.27609 (Sengseng) (Seram) (SeramStraits) (Serili) (Serua) (Siar) (Sika) (Sikaiana) (Simbo) (SinagoroKeapara) (Singhi) (Siraya) (Sisingga) 0.00635 0.17834 0.08036 0.03391 0.09929 0.11079 0.12345 0.01500 0.00573 0.25340 0.01654 4.190e-04 0.01953 6.51904 9.26050 17.65607 27.97054 7.71214 7.91352 10.33270 18.73830 3.68466 1.00322 17.32791 8.28489 13.70360 (Soa) (Sobei) (Soboyo) (Solos) 0.00401 0.23073 0.00549 0.10574 6.96601 10.06460 7.46123 4.14147 (SouAmanaTe) (SouthCentralCordilleran) (SouthEastB) (SouthEfate) (SouthHalmahera) (SouthHalmaheraWestNewGuinea) (SouthNewIrelandNorthwestSolomonic) (SouthSulawesi) (South Babar) (South Bisayan) (South Manobo) (South West Barito) (SoutheastIslands) (SoutheastMaluku) (SoutheastSolomonic) (SouthernCordilleran) (SouthernPhilippine) (Southern Malaita) (SouthwestBabar) (SouthwestMaluku) (SouthwestNewBritain) (SquliqAtay) (Suau) (Suauic) (SubanonSio) (Subanun) (SubanunSin) (Sulawesi) (SuluBorneo) (Sumatra) (Sunda) (Surigaonon) (SyeErroman) (TaeSToraja) (Tagabili) 0.03275 0.03231 0.25859 0.06242 0.04413 0.15907 0.17058 0.04575 0.08233 0.06356 0.09378 0.00602 0.01939 0.02692 0.18698 0.17112 0.00192 0.09412 0.05720 0.05085 0.23570 0.00681 0.02509 0.04206 0.03138 0.05979 0.15626 0.04499 0.02973 0.00512 0.10751 0.03947 0.11628 0.06897 0.10214 3.58352 7.74272 6.79647 9.08441 6.86021 2.91215 3.74428 8.90368 5.96986 5.93967 14.76928 4.16231 2.52762 16.51716 6.93482 1.97141 17.69275 3.02851 5.18676 6.39151 3.84871 13.92923 20.06806 7.42557 42.58704 13.92604 6.97087 11.63780 3.89781 23.81901 21.82474 24.45484 12.15031 3.10286 23.38929 continued on next page 22 continued from previous page Code (639) (640) (641) (642) (643) (644) (645) (646) (647) (648) (649) (650) (651) (652) (653) (654) (655) (656) (657) (658) (659) (660) (661) (662) (663) (664) (665) (666) (667) (668) (669) (670) (671) (672) (673) (674) (675) (676) (677) (678) (679) (680) (681) (682) (683) (684) (685) (686) (687) (688) (689) (690) (691) (692) (693) (694) (695) (696) (697) (698) (699) (700) (701) (702) (703) (704) (705) (706) (707) (708) (709) (710) (711) (712) (713) (714) (715) (716) (717) (718) (719) (720) Parent u (CentralPhilippine) > o k (Kalamian) > q q (Kalamian) > k u (Tahitic) > o e (Tahitic) > a a (Tahitic) > e f (Central East Nuclear Polynesian) > h v (SaposaTinputz) > f l (Ellicean) > r Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates Not enough data available for reliable sound change estimates h (Guadalcanal) > G f (Wetar) > h e (Vanikoro) > a a (PatpatarTolai) > e e (Utupua) > u a (Vanuatu) > @ r (Tanna) > l a (Vanuatu) > e Not enough data available for reliable sound change estimates f (Sarmi) > p o (ButuanTausug) > u Not enough data available for reliable sound change estimates a (Bilic) > O O (Tboli) > o O (Vanikoro) > o l (SouthwestBabar) > n s (SaposaTinputz) > h N (East NuclearTimor) > n t (TeunNilaSerua) > P e (SouthwestMaluku) > E b (WesternPlains) > f u (LavongaiNalik) > a a (LavongaiNalik) > o r (Futunic) > l R (ProtoCentr) > r e (Dayic) > o s (Northern Malaita) > 8 k (Sumatra) > h s (SamoicOutlier) > h g (Guadalcanal) > h a (Tongic) > o s (Polynesian) > h l (North Minahasan) > d b (North Minahasan) > w n (MonoUruava) > l a (Tsouic) > o d (Formosan) > c n (Tahitic) > N a (Wetar) > i a (MunaButon) > o a (LavongaiNalik) > e u (Barito) > o s (Ellicean) > h p (Are) > f Not enough data available for reliable sound change estimates p (Aru) > f h (Nunusaku) > b o (Erromanga) > e l (MonoUruava) > r a (EasternOuterIslands) > o s (SamoicOutlier) > h r (Futunic) > l r (Futunic) > l u (Choiseul) > @ Not enough data available for reliable sound change estimates a (EasternOuterIslands) > e o (Vanikoro) > e o (RemoteOceanic) > u o (Choiseul) > O Not enough data available for reliable sound change estimates r (SinagoroKeapara) > l a (NgeroVitiaz) > e s (MesoMelanesian) > d t (HuonGulf) > r s (BimaSumba) > h Not enough data available for reliable sound change estimates v (CenderawasihBay) > w r (GeserGorom) > l l (AreTaupota) > r Child Functional load Number of occurrences (TagalogAnt) (TagbanwaAb) (TagbanwaKa) (Tahiti) (TahitianMo) (Tahitianth) (Tahitic) (Taiof) (Takuu) 0.01883 0.42879 0.42879 0.19238 0.23947 0.23947 0.05235 0.04497 0.01827 10.16369 5.41452 17.73591 6.98869 3.68949 2.71372 6.11830 6.79178 30.88407 (TalisePole) (Talur) (Tanema) (Tanga) (Tanimbili) (Tanna) (TannaSouth) (Tape) 5.670e-05 0.03346 0.28248 0.10990 0.10880 0.01056 0.05929 0.08559 1.97576 6.74401 10.81075 12.52910 6.71919 13.80984 26.23479 36.62961 (Tarpia) (TausugJolo) 0.09285 0.03983 4.76042 38.80244 (Tboli) (TboliTagab) (Teanu) (TelaMasbua) (Teop) (TetunTerik) (Teun) (TeunNilaSerua) (Thao) (Tiang) (Tigak) (Tikopia) (Timor) (TimugonMur) (Toambaita) (TobaBatak) (Tokelau) (Tolo) (Tongan) (Tongic) (Tonsea) (Tontemboan) (Torau) (Tsou) (Tsouic) (Tuamotu) (Tugun) (TukangbesiBonerate) (TungagTung) (Tunjung) (Tuvalu) (Ubir) 0.01846 0.02446 0.13019 0.17076 0.06456 0.02632 0.01477 0.00128 0.01723 0.23417 0.10194 0.05835 0.02692 0.00467 0.01236 0.01209 0.02620 0.04461 0.28401 0.04938 0.05604 0.12446 0.17242 0.00357 0.02777 0.00257 0.34504 0.19256 0.07781 0.00125 0.01029 0.00167 18.79298 2.78683 24.94671 14.35748 11.64051 3.88184 8.79107 6.36734 9.66201 7.41175 2.85222 3.75175 17.70656 20.83722 5.00367 10.18636 9.71242 9.43634 6.49634 24.85227 2.89546 9.61430 4.80739 26.52983 7.43830 17.01166 3.34154 6.52913 8.41577 6.26453 12.42513 1.99937 (UjirNAru) (UlatInai) (Ura) (Uruava) (Utupua) (UveaEast) (UveaWest) (VaeakauTau) (Vaghua) 0.01141 0.07007 0.05761 0.18018 0.19429 0.02620 0.05835 0.05835 3.654e-05 10.59153 6.57134 9.45115 8.71608 6.30089 28.16233 27.69390 41.42212 19.13794 (Vanikoro) (Vano) (Vanuatu) (Varisi) 0.33088 0.10677 0.11711 0.01953 4.43031 4.13392 14.34763 6.43868 (Vilirupu) (Vitiaz) (Vitu) (Wampar) (Wanukaka) 0.17037 0.13267 0.02619 0.06174 0.03743 8.15153 4.47748 6.27238 5.81284 14.17816 (Waropen) (Watubela) (Wedau) 0.08865 0.17521 0.04316 4.74200 13.68799 3.56623 continued on next page 23 continued from previous page Code (721) (722) (723) (724) (725) (726) (727) (728) (729) (730) (731) (732) (733) (734) (735) (736) (737) (738) (739) (740) (741) (742) (743) (744) (745) (746) (747) (748) (749) (750) (751) (752) (753) (754) (755) Parent Child i (BimaSumba) > e (WejewaTana) u (Seram) > v (Werinama) t (CentralPapuan) > k (WestCentralPapuan) a (ProtoCentr) > o (WestDamar) f (CentralPacific) > h (WestFijianRotuman) k (NortheastVanuatuBanksIslands) > h (WestSanto) a (Barito) > e (West Barito) a (Central Manobo) > 1 (West Central Manobo) a (Manus) > e (West Manus) O (NewGeorgia) > o (West NewGeorgia) r (NuclearTimor) > l (West NuclearTimor) Not enough data available for reliable sound change estimates 1 (West Central Manobo) > e (WesternBuk) s (WestFijianRotuman) > c (WesternFij) Not enough data available for reliable sound change estimates a (ProtoOcean) > e (WesternOceanic) d (Formosan) > s (WesternPlains) s (AdmiraltyIslands) > h (Western AdmiraltyIslands) a (MunaButon) > o (Western Munic) t (SouthwestMaluku) > k (Wetar) r (MesoMelanesian) > l (Willaumez) b (CentralWestern) > v (WindesiWan) P (Manam) > k (Wogeo) k (ProtoChuuk) > g (Woleai) a (ProtoChuuk) > e (Woleaian) a (Sulawesi) > o (Wolio) w (Western Munic) > v (Wuna) t (Western AdmiraltyIslands) > P (Wuvulu) a (HuonGulf) > e (Yabem) a (SamaBajaw) > e (Yakan) a (KeiTanimbar) > e (Yamdena) o (Bashiic) > u (Yami) a (ProtoOcean) > i (Yapese) a (Javanese) > e (Yogya) o (West SantaIsabel) > O (ZabanaKia) 24 Functional load Number of occurrences 0.04706 3.739e-05 0.13745 0.02891 0.00402 0.01801 0.06510 0.12386 0.16506 0.00631 0.14621 11.31057 7.78342 14.10036 9.89554 4.95883 2.59859 2.67923 21.47034 7.10396 2.83443 3.15074 4.647e-04 0.00926 76.71920 4.99871 0.12139 0.04913 0.01688 0.19256 0.09657 0.14060 0.16193 0.07162 0.00179 0.11108 0.06991 0.00508 0.00203 0.13370 0.04299 0.18822 0.00368 0.27352 0.08208 0.02351 2.58873 4.48311 4.34207 11.48592 3.25493 7.57978 3.04357 8.86144 35.44518 60.39379 8.71005 2.73532 11.89389 6.51629 27.73964 17.50706 5.79971 4.64937 4.09979 13.69682 Figure S.1: Percentage of words with varying levels of Levenshtein distance. Known Cognates (gold) were handannotated by linguists, while Automatic Cognates were found by our system. 25 P uka puk a Tok e la u Uv e a E a s t Tonga n ( 683) Niue ( 177) F ijia nBa u ( 666) Te a nu ( 708) ( 707) Va no ( 654) ( 159) ( 98) Ta ne ma ( 26) Bum a ( 701) ( 656) As um boa ( 738) ( 442) Ta nim bili ( 581) ( 1) Ne mb a o ( 748) ( 616) S e im a t ( 330) ( 160) W uv ul u ( 435) ( 426) ( 153) L ou ( 373) ( 317) Na un a ( 598) ( 729) L e ip o ( 608) ( 325) ( 609 ( 329) S iv is n ) ( 323) L ik u ma T ita ( 266) Lev e ( 200) ( 417) L on i ( 97) ( 108 ) M u siu ( 532 ) Ka s s a u ( 718 Gim ira Ira h ( 408) ( 33 ( 485 ) ) ( 68) B u li a n ) ( 46 ( 25) Mo 5) ( 120 ) M inr ( 72 4 B ig ay a ifu in ( 24 ) ( 378 ( 61 ) As M is o o ( 8) ) 2) ( 46 l War 0 ( 742 ( 13 ) ) N 5) u o pe n ( 62 ( 13 2) 4) M a m fo r Am ra u ( 38 W ba iY ( 66 6) 7 ( 22 ) No irn d e s a p e n (1 9 (4 52 i ( 16 ) 5) ) D 4 a th B a bW a n ) (6 ( 14 ar Da i we ra D ( 6960) ( 58 8) a 7 T 8 ( ) we e 450 ( 11 ) ( ) Im la M a (5 ( 6101 ( 60 5) 86 01 ) r ( o s 6 1 ) E m in ) ) 92 bua ) (1 E a pla g 61 (4 ) 86 S e s tM wa s ) Ce ri l i a s e la (5 (8 ( 19 S n tr 87 ( ( 7 6) ) ( 7 1) ( 357 W o u th a l M 22 19 ( 360 ) ) ) T a e s tD E a s a s (9 (6 ( 44 ) ) 98 ( 2 517 9) r t a U a B ) (6 ( 8 ) N ji rN n g a m a r ( 72 7) (5 ) ( 76 1) 03 (2 G g a i Aru n B a 1 ) ( 59 ) ( 6 1) ( 57 9) W e s e b o rS 04 ( 1 716 8) ) a ( 5 Ar E t r 86 ) (4 ) ) H la t ube 25 ) S oi tu AKe i B l a Am u A m b e s P a m o A a u ha a n n M lu lo i a Te B o u rnn e h i W n a B e fi a te n S u r ri n Al M o b u Na a m M a o m a Ng a n m b y o ro P l g o K o a d g ru W o nd h a r B e di o a a i S o i m je w k S a a W a a Ta Ga a v u na ur nu a N ka k gg a au ,' (5 19 ) (7 7) (1 14 ) ( 524) ( 680) ( 702) ( 682) ( 459) ( 162) ( 563) ( 182) ( 481) ( 513) ( 146) ( 116) ( pb f v n - + t d kg q! s z x" 0 i y eø 6 œ &u 7 8 a) % / 5* 7 8 Figure S.2: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch. (3 50 ) (1 87 ) .3 26 12 &u a) (4 ( 3193) 1) $o n ba u mo a S T d s t ile oy e s E a a l m b lor B a F e o L i d a n L n g e be r i tu g E a m N an N a lue a l n K a k r rg u n P na a A a A e k a u e rm S ul iT P o t n i a i ri k ) R to m b Te 26 A a un k a le ( 3 65) ) ( 1 28 M e t a e r o tI 4 ( T e m a l ol K am ah 2) L am 4 ( 5 9) L ik a a ng ( 2 56) S ed ( 3 69) ) 9 K ra i r 6 4 ( 77) ( 1 44) 8) 2 E a lu a i ) ( 2 ( 30 8 ) 5 T put i ( ( 25 6) 5 6 ( A e ra n (1 ) 9 ) 5 ) P ugu 31 7) ( 2 496 7 0 ( T iun a ( 3 06) ( 11) ) ) 70 ( ( 3 91) I l e ru 55 ( 1 53) 5 ) (1 ( 73 S ila (6 ) (2 ( 14 6) N un 0 ( 5 0) ) 9 Te s a r 8 9 r ( 5 6) ( 6 3) Ki m a a m a 2 5 (4 (2 R o s tD s e ) 79 4 E a ti n e e n a a ( e ) 7 L m d im b 5) 0) ( 45 ) ) ( 28 0) Ya iTa n ( 74 0) 61 78 ( 67 (1 ( 54 (4 Ke l a ru I ri a S e i wa i rro 1) Ko a m o n g ( 67 1) ( 75 ) Ch m p u g 74 ) 2 6 ( 8 L a m e ri n (2 ) 3 ) 2 5 (6 Ko n d a e j a ( 14 2) 6) 7 ( 32 (6 S u ja n g R a b Re o liTa g 5) ( 27 T b g a b il i lB 3) 8 5 ( T a n a d a iB ) Ko ro n g a n ( 3100) ) S a raa n Ko ro ( 29 ( 665 8) 3 B il a a n S a ra 6 ( ) 7 ( 61 ( 291) B il a Ata y a 8) ( 573) ( 28 Ciu lili q Ata y ( 70) S q u iq ) 71 ) ( 4 ( 66 Sed n ( 133) ) B unu ( 625) ( 1279) ( 81) ( 580) ( 50 4) T h a o rl a n g ( 65335) Favo ( 2) 67 ( ( 28) Ru k a i n 6) 17 ( ( 74) P a iwa na bu 0) 10 ( Ka na ka ( 737) ( 260) S a a ro a ( 545) ) ( 553 ( 492) Ts ou ) ( 687 P uy um a ( 688) ( 269) Ka v a la n ( 530) 54) ( ( 472) Ba s a i Ce ntra lAmi ( 109) ( 147) ( 596) S ira y a ( 504) P a ze h ( 556) S a is ia t ( 750) Ya ka n ( 559) ( 40) ( 91) ( 632) Ba jo ( 375) ( 560) ( 230) M a pun ( 629) S a ma lS ia s i ( 630) Ina ba kno n ( 628) ( 620) S uba nun S in ( 733) ( 728) ( 362) S ub a no nS ( 123) ( 370) W e s te rn Buio ( 365) ( 366) M a no bo k ( 43) ( 27) M a no bo W e s t ( 614) ( 363) ( 377) M a no bo Ili a ( 364) ( 79) Di ba M a no ( 369) ( 367) ( 368) M a n o bb o Ata d o Ata u ( 376) ( 355) M a n ( 576) o ( 40 ( 233) 5) M a n b o T ig w ( 42) ( 118 ) M a n o b o Ka la ( 80) B in u ko b o S a ra ( 121 ) M a ra id ( 507 Ira n u n a o ) ( 25 ( 613 3) ) ( 37 S ( as n 7 17) 2 ( 225 ( 3) (6 ) ) B a li a k ( 49 ( 102 ( 210 5) ( 6939) ( 10 ) ) Mam ) 7) ( 20 ( 6 an 8) 35) Il o ( 103 g g wa ) Hi l in ( 662 ga y o ) W ( 25 n a ( 37 2) B u ra y W o n 1) ( 64 (7 T a utu a n o na ra y 27 ( 64 1) ) s ug Su (1 ( 57 0) Ak ri g a o Jo l o (4 ) ( 6 51) 93 ( 4 94) Ce bl a n o n n o n ) ( 547) ua n B is ( 59 K 7 a ) ( 13 5) M la ga o (3 6) 49 (6 Ta a n s a n ) 1 ( 2 5) B ik ga lo k a 45 ) T a o l N g An (1 26 Ta gba a ga Ct ( ) 6 œ (5 2) h# 5* eø ( 179) 4 % / i y ( 521) m E ( 163) e An u ta ( 13) e ll e s R e n nn a An iw ( 537) F u tuM e le M ( 180) Ifira W e s t ( 218) ( 703) Uv e an a E a s t ( 181) F u tu k a u T a u ( 704) Va e a 7) ( 64 Ta k u u ( 694) Tu v a luna ( 592) S ik a ia a m a r ( 264) Ka pi ng ( 483) Nu ku oro 333) ( L ua ng iua m $o .3 12 ) 77 (4 ) 33 ( 6 1) 4 (3 ( pb ,' f v 7) ( 61 8) ( 28 ) 64 (4 ) 96 (3 (1 87 ) (3 50 ) ) 54 (5 ) 77 (4 ) 33 ( 6 1) 4 (3 6) (5 0) ( 47 n - + t d kg q! s z x" 0 i y % / eø ( 638) ( 291) ( 573) ( 70) ( 71) ( 133) ( 625) ( 580) B il a n S a ra B il a Ata y a Ciu lili q Ata y S q u iq ( 664) Sed n ) B unu ( 1279) ) ( 81 ( 50 4) T h a o rl a n g ( 65335) Favo ( ( 672) ( 28) Ru ka i n ( 176) ( 74) P a iwa na bu ( 100) Ka na ka ( 737) ( 260) S a a ro a ( 545) ( 553) ( 492) Ts ou ( 687) P uy um a ( 688) 269) ( Ka v a la n ( 530) ( 54) ( 472) Ba s a i Ce ntra lAmi ( 109) ( 147) ( 596) S ira y a ( 504) P a ze h ( 556) S a is ia t ( 750) Ya ka n ( 559) ( 40) ( 91) ( 632) Ba jo ( 375) ( 560) ( 230) M a pun ( 629) S a ma lS ia s i ( 630) Ina ba kno n ( 628) ( 620) S uba nun S in ( 733) ( 728) ( 362) S ub a no nS io ( 123) ( 370) W e s te rnB uk ( 365) ( 366) M a no bo ( 43) ( 27) M a no bo W e s t ( 614) ( 363) ( 377) M a no bo Ili a ( 364) ( 79) Di ba M a no bo ( 369) ( 367) ( 368) M a n o b Ata d o A ( 376) ta u ( 355) M an ( 576) ( 40 ( 233) 5) M a n o b o T ig w ( 42) ( 118 ) M a n oo b o Ka la b o S a ra ( 80) B in u ( 121 ) M a rak id ( 507) Ira n u n a o (4 ( 25 ( 613 n 3) ( 3193) ) ( 37 S ( 717 as 2 1) ( 225) ) ( 3) (6 ) B a li a k ( 49 ( 102 ( 210 5) ( 6939) (5 ( 10 ) ) Ma ) 2) 7) ( 20 ( 635 8) Il o nm a n wa ) ( 103 ) Hi l g g o ( 662 ) W aiga y n ( 25 ( 37 2) B u tura y W ao n 1) ( 64 (7 T a a n o ra y 27 ( 64 1) ) S u us ug n (1 ( 57 0) Ak lri g a o nJo l o ( 49 ) ( 6 51) 93 4 (4 ) Ce a n o o n ) ( 547) (5 Ka l b u a n n B i s 7) ( 1395) a ga o M (3 6 ) 49 (6 Ta a n s a n ) 1 ( 2 5) B ik ga lo k a 45 ) o T l N g An a (1 26 Ta gba a ga t ( ( 2 410 ) B a g b a n wa C 15 ) (4 t n a P ) k w Ka ( 3 03) (2 37 B a l a w P a l a Ab (3 ( 1 67) ) 27 37 S Wa n g g a n B a w ) ) at H P i P a nu a la w (4 S a la no a no ( 1 07) (2 Dai n g hu a n o ( 4 79) ( 2 25) 06 K ya i ( 3 00) ) ( 4 98) D a ti k B ( 3 87) K a y nga a k ( 2 31) M a d oa k N n a t ( 4 31) M e r ri g a ( 3 8) ju T a a ina h ( 3 48) 99 Ke u n j n y M a ) (2 (5 M r u a n la 17 1 ( 1 1) ) M e l ai n c in g 30 ) O e y L ga la y uS I o w n u a ra B an d o M a M n n l M a ja e a y M e l a re s i a P i la y B s n Ch h a n a n y u a h e M B a H n M a ru Ra g k a ru n s I o in ng b Gaba k e a n a C I d y n n Ch h aa o am n m a) % / eø ( 179) 7 8 a) ( 521) ( 270) Figure S.3: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch. 27 (7 36 ) .3 12 &u 5* 6 œ ( 631) 7 8 $o n s u ur Du n M r u a o n d u g i tB on B u im la b lu nL ng k u T e t wa K i n ra i t h L o M u B e la a u l B e ny na n M a n B e la na o a K e a a n oB M ah g an k L ng g ) E n g s a tas e c 77 ) ( 6 276 E i a a B re s ( N o b du B a ) 5 T a ta n t ( 6 2) (6 M a ya an 9) (9 Iv ba uy y I t a b l a te n g B a ra y a ro n Ir ba o ) ) It a m a y 97 ) 169 ) ) 8) 3 2 3 3 ( 0 ( 68 4 I s a s ro d ( 1 8) 1 ( 2 37) (3 ( Iv o i o to a ( 2 9) ( 7 6) I ma m b a l B a n g ( 3 34) ( 6 78) ( 2 38) Y a m mp a y n (2 2 ( 35) S a p a o B Ka ( 2 40) K u g a h a n Ke ( 2 8) If a lla ha n 2 (2 K a lla loi a n ) 2 ( 22 7) K iba a s in nI ) ( 25 8) 41 In ng uge k 2 ( ( 25 7) ) P a k i d tKa g a 6 1 ( 1 55) ) ( 5663) 6 Ka n g o Am ) ( 4679 ( 25 (2 ) 2 I l o g a o B a ta n ( ) ( 75 2 3 I fu g a o Gu i (2 I fu n to k Gu i n o ) 64) 1 5 ( ) 5 ( 2 7) B o n to c n a y N 20) (5 2 2 ( (2 B o nk a a w 7) 89) 21) 9 ( 2 4 ( ( ) Ka l a n g Gu i 8) ( 87 ( 49 ) B a linga on 3 1 (1 88) ) ( 9 1 Ka e g B i ng (2 6) 2) 2 6 2 2 I tn d d a n p l o ( ( 9) ( 61 ) Ga a P a m a g ( 90 ) 4 5 2 ( Att e g Di b ) 9 ) 3 ) 8 2 1 ( (4 ( 47 I s nta ) 4 8 as 1 ( Ag a g a tC ) 0 3 ( ) Du ma n o 5) 0) ( 236 ) ( 60 k ( 11 o ( 255 ia n Il 2) ( y a ne s ly n e s Y o gJa ( 144) ) y oP o va ( 216 Old te rn M a la s 9) W e in e s e S o ( 46 ( 471) B ug h M a loa s s a r ) ( 754) ( 93) ( 468 8) 48 M a k T o ra ja ( ( 354) T a e S irT a b u ( 224) ) ( 94 S a ng ( 567) S a ng ir a ra ( 344) ( 566) ( 637) S a ng ilS ( 565) ( 611) Ba ntik ng ( 476) Ka id ip a 248) ( 50) ( Go ron ta loHon ( 568) ( 202) ( 201) Bo la a ng M ( 84) ( 518) P opa lia ( 203) ( 85) Bon e ra te ( 691) ( 747) W una ( 423) ( 739) ( 424) M una Ka tobu ( 685) Tonte mboa n ( 466) ( 684) Tons e a ( 46) ( 51) Ba ngga iW di ( 746) ( 418) Ba re e W olio ( 271) ( 96) M ori ( 529) Ka y a nUm a Ju ( 213) Buk at ( 715) P un a nKe la ( 749) W a mp a r i ( 484) Ya be m ( 422) Nu m ba ( 67) ( 624) ( 20) M ou k m iS ib ( 309) ( 7) Aria ( 451) ( 713) ( 462) L a mo ( 500) ( 585) Am a g a iM u l ( 268) S e n gra ( 61) ( 73) s ( 475) K a u ne n g ( 394) g Au V B il iblo ( 188) il ( 401) M e g ia ( 292 ( 575 ( 72) ( 387) ) ) ( 353) Ge d a r ( 539 ) M a ge d ( 574 ) B il iatu k a r ( 57 ( 272 9) ) Men u ( 661 ) ge n M ( 25 ( 600) a 0) Ko le u ( 249 ( 4) ) T a rvpe ( 62 ( 35 S o ia 8) 7) ( 48 0) Ka b e i ( 34 ( 28 ( 49 3) 4 R i wy u p u l 9) ( 74 ) a uK o ( 31 3) K ) a ( 55 Ki i ri ru ( 20 ( 46 ( 62 8) 5) 3) 6) W os ( 10 g 4) Al eo Au i ( 17 ( 28 (5 ) S a he l 1) 08 ( 14 ) ( 14 S u l i b a a wa ( 41 0) ( 16 1) 2) ) M au ( 72 Gu a i s i n 0) ( 69 (1 17 Di o m a w ( 18 5) ( 40 ) 5) di a n M ( 2 9) (7 80 a 23 U oli o ( ) ) i y ( 243) ( 735) h# &u 5* 6 œ 4 $o .3 12 Ag t a g a tC Du ma n o Il o k y a e s ia n Y o gJa v a n e s la y o P o ly n ) 6 1 2 ( Old te rn M a W e sin e s e S o ( 471) B ug h M a loa s s a r ( 754) ( 93) M a k T o ra ja ( 488) ( 354) T a e S irT a b u ( 94) S a ng ( 567) S a n g ir a ra ( 344) ( 566) ( 637) S a ng ilS ( 565) Ba ntik ng ( 476) Ka id ip a ) ( 248 ( 50) Go ro nta loHon ( 202) ( 201) Bo la a ng M ( 84) ( 518) P opa lia ( 85) Bon e ra te ( 691) ( 747) W una ( 739) ( 424) M una Ka tobu ( 685) Tonte mboa n ( 684) Tons e a Ba ngga iW di Ba re e W olio ( 271) ( 96) M ori ( 529) Ka y a nUm a Ju ( 213) Buk at ( 715) P un a nKe la ( 749) W a mp a r i ( 484) Ya be m ( 422) Nu m ba ( 67) ( 624) ( 20) M ou k m iS ib ( 309) ( 7) Ari a ( 451) ( 713) ( 462) L a mo ( 500) ( 585) Am a g a iM u l ( 268) S e n gra ( 61) ( 73) ( 475) Ka u s e n g ( 394) n g Au V B il iblo ( 188) il ( 401) M ( 292 ( 575 ( 72) e g ia r ( ) 38 ) ( 353) 7) G e d ( 539 ) M a a ge d ( 574 ) B il iatu k a r ( 57 ( 272 9) ) Men u ( 661 ) M a ge n ( 25 ( 600 0) ) Ko le u ( 249 ( 4) ) T a rvpe ( 62 ( 35 S 8 7 ) ) o ia ( 48 0) Ka b e i ( 34 ( 28 3) 4) R i wy u p u l ( 7 4 a uK ( 31 3) Ka o ) ( 55 Ki i ri ru ( 20 ( 46 ( 62 8) 5) 3) 6) W os ( 10 4) Al g e o Au i ( 17 ( 28 ) S a he la 1) ( 14 ( 14 S u l i b a wa ( 41 0) ( 16 1) 2) ) M au ( 72 Gu a i s i n 0) ( 69 (1 17 D ma ( 18 5) ( 40 ) 5) M i o d i wa n ( 2 9) (7 8 a 23 U o l i mo ( 1 0) ) 83 a ) Gab i r W e pa p (5 (4 94 D d a 82 ) (3 ) (1 M o b u a u i wa 42 4 ) ( 2 3) K is i a n 9 ( 5 5) (5 Gai l i v i m a 41 02 ( ) (7 39 ) D b la ( 4 12) ( 30 5) K ou a d 21 5) ) R u n ra i (2 M o ro i ( 6 61) L ek ( 5 55) V al e ( 2 90) M ili a o ( 5 93) 01 M o turu p u ) K a (7 (5 T a n go 30 ( 5 44) S a n d ri S ) ( 4 93) Ku i a r g a a s o u ( 2 37) t P ( 2 96 R a an ( 3 12 ) S o tp u a ( 3 35) ) N im v i a t 36 ) Ku du boa n aa r H k L o sa e L u un a v gh qa gg a e a m ( pb f v - + t d kg q! s z x" 0 i y ( 30) ) ( 236 ( 2) ( 144) ,' n % / h# &u 5* $o ( 49 9) (4 47 ) (5 08 ) ( 466) ( 46) ( 51) ( 746) ( 418) ( 423) ( 203) ( 568) ( 611) ( 224) ( 270) ( 631) (7 36 ) 0) ( 47 ( 243) ( 735) 9) ( 46 5) ( 60 ) ( 468 0) ( 11 ) ( 255 ( eø 4 ) 48 (4 ) 04 (4 6 œ a) % / eø ( 523) 5* 6 œ 7 8 ( 179) Figure S.4: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch. ( 158) 28 (1 11 ) (1 1 12 &u a) ( 521) .3 ta a k oe g g bo e l o n u u K gh a n un o U h ng v k e ) G a ro re 94 ) V a a s ( 2 96 ) e M b lo f ( 6 93 ) M o io ( 1 06 ) ap 7 S a op nH ( 82 ) 3 1 T e a n ( 9 h a (3 T u ) Ne i s s u g g a a T 46 ( 6 68) ) N a k in a n on 54 ( 6 39) H i s a t Gh (1 ( 4 58) ) i b 2 S 0 (4 ( 6 2) B a a ri si s i 7 V a r o a Av (5 ) V i ri g g n a 38 7) ( 4 7) ( 59 ) R e n a ta a 8 20 S a b hu na L o ( 3 11) ( ) 7 ) 0 ( B a g a ta n a Ka 4 10 (4 ( 7 38) V a b a ta n a g a ( 5 84) B a b a ta u n ( 5 5) B ab aL ( 3 05) B ung o u ( 7 4) L o n o Al (3 ) 7 M on u (3 ) ) 9 6 2 M o ra v a u ro ( 3 4) (1 ) 10 ( 33 3) T ru a F a (6 ( 41 4) U ono o ( 41 6) M l u r e Ho l m a ( 68 0) B i e k ge K a t ( 70 5) h n i C r eT ( 41 8) M aa ri n g P o ro l ( 12 ) 6) 9 ( 41 M ga o ge L e ( 37 ) 1 g N a ri n a Ki a a s ( 38 ) 2 M ba n S a m ( 45 ) ) 0 7) ( 75 ( 38 ) Z a ghu nga ( 15 5 ( 75 ) L a a bla ga G 2 ( 30 B l a bla n 2) ) 8 ( 2 B l k o ta a Y s ( 73 83) ( Ko o k a k ) 1) 9 7 8 5 2 ( ( Ki l n o n i ) 2 ) 8 (2 ( 124 B a d a ra u n g M an ga gT t T u ra W e s ) 0 4 (3 Ka k ) 2 ( 49) 9 (6 Na lin g 5) ( 26 T ia a k 1) ( 43 T ig irS u n g l a s ( 673) L ih a k L a m ( 674) 6) Mad k ( 31 ( 324) B a ro tu tu ( 339) M a u n a iB il ( 53) Na k a la i ( 388) ( 338) L a ka ( 430) Vitu e ( 304) Y a p e s otuDh ( 741) M bu gh ( 393) ( 714) Bu go tu ol i 95) ( ( 92) Ta lis e M a la ( 651) ( 650) Ta lis e M ( 195) Gh a riNdi ( 194) Gha ri ( 681) Tol o ( 648) Ta lis e ( 347) M a la ngo ( 204) ( 196) Gha riNgga e ( 190) ( 392) M bira o ( 199) Gha riTa nda ( 652) Ta lis e P ole ( 197) Gha riNgge r ( 649) ( 198) Ta lis e Koo ( 319) Gha riNg ini ( 189) ( 320) L e ngo ( 321) L e ng oGha ( 453) L e ng oP a im ( 678) Ng ge la rip ( 298) To a m ba ita ( 618) ( 312) Kwa i ( 299) L a n g a la ( 473) ( 390) nga K wa io ( 313) M ba ( 389) L a u e ngguu ( 345) ( 301) M ba ( 175) Kwa e le le a ( 315) ( 328) aeSo F a tara ( 314) l ( 346 L a u Wle k a ) a la d ( 551) L a uN e ( 550) L o n g o rth ( 621 ( 490 S a a gu ) ) ( 552 S a Uk iNiM ) ( 548 OroahS a a Vil a ) ( 142 l Sa a ) ( 18) S a aa Ul a wa ( 19) D ( 58 o ( 549 ) ) Ar ri o ( 59 ) ( 56 ( 17 Ar e a re M 4) 2 a re a a s ( 24 ) Sae W ( 57 7) B a a Au l u a i a ( 22 0) Vi l B a uu ro B a ( 23 ) F a g ro H ro o ( 21 ) (6 Ka a n a u n u ( 24 ) 57 ) S a h u a Mi ( 60 6) (1 ( 17 ) Ar n ta a 71 ) ( 1 3) A o s Ca m i i y ( 753) 7 8 $o .3 12 T ia a k T ig irS u n g l a s L ih a k L a m Mad k ( 316) B a ro tu tu M a u n a iB il Na k a la i ( 388) ( 338) L a ka ( 430) Vitu e ( 304) Y a p e s otuDh 1) ( 74 M bu gh ( 393) ( 714) Bu go tu ol i ( 95) ( 92) Ta lis e M a la ( 651) ( 650) Ta lis e M ( 195) Gh a riNdi ( 194) Gha ri ( 681) Tol o ( 648) Ta lis e ( 347) M a la ngo ( 204) ( 196) Gha riNgga e ( 190) ( 392) M bira o ( 199) Gha riTa nda ( 652) Ta lis e P ole ( 197) Gha riNgge r ( 649) ( 198) Ta lis e Koo ( 319) Gha riNg ini ( 189) ( 320) L e ngo ( 321) L e ng oGha ( 453) L e ng oP a im ( 678) Ng ge la rip ( 298) To a m ba ita ( 312) Kwa i ( 299) L a ng ( 473) ( 390) Kwa ioa la n g a ( 313) M ba ( 389) L a u e ngguu ( 345) ( 301) M ba e ( 175) Kwa le le a ( 315) ( 328) ae F a tara ( 314) le S o l ( 346 L a uW k a ) a la d ( 551) L a uN o ( 550 rth e L on ) ( 621 ( 490 S a a ggu ) ) ( 552 S a a Uk iNiM ) ( 548 Oro S a a Vi l a ) ( 142 l S a ha ) ( 18) S a aa Ul a wa ( 19) D ( 58 o ( 549 ) ) Ar ri o ( 59 ) ( 56 ( 17 Ar e a re M 4) 2 a re a a s ( 24 ) Sae W ( 57 7) B a a Au l u a i a ( 22 0) Vi l B a uu ro B a ( 23 ) F a g ro H ro o ( 21 ) Ka a n a u n u ( 24 ) S a h u a Mi ( 60 6) ( 17 ) Ar n ta a ( 17 3) Ar o s i Ca ta m i ( 56 4) Aroo s i TOn e i b l 9) ( ( 6 663 Ka s i a wa 5 ) 8 (3 ) t B hu ( 3 00) F a ur a (7 ( 6 18) 26 F a ga oP a ( 6 36) ) 99 S aa g a nn i Agwa V ) T nt iR u (1 T a wa a A i h u f 50 (1 ) Kwa n n ro gn a ( 4 5) 0 a aS a L ( 1 2) S e n me ou ( 4 0) ( 5 20) Ur y e Ea k e ra th a ( 4 10) rro l A ( 4 91) m M ra k ( 5 27) an Ame re i S o ( 6 31) M i ut ( 4 07) h P e o tab ry ( 4 89) m P t ( 4 54) So M aa er ( 4 32 ut ( 4 34) ) R w m a ra 29 S o a g o tl a e s M a ) e a Or u p S t N k h ou N g o E N a u n n fa t e N a m a Ne a h ti a k i T r a An a p s e v a q ej e om An ei m ( pb ( ( 44 ( 1 659 4) 2) ) ( 523) ( 753) (4 67 ) (3 52 ) ( 618) (1 71 ) (1 19 ) (6 57 ) ( ( 431) ( 673) ( 674) ( 324) ( 339) ( 53) ,' ( 158) (7 09 ) ) 57 (5 ) 51 (3 (1 12 ) ) 06 (4 (1 11 ) ) 22 (5 2) ( 33 6) ( 53 f v n - + t d kg q! s z x" 0 i y % / eø 7 8 a) % / eø 7 8 ( 521) a) ( 179) Figure S.5: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch. ( 609 ) ( 33 ) 29 ( 72 4 ( 24 ) ) (1 14 ) (1 52 ) ) 19 (5 (7 7 ( ( 6101 01 ) ) (4 5) (5 86 ) .3 1 12 &u 5* 6 œ ( 1) $o tO or oP a a k v ei se S a v a v e a n te a n A e m wa n n e N a l u An i a s K N u lo a le e A P u l e wa s e s P o ta k e k n W a uu oc ia ) 28 S h rtl i n s e s ( 5 27) C o o l k e l e ro ) 5 ( 45 2) 5) 3 ) M a r u o ro a 7 ( 44 ) ( 77 ) 5 ( 33 C hu s nC a ( 32 (4 C o n pa nn ( 1 19) S a i oA i e ( 4 06) ( 1 31) S ul le a e s e s ( 1 03) P o k il p n ( 6 55) W o ila e a 5 6) ) ( 0 M ing a p i 2 2 ( 5 44) (5 P on a t s ( 7 11) P i ri b a i e l l e ( 4 12) K us s ha ( 5 14) K a r ru a (5 ) M a u mw 16 (5 ) N e le 15 (5 N we l a Ja a n a ) C a i one 1 4 ( 4 4) Ia e ng 4 (2 ) N e hu a n ij 83 ( 2 97) D tu me rn F ( 2 85) Ro e s t i 4) (3 7 4 ( W a or ) 5 0 4) M h i ti (1 n ( 37 ) 2 T a n rh yo tu 4 6 ) ( 36 5) P e 3) (4 6) a mi h i k i ( 50 ) ( 54 4) u ( 44 9 T 3 8 (6 ) (7 a n ua n o 4) 1 ( 2143) ( 36 ) M u ru t a n M n i 6 ( 4 39) 5) ( 54 ) R a h i t n g a ( 64 (1 3 T ro to n th ( 64 ) Ra a an i t i ( 534 ) Ta h u e s 5) 2) ( 644 q re v a ( 12 ( 72 ) Mar ga ( 383 6) ) M a n ia n ( 15 ( 359 ai as w ) ) a 9 H pa nuiE ( 20 ( 384 Ra m o a n a S a n o ) B e llo p ia ( 533 62) ( 63) (5 T ik a e ( 675) E m ta ( 163) An u n e ll e s e ( 13) R e n n a An iw ( 537) F u tuM e le M ( 180) ( 481) ( 182) 8) Ifira W e s t ( 21 3) 51 ( ( 703) Uv e an a E a s t ( 146) ( 116) ( 181) F u tu k a u T a u ( 704) Va e a ( 563) ( 647) Ta k u u ) ( 694 Tu v a luna ( 592) S ik a ia a m a r ( 264) ( 162) Ka pi ng ( 483) Nu ku oro ( 333) L ua ng iua ( 524) P uka puk a ( 680) Tok e la u ( 702) Uv e a E a s t ( 682) Tonga n ( 683) ( 459) Niue ( 177) F ijia nBa u ( 666) Te a nu ( 708) ( 707) Va no ( 654) ( 159) ( 98) Ta ne ma ( 26) Bum a ( 701) ( 656) As um boa ( 738) ( 442) Ta nim bili ( 581) Ne mb a o ( 748) ( 616) S e im a t ( 330) ( 160) W uv ul u ( 435) ( 153) L ou ( 373) ( 317) Na un a ( 598) ( 729) L e ip o ( 608) ( 325) ( 329) S iv is n ( 323) L ik u ma T ita ( 266) Lev e ( 200) ( 417) L on i ( 97) ( 108 ) M u siu ( 532 ) Ka s s a u ( 718 Gim ira Ira h ( 408) ( 485 ) ( 68) B u li a n ) ( 46 ( 25) Mo 5) ( 120 ) M inr B i g ay a ifu in ( 378 ( 61 As M i s o o ( 8) ) 2) ( 46 l Wa 0 ( 742 ( 13 ) ) Nu ro p e n 5 ( 62 ( 13 ) 2) 4) M a m fo r r a Am u ( 38 W ba iY ( 66 6) 7 ( 22 ) No irn d e s a p e n 9 th B i W a ( 16 ) D 4 a a ba n (6 ( 14 ) Da we ra D r ( 6 60) ( 58 8) a we Te l i ( 4597) ( 11 8) 0) a Im M a ( 60 5) r ( 19 6) E m o ing s bu 2) a (1 E a pla 6 ( i y ( 426) h# &u 5* 6 œ 4 $o .3 12 A B (p, v) (o, ə) (r, d) (ʀ, r) (p, b) (w, o) (i, e) (c, s) (u, i) (e, i) (b, p) (o, w) (i, u) (k, g) (j, s) others 0 0.1 0.2 0.3 0.4 Fraction of substitution errors C a i m e t r s k u l ŋ p q v n others q a ʀ n k s t p u m ŋ r w i o others 0 0.075 0.150 0.225 0.300 Fraction of insertion errors 0 0.075 0.150 0.225 0.300 Fraction of deletion errors Figure S.6: Most common substitution errors, insertion errors, and deletion errors in the PAn reconstructions produced by our system. In (A), the first phoneme in each pair (x, y) represents the reference phoneme, followed by the incorrectly hypothesized one. In (B), each phoneme corresponds to a phoneme present in the automatic reconstruction but not in the reference, and vice-versa in (C). 30
© Copyright 2026 Paperzz