PNAS paper on automated ancient language reconstruction

Automated reconstruction of ancient languages using
probabilistic models of sound change
Alexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb
a
Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,
University of California, Berkeley, CA 94720
Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review
March 19, 2012)
One of the oldest problems in linguistics is reconstructing the
words that appeared in the protolanguages from which modern
languages evolved. Identifying the forms of these ancient languages makes it possible to evaluate proposals about the nature
of language change and to draw inferences about human history.
Protolanguages are typically reconstructed using a painstaking
manual process known as the comparative method. We present a
family of probabilistic models of sound change as well as algorithms for performing inference in these models. The resulting
system automatically and accurately reconstructs protolanguages
from modern languages. We apply this system to 637 Austronesian
languages, providing an accurate, large-scale automatic reconstruction of a set of protolanguages. Over 85% of the system’s reconstructions are within one character of the manual reconstruction
provided by a linguist specializing in Austronesian languages. Being able to automatically reconstruct large numbers of languages
provides a useful way to quantitatively explore hypotheses about the
factors determining which sounds in a language are likely to change
over time. We demonstrate this by showing that the reconstructed
Austronesian protolanguages provide compelling support for a hypothesis about the relationship between the function of a sound
and its probability of changing that was first proposed in 1955.
ancestral
| computational | diachronic
R
econstruction of the protolanguages from which modern languages are descended is a difficult problem, occupying historical linguists since the late 18th century. To solve this problem
linguists have developed a labor-intensive manual procedure called
the comparative method (1), drawing on information about the
sounds and words that appear in many modern languages to hypothesize protolanguage reconstructions even when no written
records are available, opening one of the few possible windows
to prehistoric societies (2, 3). Reconstructions can help in understanding many aspects of our past, such as the technological
level (2), migration patterns (4), and scripts (2, 5) of early societies.
Comparing reconstructions across many languages can help reveal
the nature of language change itself, identifying which aspects
of language are most likely to change over time, a long-standing
question in historical linguistics (6, 7).
In many cases, direct evidence of the form of protolanguages is
not available. Fortunately, owing to the world’s considerable linguistic diversity, it is still possible to propose reconstructions by
leveraging a large collection of extant languages descended from a
single protolanguage. Words that appear in these modern languages can be organized into cognate sets that contain words suspected to have a shared ancestral form (Table 1). The key
observation that makes reconstruction from these data possible is
that languages seem to undergo a relatively limited set of regular
sound changes, each applied to the entire vocabulary of a language
at specific stages of its history (1). Still, several factors make reconstruction a hard problem. For example, sound changes are often
context sensitive, and many are string insertions and deletions.
In this paper, we present an automated system capable of
large-scale reconstruction of protolanguages directly from words
that appear in modern languages. This system is based on a
probabilistic model of sound change at the level of phonemes,
4224–4229 | PNAS | March 12, 2013 | vol. 110 | no. 11
building on work on the reconstruction of ancestral sequences
and alignment in computational biology (8–12). Several groups
have recently explored how methods from computational biology
can be applied to problems in historical linguistics, but such work
has focused on identifying the relationships between languages
(as might be expressed in a phylogeny) rather than reconstructing
the languages themselves (13–18). Much of this type of work has
been based on binary cognate or structural matrices (19, 20), which
discard all information about the form that words take, simply indicating whether they are cognate. Such models did not have the
goal of reconstructing protolanguages and consequently use a representation that lacks the resolution required to infer ancestral
phonetic sequences. Using phonological representations allows us
to perform reconstruction and does not require us to assume that
cognate sets have been fully resolved as a preprocessing step. Representing the words at each point in a phylogeny and having a model
of how they change give a way of comparing different hypothesized
cognate sets and hence inferring cognate sets automatically.
The focus on problems other than reconstruction in previous
computational approaches has meant that almost all existing
protolanguage reconstructions have been done manually. However, to obtain more accurate reconstructions for older languages,
large numbers of modern languages need to be analyzed. The
Proto-Austronesian language, for instance, has over 1,200 descendant languages (21). All of these languages could potentially
increase the quality of the reconstructions, but the number of
possibilities increases considerably with each language, making it
difficult to analyze a large number of languages simultaneously.
The few previous systems for automated reconstruction of protolanguages or cognate inference (22–24) were unable to handle
this increase in computational complexity, as they relied on deterministic models of sound change and exact but intractable
algorithms for reconstruction.
Being able to reconstruct large numbers of languages also
makes it possible to provide quantitative answers to questions
about the factors that are involved in language change. We demonstrate the potential for automated reconstruction to lead to
novel results in historical linguistics by investigating a specific
hypothesized regularity in sound changes called functional load.
The functional load hypothesis, introduced in 1955, asserts that
sounds that play a more important role in distinguishing words are
less likely to change over time (6). Our probabilistic reconstruction
of hundreds of protolanguages in the Austronesian phylogeny
provides a way to explore this question quantitatively, producing
compelling evidence in favor of the functional load hypothesis.
Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H.
performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C.,
D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission. N.C. is a guest editor invited by the Editorial
Board.
Freely available online through the PNAS open access option.
See Commentary on page 4159.
1
To whom correspondence should be addressed. E-mail: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1204678110/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1204678110
SEE COMMENTARY
Table 1. Sample of reconstructions produced by the system
Bouchard-Côté et al.
denote the subset of modern languages that have a word form in
the cth cognate set. For each set c ∈ f1; . . . ; Cg and each language
ℓ ∈ LðcÞ, we denote the modern word form by wcℓ . For cognate set
c, only the minimal subtree τðcÞ containing LðcÞ and the root is
relevant to the reconstruction inference problem for that set.
Our model of sound change is based on a generative process
defined on this tree. From a high-level perspective, the generative process is quite simple. Let c be the index of the current
cognate set, with topology τðcÞ. First, a word is generated for the
root of τðcÞ, using an (initially unknown) root language model
(i.e., a probability distribution over strings). The words that appear at other nodes of the tree are generated incrementally, using
a branch-specific distribution over changes in strings to generate
each word from the word in the language that is its parent in τðcÞ.
Although this distribution differs across branches of the tree,
making it possible to estimate the pattern of changes involved in
the transition from one language to another, it remains the same
for all cognate sets, expressing changes that apply stochastically
to all words. The probabilities of substitution, insertion and deletion are also dependent on the context in which the change
occurs. Further details of the distributions that were used and
their parameterization appear in Materials and Methods.
The flexibility of our model comes at the cost of having literally
millions of parameters to set, creating challenges not found in
most computational approaches to phylogenetics. Our inference
algorithm learns these parameters automatically, using established principles from machine learning and statistics. Specifically, we use a variant of the expectation-maximization algorithm
(26), which alternates between producing reconstructions on the
basis of the current parameter estimates and updating the parameter estimates on the basis of those reconstructions. The
reconstructions are inferred using an efficient Monte Carlo inference algorithm (27). The parameters are estimated by optimizing a cost function that penalizes complexity, allowing us to
obtain robust estimates of large numbers of parameters. See SI
Appendix, Section 1 for further details of the inference algorithm.
If cognate assignments are not available, our system can be
applied just to lists of words in different languages. In this case it
automatically infers the cognate assignments as well as the
reconstructions. This setting requires only two modifications to
the model. First, because cognates are not available, we index the
words by their semantic meaning (or gloss) g, and there are thus
G groups of words. The model is then defined as in the previous
case, with words indexed as wgℓ . Second, the generation process is
augmented with a notion of innovation, wherein a word wgℓ′ in
some language ℓ′ may instead be generated independently from
its parent word wgℓ . In this instance, the word is generated from
a language model as though it were a root string. In effect, the
tree is “cut” at a language when innovation happens, and so the
word begins anew. The probability of innovation in any given
PNAS | March 12, 2013 | vol. 110 | no. 11 | 4225
PSYCHOLOGICAL AND
COGNITIVE SCIENCES
Model
We use a probabilistic model of sound change and a Monte Carlo
inference algorithm to reconstruct the lexicon and phonology of
protolanguages given a collection of cognate sets from modern
languages. As in other recent work in computational historical
linguistics (13–18), we make the simplifying assumption that each
word evolves along the branches of a tree of languages, reflecting
the languages’ phylogenetic relationships. The tree’s internal
nodes are languages whose word forms are not observed, and the
leaves are modern languages. The output of our system is a posterior probability distribution over derivations. Each derivation
contains, for each cognate set, a reconstructed transcription of
ancestral forms, as well as a list of sound changes describing the
transformation from parent word to child word. This representation is rich enough to answer a wide range of queries that would
normally be answered by carrying out the comparative method
manually, such as which sound changes were most prominent
along each branch of the tree.
We model the evolution of discrete sequences of phonemes,
using a context-dependent probabilistic string transducer (8).
Probabilistic string transducers efficiently encode a distribution
over possible changes that a string might undergo as it changes
through time. Transducers are sufficient to capture most types of
regular sound changes (e.g., lenitions, epentheses, and elisions) and
can be sensitive to the context in which a change takes place. Most
types of changes not captured by transducers are not regular (1) and
are therefore less informative (e.g., metatheses, reduplications, and
haplologies). Unlike simple molecular InDel models used in computational biology such as the TKF91 model (25), the parameterization of our model is very expressive: Mutation probabilities are
context sensitive, depending on the neighboring characters, and
each branch has its own set of parameters. This context-sensitive
and branch-specific parameterization plays a central role in our
system, allowing explicit modeling of sound changes.
Formally, let τ be a phylogenetic tree of languages, where each
language is linked to the languages that descended from it. In
such a tree, the modern languages, whose word forms will be
observed, are the leaves of τ. The most recent common ancestor
of these modern languages is the root of τ. Internal nodes of the
tree (including the root) are protolanguages with unobserved
word forms. Let L denote all languages, modern and otherwise.
All word forms are assumed to be strings in the International
Phonetic Alphabet (IPA).
We assume that word forms evolve along the branches of the
tree τ. However, it is usually not the case that a word belonging
to each cognate set exists in each modern language—words are
lost or replaced over time, meaning that words that appear in the
root languages may not have cognate descendants in the languages at the leaves of the tree. For the moment, we assume there
is a known list of C cognate sets. For each c ∈ f1; . . . ; Cg let LðcÞ
COMPUTER SCIENCES
*Complete sets of reconstructions can be found in SI Appendix.
†
Randomly selected by stratified sampling according to the Levenshtein edit distance Δ.
‡
Levenshtein distance to a reference manual reconstruction, in this case the reconstruction of Blust (42).
§
The colors encode cognate sets.
{
We use this symbol for encoding missing data.
Random
Automated
Fig. 1. Quantitative validation of
Effect of tree
Effect of tree size
modern
reconstruction
reconstructions and identification
quality
Agreement between
of some important factors influtwo linguists
0.55
encing reconstruction quality. (A)
0.500
Reconstruction error rates for a
0.5
baseline (which consists of picking
0.375
0.45
one modern word at random), our
system, and the amount of dis0.250
0.4
agreement between two linguist’s
manual reconstructions. Reconstruc0.125
0.35
tion error rates are Levenshtein
N/A
distances normalized by the mean
0.3
0
0
10
20
30
40
50
60
word form length so that errors
Number of modern languages
can be compared across languages.
Agreement between linguists was computed on only Proto-Oceanic because the dataset used lacked multiple reconstructions for other protolanguages. (B) The
effect of the topology on the quality of the reconstruction. On one hand, the difference between reconstruction error rates obtained from the system that ran
on an uninformed topology (first and second) and rates obtained from the system that ran on an informed topology (third and fourth) is statistically significant.
On the other hand, the corresponding difference between a flat tree and a random binary tree is not statistically significant, nor is the difference between using
the consensus tree of ref. 41 and the Ethnologue tree (29). This suggests that our method has a certain robustness to moderate topology variations. (C) Reconstruction error rate as a function of the number of languages used to train our automatic reconstruction system. Note that the error is not expected to
go down to zero, perfect reconstruction being generally unidentifiable. The results in A and B are directly comparable: In fact, the entry labeled “Ethnologue”
in B corresponds to the green Proto-Austronesian entry in A. The results in A and B and those in C are not directly comparable because the evaluation in C
is restricted to those cognates with at least one reflex in the smallest evaluation set (to make the curve comparable across the horizontal axis of C).
Kubokot
Lungga
Luqa
Kus Hoav
a
aghe
Rov Nduke a
Simbo
Patpatar
iana
Kuanua
Results
Our results address three questions about the performance of our
system. First, how well does it reconstruct protolanguages? Second,
how well does it identify cognate sets? Finally, how can this approach
be used to address outstanding questions in historical linguistics?
Protolanguage Reconstructions. To test our system, we applied it to
a large-scale database of Austronesian languages, the Austronesian Basic Vocabulary Database (ABVD) (28). We used a previously established phylogeny for these languages, the Ethnologue
tree (29) (we also describe experiments with other trees in Fig. 1).
For this first test of our system we also used the cognate sets
provided in the database. The dataset contained 659 languages at
the time of download (August 7, 2010), including a few languages
outside the Austronesian family and some manually reconstructed
protolanguages used for evaluation. The total data comprised
142,661 word forms and 7,708 cognate sets. The goal was to reconstruct the word in each protolanguage that corresponded to
each cognate set and to infer the patterns of sound changes along
each branch in the phylogeny. See SI Appendix, Section 2 for
further details of our simulations.
We used the Austronesian dataset to quantitatively evaluate the
performance of our system by comparing withheld words from
known languages with automatic reconstructions of those words.
The Levenshtein distance between the held-out and reconstructed
forms provides a measure of the number of errors in these
reconstructions. We used this measure to show that using more
languages helped reconstruction and also to assess the overall
performance of our system. Specifically, we compared the system’s
error rate on the ancestral reconstructions to a baseline and also to
the amount of divergence between the reconstructions of two
linguists (Fig. 1A). Given enough data, the system can achieve
reconstruction error rates close to the level of disagreement between manual reconstructions. In particular, most reconstructions
perfectly agree with manual reconstructions, and only a few contain big errors. Refer to Table 1 for examples of reconstructions.
See SI Appendix, Section 3 for the full lists.
We also present in Fig. 1B the effect of the tree topology on
reconstruction quality, reiterating the importance of using informative topologies for reconstruction. In Fig. 1C, we show that
the accuracy of our method increases with the number of observed Oceanic languages, confirming that large-scale inference
is desirable for automatic protolanguage reconstruction: Reconstruction improved statistically significantly with each increase
am
C
Reconstruction error rate
S a a roa
Ts ou
P uy uma
Ka v a la n
Bas ai
CentralAmi
ya
S ira
Pazeh
S a is ia t
Ya
ka n
Bajo
M a pun
S a ma lS ia s i
Ina ba knon
S uba nunS
in
S uba nonS io
W e s te rnBuk
ke o
NehanHapSolos
re
ba a rov ga
M M
Vangunu
Ughele
Ghanong
Reconstruction error rate
Ta ne ma
Te a nu
Va no
Buma
As umboa
Ta nimbili
Ne mba o
S e ima t
WLuv
ouulu
Na una
ipon
SLiveis
a Tita
LLikum
evei
gau
L oniu
M us s
au
Ka s ira
Gima Ira h
n
Buli
M
or
M iny
a
BigaMisifuin
As
ool
W a rope
Numfor
n
Ma
ra u
AmbaiYapen
W inde
s iW
NorthBabar
an
Da
Da i we ra
Tela Da we
Ma
EImroing s bua
E a mpla
s
S e tM a swa s
CentralMas
rili
e la
S outhE
W es
Ta tDa a s tB
rangama
UjirNAru
r
Nga
Ges
iborS nB a
er
W
E laa tubela Ar
SHituAmbon
tKe
ouAmanaTe
Amahai iB
es
Paulohi
Alune
M
BonfiaurnatenAl
W
erinama
S oboy
BuruNamro
M
a mboru
o
MNgadha
Pondok
a nggarai
l
Kodi
W
Bima
Soa ejewa
Savu
W
a nukaka Ta
GauraNg
na
CiuliAtaya
S quliqAtay
S e diq
Bunun
o ng
a v orla
FTha
Ruka i n
a iwa
PKa
na ka na bu
MMa anoboIlia
noboW e s t
M a noboDiba
M a noboAtad
M a noboAta
u
a noboTigw
MM
M aa noboKa
noboS la
Binukid
a ra
Va
u
iGhon
ris
Hakuan
Va
ingga
s
e
S is Nis
BabatanaT
Teop
Taiof
Tanga
M
S ia
a goriS
Ka
r
nda
Vilirupu
M outs
otu
M
e Lkeo
a la
Roro
Kuni
Gabadi
Doura
MKiliv
is ima
ila
Dobuan
Ga W edau
pa
pa
iwa
M olima
Ubir
GumaDiodio
wa
M a na
is
S uain
Auhelawa
u
Sa
liba
W ogeAli
o
Ka iriru
Kis
Ka y
Riwo
upula
S obeuK
Ta rpia i
Kov
e
M eMnge
a le u
Biliaun
Ge
M ada
tuka
ge dr
Kubokot
Lungga
Luqa
Kus Hoav
a
aghe
Rov Nduke a
Simbo
Patpatar
iana
Kuanua
Tanga
M
S ia
a goriS
Ka
r
nda
Vilirupu
M outs
otu
M
e Lkeo
a la
Roro
Kuni
Gabadi
Doura
MKiliv
is ima
ila
Dobuan
Ga W edau
pa
pa
iwa
M olima
Ubir
GumaDiodio
wa
M a na
is
S uain
Auhelawa
u
Sa
liba
W ogeAli
o
Ka iriru
Kis
Ka y
Riwo
upula
S obeuK
Ta rpia i
Kov
e
M eMnge
a le u
Biliaun
Ge
M ada
tuka
ge dr
M e gia
Bilibil
r
Ka ulongAuV
S e ngs
e ng
L a moga
Amara
iM ul
Aria
ouk
NumbaM
miS ib
Ya m
W a be
mpa
P una nKe la ir
Bukat
Ka y a nUma Ju
M
Wori
olio
Baree
Bangga iW di
r
ma
Roma
s tDa s e
Ea
mdena
e tine
nimba
LYa
Ke iTaru
S e la iIria
Koiwa
Chamorro
ring
L a mpung
Kome
S undangRe ja
ga b
Re ja
TboliTa
lB
ga bilida
Ta
Korona
nga niB
S a ra
a ra
BilaanKoro
BilaanS
IraMnun
a ra na
o
Sa
Bali s a k
M a ma
nwa
Ilonggo
Hiliga
W a ra y non
Butua
yW
Ta
us ugJolo
non a ra y
S uriga
AklanonBis
onon
Cebuano
Ka
M a la ga n
Ta
gans
a ka
logAnt
TaBikolNa
gba
Tagba nwaga C
BatakPalaw
nwa Ka
Pa
Ab
la wa
Banggi
S
nB
HaW P
at
a la
nunoo
Pa
wa
laua
DaS inghi
no
Ka yakBa n
tinga
DayakNgaju
ka
n t
M Ka
e rina
dorih
M
Tunjung
Ma
a
MKe a ny
la
e larinc a n
M
y uSi
Ogan
e
Indones
L owMla y u a ra
Banjares
a la
M M
ian
y
M e la a la
inangkab
y uBrun
y B eM
PhanRan
Chru
a ha
M HainanCh
oken
s
Iban
a
gCh
Gayo
L ungaLunga
BabatanaKa
ghuav
Va
Babatana
BabatanaLo
ngga
S e Ririoi
BabatanaA
ris
language is initially unknown and must be learned automatically
along with the other branch-specific model parameters.
4226 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110
n
n
Nage
Kambera
PalueNitur
n
Anakalang
e ka
SPulauArgu
RotiTerma
Atoni
a mbai ra le
MTetunTerik
le
mak
ma holotI
Ke
L a ma
a
ng
SLika
da
i
Ke
E ra
Talur
Aputai i
P era
Tugun
rua
Iliun
Se
Nila unr
Te a
Kis
Ta lis e
Tolo
Gha ri
Gha riNdi
Ta lis e M a la
e M oli
Ta lis
Bugotu
M bughotuDh
se
Ya peVitu
il i
iBla
L a ka
Na ka na
ututu
M aBarok
s
kL a ma
M a da unglk
L ihirSTiga
Tia ng
Na lik
t
Wes
Ka ragTung
Tunga M a da ra
Banoni
ka Ys
G
KilokaKokota
nga
Blabla ngas
maKia
Blablaana
el
ba
L a ghuS
ZaringeL
oro
Ma
oP
a
Km
ringeTat
Ngga
M aa ringe
Bilur
M ChekeHolo
a uroa
Uruav u
M onoF
Tora
M onoAlu
M ono
s unur
M
r
Idaan
on
BunduDu
Timugon
ong
Bintulu hL uk
KelabitBa
a uM n l
BerawanL
Belait na na a
an
Keny
e laa ha
M L
e
EngganoM
Nias dures c
EngganoB
a
TobaBatak
M
atanBas
an
IvItbayat y
Babuy
Irarala
yaten
Itba amorong
ay
Is
as
nga
IvImorod lBoto
mi
mpa y n
YaS a mba
pa
oBanKa
Ka
ha nKe
lla
Ifuga
Ka lla ha
n
loi
Ka
s ina nI
Iniba
nga
P a kiduge k
Ka
oAmga
IlongotKa
Ifuga a ta
oB
GuinNo
Ifuga
BontokGuin
na y
Bontoc
Ka nka wGui
Balanga
Ka lingainon
gB ng
Itne
Ga dda
g
AttaPamplo
Is negDiba s
Agta ga tCa
Duma no
s ia n
Iloka a
oly ne
Yogy a neasla y oP
e s tevrnM
OldJa
W
s eSo
Bugine
lohs s a r
Maa ka
ja
MTa
e S Tora
bu
S a ngirTa
S a ngir a ra
S a ngilS
Bantik
Ka idipa ng
Goronta loHon
Bola a ngM
P opa lia
Bone ra te
W una
M una Ka tobu
Tonte mboa n
Tons e a
s unur
M
ar
Idaan
on
BunduDu
Timugon
ong
Bintulu hL uk
KelabitB
a uM n l
BerawanL
Belait na na a
an
Keny
e laa ha
M L
e
EngganoM
Nias dures c
EngganoB
a
TobaBatak
M
atanBas
an
IvItbayat y
Babuy
Irarala
yaten
Itba amorong
ay
Is
as
nga
IvImorod lBoto
mi
mpa y n
YaS a mba
pa
oBanKa
Ka
ha nKe
lla
Ifuga
Ka lla ha
loi ina n
Ka
nI
s
Iniba
nga
P a kiduge k
Ka
oAmga
IlongotKa
Ifuga a ta
oB
Guin No
Ifuga
BontokGuin
na y
Bontoc
Ka nka wGui
Balanga
Ka linga
ng
gB inon
Itne
Ga dda
g
AttaPamplo
gDiba
Is ne
s
Agta ga tCa
Duma no
s ia n
Iloka a
oly ne
Yogy a neasla y oP
e s tevrnM
OldJa
W
s eSo
Bugine
lohs s a r
Maa ka
ja
MTa
e S Tora
bu
S a ngirTa
S a ngir a ra
S a ngilS
Bantik
Ka idipa ng
Goronta loHon
Bola a ngM
P opa lia
Bone ra te
W una
M una Ka tobu
Tonte mboa n
Tons e a
am
umbana
T
tS
Eas
Baliledo
Lamboy
Ende
LioFlores
Sa
nta
FFaa ga
Ana
ga niRihu
BauroP
niAguf
a
Ka wa V
hua
Aros
Aros
Aros iTawat
S a iOneib i
Ka nta Ca
hua
F a M ta l
BauroHa
ga a mi
ni
BauroBaroo
unu
Saa
AreareW
AuluVil
AreareMaas
a ia
Dorio
Saa Saa
Ulawa
Oroha
S
S a a a a S a a Vill
UkiNiM
a
L onggu
L a uWL a uNorth
a la de
KwaFraa ta
a S ol
ka
M ba e ele
le le a
Lau
M ba e ngguu
L a ngaKwa
io
la nga
Kwa i
Toa mba
ita
Ngge la
L e ngoP
a rip
L e ngoGha im
L e ngo
Gha riNgini
Gha
TariNgge
r
lis e Koo
Ta lis e P ole
Gha riTa nda
M bira o
Gha riNgga e
M a la ngo
M e gia
Bilibil
r
Ka ulongAuV
S e ngs
e ng
L a moga
Amara
iM ul
Aria
ouk
NumbaM
miS ib
Ya m
W a be
mpa
P una nKe la ir
Bukat
Ka y a nUma Ju
M
Wori
olio
Baree
Bangga iW di
Ta ne ma
Te a nu
Va no
L oniu
M us s
au
Ka s ira
Gima Ira h
n
Buli
M
or
M iny
a
BigaMisifuin
As
ool
W a rope
Numfor
n
Ma
ra
AmbaiYapen
W inde u
s iW
NorthBabar
an
Da
Da i we ra
Tela Da we
Ma
EImroing s bua
E a mpla
s
S e tM a swa s
CentralMas
rili
e la
S outhE
W es
Ta tDa a s tB
ra
UjirNAru
ma
Nga nga r
Ges
iborS nB a
er
W
E laa tubela Ar
SHituAmbon
tKe
ouAmanaTe
Amahai iB
es
Paulohi
Alune
M
BonfiaurnatenAl
W
erinama
S oboy
BuruNamro
M
a mboru
o
MNgadha
Pondok
a nggarai
l
Kodi
W
Bima
Soa ejewa
Savu
W
a nukaka Ta
GauraNg
na
gau
SakaoPo
S a a roa
Ts ou
P uy uma
Ka v a la n
Bas ai
CentralAmi
ya
S ira
Pazeh
S a is ia t
Ya
ka n
Bajo
M a pun
ke o
NehanHapSolos
re
ba a rov ga
M M
Vangunu
Ughele
Ghanong
Kiriba
e
Ponapea
oleai
okiles
Pingilapes
MW
PuloAnna
e
oroles
ipanCaro
S a ons
kes
S
Chuukes
ortloceAK
Carolinian
lese ne
M
oleaians
W
Chuukes
Satawa
eei
rtO
PuloAnna
Naman
Puluwate
NevAvava
CiuliAtaya
S quliqAtay
S e diq
Bunun
o ng
a v orla
FTha
Ruka i n
a iwa
PKa
na ka na bu
S a ma lS ia s i
Ina ba knon
S uba nunS
in
S uba nonS io
W e s te rnBuk
MMa anoboIlia
noboW e s t
M a noboDiba
M a noboAtad
M a noboAtau
a noboTigw
MM
M aa noboKa
noboS la
Binukid
a ra
Va
u
iGhon
ris
Hakuan
Va
ingga
s
e
S is
BabatanaT
Nis Teop
Taiof
urus
Nalle ie
Nelemwa
ha a
a rs Kus ti n
M
NesTapenei
Nahav
Namakir e
aq
Nguna Nati
SouthEfa
Orkon
Paames
M
Raga
wotlap
PeteraraM
AmbrymSo
te
eSou
ArakiSouthM a
ota
M ut
Sy
e re
eE
i
rroma
Le
TannaSouth
na nUra
Kwa
kel
Tawaroga
me
ra
r
ma
Roma
s tDa sna
e
Ea
mdenimba
e tine
LYa
Ke iTaru
S e la iIria
Koiwa
Chamorro
ring
L a mpung
Kome
S undangRe ja
ga b
Re ja
TboliTa
lB
ga bilida
Ta
Korona
nga niB
S a ra
a ra
BilaanKoro
BilaanS
IraMnun
a ra na
o
Sa
Bali s a k
M a ma
nwa
Ilonggo
Hiliga
W a ra y non
Butua
yW
Ta
us ugJolo
non a ra y
S uriga
AklanonBis
onon
Cebuano
Ka
M a la ga n
Ta
gans
a ka
logAnt
TaBikolNa
gba
ga
Ta
gba nwa KaC
BatakPalaw
nwa
Pa
Ab
la wa
Banggi
S
nB
HaW P
at
a la
nunoo
Pa
wa
la ua
DaS inghi
no
Ka yakBa n
tinga
Da
ka
n t
M Ka yakNgaju
e rinaMa
dorih
M
Tunjung
a
MKe a ny
la
e larinc a n
M
y uSi
Ogan
e
Indones
L owMla y u a ra
Banjares
a la
M M
ian
y
M e la a la
inangkab
y uBrun
y B eM
PhanRan
Chru
a ha
Moken
HainanCh
s
Iban
a
gCh
Gayo
L ungaLunga
BabatanaKa
ghuav
Va
Babatana
BabatanaLo
ngga
S e Ririoi
BabatanaA
ris
i
Ia a
Nengone
Canalawe
Ja
AnejomA
n
n
Nage
Kambera
PalueNitur
n
Anakalang
e ka
SPulauArgu
RotiTerma
Atoni
a mbaik ra le
MTetunTerik
le
ma
ma holotI
Ke
L a ma
a
ng
SLika
da
i
Ke
E ra
Talur
Aputai i
ra
Pe
Tugun
rua
Iliun
Se
Nila r
a
Teun
Kis
Ta lis e
Tolo
Gha ri
Gha riNdi
Ta lis e M a la
e M oli
Ta lis
Bugotu
M bughotuDh
se
Ya peVitu
il i
iBla
L a ka
Na ka na
ututu
M aBarok
s
kL a ma
M a da
unglk
L ihirSTiga
Tia ng
Na lik
t
Wes
Ka ragTung
Tunga M a da ra
Banoni
ka Ys
G
KilokaKokota
nga
Blabla ngas
maKia
Blablaana
el
ba L
L a ghuS
Zaringe
oro
Ma
oP Tat
ringe Kma
Ngga
M aa ringe
Bilur
M ChekeHolo
a uroa
Uruav u
M onoF
Tora
M onoAlu
M ono
Tikopia
Bellona n
S a moaa s n
nuiE iia
Ra paHa wa v a
re
nga a n
Ma
rques nth
M aTa hitia n
o
nM n
Ra rotonga
Ta hitia
Rurutua
nihiki
M a motu n
Tua e nrhy
P
hiti
ori
Ta
M a ij
n
ternF
W es
Rotuma
Dehu
umbana
T
tS
Eas
Baliledo
Lamboy
Ende
LioFlores
ArakiSouthM a
ota
M ut
Sy
e re
eE
i
rroma
Le
Ura
TannaSouth
n
Kwa na kel
Tawaroga
me
ra
Sa
nta
FFaa ga
Ana
ga niRihu
BauroP
niAguf
a
Ka wa V
hua
Aros
Aros
Aros iTawat
S a iOneib i
Ka nta Ca
hua
F a M ta l
BauroHa
ga a mi
ni
BauroBaroo
unu
Saa
AreareW
AuluVil
AreareMaas
a ia
Dorio
Saa Saa
Ulawa
Oroha
S
S a a a a S a a Vill
UkiNiM
a
L onggu
L a uWL a uNorth
a la de
KwaFraa ta
a S ol
ka
M ba e ele
le le a
Lau
M ba e ngguu
L a ngaKwa
io
la nga
Kwa i
Toa mba
ita
Ngge la
L e ngoP
a
L e ngoGha rip
im
L e ngo
Gha riNgini
Gha
TariNgge
r
lis e Koo
Ta lis e P ole
Gha riTa nda
M bira o
Gha riNgga e
M a la ngo
F ijia nB a u
Niue
nt
eaEas
Uv
Tonga
Tokela u
P uka puka
L ua ngiua
Nukuoro r
Ka pinga ma
S ika ia na
Tuv a lu
Ta kuu
u
ka uTa
Eas t
Va eFautuna
es t
Uv eaW le M
e
Ifira M Aniw
se
nne lle
utuna
F Re
e
E ma
Anuta
SakaoPo
Kiriba
e
Ponapea
oleai
okiles
Pingilapes
MW
PuloAnna
e
oroles
ipanCaro
S a ons
kes
S
Chuukes
ortloceAK
Carolinian
lese ne
M
oleaians
W
Chuukes
Satawa
eei
rtO
PuloAnna
Naman
Puluwate
NevAvava
urus
Nalle ie
Nelemwa
ha a
a rs Kus ti n
M
i
Ia a
Nengone
Canalawe
Ja
Tikopia
Bellona n
S a moaa s n
nuiE iia
Ra paHa wa v a
re
nga a n
Ma
rques nth
M aTa hitia n
o
nM n
Ra rotonga
Ta hitia
Rurutua
nihiki
M a motu n
Tua e nrhy
P
hiti
ori
Ta
M a ij
n
ternF
W es
Rotuma
Dehu
AnejomA
NesTapenei
Nahav
Namakir e
aq
Nguna Nati
SouthEfat
Orkon
Paames
M
Raga
wotlap
PeteraraM
AmbrymSo
e
eSou
Buma
As umboa
Ta nimbili
Ne mba o
S e ima t
WLuv
ouulu
Na una
ipon
SLiveis
a Tita
LLikum
evei
B
F ijia nB a u
Niue
nt
eaEas
Uv
Tonga
Tokela u
P uka puka
L ua ngiua
Nukuoro r
Ka pinga ma
S ika ia na
Tuv a lu
Ta kuu
u
ka uTa
Eas t
Va eF autuna
es t
Uv eaW le M
e
Ifira M Aniw
se
nne lle
utuna
F Re
e
E ma
Anuta
Reconstruction error rate
A
except from 32 to 64 languages, where the average edit distance
improvement was 0.05.
For comparison, we also evaluated previous automatic reconstruction methods. These previous methods do not scale to
large datasets so we performed comparisons on smaller subsets
of the Austronesian dataset. We show in SI Appendix, Section 2
that our method outperforms these baselines.
We analyze the output of our system in more depth in Fig. 2
A–C, which shows the system learned a variety of realistic sound
changes across the Austronesian family (30). In Fig. 2D, we show
the most frequent substitution errors in the Proto-Austronesian
reconstruction experiments. See SI Appendix, Section 5 for
details and similar plots for the most common incorrect insertions and deletions.
Cognate Recovery. Previous reconstruction systems (22) required
that cognate sets be provided to the system. However, the creation of these large cognate databases requires considerable annotation effort on the part of linguists and often requires that at
least some reconstruction be done by hand. To demonstrate that
our model can accurately infer cognate sets automatically, we
used a version of our system that learns which words are cognate,
starting only from raw word lists and their meanings. This system
uses a faster but lower-fidelity model of sound change to infer
correspondences. We then ran our reconstruction system on
cognate sets that our cognate recovery system found. See SI Appendix, Section 1 for details.
This version of the system was run on all of the Oceanic languages in the ABVD, which comprise roughly half of the Austronesian languages. We then evaluated the pairwise precision
(the fraction of cognate pairs identified by our system that are also
in the set of labeled cognate pairs), pairwise recall (the fraction of
labeled cognate pairs identified by our system), and pairwise F1
measure (defined as the harmonic mean of precision and recall)
for the cognates found by our system against the known cognates
that are encoded in the ABVD. We also report cluster purity,
which is the fraction of words that are in a cluster whose known
cognate group matches the cognate group of the cluster. See SI
Appendix, Section 2.3 for a detailed description of the metrics.
Using these metrics, we found that our system achieved a precision of 0.844, recall of 0.621, F1 of 0.715, and cluster purity of
0.918. Thus, over 9 of 10 words are correctly grouped, and our
system errs on the side of undergrouping words rather than clustering words that are not cognates. Because the null hypothesis in
historical linguistics is to deem words to be unrelated unless
proved otherwise, a slight undergrouping is the desired behavior.
Bouchard-Côté et al.
11
n
pb
t d
10
8
a
L
P uka puk a
Toke la u
Uv e a E a s t
Tonga n
Niue
F ijia nBa u
Te a n
2
f v
s z
5
kg
q
4
9
h
x
6
D
Fraction of substitution errors
Fig. 2. Analysis of the output of our system in more depth. (A) An Austronesian phylogenetic tree from ref. 29 used in our analyses. Each quadrant is
available in a larger format in SI Appendix, Figs. S2–S5, along with a detailed table of sound changes (SI Appendix, Table S5). The numbers in parentheses
attached to each branch correspond to rows in SI Appendix, Table S5. The colors and numbers in parentheses encode the most prominent sound change along
each branch, as inferred automatically by our system in SI Appendix, Section 4. (B) The most supported sound changes across the phylogeny, with the width of
links proportional to the support. Note that the standard organization of the IPA chart into columns and rows according to place, manner, height, and
backness is only for visualization purposes: This information was not encoded in the model in this experiment, showing that the model can recover realistic
cross-linguistic sound change trends. All of the arcs correspond to sound changes frequently used by historical linguists: sonorizations /p/ > /b/ (1) and /t/ > /d/
(2), voicing changes (3, 4), debuccalizations /f/ > /h/ (5) and /s/ > /h/ (6), spirantizations /b/ > /v/ (7) and /p/ > /f/ (8), changes of place of articulation (9, 10), and
vowel changes in height (11) and backness (12) (1). Whereas this visualization depicts sound changes as undirected arcs, the sound changes are actually
represented with directionality in our system. (C) Zooming in a portion of the Oceanic languages, where the Nuclear Polynesian family (i) and Polynesian
family (ii) are visible. Several attested sound changes such as debuccalization to Maori and place of articulation change /t/ > /k/ to Hawaiian (30) are successfully localized by the system. (D) Most common substitution errors in the PAn reconstructions produced by our system. The first phoneme in each pair
ðx; yÞ represents the reference phoneme, followed by the incorrectly hypothesized one. Most of these errors could be plausible disagreements among human
experts. For example, the most dominant error (p, v) could arise over a disagreement over the phonemic inventory of Proto-Austronesian, whereas vowels are
common sources of disagreement.
Because we are ultimately interested in reconstruction, we then
compared our reconstruction system’s ability to reconstruct words
given these automatically determined cognates. Specifically, we
took every cognate group found by our system (run on the Oceanic
subclade) with at least two words in it. Then, we automatically
reconstructed the Proto-Oceanic ancestor of those words, using
our system. For evaluation, we then looked at the average Levenshtein distance from our reconstructions to the known reconstructions described in the previous sections. This time, however,
we average per modern word rather than per cognate group, to
provide a fairer comparison. (Results were not substantially different when averaging per cognate group.) Compared with reconstruction from manually labeled cognate sets, automatically
identified cognates led to an increase in error rate of only 12.8%
and with a significant reduction in the cost of curating linguistic
databases. See SI Appendix, Fig. S1 for the fraction of words with
each Levenshtein distance for these reconstructions.
Functional Load. To demonstrate the utility of large-scale recon-
struction of protolanguages, we used the output of our system to
investigate an open question in historical linguistics. The functional load hypothesis (FLH), introduced 1955 (6), claims that
the probability that a sound will change over time is related to
the amount of information provided by a sound. Intuitively, if
two phonemes appear only in words that are differentiated from
one another by at least one other sound, then one can argue that
no information is lost if those phonemes merge together, because no new ambiguous forms can be created by the merger.
A first step toward quantitatively testing the FLH was taken in
1967 (7). By defining a statistic that formalizes the amount of
information lost when a language undergoes a certain sound
change—on the basis of the proportion of words that are discriminated by each pair of phonemes—it became possible to
evaluate the empirical support for the FLH. However, this initial
Bouchard-Côté et al.
investigation was based on just four languages and found little
evidence to support the hypothesis. This conclusion was criticized by several authors (31, 32) on the basis of the small number
of languages and sound changes considered, although they provided no positive counterevidence.
Using the output of our system, we collected sound change
statistics from our reconstruction of 637 Austronesian languages,
including the probability of a particular change as estimated by our
system. These statistics provided the information needed to give
a more comprehensive quantitative evaluation of the FLH, using
a much larger sample than previous work (details in SI Appendix,
Section 2.4). We show in Fig. 3 A and B that this analysis provides
clear quantitative evidence in favor of the FLH. The revealed
pattern would not be apparent had we not been able to reconstruct
large numbers of protolanguages and supply probabilities of different kinds of change taking place for each pair of languages.
Discussion
We have developed an automated system capable of large-scale
reconstruction of protolanguage word forms, cognate sets, and
sound change histories. The analysis of the properties of hundreds of ancient languages performed by this system goes far
beyond the capabilities of any previous automated system and
would require significant amounts of manual effort by linguists.
Furthermore, the system is in no way restricted to applications
like assessing the effects of functional load: It can be used as
a tool to investigate a wide range of questions about the structure
and dynamics of languages.
In developing an automated system for reconstructing ancient
languages, it is by no means our goal to replace the careful
reconstructions performed by linguists. It should be emphasized
that the reconstruction mechanism used by our system ignores
many of the phenomena normally used in manual reconstructions. We have mentioned limitations due to the transducer
PNAS | March 12, 2013 | vol. 110 | no. 11 | 4227
COMPUTER SCIENCES
ii
3)
i
( 177)
( 683)
2)
( 12
( 563)
)
( 448
( 541
( 395 )
( 305 )
)
( 409)
( 280)
( 183)
( 482
)
( 712
( 421 )
)
( 666)
Ra m o a n
S a ona
e ll
)
( 396
( 56)
)
( 464
( 723
)
6)
( 15
( 753)
( 404)
)
( 554
( 470)
( 594
( 342 )
)
R o e s tei
W or
( 523)
)
( 633 )
( 341
( 270)
( 631)
( 243)
( 735)
un
us ur
n uD M
on ar
aa
on
Id nd ugbi tB
Bu m la u nL
g
ul
Ti
Ke nt ra wait L onM uk
Bi la ah
al
au
Be
Be ny an an oM an
Ke el an an oB
M ah
L gg an ta k
)
En gg Ba e
( 677 )
En as
es c
( 276
Ni ba ur Ba s
)
To ad an
( 65)62)
(
M at at an
( 21799)
(
Iv ay y y
Itb bu la en
Ba ra at ng
Ira ay oro
)
)
Itb am ay
)
)
)
( 397303 ( 169 )
Is as d
( 138
(
( 242 )
to
( 168
Iv oro
( 78)
( 237
Im mi ba lBoan ga
( 39) )
( 66) )
Yaam mp a y n
( 234 )
( 278
( 238
S pa oB nKa
( 235)
Ka ga ha
e
( 240)
Ifu lla ha nK
( 228)
n
Ka lla loi
( 222)
Ka ba s ina I
)
( 257)
Ini ng a ug enk
( 241
)
( 258)
a
P a kid
( 167 ) )
( 561)
Ka ng otK ga
( 455
( 256)
( 263)
( 679
Ilo a oAm ta n
( 752)
Ifug a oBa
( 232)
Gui
Ifug tok Gui n
( 64)
( 251)
Bon toc y No
( 55)
( 220)
( 227)
Bonnka naw
( 89)
( 221)
( 497)
Ka la nga Gui
( 87)
( 498)
Ba ling a on
( 113)
( 88)
Ka gBin
( 219)
( 226)
Itnedda ng o
( 262)
( 619)
Ga P a mpla g
( 90)
( 254)
Atta gDib
( 239)
( 41)
( 478)
Is ne
( 184)
tCa s
Agta
a ga
( 30)
Dum no
( 236)
( 605)
n
( 110)
( 255)
Iloka a
( 2)
ne s ia
( 144)
Yogy v a ne s la y oP oly
( 216)
OldJa rnM a
W e s tee s e S o
( 469)
( 471)
Bugin
M a lohs s a r
( 754)
( 93)
( 468)
M a kaTora ja
( 488)
( 354)
Ta e S
bu
( 224)
( 94)
S a ngirTa
( 567)
( 344)
S a ngir a ra
( 637)
( 566)
S a ngilS
( 565)
( 611)
Ba ntik ng
( 476)
Ka idipa
( 248)
( 50)
Goronta loHon
( 568)
( 202)
( 201)
Bola a ngM
( 518)
P opa lia
( 84)
( 203)
( 85)
Bone ra te
( 691)
( 747)
W una
( 423)
( 739)
( 424)
M una Ka tobu
( 685)
Tonte mboa n
( 466)
( 684)
Tons e a
( 46)
( 51)
Ba ngga iW di
( 746)
( 418)
Ba re e
W olio
( 271)
( 96)
M ori
( 529)
Ka y a nUma Ju
( 213)
Buka t
( 715)
P una nKe
( 749)
W a mpa la i
( 484)
Ya be m r
( 422)
Numba
( 67)
( 624)
( 20)
M ouk miS ib
( 309)
(
7)
Aria
( 451)
( 462)
( 713)
L a moga
( 500)
( 585)
Ama ra iM ul
( 268)
S e ngs
( 61)
( 73)
e ng
( 475)
Ka ulong
( 394)
AuV
Bilibil
( 188)
( 401)
M e gia
( 292)
( 575)
( 72)
( 387)
( 353)
r
Ge
( 539)
ge
M ada
( 574)
tuka d
r
Bilia
( 579)
( 272)
Me u
( 661)
( 250)
M a nge n
( 600)
Kov le u
( 249)
( 4)
e
Ta rpia
( 627)
( 358)
S obe
( 480)
Ka
i
( 343)
( 284)
( 499)
Riwy upu la
( 743)
( 31)
uK
Ka o
( 558)
Kis iriru
( 205)
( 463)
( 626)
W
( 104)
Ali oge o
Auh
( 17)
( 281)
( 508
S a e la
( 140)
)
( 141)
S ualiba wa
( 412)
( 16)
M u
( 720)
Gu a is
( 695)
Dio ma in
( 185)
M dio wa na
Ub oli
Ga ir ma
W pa
Do ed pa iwa
( 143)
M bu au
( 295
Kil is iman
)
Ga iv a
Do ba ila
Ku ura di
Ro ni
M ro
L ek
Vi al a eo
M lir
M ot up
Ka ag u u
( 502
)
ta
a
4) M h i ti n
( 37 )
2
T a n rh yo tu
( 64 )
5
P e a m ik i
( 50
9) T u n ih n
( 68
1)
M a ru tu a n M o
( 36
6) R u itia
5)
( 54
ga n
h
( 64
3) T a
to nn th
( 64
a ro
( 534) R h itia s a n
a
e va
( 644) T a rq u
3)
M n g a re
( 38
n s
a
( 359) M wa ii a
iE a
( 209) Ha a n u
( 384)
p
C
7
3
( 524)
( 680)
( 702)
( 682)
( 459)
( 77)
( 519
)
( 158)
( 521)
)
( 477
( 736
)
ko
( 261
( 655 )
( 590 )
( 293 )
( 501 )
)
( 544
Ta nd or
( 593
iS
S
( 437 ) )
Kuia ng as ou
( 296 )
t
Pa anr a
( 212
Ro
( 335 ) )
Si tp ua
( 336
)
Nd m v ia atar
( 294
)
bo na
Ku
)
Ho uk
(Fig. S3)
m
1
( 481)
( 467
)
( 111
)
( 350
)
( 730
)
Lu s ag e
Lu av
Ku qang a he
bo ga
n
ba
umo a
T
s tS ed y s
Ea lilm bolo re
Ba
La oFde
n
Li ge be raitu g
En m
Na lu eNal an un
Ka ak
rg an
Pa
Anek aruA m
S la er i
Pu tiT i
ik
)
Ro on ba er
)
e
At am nT
( 326
( 165 )
M tu ak eral otI
( 428
Te m al ol
Keam ah
)
L am
( 542
L ik a ng
( 29) )
S da
( 356 )
)
Kerai
( 669 )
)
( 149
E lur i
)
( 277
( 44)308
)
)
(
Ta uta
( 582
( 525
Ap rai n
( 166 )
)
)
Pe gu
)
( 259
( 731
Tu n a
( 307 )
( 496
)
( 11)
( 306 )
Iliueru
( 170)
( 155
( 591
653)
S a
(
( 273)
( 14)
Nil un
( 506)
Te a r
( 589)
( 690)
Kis ma ma r
( 456)
( 223)
Ro s tDa s e
)
E a tine e na
( 479
L e md nim ba
( 457)
)
( 285)
)
Ya iTa
( 740)
( 540)
( 670)
( 178
( 461
Ke la ru
S e wa iIria
Koi mo rro
( 671)
( 751)
Champ ung
( 274)
L a e ring
( 286)
( 623)
Kom a e ja
( 145)
( 322)
( 676)
S und
ngR b
Re jaliTa ga
( 275)
Tbo bili lB
( 583)
Ta ga na da
Koro nga niB
( 310)
S a ra nKoro
( 290)
( 665)
Bila aa nS a ra
( 638)
617)
(
( 291)
Bila ta y a
( 573)
( 288)
CiuliA ta y
( 70)
S quliqA
( 71)
( 664)
S e diq
( 133)
Bunun
( 625)
( 127)
( 81)
( 580)
( 509)
Tha o
ng
( 634)
F a v orla
( 535)
( 672)
( 28)
Ruka i n
( 176)
( 74)
P a iwaka na bu
( 100)
Ka na
( 737)
( 260)
S a a roa
( 545)
( 553)
( 492)
Ts ou
( 687)
P uy uma
( 688)
( 269)
Ka v a la n
( 530)
( 54)
( 472)
Ba s a i
Ce ntra lAmi
( 147)
( 109)
( 596)
S ira y a
( 504)
P a ze h
( 556)
S a is ia t
( 750)
( 559)
Ya ka n
( 40)
( 632)
( 91)
Ba jo
( 375)
( 560)
( 230)
M a pun
( 629)
S a ma lS ia s i
( 630)
( 628)
Ina ba knon
( 620)
S uba nunS in
( 733)
( 728)
( 362)
S uba nonS io
( 123)
( 370)
W e s te rnBuk
( 365)
( 366)
M a noboW
( 43)
( 27)
M a noboIliae s t
( 363)
( 614)
( 377)
M a noboDib
( 364)
( 79)
M a noboAta a
( 369)
( 367)
( 368)
M a noboAt d
( 376)
( 355)
M a noboT a u
( 576)
( 233)
( 405)
M a noboK igw
( 42)
( 118)
M a noboS a la
( 80)
Binuk
a ra
( 121)
M a ra id
na o
( 507)
Ira nun
( 493)
( 613)
( 253)
( 311)
( 372)
Sas
( 717)
( 225)
( 639)
( 3)
( 495)
Ba li a k
( 102)
( 210)
( 52)
( 69)
( 107)
M a ma
( 208)
( 635)
Ilong
nwa
( 103)
Hilig go
( 662)
W a raa y non
( 252)
( 371)
Butu y W
( 641)
( 727)
Ta us a nona ra y
( 640)
ugJo
S urig
( 151
( 57)
Akl a onolo
( 693 )
( 494)
( 47)
Ce a non n
)
( 547)
( 595)
Ka bua noBis
( 136)
M la ga
( 349
( 615)
Ta a ns a n
)
( 245)
Bikga logka
Ta olN Ant
( 126
a
gb
Ta
a ga
( 410 )
( 215
Ba gb nwa C
)
( 403
P ta kPa nw Ka
)
( 337 )
( 267)
Baa la wa a laa Ab
( 137)
( 327
)
S ng nB w
)
at
HaW P a gi
Pa nu la
( 407
S la no wa no
( 279
( 125 )
Daing ua o
( 400 )
( 206 )
Ka y hi n
( 398 )
)
Da tinak Ba
( 487 )
( 331 )
Ka y ga ka
( 231 )
t
M doak Ngn
( 48) )
M er rih aju
(
Tu aa in aM
( 399348
)
Ke nj ny
)
( 511
M rin un an al a
( 130
)
M el c g
)
Og el ay i
L anay uS ar
u
In ow
a
Ba do M
M nj ne al ay
M al ar
M el ay ess ia
Ph in ay Ba eMn
Ch
an uB
Ha ruan Ra gk ruha s
M
Ib okin
ng ab n
Ga an enan
Ch a
Ch
yo
am
o
B o p ia
( 63)
T ik a e
( 675)
E m ta
( 163)
An u n e ll e s e
( 13)
R e n n a An iw
( 537)
F u tuM e le M
( 180)
( 182)
)
Ifi ra W e s t
( 218
( 703)
Uv e a a E a s t
( 181)
F utunka uT a u
( 704)
Va e a u
( 647)
Ta ku
( 694)
Tu v a lu
( 592)
S ika ia nama r
( 162)
( 264)
Ka pin ga
( 483)
Nuk uoro
( 333)
ua ngiu a
( 114
)
e
3)
( 54 4)
( 73
)
)
( 528527
)
(
( 745 ) )
( 577
( 132 )
( 419 )
( 106 )
( 131
)
( 520
)
( 32)445 )
( 433
(
)
( 557 )
( 351
( 119
)
(
( 444
( 12)659 )
)
( 179)
( 187
)
( 117
)
a
le gg
he onun u
Ug anng o ke
ov
Gh
)
Va ar res
e
M balo
( 696 ) )
f
M
ap
( 193
So io nH
( 706 ) )
( 382
Ta op
( 391
Te has an a Tu
)
Ne s
( 646 )
)
Ni kuin gg na n
( 668 )
Hais ta ho
( 154
( 439 )
)
S ba iG
( 458
( 602 )
Ba ris i
Va ris
( 572
Av
)
)
Va rio gg a na
( 438 ) ( 597
Ri en ta
S ba ua a o
( 38) )
( 207
)
( 711 )
Ba gh tan aL a
710 )
( 440
(
ba
Va
aK
tan
( 538 )
Ba ba tan un ga
( 584
Ba ba L
( 35) )
Baun gao
( 705
L on oAlu
( 34)
M on u
( 37)
)
M ra a uro
( 36)
( 129
)
To a v a
( 334)
( 610
( 413)
Uruon oF
( 414)
M ur eHolo a
( 686)
Bil ek ge Km t
( 700)
Ch a rin ge Ta
( 415)
M a rin oro l
( 128)
M a oP e L e
( 416)
( 379)
Ngga ring Kia s
( 381)
( 452)
M ba naS a ma
( 75)
( 380)
Za ghu nga
( 157)
G
( 755)
L a bla
( 302)
Bla bla nga
( 82)
Bla ota Ys
( 732)
( 83)
Kok ka ka
( 289)
( 571)
i
Kilo
( 282)
( 124)
ra
Ba non
ng
M a da
ga gTu
Tunra W e s t
( 340)
Ka
( 49)
( 692)
Na lik
( 265)
Tia ngk
( 431)
Tiga ungl s
( 673)
L ihirS
kL a ma
( 674)
M a da
( 316)
( 324)
Ba rok
( 339)
M a ututu
( 53)
iBil
Na ka na
i
( 388)
( 338)
L a ka la
( 430)
Vitu
( 304)
Ya pe s e
Dh
( 741)
M bughotu
( 393)
( 714)
Bugotu
( 95)
( 92)
Ta lis e M olila
( 651)
( 650)
Ta lis e M a
( 195)
Gha riNdi
( 194)
Gha ri
( 681)
Tolo
( 648)
Ta lis e
( 347)
M a la ngo
( 204)
( 196)
Gha riNgga e
( 190)
( 392)
M bira o
( 199)
Gha riTa nda
( 652)
Ta lis e P ole
( 197)
( 649)
Gha riNgge r
Ta lis e Koo
( 198)
( 319)
Gha riNgini
( 189)
( 320)
L e ngo
( 321)
L e ngoGha im
( 453)
L e ngoP
( 678)
Ngge la a rip
( 298)
Toa mba ita
( 618)
( 312)
Kwa i
( 299)
L a nga
( 473)
( 390)
Kwa io la nga
( 313)
M ba e
( 389)
L a u ngguu
( 345)
( 301)
M ba
( 175)
Kwa e le le a
( 315)
( 328)
F a tara a e S ol
( 314)
( 346)
L a uWle ka
( 551)
L a uNora la de
( 550)
L ongg th
( 621)
( 490)
u
Saa
( 552)
S a UkiNiM
a
( 548)
Saa
Oroh
Vill a
( 142)
Sa a
( 18)
S a aa Ula wa
( 19)
Dori
( 58)
( 549)
Are o
( 59)
( 564)
( 172)
Are a re M
( 247)
S a a re W a a s
( 570)
Ba a Aul a ia
( 22)
Ba uro BauVi l
( 23)
F uro Ha roo
( 21)
( 657)
Kaa ga ni unu
( 246)
S hua
( 60)
( 171
Aroa nta M a
( 173)
)
( 174)
Aro s iOnCa ta mi
( 569)
Aro s iTa eib l
( 658)( 663)
Ka s i
wa
( 300)
t
Ba hu
( 318)
F a uro a
( 726
( 636
ga P a
F ag
)
( 699 )
wa
S
a niA
V
)
Taan taAniR gu
( 150
Ta wa naihu f
( 15)
)
Kw nn rog
( 402
L am aS a
( 10) )
( 420
S en eraou th
( 510 )
Ur y eEak
( 491 )
Ar a rroel
( 427 )
m
M ak
( 531 )
Amerei iS ou an
)
M
th
Pe ot br
Pa te a y m S
M am ra
ou
ra
Ra wo
t
So ga tla es M a
eS
Or ut
ou
Ng ko hE p
Na
Na un n fa te
Na m a
Ne ti ak ir
Ta ha
An pes e v aq
ej
om
An
ei
( 607
( 489 )
( 454
( 432 ) )
(
( 429434 )
)
)
( 352
)
( 709
)
( 112
)
ProtoAustronesian
( 447
)
(Fig. S4)
u
12
SEE COMMENTARY
i y
PSYCHOLOGICAL AND
COGNITIVE SCIENCES
B
( 533) 2)
( 56
(Fig. S2)
(Fig. S5)
tO
or
oP
ka a i
se
Sa avv eean te n
Av m wa na n e
Ne lu An ia s K
Na lo
ea le s eAs
Pu
Pu ol ta wake ke n
c
W
Sa uutlo ia s e
Ch or linke es o
M rouu orol ar
Ca
Chon spa nCna
S ai An i e
S lo ea s s
)
Pu ol ile pe
( 603 )
W ok ila an
( 555 )
M ng pe i
( 526 )
Pi na at
744
s
)
(
Po rib aie lle
( 411 )
Ki s ha
( 512 )
Ku ars
( 514
)
M uru wa
( 516
)
Na lem
( 515
Ne we la
Ja na
Ca a i on e
)
( 441)
Ia ng
( 406
( 244)
)
Ne hu a n ij
( 283
De tum nF
( 297)
Ro es ter
( 385)
)
( 474)
W a ori i
( 522
M hit y n
( 105)
( 374)
Ta nrh tu
( 642)
e mo
( 436)
505) P
(
543)
iki
446)
(
(
( 689) Tua
( 734)
a nih a n o
utua nM n
( 214)
( 361) M
( 443)
( 546) Rurhiti nga
( 645)
( 139)
( 643) Ta roto nth
( 534) Ra hiti a s a n
( 332)
( 644) Ta a rque
( 122)
( 725)
re v a
( 536)
( 383) M
n s
a nga
( 156)
( 359) M
a
wa iia
nuiE
( 209) Ha
( 384)
n
Ra pa
S a moa
Be llona
( 533)
( 63)
pia
( 562)
Tiko e
( 675)
E ma
( 163)
Anuta lle s e
( 13)
Re nne Aniw
( 537)
F utunae le M
( 180)
( 481)
( 182)
Ifira MW e s t
( 218)
( 513)
( 703)
Uv e a E a s t
( 146)
( 116)
( 181)
F utuna
uTa u
( 704)
Va e a ka
( 647)
( 563)
Ta kuu
( 694)
Tuv a lu
( 592)
S ika ia na r
( 162)
( 264)
Ka pinga ma
483)
(
Nukuoro
( 333)
L ua ngiua
( 524)
P uka puka
( 680)
Toke la u
( 702)
Uv e a E a s t
( 682)
Tonga n
( 683)
( 459)
Niue
( 177)
F ijia nBa u
( 666)
Te a nu
( 708)
( 707)
Va
no
(
654)
( 159)
Ta ne ma
( 98)
( 26)
Buma
( 701)
( 656)
As umboa
( 738)
( 442)
Ta nimbili
( 581)
( 1)
Ne mba o
( 748)
( 616)
S e ima t
( 330)
( 160)
W uv ulu
( 435)
( 426)
( 153)
L ou
( 373)
( 317)
Na una
( 598)
( 729)
L e ipon
( 608)
( 325)
( 609)
S iv is a
( 329)
( 323)
L ikum Tita
( 266)
Lev e
( 200)
( 417)
L oniui
( 108)
( 97)
M us
( 532)
Ka s s a u
( 718)
( 408)
Gimaira Ira h
( 33)
( 485)
( 68)
Buli n
( 465)
( 25)
M or
( 120)
M iny
( 724)
Biga a ifuin
( 24)
( 378)
As M is ool
( 612)
( 8)
( 460)
Wa
( 742)
( 135)
Numrope n
( 622)
( 134)
M a for
Am ra u
ba iYa
W ind
( 386)
( 667)
( 229)
( 152
Nor e s pe n
( 45)
( 164)
)
Da thB iW a
( 660)
( 148)
Da we ra a ba r n
( 697)
( 588)
Da we
Te i
( 450)
( 101
( 115)
( 601
( 586
Imr la M
( 606)
)
( 192)
)
)
E mpoin a s bua
g
( 161)
E a la wa
( 486
s tM
S eri
s
)
Ce li a s e
la
( 587
( 86)
( 191)
S ntr
( 357
( 722
)
( 719)
( 360
W ou thEa lM
)
( 449
)
( 9)
( 698
)
Ta es tDa a s a s
( 517
( 287 ) )
tB
Uji ra ng ma
)
( 6)
( 721
)
(
Ng rNA a
( 503
( 76) )
( 211)
Ge aib ru nB ar
)
( 599
( 604
( 578 )
W s er orS
( 186716 )
)
( 5)
E atu
Ar
)
( 425
)
Hitlat be
)
S ouuAKe iBela
mb
Am
Pa ahAm anons
Al ul
M un ohai aTe
Bo ur e i
W nf na te
Bu er ia
nA
S ru in
l
M ob Na am
M am oy m a
Ng an
o ro
Po ad ggbo
l
ru
Ko nd ha
ar
W
ai
Bi ej di ok
So m ew
Sa a a aT
W
an
Ga anv u
a
ur uk
aN ak
gg a
au
A
Fig. 3. Increasing the number of languages we can
1
1
106
reconstruct gives new ways to approach questions in
historical linguistics, such as the effect of functional
0.8
0.8
load on the probability of merging two sounds. The
104
plots shown are heat maps where the color encodes
0.6
0.6
the log of the number of sound changes that fall
into a given two-dimensional bin. Each sound
102
change x > y is encoded as a pair of numbers in the
0.4
0.4
unit square, ðl; mÞ, as explained in Materials and
Methods. To convey the amount of noise one could
1
0.2
0.2
expect from a study with the number of languages
that King previously used (7), we first show in A the
0
0
0
heat map visualization for four languages. Next, we
0
0
2x10-5
2x10-5
4x10-5
6x10-5
8x10-5
1x10-4
4x10-5
6x10-5
8x10-5
1x10-4
show the same plot for 637 Austronesian languages
Functional load
Functional load
in B. Only in this latter setup is structure clearly
visible: Most of the points with high probability of merging can be seen to have comparatively low functional load, providing evidence in favor of the
functional load hypothesis introduced in 1955. See SI Appendix, Section 2.4 for details.
formalism but other limitations include the lack of explicit modeling of changes at the level of the phoneme inventories used by
a language and the lack of morphological analysis. Challenges
specific to the cognate inference task, for example difficulties with
polymorphisms, are also discussed in more detail in SI Appendix.
Another limitation of the current approach stems from the assumption that languages form a phylogenetic tree, an assumption
violated by borrowing, dialect variation, and creole languages.
However, we believe our system will be useful to linguists in
several ways, particularly in contexts where there are large numbers of languages to be analyzed. Examples might include using
the system to propose short lists of potential sound changes and
correspondences across highly divergent word forms.
An exciting possible application of this work is to use the
model described here to infer the phylogenetic relationships
between languages jointly with reconstructions and cognate sets.
This will remove a source of circularity present in most previous
computational work in historical linguistics. Systems for inferring
phylogenies such as ref. 13 generally assume that cognate sets are
given as a fixed input, but cognacy as determined by linguists is in
turn motivated by phylogenetic considerations. The phylogenetic
tree hypothesized by the linguist is therefore affecting the tree
built by systems using only these cognates. This problem can be
avoided by inferring cognates at the same time as a phylogeny,
something that should be possible using an extended version of
our probabilistic model.
Our system is able to reconstruct the words that appear in
ancient languages because it represents words as sequences of
sounds and uses a rich probabilistic model of sound change. This
is an important step forward from previous work applying computational ideas to historical linguistics. By leveraging the full
sequence information available in the word forms in modern
languages, we hope to see in historical linguistics a breakthrough
similar to the advances in evolutionary biology prompted by the
transition from morphological characters to molecular sequences
in phylogenetic analysis.
Materials and Methods
This section provides a more detailed specification of our probabilistic
model. See SI Appendix, Section 1.2 for additional content on the algorithm and simulations.
Distributions. The conditional distributions over pairs of evolving strings are
specified using a lexicalized stochastic string transducer (33).
Consider a language ℓ′ evolving to ℓ for cognate set c. Assume we have a
word form x = wcℓ′ . The generative process for producing y = wcℓ works as
follows. First, we consider x to be composed of characters x1 x2 . . . xn , with
the first and last ones being a special boundary symbol x1 = # ∈ Σ, which is
never deleted, mutated, or created. The process generates y = y1 y2 . . . yn in n
chunks yi ∈ Σ* ; i ∈ f1; . . . ; ng, one for each xi . The yi s may be a single character, multiple characters, or even empty. To generate yi , we define a mutation Markov chain that incrementally adds zero or more characters to an
initially empty yi . First, we decide whether the current phoneme in the top
word t = xi will be deleted, in which case yi = e (the probabilities of the
4228 | www.pnas.org/cgi/doi/10.1073/pnas.1204678110
B
Merger posterior
Merger posterior
A
decisions taken in this process depend on a context to be specified shortly).
If t is not deleted, we choose a single substitution character in the bottom
word. We write S = Σ∪​ fζg for this set of outcomes, where ζ is the special
outcome indicating deletion. Importantly, the probabilities of this multinomial can depend on both the previous character generated so far (i.e., the
rightmost character p of yi−1 ) and the current character in the previous
generation string (t), providing a way to make changes context sensitive.
This multinomial decision acts as the initial distribution of the mutation
Markov chain. We consider insertions only if a deletion was not selected in
the first step. Here, we draw from a multinomial over S , where this time the
special outcome ζ corresponds to stopping insertions, and the other elements of S correspond to symbols that are appended to yi . In this case, the
conditioning environment is t = xi and the current rightmost symbol p in yi .
Insertions continue until ζ is selected. We use θS;t;p;ℓ and θI;t;p;ℓ to denote the
probabilities over the substitution and insertion decisions in the current
branch ℓ′ → ℓ. A similar process generates the word at the root ℓ of a tree or
when an innovation happens at some language ℓ, treating this word as
a single string y1 generated from a dummy ancestor t = x1 . In this case, only
the insertion probabilities matter, and we separately parameterize these
probabilities with θR;t;p;ℓ . There is no actual dependence on t at the root or
innovative languages, but this formulation allows us to unify the parameterization, with each θω;t;p;ℓ ∈ RjΣj+1 , where ω ∈ fR; S; Ig. During cognate inference, the decision to innovate is controlled by a simple Bernoulli random
variable ngℓ for each language in the tree. When known cognate groups are
assumed, ncℓ is set to 0 for all nonroot languages and to 1 for the root
language. These Bernoulli distributions have parameters νℓ .
Mutation distributions confined in the family of transducers miss certain
phylogenetic phenomena. For example, the process of reduplication (as in
“bye-bye”, for example) is a well-studied mechanism to derive morphological and lexical forms that is not explicitly captured by transducers. The same
situation arises in metatheses (e.g., Old English frist > English first). However, these changes are generally not regular and therefore less informative
(1). Moreover, because we are using a probabilistic framework, these events
can still be handled in our system, even though their costs will simply not be
as discounted as they should be.
Note also that the generative process described in this section does not
allow explicit dependencies to the next character in ℓ. Relaxing this assumption can be done in principle by using weighted transducers, but at the
cost of a more computationally expensive inference problem (caused by the
transducer normalization computation) (34). A simpler approach is to use
the next character in the parent ℓ′ as a surrogate for the next character in ℓ.
Using the context in the parent word is also more aligned to the standard
representation of sound change used in historical linguistics, where the
context is defined on the parent as well.
More generally, dependencies limited to a bounded context on the parent
string can be incorporated in our formalism. By bounded, we mean that it
should be possible to fix an integer k beforehand such that all of the
modeled dependencies are within k characters to the string operation. The
caveat is that the computational cost of inference grows exponentially in k.
We leave open the question of handling computation in the face of unbounded dependencies such as those induced by harmony (35).
Parameterization. Instead of directly estimating the transition probabilities of
the mutation Markov chain (which could be done, in principle, by taking them
to be the parameters of a collection of multinomial distributions) we express
them as the output of a multinomial logistic regression model (36). This
Bouchard-Côté et al.
Bouchard-Côté et al.
θω;t;p;ℓ = θω;t;p;ℓ ðξ; λÞ =
expfÆλ; f ðω; t; p; ℓ; ξÞæg
× μðω; t; ξÞ;
Zðω; t; p; ℓ; λÞ
SEE COMMENTARY
[1]
where ξ ∈ S , f : fS; I; Rg × Σ × Σ × L × S → Rk is the feature function (which
indicates which features apply for each event), Æ · ; · æ denotes inner product,
and λ ∈ Rk is a weight vector. Here, k is the dimensionality of the feature space
of the logistic regression model. In the terminology of exponential families, Z
and μ are the normalization function and the reference measure, respectively:
Zðω; t; p; ℓ; λÞ =
X
exp Æλ; f ω; t; p; ℓ; ξ′ æ
ξ′∈S
8
0 if ω = S; t = #; ξ ≠ #
>
>
<
0 if ω = R; ξ = ζ
μðω; t; ξÞ =
> 0 if ω ≠ R; ξ = #
>
:
1 o:w:
Here, μ is used to handle boundary conditions, ensuring that the resulting
probability distribution is well defined.
During cognate inference, the innovation Bernoulli random variables νgℓ
are similarly parameterized, using a logistic regression model with two kinds
of features: a global innovation feature κ global ∈ R and a language-specific
feature κℓ ∈ R. The likelihood function for each νgℓ then takes the form
νgℓ =
1
:
1 + exp −κglobal − κ ℓ
[2]
ACKNOWLEDGMENTS. This work was supported by Grant IIS-1018733 from
the National Science Foundation.
21. Lynch J, ed (2003) Issues in Austronesian (Pacific Linguistics, Canberra, Australia).
22. Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word
lists in four daughter languages. J Quant Linguist 7:233–244.
23. Kondrak G (2002) Algorithms for Language Reconstruction. PhD thesis (Univ of Toronto, Toronto).
24. Ellison TM (2007) Bayesian identification of cognates and correspondences. Assoc
Comput Linguist 45:15–22.
25. Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33(2):114–124.
26. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data
via the EM algorithm. J R Stat Soc B 39:1–38.
27. Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel
trees. Adv Neural Inf Process Syst 21:177–184.
28. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian basic vocabulary database:
From bioinformatics to lexomics. Evol Bioinform Online 4:271–283.
29. Lewis MP, ed (2009) Ethnologue: Languages of the World (SIL International, Dallas,
TX, 16th Ed).
30. Lyovin A (1997) An Introduction to the Languages of the World (Oxford Univ Press,
Oxford, UK).
31. Hockett CF (1967) The quantification of functional load. Word 23:320–339.
32. Surendran D, Niyogi P (2006) Competing Models of Linguistic Change. Evolution and
Beyond (Benjamins, Amsterdam).
33. Varadarajan A, Bradley RK, Holmes IH (2008) Tools for simulating evolution of aligned
genomic regions with integrated parameter estimation. Genome Biol 9(10):R147.
34. Mohri M (2009) Handbook of Weighted Automata, Monographs in Theoretical
Computer Science, eds Droste M, Kuich W, Vogler H (Springer, Berlin).
35. Hansson GO (2007) On the evolution of consonant harmony: The case of secondary
articulation agreement. Phonology 24:77–120.
36. McCullagh P, Nelder JA (1989) Generalized Linear Models (Chapman & Hall, London).
37. Dreyer M, Smith JR, Eisner J (2008) Latent-variable modeling of string transductions
with finite-state methods. Empirical Methods on Natural Language Processing 13:
1080–1089.
38. Chen SF (2003) Conditional and joint models for grapheme-to-phoneme conversion.
Eurospeech 8:2033–2036.
39. Goldwater S, Johnson M (2003) Learning OT constraint rankings using a maximum
entropy model. Proceedings of the Workshop on Variation Within Optimality Theory
eds Spenader J, Eriksson A, Dahl Ö (Stockholm University, Stockholm) pp 113–122.
40. Wilson C (2006) Learning phonology with substantive bias: An experimental and
computational study of velar palatalization. Cogn Sci 30(5):945–982.
41. Gray RD, Drummond AJ, Greenhill SJ (2009) Language phylogenies reveal expansion
pulses and pauses in Pacific settlement. Science 323(5913):479–483.
42. Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian
comparative linguistics. Inst Linguist Acad Sinica 1:31–94.
PNAS | March 12, 2013 | vol. 110 | no. 11 | 4229
COMPUTER SCIENCES
1. Hock HH (1991) Principles of Historical Linguistics (Mouton de Gruyter, The Hague,
Netherlands).
2. Ross M, Pawley A, Osmond M (1998) The Lexicon of Proto Oceanic: The Culture and
Environment of Ancestral Oceanic Society (Pacific Linguistics, Canberra, Australia).
3. Diamond J (1999) Guns, Germs, and Steel: The Fates of Human Societies (WW Norton,
New York).
4. Nichols J (1999) Archaeology and Language: Correlating Archaeological and Linguistic
Hypotheses, eds Blench R, Spriggs M (Routledge, London).
5. Ventris M, Chadwick J (1973) Documents in Mycenaean Greek (Cambridge Univ Press,
Cambridge, UK).
6. Martinet A (1955) Économie des Changements Phonétiques [Economy of phonetic
sound changes] (Maisonneuve & Larose, Paris).
7. King R (1967) Functional load and sound change. Language 43:831–852.
8. Holmes I, Bruno WJ (2001) Evolutionary HMMs: A Bayesian approach to multiple
alignment. Bioinformatics 17(9):803–820.
9. Miklós I, Lunter GA, Holmes I (2004) A “Long Indel” model for evolutionary sequence
alignment. Mol Biol Evol 21(3):529–540.
10. Suchard MA, Redelings BD (2006) BAli-Phy: Simultaneous Bayesian inference of
alignment and phylogeny. Bioinformatics 22(16):2047–2048.
11. Liberles DA, ed (2007) Ancestral Sequence Reconstruction (Oxford Univ Press, Oxford, UK).
12. Paten B, et al. (2008) Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res 18(11):1829–1843.
13. Gray RD, Jordan FM (2000) Language trees support the express-train sequence of
Austronesian expansion. Nature 405(6790):1052–1055.
14. Ringe D, Warnow T, Taylor A (2002) Indo-European and computational cladistics.
Trans Philol Soc 100:59–129.
15. Evans SN, Ringe D, Warnow T (2004) Inference of Divergence Times as a Statistical Inverse Problem, McDonald Institute Monographs, eds Forster P, Renfrew C (McDonald
Institute, Cambridge, UK).
16. Gray RD, Atkinson QD (2003) Language-tree divergence times support the Anatolian
theory of Indo-European origin. Nature 426(6965):435–439.
17. Nakhleh L, Ringe D, Warnow T (2005) Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language
81:382–420.
18. Bryant D (2006) Phylogenetic Methods and the Prehistory of Languages, eds Forster P,
Renfrew C (McDonald Institute for Archaeological Research, Cambridge, UK), pp
111–118.
19. Daumé H III, Campbell L (2007) A Bayesian model for discovering typological implications. Assoc Comput Linguist 45:65–72.
20. Dunn M, Levinson S, Lindstrom E, Reesink G, Terrill A (2008) Structural phylogeny
in historical linguistics: Methodological explorations applied in Island Melanesia.
Language 84:710–759.
Using these features and parameter sharing, the logistic regression model
defines the transition probabilities of the mutation process and the root
language model to be
PSYCHOLOGICAL AND
COGNITIVE SCIENCES
model specifies a distribution over transition probabilities by assigning
weights to a set of features that describe properties of the sound changes
involved. These features provide a more coherent representation of the
transition probabilities, capturing regularities in sound changes that reflect
the underlying linguistic structure.
We used the following feature templates: OPERATION, which identifies
whether an operation in the mutation Markov chain is an insertion, a deletion, a substitution, a self-substitution (i.e., of the form x > y, x = y), or the
end of an insertion event; MARKEDNESS, which consists of language-specific
n-gram indicator functions for all symbols in Σ (during reconstruction, only
unigram and bigram features are used for computational reasons; for cognate
inference, only unigram features are used); FAITHFULNESS, which consists of
indicators for mutation events of the form 1 [ x > y ], where x ∈ Σ, y ∈ S .
Feature templates similar to these can be found, for instance, in the work of
refs. 37 and 38, in the context of string-to-string transduction models used in
computational linguistics. This approach to specifying the transition probabilities produces an interesting connection to stochastic optimality theory (39,
40), where a logistic regression model mediates markedness and faithfulness
of the production of an output form from an underlying input form.
Data sparsity is a significant challenge in protolanguage reconstruction.
Although the experiments we present here use an order of magnitude more
languages than previous computational approaches, the increase in observed
data also brings with it additional unknowns in the form of intermediate
protolanguages. Because there is one set of parameters for each language,
adding more data is not sufficient to increase the quality of the reconstruction; it is important to share parameters across different branches in the
tree to benefit from having observations from more languages. We used the
following technique to address this problem: We augment the parameterization to include the current language (or language at the bottom of the
current branch) and use a single, global weight vector instead of a set of
branch-specific weights. Generalization across branches is then achieved
by using features that ignore ℓ, whereas branch-specific features depend
on ℓ. Similarly, all of the features in OPERATION, MARKEDNESS, and
FAITHFULNESS have universal and branch-specific versions.
Supporting Information:
Automated reconstruction of ancient languages using
probabilistic models of sound change
Alexandre Bouchard-Côté
David Hall
Thomas L. Griffiths
Dan Klein
This Supporting Information describes the learning and inference algorithms used by our system (Section 1),
details of simulations supporting the accuracy of our system and analyzing the effects of functional load (Section 2),
lists of reconstructions (Section 3), lists of sound changes (Section 4), and analysis of frequent errors (Section 5).
1
Learning and inference
The generative model introduced in the Materials and Methods section of the main paper sets us up with two
problems to solve: estimating the values of the parameters characterizing the distribution on sound changes on
each branch of the tree, and inferring the optimal values of the strings representing words in the unobserved
protolanguages. Section 1.1 introduces the full objective function that we need to optimize in order to estimate
these quantities. Section 1.2 describes the Monte Carlo Expectation-Maximization algorithm we used for solving
the learning problem. We present the algorithm for inferring ancestral word forms in Section 1.4.
1.1
Full objective function
The generative model specified in the Materials and Methods section of the main paper defines an objective function that we can optimize in order to find good protolanguage reconstructions. This objective function takes the
form of a regularized log-likelihood, combining the probability of the observed languages with additional constraints intended to deal with data sparsity. This objective function can be written concisely if we let Pλ (·), Pλ (·|·)
denote the root and branch probability models described in the Materials and Methods section of the paper (with
transition probabilities given by the above logistic regression model), I(c), the set of internal (non-leaf) nodes in
τ (c), pa(`), the parent of language `, r(c), the root of τ (c) and W (c) = (Σ∗ )|I(c)| . The full objective function is
then
Li(λ, κ) =
C
X
c=1
log
X
Pλ (wc,r(c) )
w∈W
~
(c)
Y
Pλ (wc,` |wc,pa(`) , nc` )Pκ (nc` |νc` ) −
`∈I(c)
||λ||22 + ||κ||22
2σ 2
(1)
where the second term is a standard L2 regularization penalty intended to reduce over-fitting due to data sparsity
(we used σ 2 = 1) [1]. The goal of learning is to find the value of λ, the parameters of the logistic regression model
for the transition probabilities, that maximizes this function.
1
1.2
A Monte Carlo Expectation-Maximization algorithm for reconstruction
Optimization of the objective function given in Equation 1 is done using a Monte Carlo variant of the ExpectationMaximization (EM) algorithm [2]. This algorithm breaks down into two steps, an E step in which the objective
function is approximated and an M step in which this approximate objective function is optimized. The M step
is convex and computed using L-BFGS [3] but the E step is intractable [4], in part because it requires solving
the problem of inferring the words in the protolanguages. We approximate the solution to this inference problem
using a Markov chain Monte Carlo (MCMC) algorithm [5]. This algorithm repeatedly samples words from the
protolanguages until it converges on the distribution implied by our generative model. Since this procedure is
guaranteed to find a local maximum of the objective, we ran it with several different random initializations of the
model parameters. The next two subsections provide the details of these two parts of our system.
1.2.1
E step: Inferring the posterior over string reconstructions
In the E step, the inference problem is to compute an expectation under the posterior over strings in a protolanguage
given observed word forms at the leaves of the tree.1 The typical approach in biological InDel models [6] is to
use Gibbs sampling, where the entire string at each node in the tree is repeatedly resampled, conditioned on its
parent and children. We will call this method Single Sequence Resampling (SSR). While conceptually simple, this
approach suffers from mixing problems in large trees, since it can take a long time for information to propagate
from one region of the tree to another [6]. Consequently, we use a different MCMC procedure, called Ancestry
Resampling (AR) that alleviates these mixing problems. This method was originally introduced for biological
applications [7], but commonalities between the biological and linguistic cases make it possible to use it in our
model.
Concretely, the problem with SSR arises when the tree under consideration is large or unbalanced. In this case,
it can take a long time for information from the observed languages to propagate to the root of the tree. Indeed,
samples at the root will initially be independent of the observations. AR addresses this problem by resampling one
thin vertical slice of all sequences at a time, called an ancestry (for the precise definition of the algorithm, see [7]).
Slices condition on observed data, avoiding mixing problems, and can propagate information rapidly across the
tree. We ran the ancestry resampling algorithm for a number of iterations that increased linearly with the number of
iterations of the EM algorithm that had been completed, resulting in an approximation regime that could allow the
EM algorithm to converge to a solution [8]. To speed-up the large experiments, we also used an approximation in
AR. This approximation is based on fixing the value of a set-valued auxiliary variables zg , where zg = {wg,` }, and
` ranges over the set of all languages (both internal and at the leaves). Conditioning on these variables, sampling
wg,` |zg can be done exactly using dynamic programming and rejection sampling.
1.2.2
M step: Convex optimization of the approximate objective
In the M step, we individually update the parameters λ and κ as specified in Equation 1. We show how λ is updated
in this section, κ can be optimized similarly. Let C = (ω, t, p, `) denote local transducer contexts from the space
C = {S, I, R} × Σ × Σ × L of all such contexts. Let N (C, ξ) be the expected number of times the transition
ξ was used in context C in the preceding E-step. Given these sufficient statistics, the estimate of λ is given by
optimizing the expected complete (regularized) log-likelihood O(λ) derived from the original objective function
given Equation [1] in the Materials and Methods section of the main paper (ignoring terms that do not involve λ),
O(λ) =
X X
C∈C ξ∈S
1 To
h
i ||λ||2
X
N (C, ξ) hλ, f (C, ξ)i − log
exp{hλ, f (C, ξ 0 )i} −
.
2σ 2
0
ξ
be precise: the posterior is over both protolanguage strings and the derivations between these strings and the modern words.
2
We use L-BFGS [3] to optimize this convex objective function. L-BFGS requires the partial derivatives
h
i λ
X
X X
∂O(λ)
j
N (C, ξ) fj (C, ξ) −
θC (ξ 0 ; λ)fj (C, ξ 0 ) − 2
=
∂λj
σ
0
C∈C ξ∈S
= F̂j −
X X
ξ
N (C, ·)θC (ξ 0 ; λ)fj (C, ξ 0 ) −
C∈C ξ∈S
P
λj
,
σ2
P
where F̂j =
C∈C
ξ∈S N (C, ξ)fj (C, ξ) is the empirical feature vector and N (C, ·) =
ξ N (C, ξ) is the
number of times context C was used. F̂j and N (C, ·) do not depend on λ and thus can be precomputed at the
beginning of the M-step, thereby speeding up each L-BFGS iteration.
1.3
P
Approximate Expectation-Maximization for cognate inference
The procedure for cognate inference is similar: we again operate within the Expectation-Maximization framework,
and the M-steps are identical. However, because of different characteristics of the cognate inference problem, we
make a few different choices for approximations for the E-step. The Monte Carlo inference algorithm described
in the preceding section works exceedingly well for reconstructing words for known cognates. However, a Monte
Carlo approach to determining which words are cognate requires resampling the innovation variables ng` , which
is likely to lead to slow mixing of the Markov chain.
We therefore take a different approach when doing cognate inference. Here, we restrict our system to not
perform inference over all possible reconstructions for all words, but to only use words that correspond to some
observed modern word with the same meaning. The simplification is of course false, but it works well in practice.
Specifically, we perform inference on the tree for each gloss using message passing [9], also known as pruning
in the computational biology literature [10], where each message µ(wg` ) has a non-zero score only when wg` is
one of the observed modern word forms in gloss g. The result of inference yields expected alignments counts for
each character in each language along with the expected number of innovations that occur at each language. These
expectations can then be used in the convex M-step.
1.4
Ancestral word form reconstruction
In the E step described in the preceding section, a posterior distribution π over ancestral sequences given observed
forms is approximated by a collection of samples X1 , X2 , . . . XS . In this section, we describe how this distribution
is summarized to produce a single output string for each cognate set.
This algorithm is based on a fundamental Bayesian decision theoretic concept: Bayes estimators. Given a loss
function over strings Loss : Σ∗ × Σ∗ → [0, ∞), an estimator is a Bayes estimator if it belongs to the random set:
X
argmin Eπ Loss(x, X) = argmin
Loss(x, y)π(y).
x∈Σ∗
x∈Σ∗
y∈Σ∗
Bayes estimators are not only optimal within the Bayesian decision framework, but also satisfy frequentist optimality criteria such as admissibility [11]. In our case, the loss we used is the Levenshtein [12] distance, denoted
Loss(x, y) = Lev(x, y) (we discuss this choice in more detail in Section 2).
Since we do not have access to π, but rather to an approximation based on S samples, the objective function
we use for reconstruction rewrites as follows:
argmin
x∈Σ∗
X
Lev(x, y)π(y) ≈ argmin
x∈Σ∗
y∈Σ∗
= argmin
x∈Σ∗
3
S
1X
Lev(x, Xs )
S s=1
S
X
s=1
Lev(x, Xs ).
The raw samples contain both derivations and strings for all protolanguages, whereas we are only interested in
reconstructing words in a single protolanguage. This is addressed by marginalization, which is done in sampling
representations by simply discarding the irrelevant information. Hence, the random variables Xs in the above
equation can be viewed as being string-valued random variables.
Note that the optimum is not changed if we restrict the minimization to be taken on x ∈ Σ∗ such that m ≤ |x| ≤
M where m = mins |Xs |, M = maxs |Xs |. However, even with this simplification, optimization is intractable.
As an approximation, we considered only strings built by at most k contiguous substrings taken from the word
forms in X1 , X2 , . . . , XS . If k = 1, then it is equivalent to taking the min over {Xs : 1 ≤ s ≤ S}. At the other
end of the spectrum, if k = S, it is exact. This scheme is exponential in k, but since words are relatively short, we
found that k = 2 often finds the same solution as higher values of k.
1.5
Finding Cognate Groups
Our cognate model finds cuts to the phylogeny in order to determine which words are cognate with one another.
However, this approach cannot straightforwardly handle instances where the evolution of the words did not follow
strictly treelike behavior. This limitation applies to polymorphisms—where multiple words for a given meaning
are available in a language—and also for borrowing—where the words did not evolve according to the phylogeny.
However, we can modify the inference procedure to capture these kinds of behaviors using a post hoc agglomerative “merge” procedure. Specifically, we can run our procedure to find an initial set of cognate groups, and
then merge those cognate groups that produce an increase in model score. That is, we create several initial small
subtrees containing some cognates, and then stitch them together into one or more larger trees. Thus, non-treelike
behaviors like borrowing will be represented as multiple trees that “overlap.” For instance, if two languages each
have two words for one meaning (say A and B), then the initial stage might find that the two A’s are cognate,
leaving the two B’s as singleton cognate groups. However, merging these two words into a single group will likely
produce a gain in likelihood, and so we can merge them. Note this procedure is unlikely to work well for long
distance borrowings, as the sound changes involved in long distance borrowing are likely to be very different from
those according to the phylogeny. Nevertheless, we found this procedure to be effective in practice.
2
Experiments
In this section, we give more details on the results in the main paper, namely those concerning validation using
reconstruction error rate, and those measuring the effect of the tree topology and the number of languages. We
also include additional comparisons to other reconstruction methods, as well as cognate inference results. In
Section 2.1, we analyze in isolation the effects of varying the set of features, the number of observed languages,
the topology, and the number of iterations of EM. In Section 2.2 we compare performance to an oracle and to two
other systems.
Evaluation of all methods was done by computing the Levenshtein distance [12] (uniform-cost edit distance)
between the reconstruction produced by each method and the reconstruction produced by linguists. The Levenshtein distance is the minimum number of substitutions, insertions, or deletions of a phoneme required to transform
one word to another. While the Levenshtein distance misses important aspects of phonology (all phoneme substitutions are not equal, for instance), it is parameter-free and still correlates to a large extent with linguistic quality
of reconstruction. It is also superior to held-out log-likelihood, which fails to penalize errors in the modeling
assumptions, and to measuring the percentage of perfect reconstructions, which ignores the degree of correctness
of each reconstructed word. We averaged this distance across reconstructed words to report a single number for
each method. The statistical significance of all performance differences are assessed using a paired t-test with
significance level of 0.05.
4
2.1
Evaluating system performance
We used the Austronesian Basic Vocabulary Database (ABVD) [13] as the basis for a series of experiments used
to evaluate the performance of our system and the factors relevant to its success. The database, downloaded from
http://language.psy.auckland.ac.nz/austronesian/
on August 7, 2010, includes partial cognacy judgments and IPA transcriptions,2 as well as a several reconstructed
protolanguages.
In our main experiments, we used the tree topology induced by the Ethnologue classification [14] in order
to facilitate interpretability of the results (i.e. so that we can give well-known names to clades in the tree in
our analyses). Our method does not require specifying branch lengths since our unsupervised learning procedure
provides a more flexible way of estimating the amount of change between points of the phylogenetic tree.
The first claim we verified experimentally is that having more observed languages aids reconstruction of protolanguages. In the results in this section, we used the subset of the languages under Proto-Oceanic (POc) to
speed-up the computations. To test this hypothesis we added observed modern languages in increasing order of
distance dc to the target reconstruction of POc so that the languages that are most useful for POc reconstruction
are added first. This prevents the effects of adding a close language after several distant ones being confused with
an improvement produced by increasing the number of languages.
The results are reported in Figure 1(c) of the main paper. They confirm that large-scale inference is desirable
for automatic protolanguage reconstruction: reconstruction improved statistically significantly with each increase
except from 32 to 64 languages, where the average edit distance improvement was 0.05.
We then conducted a number of experiments intended to identify the contribution made by different factors it
incorporates. We found that all of the following ablations significantly hurt reconstruction: using a flat tree (in
which all languages are equidistant from the reconstructed root and from each other) instead of the consensus tree,
dropping the markedness features, dropping the faithfulness features, and disabling sharing across branches. The
results of these experiments are shown in Table S.1.
For comparison, we also included in the same table the performance of a semi-supervised system trained by
K-fold validation. The system was run K = 5 times, with 1 − K −1 of the POc words given to the system as
observations in the graphical model for each run. It is semi-supervised in the sense that target reconstructions for
many internal nodes are not available in the dataset, so they are still not filled.3
2.2
Comparisons against other methods
The first competing method, PRAGUE was introduced in [15]. In this method, the word forms in a given protolanguage are reconstructed using a Viterbi multi-alignment between a small number of its descendant languages. The
alignment is computed using hand-set parameters. Deterministic rules characterizing changes between pairs of
observed languages are extracted from the alignment when their frequency is higher than a threshold, and a protophoneme inventory is built using linguistically motivated rules and parsimony. A reconstruction of each observed
word is first proposed independently for each language. If at least two reconstructions agree, a majority vote is
taken, otherwise no reconstruction is proposed. This approach has several limitations. First, it is not tractable for
larger trees, since the time complexity of their multi-alignment algorithm grows exponentially in the number of
languages. Second, deterministic rules, while elegant in theory, are not robust to noise: even in experiments with
only four daughter languages, a large fraction of the words could not be reconstructed.
Since PRAGUE does not scale to large datasets, we also built a second, more tractable baseline. This new
baseline system, CENTROID, computes the centroid of the observed word forms in Levenshtein distance. Let
2 While most word forms in ABVD are encoded using IPA, there are a few exceptions, for example language family specific conventions
such as the usage of the okina symbol (’) for glottal stops (P) in some Polynesian languages or ad hoc conventions such as using the bigram
‘ng’ for the agma symbol (N). We have preprocessed as many of these exceptions as possible, and probabilist methods are generally robust to
reasonable amounts of encoding glitches.
3 We also tried a fully supervised system where a flat topology is used so that all of these latent internal nodes are avoided; but it did not
perform as well—this is consistent with the -Topology experiment of Table S.1.
5
Lev(x, y) denote the Levenshtein distance between word forms x and y. Ideally, we would like the baseline
system to return:
X
argmin
Lev(x, y),
x∈Σ∗
y∈O
where O = {y1 , . . . , y|O| } is the set of observed word forms. This objective function is motivated by Bayesian
decision theory [11], and shares similarity to the more sophisticated Bayes estimator described in Section 2. However it replaces the samples obtained by MCMC sampling by the set of observed words. Similarly to the algorithm
of Section 2, we also restrict the minimization to be taken on x ∈ Σ(O)∗ such that m ≤ |x| ≤ M and to strings
built by at most k contiguous substrings taken from the word forms in O, where m = mini |yi |, M = maxi |yi |
and Σ(O) is the set of characters occurring in O. Again, we found that k = 2 often finds the same solution as
higher values of k. The difference was in all the cases not statistically significant, so we report the approximation
k = 2 in what follows.
We also compared against an oracle, denoted ORACLE, which returns
argmin Lev(y, x∗ ),
y∈O
where x∗ is the target reconstruction. We will denote it by ORACLE. This is superior to picking a single closest
language to be used for all word forms, but it is possible for systems to perform better than the oracle since it has
to return one of the observed word forms. Of course, this scheme is only available to assess system performance
on held-out data: It cannot make new predictions.
We performed the comparison against a previous system proposed in [15] on the same dataset and experimental
conditions as used in [15]. The PMJ dataset was compiled by [16], who also reconstructed the corresponding
protolanguage. Since PRAGUE is not guaranteed to return a reconstruction for each cognate set, only 55 word forms
could be directly compared to our system. We restricted comparison to this subset of the data. This favors PRAGUE
since the system only proposes a reconstruction when it is certain. Still, our system outperformed PRAGUE, with
an average distance of 1.60 compared to 2.02 for PRAGUE. The difference is marginally significant, p = 0.06,
partly due to the small number of word forms involved.
To get a more extensive comparison, we considered the hybrid system that returns PRAGUE’s reconstruction
when possible and otherwise back off to the Sundanese (Snd.) modern form, then Madurese (Mad.), Malay (Mal.)
and finally Javanic (Jv.) (the optimal back-off order). In this case, we obtained an edit distance of 1.86 using our
system against 2.33 for PRAGUE, a statistically significant difference.
We also compared against ORACLE and CENTROID in a large-scale setting. Specifically, we compare to the
experimental setup on 64 modern languages used to reconstruct POc described before. Encouragingly, while
the system’s average distance (1.49) does not attain that of the ORACLE (1.13), we significantly outperform the
CENTROID baseline (1.79).
2.3
Cognate recovery
To test the effectiveness of our cognate recovery system, we ran our system on all of the Oceanic languages in the
ABVD, which comprises roughly half of the Austronesian languages. We then evaluated the pairwise precision,
recall, F1, and purity scores, defined as follows. Let G = {G1 , G2 , . . . , Gg } denote the known partitions of the
forms into cognates, and let F = {F1 , F2 , . . . , Ff } denote the inferred partitions. Let pairs(F ) denote the set of
unordered pairs of indices in the same partition in F : pairs(F ) = {{i, j} : ∃k s.t. i, j ∈ Fk , i 6= j}, and similarly
6
for pairs(G). The metrics are defined as follows:
|pairs(G) ∩ pairs(F )|
,
|pairs(F )|
|pairs(G) ∩ pairs(F )|
recall =
,
|pairs(G)|
precision · recall
F1 = 2
,
precision + recall
1 X
max |Gg ∩ Ff |.
purity =
g
N
precision =
f
Using these metrics, we found that our system achieved a precision of 84.4, recall of 62.1, F1 of 71.5, and
cluster purity of 91.8. Thus, over 9 out of 10 words are correctly grouped, and our system errs on the side of undergrouping words rather than clustering words that are not cognates. Since the null hypothesis in historical linguistics
is to deem words to be unrelated unless proven otherwise, a slight under-grouping is the desired behavior.
Since we are ultimately interested in reconstruction, we then compared our reconstruction system’s ability to
reconstruct words given these automatically determined cognates. Specifically, we took every cognate group found
by our system (run on the Oceanic subclade) with at least two words in it. That is, we excluded words that our
system found to be isolates. Then, we automatically reconstructed the Proto-Oceanic ancestor of those words using
our system (using the auxiliary variables z described in Section 1.2.1).
For evaluation, we then looked at the average edit distance from our reconstructions to the known reconstructions described in the previous sections. This time, however, we average per modern word rather than per cognate
group, to provide a fairer comparison. (Results were not substantially different averaging per cognate group.)
Using known cognates from the ABVD, there was an average reconstruction error of 2.19, versus 2.47 for the
automatically reconstructed cognates, or an increase in error rate of 12.8%. The fraction of words with each Levenshtein distance for these reconstructions is shown in Figure S.1. While the plots are similar, the automatic cognates
exhibit a slightly longer tail. Thus, even with automatic cognates, the reconstruction system can reconstruct words
faithfully in many cases, only failing in a few instances.
2.4
Computation of functional loads
To measure functional load quantitatively, we used the same estimator as the one used in [17]. This definition
is based on associating a context vector cx,` to each phoneme and language. For a given language with N (`)
phoneme tokens, these context vectors are defined as follows: first, fix an enumeration order of all the contexts
found in the corpus, where the context is defined as a pair of phonemes, one at the left and one at the right of a
position. Element i in this enumeration will correspond to component i of the context vectors. Then, the value of
component i in context vector cx,` is set to be the number of time phoneme x occurred in context i and language
l. Finally, King’s definition of functional load FL` (x, y) is the dot product of the two induced context vectors:
FL` (x, y) =
1
1 X
hcx,` , cy,` i =
cx,` (i) × cy,` (i),
2
N (`)
N (`)2 i
where the denominator is simply a normalization that insures FL` (x, y) ≤ 1. Note that if x and y are in complementary distribution in language `, then the two vectors cx,` and cy,` are orthogonal. The functional load is indeed
zero in this case.
In Figure 3 of the main paper, we show heat maps where the color encodes the log of the number of sound
changes that fall into a given 2-dimensional bin. Each sound change x > y is encoded as pair of numbers in the
unit interval, (ˆl, m̂), where ˆl is an estimate of the functional load of the pair and m̂ is the posterior fraction of the
instances of the phoneme x that undergo a change to y. We now describe how ˆl, m̂ were estimated. The posterior
7
fraction m̂ for the merger x > y between languages pa(`) → ` is easily computed from the same expected
sufficient statistics used for parameter estimation:
P
p∈Σ N (S, x, p, `, y)
P
m̂` (x > y) = P
.
0
0
p0 ∈Σ
y 0 ∈Σ N (S, x, p , `, y )
The estimate of the functional load requires additional statistics, i.e. the expected context vectors ĉx,` and expected
phoneme token counts N̂ (`), but these can be readily extracted from the output of the MCMC sampler. The
estimate is then:
ˆl` (x, y) =
1
N̂ (`)2
hĉx,` , ĉy,` i.
Finally, the set of points used to construct the heat map is:
n
o
ˆlpa(`) (x, y), m̂` (x > y) : ` ∈ L − {root}, x ∈ Σ, y ∈ Σ, x 6= y .
3
Reconstruction lists
In Table S.2–S.4, we show the lists of consensus reconstructions produced by our system (the ‘Automatic’ column).
For comparison, we also include a baseline (randomly picking one modern word), and the edit distances (when it
is greater than zero). In Proto-Oceanic, since two manual reconstructions are available, we include the distances
of the automatic reconstruction to both manual reconstructions (‘P-A’ and ‘B-A’) and the distance between the two
manual reconstructions (‘B-P’).
4
Sound changes
In Figures S.2–S.5, we zoom and rotate each quadrant of the tree shown in Figure 2 of the main paper. For
information on the most frequent change of each branch, refer to the row in Table S.5 corresponding to the code in
parenthesis attached to each branch. By most frequent, we mean the change with the highest expected count in the
last EM iteration, collapsing contexts for simplicity. In cases where no change is observed with an expected count
of more than 1.0, we skip the corresponding entry—this can be caused for example by languages where cognacy
information was too sparse in the dataset. The functional load, normalized to be in the interval [0, 1] is also shown
for reference.
5
Frequent errors
In Figure S.6, we analyze the frequent discrepancy between the PAn reconstruction from our system with those
from [18]. The most frequent problematic substitutions, insertions, and deletions are shown. These frequencies
were obtained by aligning, after running the system, the reconstructions with the references. The aligner is a pairHMM trained via EM, using only phoneme identity and gap information [19]. The frequencies were extracted
from the posterior alignments.
References and Notes
[1] Hastie T, Tibshiranim R, Friedman J (2009) The Elements of Statistical Learning (Springer).
[2] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal
Statistical Society. Series B (Methodological) 39:1–38.
8
[3] Liu DC, Nocedal J, Dong C (1989) On the limited memory BFGS method for large scale optimization. Mathematical Programming
45:503–528.
[4] Lunter GA, Miklós I, Song YS, Hein J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J.
Comp. Biol. 10:869–889.
[5] Tierney L (1994) Markov chains for exploring posterior distributions. The Annals of Statistics 22:1701–1728.
[6] Holmes I, Bruno WJ (2001) Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820.
[7] Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel trees. Advances in Neural Information Processing
Systems 21:177–184.
[8] Caffo B, Jank W, Jones G (2005) Ascent-based Monte Carlo EM. Journal of the Royal Statistical Society - Series B 67:235–252.
[9] Judea P (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. (Morgan Kaufmann).
[10] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368–
376.
[11] Robert CP (2001) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer).
[12] Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10.
[13] Greenhill S, Blust R, Gray R (2008) The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary
Bioinformatics 4:271–283.
[14] Lewis MP, ed (2009) Ethnologue: Languages of the World, Sixteenth edition. (SIL International).
[15] Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quantitative Linguistics 7:233–244.
[16] Nothofer B (1975) The reconstruction of Proto-Malayo-Javanic (M. Nijhoff).
[17] King R (1967) Functional load and sound change. Language 43:831–852.
[18] Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian comparative linguistics. Inst. Linguist. Acad. Sinica
1:31–94.
[19] Berg-Kirkpatrick T, Bouchard-Côté A, DeNero J, Klein D (2010) Painless unsupervised learning with features. Proceedings of the North
American Conference on Computational Linguistics pp 582–590.
Condition
Unsupervised full system
-FAITHFULNESS
-MARKEDNESS
-Sharing
-Topology
Semi-supervised system
Edit dist.
1.87
2.02
2.18
1.99
2.06
1.75
Table S.1: Effects of ablation of various aspects of our unsupervised system on mean edit distance to POc. -Sharing
corresponds to the restriction to the subset of the features in OPERATION, FAITHFULNESS and MARKEDNESS that
are branch-specific, -Topology corresponds to using a flat topology where the only edges in the tree connect
modern languages to POc. The semi-supervised system is described in the text. All differences (compared to the
unsupervised full system) are statistically significant.
9
Table S.2. Proto-Austronesian reconstructions
Gloss
Reference
Baseline
Distance
Automatic
tohold(1)
smoke(1)
toscratch(1)
and(2)
leg/foot(1)
shoulder(1)
woman/female(1)
left(1)
day(1)
mother(1)
tosuck(1)
small(2)
night(1)
tosqueeze(1)
tohit(1)
rat(1)
you(1)
toplant(1)
toswell(1)
tosee(1)
one(1)
tosleep(1)
dog(1)
topound,beat(20)
stone(1)
green(1)
father(1)
this(1)
tooth(1)
tochoose(1)
star(1)
tobuy(23)
tovomit(1)
towork(1)
wide(1)
tocut,hack(1)
tofear(1)
tolive,bealive(1)
thunder(3)
tofly(2)
toshoot(1)
name(1)
tobuy(1)
and(1)
when?(1)
todig(1)
ash(1)
big(1)
tocut,hack(3)
road/path(1)
tostand(1)
sand(1)
*gemgem
*qebel
*ka Raw
*mah
*qaqay
*qaba Ra
*bahi
*kawi Ri
*qalejaw
*tina
*sepsep
*kedi
*be RNi
*pe Req
*palu
*labaw
*ikamu
*mula
*ba Req
*kita
*isa
*tudu R
*wasu
*tutuh
*batu
*mataq
*tama
*ini
*nipen
*piliq
*bituqen
*baliw
*utaq
*qumah
*malawas
*ta Raq
*matakut
*maqudip
*de RuN
*layap
*panaq
*Najan
*beli
*ka
*ijan
*kalih
*qabu
*ma Raya
*tektek
*zalan
*di Ri
*qenay
*higem
*q1v1l
*kagaw
*ma
*ai
*vala
*babinay
*kayli
*andew
*qina
*sipsip
*kedhi
*beyi
*perah
*mipalo
*lapo
*kamo
*h1mula
*ba Req
*kita
*sa
*maturug
*kahu
*nutu
*batu
*mataq
*qama
*eni
*lipon
*piliP
*bitun
*taiw
*mutjaq
*quma
*Gawig
*ta Raq
*takuP
*maPuNi
*zuN
*layap
*fanak
*ngaza
*poli
*kae
*pirang
*kali
*abu
*ma Raya
*tutek
*dalan
*diri
*one
3
3
1
1
4
4
4
3
5
1
2
1
2
3
3
3
2
2
*gemgem
*qebel
*ka Raw
*ma
*qaqay
*qaba Ra
*vavaian
*kawi Ri
*qalejaw
*tina
*sepsep
*kedi
*be RNi
*pereq
*palu
*kulavaw
*kamu
*mula
*ba Req
*kita
*isa
*tudu R
*vatu
*tutu
*batu
*mataq
*tama
*ani
*nipen
*piliq
*bituqen
*taiw
*utaq
*quma
*malabe R
*ta Raq
*matakut
*maqudip
*deruN
*layap
*panaq
*Najan
*beli
*ka
*pijan
*kali
*qabu
*ma Raya
*tektek
*zalan
*di Ri
*qenay
1
4
2
2
1
1
2
1
2
2
2
1
5
3
3
3
2
4
2
1
3
1
1
2
1
1
4
Distance
1
5
1
3
1
2
1
1
2
1
3
1
1
1
continued on next page
10
continued from previous page
Gloss
Reference
Baseline
meat/flesh(31)
what?(2)
toswell(26)
bird(2)
hand(1)
in,inside(1)
if(2)
toknow,beknowledgeable(2)
at(20)
wind(2)
new(1)
blood(1)
breast(1)
i(1)
salt(1)
toflow(1)
five(1)
at(1)
other(1)
leaf(2)
all(1)
rotten(1)
tocook(1)
head(1)
mouth(2)
house(1)
if(1)
neck(1)
needle(1)
he/she(1)
fruit(1)
back(1)
tochew(2)
salt(2)
we(2)
long(1)
we(1)
three(1)
lake(1)
toeat(1)
no,not(3)
where?(1)
how?(1)
tothink(34)
who?(2)
tobite(1)
tail(1)
*isi
*nanu
* Ribawa
*qayam
*lima
*idalem
*nu
*bajaq
*di
*bali
*mabaqe Ru
*da Raq
*susu
*iaku
*qasi Ra
*qalu R
*lima
*i
*duma
*bi Raq
*amin
*mabu Raq
*tanek
*qulu
*Nusu
* Rumaq
*ka
*liqe R
*za Rum
*siia
*buaq
*likud
*qelqel
*timus
*kami
*inaduq
*ikita
*telu
*danaw
*kaen
*ini
*inu
*kuja
*nemnem
*siima
*ka Rat
*iku R
*ici
*anu
*malifawa
*ayam
*lime
*dalale
*no
*mafanaP
*ri
*feli
*baPru
*da Raq
*soso
*ako
*sie
*ilir
*lima
*i
*duma
*bela
*kemon
*mavuk
*tan@k
*PuGuh
*Nutu
*uma
*ke
*liq1g
*dag1m
*ia
*buaq
*likudeP
*qmelqel
*timus
*sikami
*nandu
*itam
*t1lu
*ranu
*kman
*ini
*sadinno
*gagua
*n1mn1m
*cima
*kagat
*ikog
11
Distance
Automatic
1
1
4
1
1
4
1
5
1
2
5
*isi
*anu
*abeh
*qayam
*lima
*idalem
*nu
*mafanaP
*di
*beliu
*vaquan
*da Raq
*susu
*iaku
*qasi Ra
*qalu R
*lima
*i
*duma
*bela
*amin
*mabu Ruk
*tanek
*qulu
*Nuju
* Rumaq
*ka
*liqe R
*za Rum
*siia
*buaq
*likud
*qmelqel
*timus
*kami
*anaduq
*kita
*telu
*danaw
*kman
*ini
*ainu
*kua
*kinemnem
*tima
*ka Rat
*iku R
2
2
4
4
3
3
4
1
3
1
2
1
2
3
2
2
1
2
3
3
1
3
2
5
4
2
2
1
2
Distance
1
5
5
2
6
3
2
1
1
1
1
2
1
1
2
2
Table S.3. Proto-Oceanic reconstructions
Reconstructions
Pairwise distances
B-P
P-A
B-A
1
1
1
1
3
4
2
4
1
1
1
3
3
2
1
2
3
1
2
2
3
2
1
1
1
1
1
1
1
1
3
1
3
3
2
2
3
2
1
1
Gloss
Blust (B)
Pawley (P)
Automatic (A)
fish(1)
five(1)
what?(1)
meat/flesh(1)
star(1)
fog(1)
toscratch(44)
shoulder(1)
where?(3)
toclimb(2)
toeat(1)
two(1)
dry(11)
narrow(1)
todig(1)
bone(2)
stone(1)
left(1)
they(1)
toliedown(1)
tohide(1)
rope(1)
smoke(2)
when?(1)
we(2)
this(1)
egg(1)
stick/wood(1)
tosit(16)
toshoot(1)
liver(1)
needle(1)
feather(1)
topound,beat(2)
near(9)
heavy(1)
year(1)
old(1)
fire(1)
tochoose(1)
rain(1)
togrow(1)
tosee(1)
tohear(1)
tochew(1)
louse(1)
wind(1)
bad,evil(1)
hand(1)
toflow(1)
night(1)
*ikan
*lima
*sapa
*pisiko
*pituqun
*kaput
*karu
*pa Ra
*pai
*sake
*kani
*rua
*maca
*kopit
*keli
*su Ri
*patu
*mawi Ri
*ira
*qinop
*puni
*tali
*qasu
*Naican
*kamami
*ne
*qatolu R
*kayu
*nopo
*panaq
*qate
*sa Rum
*pulu
*tutuk
*tata
*mamat
*taqun
*matuqa
*api
*piliq
*qusan
*tubuq
*kita
*roNo R
*mamaq
*kutu
*aNin
*saqat
*lima
*tape
*boNi
*ikan
*lima
*saa
*pisako
*pituqon
*kaput
*kadru
*qapa Ra
*pea
*sake
*kani
*rua
*masa
*kopit
*keli
*su Ri
*patu
*mawi Ri
*ira
*qeno
*puni
*tali
*qasu
*Naijan
*kami
*ani
*katolu R
*kayu
*nopo
*pana
*qate
*sa Rum
*pulu
*tuki
*tata
*mapat
*taqun
*matuqa
*api
*piliq
*qusan
*tubuq
*kita
*roNo R
*mamaq
*kutu
*mataNi
*saqat
*lima
*tape
*boNi
*ikan
*lima
*sava
*kiko
*vetuqu
*kabu
*kadru
*vara
*vea
*cake
*kani
*rua
*mamasa
*kapi
*keli
*sui
*patu
*mawii
*sira
*eno
*vuni
*tali
*qasu
*Nisa
*kami
*eni
*tolu
*kai
*nofo
*pana
*qate
*sau
*vulu
*tutuk
*tata
*mamava
*taqu
*matuqa
*avi
*vili
*usa
*tubu
*kita
*roNo
*mama
*kutu
*mataNi
*saqa
*lima
*tave
*boNi
1
2
2
1
2
1
2
2
1
1
3
2
1
1
3
1
2
1
3
2
1
3
1
2
1
1
2
2
1
1
2
2
1
1
1
1
1
1
4
1
1
1
4
continued on next page
12
continued from previous page
Reconstructions
Pairwise distances
Gloss
Blust (B)
Pawley (P)
Automatic (A)
day(5)
tospit(14)
person/humanbeing(1)
tovomit(1)
name(1)
snake(12)
man/male(1)
tobreathe(1)
far(1)
tobuy(1)
tovomit(8)
tocook(9)
thick(3)
leg/foot(1)
tobite(1)
leaf(1)
sky(1)
todrink(1)
tostand(2)
i(1)
warm(1)
moon(1)
how?(1)
three(1)
toplant(2)
mosquito(1)
bird(1)
four(1)
water(2)
one(1)
skin(1)
toyawn(1)
he/she(1)
nose(1)
thatch/roof(1)
towalk(2)
flower(1)
dust(1)
neck(18)
eye(1)
father(1)
tofear(1)
root(2)
tostab,pierce(8)
breast(1)
tolive,bealive(1)
head(1)
thou(1)
fruit(1)
*qaco
*qanusi
*taumataq
*mumutaq
*Najan
*mwata
*mwa Ruqane
*manawa
*sauq
*poli
*luaq
*tunu
*matolu
*waqe
*ka Rat
*raun
*laNit
*inum
*tuqur
*au
*mapanas
*pulan
*kua
*tolu
*tanum
*namuk
*manuk
*pani
*wai R
*sakai
*kulit
*mawap
*ia
*isuN
*qatop
*pano
*puNa
*qapuk
* Ruqa
*mata
*tama
*matakut
*waka Ra
*soka
*susu
*maqurip
*qulu
*ko
*puaq
*qaco
*qanusi
*tamwata
*mumuta
*qajan
*mwata
*taumwaqane
*manawa
*sauq
*poli
*luaq
*tunu
*matolu
*waqe
*ka Rati
*rau
*laNit
*inum
*taqur
*au
*mapanas
*pulan
*kuya
*tolu
*tanom
*namuk
*manuk
*pat
*wai R
*tasa
*kulit
*mawap
*ia
*ijuN
*qatop
*pano
*puNa
*qapuk
* Ruqa
*mata
*tamana
*matakut
*waka R
*soka
*susu
*maqurip
*qulu
*iko
*puaq
*qaso
*aNusu
*tamata
*muta
*qasa
*mwata
*mwane
*manawa
*sau
*voli
*lua
*tunu
*matolu
*waqe
*karat
*dau
*laNi
*inum
*tuqu
*yau
*mavana
*vula
*kua
*tolu
*tan@m
*namu
*manu
*vati
*wai
*sa
*kulit
*mawa
*ia
*isu
*qato
*vano
*vuNa
*avu
*ua
*mata
*tama
*matakut
*waka
*soka
*susu
*maquri
*qulu
*kou
*vua
13
B-P
P-A
B-A
3
1
1
1
3
1
2
2
1
3
2
3
3
5
5
4
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
2
1
2
2
1
1
1
2
2
1
1
1
2
1
2
1
1
1
2
1
3
1
1
1
2
1
1
1
3
2
1
1
1
1
3
2
2
2
1
1
2
1
1
2
2
1
2
1
1
2
3
1
Table S.4. Proto-Polynesian reconstructions
Gloss
Reference
Baseline
rope(9)
short(11)
strikewithfist
coral
cook
painful,sick(1)
downwards
dive
toface
grasp
bay
tail
toblow(6)
channel
pandanus
urinate
navel
gall
wave
sleep
warm(1)
beak
tooth
dance
leg
drown
water
nose
taro
tohide(1)
upwards
overripe
tosit(16)
red
tochew(1)
three(1)
tosleep(10)
nine
dawn
feather(1)
canoe
left(11)
ashes
dew
small
house
voice
octopus
toclimb(2)
sea
day
branch
*taula
*pukupuku
*moto
*puNa
*tafu
*masaki
*hifo
*ruku
*haNa
*kapo
*faNa
*siku
*pupusi
*awa
*fara
*mimi
*pito
*Pahu
*Nalu
*mohe
*mafanafana
*Nutu
*nifo
*saka
*waPe
*lemo
*wai
*isu
*talo
*funi
*hake
*pePe
*nofo
*kula
*mama
*tolu
*mohe
*hiwa
*ata
*fulu
*waka
*sema
*refu
*sau
*riki
*fale
*lePo
*feke
*kake
*tahi
*Paho
*maNa
*taura
*puPupuPu
*moko
*puNa
*tahu
*mahaki
*ifo
*uku
*haNa
*Papo
*hana
*siPu
*pupuhi
*ava
*fala
*mimi
*pito
*au
*Nalu
*moe
*mafanafana
*Nutu
*niho
*haka
*wae
*lemo
*vai
*isu
*kalo
*huna
*aPe
*pee
*noho
*Pula
*mama
*tolu
*moe
*iwa
*ata
*fulu
*vaPa
*hema
*lehu
*sau
*riki
*hale
*leo
*hePe
*aPe
*kai
*ao
*maNa
Distance
1
2
1
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
2
2
1
1
1
1
1
2
1
2
1
1
2
2
2
2
Automatic
*taula
*pukupuku
*moto
*puNa
*tafu
*masaki
*hifo
*ruku
*haNa
*kapo
*faNa
*siku
*pupusi
*awa
*fara
*mimi
*pito
*Pahu
*Nalu
*mohe
*mafanafana
*Nutu
*nifo
*saka
*waPe
*lemo
*wai
*isu
*talo
*suna
*hake
*pePe
*nofo
*kula
*mama
*tolu
*mohe
*hiwa
*ata
*fulu
*waka
*sema
*refu
*sau
*riki
*fale
*lePo
*feke
*ake
*tahi
*Paho
*maNa
Distance
2
1
continued on next page
14
continued from previous page
Gloss
Reference
Baseline
Distance
lovepity
flyrun
thick(3)
tostrippeel
sitdwell
black
torch
*PaloPofa
*lele
*matolutolu
*hisi
*nofo
*kele
*rama
*aroha
*rere
*matolutolu
*hihi
*noho
*Pele
*ama
5
2
1
1
1
1
Automatic
Distance
*PaloPofa
*lele
*matolutoru
*hisi
*nofo
*kele
*rama
1
Table S.5. Sound changes
Code
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)
(53)
(54)
(55)
(56)
(57)
(58)
(59)
(60)
(61)
(62)
(63)
(64)
Parent
v
(ProtoOcean)
>
p
a
(Northern Dumagat)
>
2
q
(Bisayan)
>
P
a
(Schouten)
>
@
e
(UlatInai)
>
a
e
(SeramStraits)
>
o
a
(SouthwestNewBritain)
>
e
a
(CentralWestern)
>
e
i
(SeramStraits)
>
a
a
(EastVanuatu)
>
e
s
(BimaSumba)
>
h
a
(Vanuatu)
>
e
f
(Futunic)
>
p
n
(Wetar)
>
N
t
(WestSanto)
>
r
Not enough data available for reliable sound change estimates
d
(NorthPapuanMainlandDEntrecasteaux)
>
t
O
(Southern Malaita)
>
o
O
(Southern Malaita)
>
o
a
(Bibling)
>
e
Not enough data available for reliable sound change estimates
O
(SanCristobal)
>
o
Not enough data available for reliable sound change estimates
b
(ProtoCentr)
>
p
a
(RajaAmpat)
>
E
o
(Utupua)
>
u
a
(Central Manobo)
>
o
s
(Formosan)
>
h
l
(West NuclearTimor)
>
n
t
(Ibanagic)
>
q
a
(Suauic)
>
e
N
(MalekulaCentral)
>
n
a
(ProtoCentr)
>
e
O
(Choiseul)
>
o
O
(Choiseul)
>
o
O
(Choiseul)
>
o
Not enough data available for reliable sound change estimates
O
(Choiseul)
>
o
v
(Ivatan)
>
b
a
(BorneoCoastBajaw)
>
e
a
(NuclearCordilleran)
>
2
N
(BaliSasak)
>
n
e
(ProtoMalay)
>
@
h
(BimaSumba)
>
s
p
(East CentralMaluku)
>
f
a
(Sulawesi)
>
o
a
(Palawano)
>
o
e
(LocalMalay)
>
a
a
(SouthNewIrelandNorthwestSolomonic)
>
o
r
(Sangiric)
>
d
b
(Sulawesi)
>
w
q
(ProtoMalay)
>
P
e
(Madak)
>
o
u
(Northern EastFormosan)
>
o
u
(BashiicCentralLuzonNorthernMindoro)
>
o
r
(NorthernPhilippine)
>
y
N
(Palawano)
>
n
O
(SanCristobal)
>
o
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
p
(Vitiaz)
>
f
u
(BerawanLowerBaram)
>
o
f
(Futunic)
>
h
t
(Pangasinic)
>
s
Child
Functional load
Number of occurrences
(AdmiraltyIslands)
(Agta)
(AklanonBis)
(Ali)
(Alune)
(Amahai)
(Amara)
(AmbaiYapen)
(Ambon)
(AmbrymSout)
(Anakalang)
(AnejomAnei)
(Anuta)
(Aputai)
(ArakiSouth)
0.09884
5.132e-05
0.03289
7.158e-05
1.00000
0.05802
0.19090
0.21003
0.27139
0.09085
0.03743
0.08559
0.09282
0.04788
0.21996
5.67800
110.26986
46.31015
5.45255
1.31832
6.91108
15.52212
5.74372
3.04082
14.42495
10.01598
14.43646
19.96156
7.10435
20.80080
(AreTaupota)
(AreareMaas)
(AreareWaia)
(Aria)
0.06738
0.01151
0.01151
0.29446
2.92590
1.99913
1.99971
4.99706
(ArosiOneib)
0.00562
1.99971
(Aru)
(As)
(Asumboa)
(AtaTigwa)
(Atayalic)
(Atoni)
(AttaPamplo)
(Auhelawa)
(Avava)
(Babar)
(Babatana)
(BabatanaAv)
(BabatanaKa)
0.13711
0.05156
0.07562
0.00291
0.04098
0.23175
0.47297
0.18692
0.10747
0.04891
0.01953
0.01953
0.01953
9.35892
4.74243
3.70741
7.29063
4.31225
7.02772
7.87084
7.98362
15.49566
5.61192
8.94227
1.99952
2.99952
(BabatanaTu)
(Babuyan)
(Bajo)
(Balangaw)
(Bali)
(BaliSasak)
(Baliledo)
(BandaGeser)
(BanggaiWdi)
(Banggi)
(BanjareseM)
(Banoni)
(Bantik)
(Baree)
(Barito)
(Barok)
(Basai)
(Bashiic)
(BashiicCentralLuzonNorthernMindoro)
(BatakPalaw)
(BauroBaroo)
0.01953
0.01574
0.04774
0.00387
0.19166
6.064e-04
0.03743
0.05559
0.06991
6.735e-04
0.10667
0.16106
0.01606
0.05030
0.04087
0.04576
0.00130
0.02423
0.01665
0.04752
0.00562
2.99961
16.97554
21.59109
37.89097
13.91349
26.22332
7.64100
3.52479
8.61615
7.69053
36.24790
5.69268
6.99267
10.71565
15.54090
1.91240
5.88492
63.00884
3.54936
4.94491
1.99981
(Bel)
(Belait)
(Bellona)
(Benguet)
0.01771
0.03125
0.01009
0.09759
1.06548
12.34195
15.97130
1.35377
continued on next page
15
continued from previous page
Code
(65)
(66)
(67)
(68)
(69)
(70)
(71)
(72)
(73)
(74)
(75)
(76)
(77)
(78)
(79)
(80)
(81)
(82)
(83)
(84)
(85)
(86)
(87)
(88)
(89)
(90)
(91)
(92)
(93)
(94)
(95)
(96)
(97)
(98)
(99)
(100)
(101)
(102)
(103)
(104)
(105)
(106)
(107)
(108)
(109)
(110)
(111)
(112)
(113)
(114)
(115)
(116)
(117)
(118)
(119)
(120)
(121)
(122)
(123)
(124)
(125)
(126)
(127)
(128)
(129)
(130)
(131)
(132)
(133)
(134)
(135)
(136)
(137)
(138)
(139)
(140)
(141)
(142)
(143)
(144)
(145)
(146)
Parent
u
(BerawanLowerBaram)
>
o
k
(NorthSarawakan)
>
P
e
(SouthwestNewBritain)
>
i
O
(RajaAmpat)
>
o
k
(CentralPhilippine)
>
c
a
(Blaan)
>
O
N
(Blaan)
>
n
Not enough data available for reliable sound change estimates
u
(Northern NuclearBel)
>
o
a
(ProtoMalay)
>
1
O
(SouthNewIrelandNorthwestSolomonic)
>
o
t
(BimaSumba)
>
d
b
(ProtoCentr)
>
w
a
(NorthSarawakan)
>
e
1
(Manobo)
>
u
1
(CentralPhilippine)
>
u
1
(Bilic)
>
a
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
t
(GorontaloMongondow)
>
s
o
(TukangbesiBonerate)
>
O
u
(Seram)
>
i
u
(Bontok)
>
o
a
(BontokKankanay)
>
1
q
(Bontok)
>
P
u
(NuclearCordilleran)
>
o
u
(SuluBorneo)
>
o
v
(GelaGuadalcanal)
>
p
a
(Bugis)
>
@
N
(SouthSulawesi)
>
k
s
(Bughotu)
>
h
e
(KayanMurik)
>
a
a
(SouthHalmahera)
>
e
o
(Vanikoro)
>
O
a
(Sabahan)
>
o
u
(Formosan)
>
o
u
(CentralMaluku)
>
o
o
(South Bisayan)
>
u
u
(ButuanTausug)
>
o
r
(NorthPapuanMainlandDEntrecasteaux)
>
l
a
(NewCaledonian)
>
E
N
(ProtoChuuk)
>
n
u
(Bisayan)
>
o
l
(SouthHalmaheraWestNewGuinea)
>
r
u
(EastFormosan)
>
o
u
(SouthCentralCordilleran)
>
o
e
(ProtoMalay)
>
@
j
(ProtoOcean)
>
s
e
(BashiicCentralLuzonNorthernMindoro)
>
i
R
(ProtoCentr)
>
r
u
(MaselaSouthBabar)
>
E
N
(RemoteOceanic)
>
g
k
(Peripheral PapuanTip)
>
g
u
(MesoPhilippine)
>
o
x
(NortheastVanuatuBanksIslands)
>
k
i
(CenderawasihBay)
>
u
u
(Bisayan)
>
o
l
(East Nuclear Polynesian)
>
r
a
(Manobo)
>
1
a
(SantaIsabel)
>
u
t
(Chamic)
>
k
y
(Malayic)
>
i
b
(ProtoMalay)
>
p
7
(East SantaIsabel)
>
g
o
(SouthNewIrelandNorthwestSolomonic)
>
O
a
(ChamChru)
>
@
a
(ProtoChuuk)
>
e
a
(ProtoChuuk)
>
e
@
(Atayalic)
>
a
e
(North Babar)
>
o
n
(North Babar)
>
l
N
(LandDayak)
>
n
N
(South West Barito)
>
n
a
(NorthSarawakan)
>
e
a
(LoyaltyIslands)
>
e
g
(Bwaidoga)
>
y
l
(NorthPapuanMainlandDEntrecasteaux)
>
r
p
(Southern Malaita)
>
b
l
(Nuclear WestCentralPapuan)
>
r
u
(Northern Dumagat)
>
o
s
(SouthwestMaluku)
>
h
a
(CentralPacific)
>
o
Child
Functional load
Number of occurrences
(BerawanLon)
(BerawanLowerBaram)
(Bibling)
(BigaMisool)
(BikolNagaC)
(BilaanKoro)
(BilaanSara)
0.03125
0.15797
0.03719
0.03843
9.303e-05
0.01132
0.08320
13.55252
10.19426
2.29657
5.00642
12.87548
23.88863
18.08416
(Bilibil)
(Bilic)
(Bilur)
(Bima)
(BimaSumba)
(Bintulu)
(Binukid)
(Bisayan)
(Blaan)
0.03687
0.00777
0.00242
0.08028
0.08312
0.07029
0.03862
0.01651
0.08598
2.95970
19.43024
1.97890
9.97797
11.67021
9.74568
2.90829
18.53014
17.06499
(BolaangMon)
(Bonerate)
(Bonfia)
(BontocGuin)
(Bontok)
(BontokGuin)
(BontokKankanay)
(BorneoCoastBajaw)
(Bughotu)
(BugineseSo)
(Bugis)
(Bugotu)
(Bukat)
(Buli)
(Buma)
(BunduDusun)
(Bunun)
(BuruNamrol)
(ButuanTausug)
(Butuanon)
(Bwaidoga)
(Canala)
(Carolinian)
(Cebuano)
(CenderawasihBay)
(CentralAmi)
(CentralCordilleran)
(CentralEastern)
(CentralEasternOceanic)
(CentralLuzon)
(CentralMaluku)
(CentralMas)
(CentralPacific)
(CentralPapuan)
(CentralPhilippine)
(CentralVanuatu)
(CentralWestern)
(Central Bisayan)
(Central East Nuclear Polynesian)
(Central Manobo)
(Central SantaIsabel)
(ChamChru)
(Chamic)
(Chamorro)
(ChekeHolo)
(Choiseul)
(Chru)
(Chuukese)
(ChuukeseAK)
(CiuliAtaya)
(Dai)
(DaweraDawe)
(DayakBakat)
(DayakNgaju)
(Dayic)
(Dehu)
(Diodio)
(Dobuan)
(Dorio)
(Doura)
(DumagatCas)
(EastDamar)
(EastFijianPolynesian)
0.03159
0.02521
0.14937
0.04335
0.10466
8.294e-04
0.03806
0.05043
0.09530
0.00654
0.10991
0.03395
0.06056
0.05583
0.13019
0.05640
2.254e-04
0.01570
0.03947
0.03983
0.10631
0.05974
0.04042
0.03656
0.11907
4.190e-04
0.02941
6.064e-04
0.00410
0.01063
0.02692
0.04503
0.00869
0.11515
0.01374
0.02455
0.12684
0.03656
0.02709
0.09378
0.27278
0.34269
0.00237
0.15451
0.03879
0.00242
0.02139
0.11108
0.11108
0.02870
0.02409
0.16784
0.26785
0.16918
0.07029
0.20861
0.08008
0.10631
8.555e-04
0.13489
0.01544
0.02479
0.23158
4.79938
18.76480
11.37310
58.88829
3.41602
49.64130
9.39059
3.53556
1.99015
17.32980
2.98167
8.55895
4.02644
3.50307
23.27248
1.58162
11.98634
19.92112
3.20776
30.32254
8.92246
4.64394
23.82083
8.68282
11.14761
12.82795
3.43974
32.49321
2.00058
2.14101
12.80932
4.21108
11.59225
5.29705
38.98219
3.46857
1.99006
7.64073
2.35911
18.50241
1.22175
7.73602
1.73963
10.03113
10.10942
8.56855
31.97181
36.12356
59.55135
6.00504
5.70633
23.27929
7.13031
29.67016
5.15706
4.46889
12.32717
12.30865
1.99973
5.63596
18.63984
6.75499
1.54042
continued on next page
16
continued from previous page
Code
(147)
(148)
(149)
(150)
(151)
(152)
(153)
(154)
(155)
(156)
(157)
(158)
(159)
(160)
(161)
(162)
(163)
(164)
(165)
(166)
(167)
(168)
(169)
(170)
(171)
(172)
(173)
(174)
(175)
(176)
(177)
(178)
(179)
(180)
(181)
(182)
(183)
(184)
(185)
(186)
(187)
(188)
(189)
(190)
(191)
(192)
(193)
(194)
(195)
(196)
(197)
(198)
(199)
(200)
(201)
(202)
(203)
(204)
(205)
(206)
(207)
(208)
(209)
(210)
(211)
(212)
(213)
(214)
(215)
(216)
(217)
(218)
(219)
(220)
(221)
(222)
(223)
(224)
(225)
(226)
(227)
(228)
Parent
Child
c
(Formosan)
>
t
(EastFormosan)
b
(MaselaSouthBabar)
>
v
(EastMasela)
i
(BimaSumba)
>
u
(EastSumban)
b
(NortheastVanuatuBanksIslands)
>
v
(EastVanuatu)
e
(Barito)
>
i
(East Barito)
p
(CentralMaluku)
>
f
(East CentralMaluku)
u
(Manus)
>
o
(East Manus)
g
(NewGeorgia)
>
h
(East NewGeorgia)
a
(NuclearTimor)
>
e
(East NuclearTimor)
l
(Nuclear Polynesian)
>
r
(East Nuclear Polynesian)
O
(SantaIsabel)
>
o
(East SantaIsabel)
@
(CentralEastern)
>
o
(EasternMalayoPolynesian)
a
(RemoteOceanic)
>
e
(EasternOuterIslands)
a
(AdmiraltyIslands)
>
e
(Eastern AdmiraltyIslands)
a
(BandaGeser)
>
o
(ElatKeiBes)
r
(SamoicOutlier)
>
l
(Ellicean)
a
(Futunic)
>
e
(Emae)
e
(SouthwestBabar)
>
E
(Emplawas)
Not enough data available for reliable sound change estimates
i
(BimaSumba)
>
e
(EndeLio)
a
(Sumatra)
>
@
(Enggano)
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
f
(Wetar)
>
h
(Erai)
a
(Vanuatu)
>
e
(Erromanga)
O
(SanCristobal)
>
o
(Fagani)
O
(SanCristobal)
>
o
(FaganiAguf)
O
(SanCristobal)
>
o
(FaganiRihu)
Not enough data available for reliable sound change estimates
u
(WesternPlains)
>
o
(Favorlang)
s
(EastFijianPolynesian)
>
c
(FijianBau)
f
(Timor)
>
w
(FloresLembata)
l
(ProtoAustr)
>
r
(Formosan)
r
(Futunic)
>
l
(FutunaAniw)
r
(Futunic)
>
l
(FutunaEast)
l
(SamoicOutlier)
>
r
(Futunic)
l
(WestCentralPapuan)
>
r
(Gabadi)
u
(Ibanagic)
>
o
(Gaddang)
e
(Are)
>
i
(Gapapaiwa)
a
(BimaSumba)
>
o
(GauraNggau)
a
(ProtoMalay)
>
@
(Gayo)
r
(Northern NuclearBel)
>
z
(Gedaged)
c
(GelaGuadalcanal)
>
s
(Gela)
k
(SoutheastSolomonic)
>
g
(GelaGuadalcanal)
b
(GeserGorom)
>
w
(Geser)
f
(BandaGeser)
>
w
(GeserGorom)
Not enough data available for reliable sound change estimates
g
(Guadalcanal)
>
G
(Ghari)
Not enough data available for reliable sound change estimates
h
(Guadalcanal)
>
G
(GhariNggae)
O
(Guadalcanal)
>
o
(GhariNgger)
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
a
(SouthHalmahera)
>
o
(Giman)
a
(GorontaloMongondow)
>
o
(Gorontalic)
k
(Gorontalic)
>
P
(GorontaloH)
a
(Sulawesi)
>
o
(GorontaloMongondow)
s
(GelaGuadalcanal)
>
c
(Guadalcanal)
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
N
(NehanNorthBougainville)
>
n
(Haku)
1
(MesoPhilippine)
>
u
(Hanunoo)
t
(Marquesic)
>
k
(Hawaiian)
u
(Peripheral Central Bisayan)
>
o
(Hiligaynon)
r
(Ambon)
>
l
(HituAmbon)
g
(West NewGeorgia)
>
G
(Hoava)
a
(NorthNewGuinea)
>
e
(HuonGulf)
a
(LoyaltyIslands)
>
e
(Iaai)
u
(Malayic)
>
o
(Iban)
k
(NorthernCordilleran)
>
q
(Ibanagic)
N
(Sabahan)
>
n
(Idaan)
e
(Futunic)
>
a
(IfiraMeleM)
1
(NuclearCordilleran)
>
o
(Ifugao)
k
(Ifugao)
>
q
(IfugaoAmga)
k
(Ifugao)
>
q
(IfugaoBata)
1
(Kallahan)
>
o
(IfugaoBayn)
n
(Wetar)
>
N
(Iliun)
u
(NorthernLuzon)
>
o
(Ilokano)
a
(Peripheral Central Bisayan)
>
u
(Ilonggo)
a
(SouthernCordilleran)
>
1
(Ilongot)
N
(Ilongot)
>
n
(IlongotKak)
a
(Ivatan)
>
e
(Imorod)
Functional load
Number of occurrences
0.04224
0.00417
0.17438
0.16115
0.02495
0.09093
0.01231
0.02220
0.14170
0.04761
0.02383
0.00303
0.14981
0.16811
0.05446
0.03331
0.20786
0.18331
6.87583
5.01447
28.36485
2.78838
4.21938
7.47879
4.11061
5.62094
3.51835
79.14382
4.65059
18.76994
13.43554
7.58136
15.50349
5.08687
3.97652
11.05576
0.04706
0.00149
5.22061
1.61296
0.03346
0.08559
0.00562
0.00562
0.00562
7.93007
16.54778
1.99961
1.99952
1.99690
0.01234
0.04333
0.01025
0.01978
0.05835
0.05835
0.03331
0.13055
0.01362
0.06186
0.13543
0.00259
0.02021
0.01050
0.01749
0.06536
0.05946
20.09559
9.93879
7.38989
16.25315
2.94909
48.57376
74.90293
5.24097
18.75014
4.23744
29.26141
24.49089
7.64916
2.08324
19.76693
3.91631
4.12917
4.995e-04
9.42542
5.670e-05
0.00601
2.00000
2.99863
0.11966
0.12679
0.00973
0.06991
0.01050
11.83686
8.78982
18.65349
26.41042
3.50693
0.10208
0.02599
0.31942
0.02028
0.03275
0.00152
0.13866
0.20861
0.00621
0.40890
0.13834
0.20786
0.01254
0.34057
0.34057
0.01101
0.04788
0.02660
0.12642
0.07296
0.06631
0.08734
6.26107
24.61094
56.48048
1.95625
16.21990
7.69893
4.84173
8.25767
12.93856
11.09862
5.84917
4.95522
19.55025
4.76075
17.99754
17.75466
10.14036
29.66945
4.07172
10.35510
9.13131
6.03794
continued on next page
17
continued from previous page
Code
(229)
(230)
(231)
(232)
(233)
(234)
(235)
(236)
(237)
(238)
(239)
(240)
(241)
(242)
(243)
(244)
(245)
(246)
(247)
(248)
(249)
(250)
(251)
(252)
(253)
(254)
(255)
(256)
(257)
(258)
(259)
(260)
(261)
(262)
(263)
(264)
(265)
(266)
(267)
(268)
(269)
(270)
(271)
(272)
(273)
(274)
(275)
(276)
(277)
(278)
(279)
(280)
(281)
(282)
(283)
(284)
(285)
(286)
(287)
(288)
(289)
(290)
(291)
(292)
(293)
(294)
(295)
(296)
(297)
(298)
(299)
(300)
(301)
(302)
(303)
(304)
(305)
(306)
(307)
(308)
(309)
(310)
Parent
Child
n
(SouthwestBabar)
>
m
(Imroing)
u
(SamaBajaw)
>
o
(Inabaknon)
a
(LocalMalay)
>
e
(Indonesian)
u
(Benguet)
>
o
(Inibaloi)
y
(MaranaoIranon)
>
i
(Iranun)
a
(Ivatan)
>
e
(Iraralay)
l
(Ivatan)
>
d
(Isamorong)
t
(Ibanagic)
>
s
(IsnegDibag)
h
(Ivatan)
>
x
(Itbayat)
o
(Ivatan)
>
u
(Itbayaten)
u
(KalingaItneg)
>
o
(ItnegBinon)
l
(Ivatan)
>
d
(Ivasay)
g
(Bashiic)
>
y
(Ivatan)
e
(Ivatan)
>
1
(IvatanBasc)
q
(ProtoMalay)
>
h
(Javanese)
a
(Northern NewCaledonian)
>
e
(Jawe)
e
(West Barito)
>
o
(Kadorih)
O
(SanCristobal)
>
o
(Kahua)
O
(SanCristobal)
>
o
(KahuaMami)
l
(Gorontalic)
>
r
(Kaidipang)
k
(KairiruManam)
>
q
(Kairiru)
u
(Schouten)
>
i
(KairiruManam)
n
(Ilongot)
>
N
(KakidugenI)
r
(Mansakan)
>
l
(Kalagan)
i
(MesoPhilippine)
>
1
(Kalamian)
1
(KalingaItneg)
>
o
(KalingaGui)
a
(CentralCordilleran)
>
i
(KalingaItneg)
s
(Benguet)
>
h
(Kallahan)
s
(Kallahan)
>
h
(KallahanKa)
1
(Kallahan)
>
e
(KallahanKe)
n
(BimaSumba)
>
N
(Kambera)
b
(Tsouic)
>
v
(Kanakanabu)
v
(PatpatarTolai)
>
w
(Kandas)
u
(BontokKankanay)
>
o
(KankanayNo)
n
(CentralLuzon)
>
N
(Kapampanga)
t
(Ellicean)
>
d
(Kapingamar)
v
(LavongaiNalik)
>
f
(KaraWest)
a
(SouthHalmahera)
>
e
(KasiraIrah)
e
(South West Barito)
>
E
(Katingan)
I
(Pasismanua)
>
i
(KaulongAuV)
a
(Northern EastFormosan)
>
i
(Kavalan)
b
(ProtoMalay)
>
v
(KayanMurik)
a
(KayanMurik)
>
e
(KayanUmaJu)
a
(SarmiJayapuraBay)
>
e
(KayupulauK)
a
(FloresLembata)
>
e
(Kedang)
b
(KeiTanimbar)
>
B
(KeiTanimba)
a
(SoutheastMaluku)
>
e
(KeiTanimbar)
a
(Dayic)
>
e
(KelabitBar)
e
(East NuclearTimor)
>
E
(Kemak)
i
(NorthSarawakan)
>
e
(KenyahLong)
a
(LocalMalay)
>
o
(Kerinci)
a
(KilivilaLouisiades)
>
e
(Kilivila)
o
(Peripheral PapuanTip)
>
a
(KilivilaLouisiades)
o
(Central SantaIsabel)
>
O
(KilokakaYs)
N
(MicronesianProper)
>
n
(Kiribati)
P
(Manam)
>
k
(Kis)
t
(KisarRoma)
>
k
(Kisar)
f
(SouthwestMaluku)
>
w
(KisarRoma)
a
(BimaSumba)
>
o
(Kodi)
l
(ProtoCentr)
>
r
(KoiwaiIria)
O
(Central SantaIsabel)
>
o
(Kokota)
e
(Pesisir)
>
o
(Komering)
a
(Blaan)
>
O
(KoronadalB)
s
(NgeroVitiaz)
>
r
(Kove)
u
(PatpatarTolai)
>
a
(Kuanua)
O
(West NewGeorgia)
>
o
(Kubokota)
v
(Nuclear WestCentralPapuan)
>
b
(Kuni)
Not enough data available for reliable sound change estimates
a
(MicronesianProper)
>
e
(Kusaie)
O
(Northern Malaita)
>
o
(Kwai)
a
(Northern Malaita)
>
o
(Kwaio)
l
(Tanna)
>
r
(Kwamera)
N
(Northern Malaita)
>
n
(KwaraaeSol)
Not enough data available for reliable sound change estimates
e
(MelanauKajang)
>
@
(Lahanan)
i
(Willaumez)
>
e
(Lakalai)
r
(Nuclear WestCentralPapuan)
>
l
(Lala)
u
(FloresLembata)
>
o
(LamaholotI)
Not enough data available for reliable sound change estimates
s
(BimaSumba)
>
h
(Lamboya)
o
(Bibling)
>
u
(LamogaiMul)
a
(Pesisir)
>
@
(Lampung)
Functional load
Number of occurrences
0.23281
0.02973
0.10667
0.03849
0.00959
0.08734
0.01340
0.05600
0.00445
5.820e-04
0.02352
0.01340
0.02574
0.00723
0.07129
0.13215
7.673e-05
0.00562
0.00562
0.01053
0.00897
0.17315
0.06631
0.04611
0.02195
0.01076
0.13957
0.02667
0.00566
0.00403
0.00600
0.01715
0.03857
0.04380
0.01456
3.974e-05
0.02487
0.05583
0.00426
0.01894
0.08596
1.449e-07
0.06056
0.32087
0.10748
4.198e-05
0.17157
0.07691
2.255e-04
0.03423
0.01029
0.14778
0.17286
0.02707
0.04469
0.07162
0.05336
0.03321
0.13543
0.06793
0.02707
0.00325
0.01132
0.06561
0.16803
0.00573
0.11533
10.35889
19.36631
3.03924
64.30170
4.31410
11.08683
14.95905
2.91874
31.26009
67.10087
60.52923
14.96335
5.02873
32.77267
15.44980
8.47484
3.80522
1.99971
1.99952
21.87631
10.55125
1.37804
9.38577
3.61546
3.10459
22.89422
2.25237
15.50239
1.92678
38.49811
25.28750
4.07361
3.91962
49.14539
6.07915
35.94977
4.21517
5.86174
27.03064
6.17672
4.99960
6.64852
5.93656
3.93441
14.89342
4.13306
1.74318
10.95832
5.94197
6.39511
37.28182
5.83519
2.23572
5.60509
10.23912
3.84299
24.56857
7.75331
22.04728
11.95360
31.07342
8.58447
30.81323
7.13972
1.98361
2.00145
6.21112
0.12251
0.00348
0.18875
0.05929
0.03728
14.33579
1.99952
8.31055
25.71039
9.80996
0.00231
0.11589
0.13489
0.02895
14.95868
1.41234
4.70309
10.53359
0.03743
0.12203
0.00929
9.36744
5.67124
12.80317
continued on next page
18
continued from previous page
Code
(311)
(312)
(313)
(314)
(315)
(316)
(317)
(318)
(319)
(320)
(321)
(322)
(323)
(324)
(325)
(326)
(327)
(328)
(329)
(330)
(331)
(332)
(333)
(334)
(335)
(336)
(337)
(338)
(339)
(340)
(341)
(342)
(343)
(344)
(345)
(346)
(347)
(348)
(349)
(350)
(351)
(352)
(353)
(354)
(355)
(356)
(357)
(358)
(359)
(360)
(361)
(362)
(363)
(364)
(365)
(366)
(367)
(368)
(369)
(370)
(371)
(372)
(373)
(374)
(375)
(376)
(377)
(378)
(379)
(380)
(381)
(382)
(383)
(384)
(385)
(386)
(387)
(388)
(389)
(390)
(391)
(392)
Parent
e
(ProtoMalay)
>
a
Not enough data available for reliable sound change estimates
k
(Northern Malaita)
>
g
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
r
(NewIreland)
>
l
a
(East Manus)
>
e
a
(Tanna)
>
e
Not enough data available for reliable sound change estimates
h
(Gela)
>
G
Not enough data available for reliable sound change estimates
f
(SouthwestMaluku)
>
w
n
(West Manus)
>
N
a
(LavongaiNalik)
>
e
i
(West Manus)
>
e
e
(EndeLio)
>
@
a
(Malayan)
>
e
Not enough data available for reliable sound change estimates
r
(Manus)
>
P
a
(SoutheastIslands)
>
e
i
(LocalMalay)
>
e
a
(RemoteOceanic)
>
e
t
(Ellicean)
>
k
e
(MonoUruava)
>
a
O
(West NewGeorgia)
>
o
O
(West NewGeorgia)
>
o
b
(East Barito)
>
w
a
(NewIreland)
>
e
k
(Madak)
>
g
l
(LavongaiNalik)
>
r
u
(ProtoMalay)
>
o
l
(CentralPapuan)
>
r
a
(Nuclear PapuanTip)
>
i
n
(SouthSulawesi)
>
N
k
(MalaitaSanCristobal)
>
P
v
(SoutheastSolomonic)
>
f
Not enough data available for reliable sound change estimates
N
(LocalMalay)
>
n
p
(Malayic)
>
m
q
(ProtoMalay)
>
h
a
(Vanuatu)
>
e
a
(NortheastVanuatuBanksIslands)
>
e
a
(Vitiaz)
>
o
e
(Bugis)
>
a
u
(CentralPhilippine)
>
o
u
(East NuclearTimor)
>
a
e
(BimaSumba)
>
a
k
(KairiruManam)
>
P
a
(Marquesic)
>
e
s
(BimaSumba)
>
c
r
(Tahitic)
>
l
N
(SouthernPhilippine)
>
n
1
(AtaTigwa)
>
o
1
(AtaTigwa)
>
o
1
(Central Manobo)
>
a
g
(West Central Manobo)
>
h
a
(South Manobo)
>
1
1
(South Manobo)
>
2
o
(AtaTigwa)
>
1
a
(West Central Manobo)
>
1
l
(Mansakan)
>
r
o
(CentralPhilippine)
>
u
a
(Eastern AdmiraltyIslands)
>
e
v
(Tahitic)
>
w
l
(BorneoCoastBajaw)
>
w
u
(MaranaoIranon)
>
o
1
(SouthernPhilippine)
>
e
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
s
(East NewGeorgia)
>
c
a
(Marquesic)
>
e
f
(Central East Nuclear Polynesian)
>
h
a
(MicronesianProper)
>
e
i
(South Babar)
>
E
e
(Northern NuclearBel)
>
i
t
(Willaumez)
>
f
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
Child
Functional load
(LandDayak)
0.04499
Number of occurrences
11.18206
(Lau)
0.07399
10.37078
(LavongaiNalik)
(Leipon)
(Lenakel)
0.13377
0.13133
0.05393
4.12434
15.77294
9.10010
(LengoGhaim)
5.895e-05
1.98182
(Letinese)
(Levei)
(LihirSungl)
(Likum)
(LioFloresT)
(LocalMalay)
0.03321
0.18931
0.07781
0.08687
0.00290
0.09530
8.50360
43.18724
10.60106
4.08584
10.17585
18.43396
(Loniu)
(Lou)
(LowMalay)
(LoyaltyIslands)
(Luangiua)
(LungaLunga)
(Lungga)
(Luqa)
(Maanyan)
(Madak)
(MadakLamas)
(Madara)
(Madurese)
(MagoriSout)
(Maisin)
(Makassar)
(Malaita)
(MalaitaSanCristobal)
0.00619
0.09902
0.03770
0.14981
0.47404
0.13896
0.00573
0.00573
0.07537
0.16211
0.05218
0.10006
0.01585
0.12605
0.22933
0.18370
0.09152
0.06368
8.93647
12.46015
2.99404
15.79651
39.73227
3.85273
2.00155
2.00145
6.13215
8.90472
9.51948
10.95189
39.16274
3.95178
2.17692
10.70075
5.02310
25.62607
(MalayBahas)
(Malayan)
(Malayic)
(MalekulaCentral)
(MalekulaCoastal)
(Maleu)
(Maloh)
(Mamanwa)
(Mambai)
(Mamboru)
(Manam)
(Mangareva)
(Manggarai)
(Manihiki)
(Manobo)
(ManoboAtad)
(ManoboAtau)
(ManoboDiba)
(ManoboIlia)
(ManoboKala)
(ManoboSara)
(ManoboTigw)
(ManoboWest)
(Mansaka)
(Mansakan)
(Manus)
(Maori)
(Mapun)
(Maranao)
(MaranaoIranon)
0.17169
0.17398
0.07129
0.08559
0.08702
0.11223
0.07175
0.01883
0.34369
0.13262
0.01548
0.26245
2.420e-05
9.361e-04
0.07976
0.03723
0.03723
0.12386
0.03983
0.12673
2.271e-04
0.03723
0.13743
0.04611
0.01883
0.13652
0.00255
0.03973
0.00377
0.00422
34.31949
3.33605
31.25704
25.93669
31.59187
11.90975
4.70755
38.86565
4.99285
7.46411
9.58872
3.72483
8.94154
13.67126
13.17836
66.54667
66.57068
3.76411
13.28241
7.79055
58.20252
11.93633
4.54113
18.44830
21.46241
4.75105
12.29480
7.82813
54.21909
7.67944
(Marovo)
(Marquesan)
(Marquesic)
(Marshalles)
(MaselaSouthBabar)
(Matukar)
(Maututu)
0.00144
0.26245
0.05235
0.12251
0.04148
0.04103
0.00143
4.87889
10.48977
2.06222
42.07442
3.02753
3.86380
5.97137
continued on next page
19
continued from previous page
Code
(393)
(394)
(395)
(396)
(397)
(398)
(399)
(400)
(401)
(402)
(403)
(404)
(405)
(406)
(407)
(408)
(409)
(410)
(411)
(412)
(413)
(414)
(415)
(416)
(417)
(418)
(419)
(420)
(421)
(422)
(423)
(424)
(425)
(426)
(427)
(428)
(429)
(430)
(431)
(432)
(433)
(434)
(435)
(436)
(437)
(438)
(439)
(440)
(441)
(442)
(443)
(444)
(445)
(446)
(447)
(448)
(449)
(450)
(451)
(452)
(453)
(454)
(455)
(456)
(457)
(458)
(459)
(460)
(461)
(462)
(463)
(464)
(465)
(466)
(467)
(468)
(469)
(470)
(471)
(472)
(473)
(474)
Parent
Not enough data available for reliable sound change estimates
e
(Northern NuclearBel)
>
i
b
(Nuclear WestCentralPapuan)
>
p
e
(Northwest)
>
a
a
(MelanauKajang)
>
e
N
(LocalMalay)
>
n
e
(LocalMalay)
>
a
Not enough data available for reliable sound change estimates
r
(Vitiaz)
>
l
a
(WestSanto)
>
e
u
(East Barito)
>
o
p
(WesternOceanic)
>
b
e
(ProtoMalay)
>
1
s
(ProtoMicro)
>
t
e
(Malayan)
>
a
b
(RajaAmpat)
>
p
r
(KilivilaLouisiades)
>
l
u
(Malayic)
>
o
a
(Ponapeic)
>
O
k
(Bwaidoga)
>
P
r
(MonoUruava)
>
l
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
N
(SouthNewIrelandNorthwestSolomonic)
>
n
t
(CenderawasihBay)
>
P
a
(Sulawesi)
>
o
d
(ProtoChuuk)
>
t
v
(EastVanuatu)
>
w
v
(SinagoroKeapara)
>
h
r
(Bibling)
>
x
u
(Sulawesi)
>
o
N
(Western Munic)
>
n
d
(UlatInai)
>
r
r
(ProtoOcean)
>
l
a
(EastVanuatu)
>
E
r
(EndeLio)
>
z
i
(MalekulaCoastal)
>
e
n
(Willaumez)
>
l
a
(LavongaiNalik)
>
@
u
(CentralVanuatu)
>
i
a
(MalekulaCentral)
>
e
b
(MalekulaCoastal)
>
p
r
(SoutheastIslands)
>
l
a
(ProtoMicro)
>
e
Not enough data available for reliable sound change estimates
s
(NehanNorthBougainville)
>
h
N
(Nehan)
>
n
a
(SouthNewIrelandNorthwestSolomonic)
>
o
e
(Northern NewCaledonian)
>
a
o
(Utupua)
>
œ
a
(LoyaltyIslands)
>
e
m
(Vanuatu)
>
n
a
(MalekulaCentral)
>
e
e
(LoyaltyIslands)
>
a
k
(SouthNewIrelandNorthwestSolomonic)
>
g
e
(MesoMelanesian)
>
i
w
(BimaSumba)
>
v
i
(Aru)
>
e
f
(NorthNewGuinea)
>
w
Not enough data available for reliable sound change estimates
s
(Gela)
>
h
b
(CentralVanuatu)
>
p
p
(Sumatra)
>
f
f
(NilaSerua)
>
h
t
(TeunNilaSerua)
>
l
u
(Nehan)
>
w
a
(Tongic)
>
e
l
(North Babar)
>
n
R
(ProtoCentr)
>
r
v
(WesternOceanic)
>
p
a
(Nuclear PapuanTip)
>
e
a
(Northwest)
>
e
e
(Babar)
>
E
b
(Sulawesi)
>
w
u
(Vanuatu)
>
o
r
(NorthernLuzon)
>
g
m
(NorthernPhilippine)
>
n
N
(ProtoMalay)
>
n
o
(NorthernCordilleran)
>
u
l
(EastFormosan)
>
n
p
(Malaita)
>
b
a
(NewCaledonian)
>
o
Child
Functional load
Number of occurrences
(Megiar)
(Mekeo)
(MelanauKajang)
(MelanauMuk)
(Melayu)
(MelayuBrun)
0.04103
0.01057
0.05269
0.05542
0.17169
0.10667
3.96950
10.53926
4.10012
8.30760
15.57376
57.81798
(Mengen)
(Merei)
(MerinaMala)
(MesoMelanesian)
(MesoPhilippine)
(MicronesianProper)
(Minangkaba)
(Minyaifuin)
(Misima)
(Moken)
(Mokilese)
(Molima)
(Mono)
0.09236
0.12570
0.00538
0.06671
0.00192
0.14436
0.09530
0.08190
0.09569
0.00621
0.00863
0.02226
0.18018
7.76084
4.40581
45.42570
3.25436
27.69856
10.32542
36.65853
5.44290
2.85853
18.67524
20.21510
6.39306
7.90990
(MonoUruava)
(Mor)
(Mori)
(Mortlockes)
(Mota)
(Motu)
(Mouk)
(MunaButon)
(MunaKatobu)
(MurnatenAl)
(Mussau)
(Mwotlap)
(Nage)
(Nahavaq)
(NakanaiBil)
(Nalik)
(Namakir)
(Naman)
(Nati)
(Nauna)
(Nauru)
0.07198
0.00191
0.06991
0.10821
0.04447
0.01870
3.176e-05
0.04575
2.731e-04
0.00181
0.16171
0.01111
0.02129
0.05885
0.00412
0.00211
0.18127
0.13582
0.01049
0.10938
0.13661
4.50934
7.90698
15.10951
17.95820
4.78867
13.39891
16.89672
3.73293
6.02727
5.95393
3.95147
27.83527
1.96583
9.63004
1.02690
12.96772
21.52549
33.25250
19.54611
5.60587
5.67814
(Nehan)
(NehanHape)
(NehanNorthBougainville)
(Nelemwa)
(Nembao)
(Nengone)
(Nese)
(Neveei)
(NewCaledonian)
(NewGeorgia)
(NewIreland)
(Ngadha)
(NgaiborSAr)
(NgeroVitiaz)
0.05608
0.11803
0.16106
0.13215
0.01621
0.20861
0.35703
0.13582
0.20861
0.08381
0.06281
4.053e-05
0.02299
0.01667
13.16549
18.93949
4.34745
10.43158
4.98208
5.30862
18.91500
34.11813
1.71479
5.00348
3.31247
14.09575
6.77784
3.53470
(Nggela)
(Nguna)
(Nias)
(Nila)
(NilaSerua)
(Nissan)
(Niue)
(NorthBabar)
(NorthBomberai)
(NorthNewGuinea)
(NorthPapuanMainlandDEntrecasteaux)
(NorthSarawakan)
(North Babar)
(North Minahasan)
(NortheastVanuatuBanksIslands)
(NorthernCordilleran)
(NorthernLuzon)
(NorthernPhilippine)
(Northern Dumagat)
(Northern EastFormosan)
(Northern Malaita)
(Northern NewCaledonian)
0.05008
0.06113
0.00205
0.01125
0.13648
0.02937
0.18204
0.16784
0.02692
0.12924
0.27192
0.05269
0.03978
0.05030
0.06513
0.01688
0.15955
0.10751
0.01648
0.14981
0.00727
0.17212
17.07009
6.26520
8.90278
4.39065
1.88492
8.95963
4.31952
6.84401
10.12083
8.71798
3.13375
5.44048
1.27682
3.03447
4.83269
4.17784
5.87812
18.96525
1.25949
5.83985
4.99257
3.00312
continued on next page
20
continued from previous page
Code
(475)
(476)
(477)
(478)
(479)
(480)
(481)
(482)
(483)
(484)
(485)
(486)
(487)
(488)
(489)
(490)
(491)
(492)
(493)
(494)
(495)
(496)
(497)
(498)
(499)
(500)
(501)
(502)
(503)
(504)
(505)
(506)
(507)
(508)
(509)
(510)
(511)
(512)
(513)
(514)
(515)
(516)
(517)
(518)
(519)
(520)
(521)
(522)
(523)
(524)
(525)
(526)
(527)
(528)
(529)
(530)
(531)
(532)
(533)
(534)
(535)
(536)
(537)
(538)
(539)
(540)
(541)
(542)
(543)
(544)
(545)
(546)
(547)
(548)
(549)
(550)
(551)
(552)
(553)
(554)
(555)
(556)
Parent
b
(Bel)
>
p
N
(Sangiric)
>
n
q
(ProtoMalay)
>
P
l
(CentralCordilleran)
>
k
b
(Timor)
>
f
l
(PapuanTip)
>
n
N
(Polynesian)
>
n
r
(WestCentralPapuan)
>
l
t
(Ellicean)
>
d
f
(HuonGulf)
>
w
t
(CenderawasihBay)
>
k
f
(Seram)
>
h
a
(LocalMalay)
>
@
N
(Javanese)
>
n
t
(CentralVanuatu)
>
r
O
(Southern Malaita)
>
o
r
(EastVanuatu)
>
l
s
(Formosan)
>
t
a
(ProtoMalay)
>
e
1
(Palawano)
>
@
1
(MesoPhilippine)
>
u
Not enough data available for reliable sound change estimates
u
(Pangasinic)
>
o
Not enough data available for reliable sound change estimates
a
(WesternOceanic)
>
o
N
(SouthwestNewBritain)
>
n
v
(PatpatarTolai)
>
h
a
(SouthNewIrelandNorthwestSolomonic)
>
e
e
(SeramStraits)
>
i
u
(Formosan)
>
o
f
(Tahitic)
>
h
n
(Wetar)
>
N
u
(Central Bisayan)
>
o
e
(PapuanTip)
>
a
q
(ProtoMalay)
>
h
v
(EastVanuatu)
>
f
a
(ChamChru)
>
1
Not enough data available for reliable sound change estimates
v
(EastFijianPolynesian)
>
f
a
(Ponapeic)
>
e
a
(PonapeicTrukic)
>
e
t
(MicronesianProper)
>
d
e
(BimaSumba)
>
a
B
(TukangbesiBonerate)
>
F
e
(CentralEastern)
>
@
o
(PonapeicTrukic)
>
a
N
(ProtoAustr)
>
n
v
(RemoteOceanic)
>
f
a
(EasternMalayoPolynesian)
>
o
f
(SamoicOutlier)
>
w
a
(NorthBomberai)
>
e
u
(ProtoChuuk)
>
U
d
(ProtoChuuk)
>
t
d
(ProtoChuuk)
>
t
u
(KayanMurik)
>
o
b
(Formosan)
>
v
s
(EastVanuatu)
>
h
r
(CenderawasihBay)
>
l
f
(East Nuclear Polynesian)
>
h
a
(Tahitic)
>
e
a
(ProtoMalay)
>
@
i
(CentralEasternOceanic)
>
a
r
(Futunic)
>
g
O
(Choiseul)
>
o
r
(NorthNewGuinea)
>
z
r
(KisarRoma)
>
R
k
(Nuclear WestCentralPapuan)
>
h
f
(West NuclearTimor)
>
b
t
(WestFijianRotuman)
>
f
O
(West NewGeorgia)
>
o
u
(Formosan)
>
o
k
(Tahitic)
>
P
u
(Palawano)
>
o
f
(Southern Malaita)
>
h
Not enough data available for reliable sound change estimates
O
(Southern Malaita)
>
o
O
(Southern Malaita)
>
o
Not enough data available for reliable sound change estimates
u
(Tsouic)
>
o
a
(Northwest)
>
o
d
(ProtoChuuk)
>
t
u
(Formosan)
>
o
Child
Functional load
Number of occurrences
(Northern NuclearBel)
(Northern Sangiric)
(Northwest)
(NuclearCordilleran)
(NuclearTimor)
(Nuclear PapuanTip)
(Nuclear Polynesian)
(Nuclear WestCentralPapuan)
(Nukuoro)
(NumbamiSib)
(Numfor)
(Nunusaku)
(Ogan)
(OldJavanes)
(Orkon)
(Oroha)
(PaameseSou)
(Paiwan)
(Palauan)
(PalawanBat)
(Palawano)
0.08564
0.10504
0.04087
0.14865
0.08459
0.10099
0.01742
0.13055
3.974e-05
0.03605
0.18043
0.02050
0.00113
0.12968
0.18595
0.01151
0.17094
0.10317
0.04499
4.146e-04
0.02599
2.04429
18.92173
30.80231
1.01516
2.80832
4.22343
5.34206
1.52585
48.84748
5.99986
10.67064
9.64666
22.90283
22.81888
20.59925
1.99952
18.23342
11.06529
11.70163
36.88428
5.66511
(Pangasinan)
0.03899
25.70031
(PapuanTip)
(Pasismanua)
(Patpatar)
(PatpatarTolai)
(Paulohi)
(Pazeh)
(Penrhyn)
(Perai)
(Peripheral Central Bisayan)
(Peripheral PapuanTip)
(Pesisir)
(PeteraraMa)
(PhanRangCh)
0.15356
0.09977
0.00182
0.15474
0.18832
2.254e-04
0.02693
0.04788
0.02827
0.19334
0.07129
0.01242
0.00636
3.56136
3.11319
8.75307
6.20403
3.98409
6.81642
5.16875
4.13934
1.93754
3.54732
14.80076
17.18743
12.19166
(Polynesian)
(Ponapean)
(Ponapeic)
(PonapeicTrukic)
(Pondok)
(Popalia)
(ProtoCentr)
(ProtoChuuk)
(ProtoMalay)
(ProtoMicro)
(ProtoOcean)
(Pukapuka)
(PulauArgun)
(PuloAnna)
(PuloAnnan)
(Puluwatese)
(PunanKelai)
(Puyuma)
(Raga)
(RajaAmpat)
(RapanuiEas)
(Rarotongan)
(RejangReja)
(RemoteOceanic)
(Rennellese)
(Ririo)
(Riwo)
(Roma)
(Roro)
(RotiTerman)
(Rotuman)
(Roviana)
(Rukai)
(Rurutuan)
(SWPalawano)
(Saa)
0.05871
0.16942
0.13099
0.04291
0.13262
0.00989
0.00248
0.09341
0.13485
0.04980
0.09981
0.00204
0.13318
8.595e-06
0.10821
0.10821
0.00695
0.02080
0.03550
0.10632
0.05254
0.23947
0.00259
0.30671
0.01560
0.01953
0.01094
0.00324
0.04267
0.05510
0.07237
0.00573
2.254e-04
0.00737
7.014e-04
0.00833
17.52595
10.52294
19.78210
17.55398
6.53158
10.98695
3.97165
2.83945
12.99231
13.02087
7.99270
8.90305
12.93777
28.37331
26.92081
32.33646
11.47865
5.97573
19.03806
10.89494
6.51440
2.99725
33.94479
1.87121
46.78448
8.10758
1.23968
9.86839
7.79458
4.89359
18.85436
6.86288
9.26163
41.17228
10.93716
23.36029
(SaaSaaVill)
(SaaUkiNiMa)
0.01151
0.01151
2.00000
1.99961
(Saaroa)
(Sabahan)
(SaipanCaro)
(Saisiat)
0.00593
0.01254
0.10821
2.254e-04
29.89765
7.65235
30.24687
32.49754
continued on next page
21
continued from previous page
Code
(557)
(558)
(559)
(560)
(561)
(562)
(563)
(564)
(565)
(566)
(567)
(568)
(569)
(570)
(571)
(572)
(573)
(574)
(575)
(576)
(577)
(578)
(579)
(580)
(581)
(582)
(583)
(584)
(585)
(586)
(587)
(588)
(589)
(590)
(591)
(592)
(593)
(594)
(595)
(596)
(597)
(598)
(599)
(600)
(601)
(602)
(603)
(604)
(605)
(606)
(607)
(608)
(609)
(610)
(611)
(612)
(613)
(614)
(615)
(616)
(617)
(618)
(619)
(620)
(621)
(622)
(623)
(624)
(625)
(626)
(627)
(628)
(629)
(630)
(631)
(632)
(633)
(634)
(635)
(636)
(637)
(638)
Parent
a
(Vanuatu)
>
E
v
(Suauic)
>
h
e
(ProtoMalay)
>
a
a
(SuluBorneo)
>
1
u
(CentralLuzon)
>
o
k
(SamoicOutlier)
>
P
r
(Nuclear Polynesian)
>
l
Not enough data available for reliable sound change estimates
h
(Northern Sangiric)
>
r
q
(Northern Sangiric)
>
P
P
(Northern Sangiric)
>
q
a
(Sulawesi)
>
e
l
(SanCristobal)
>
r
O
(SanCristobal)
>
o
s
(SouthNewIrelandNorthwestSolomonic)
>
h
l
(NehanNorthBougainville)
>
n
a
(Blaan)
>
u
o
(SarmiJayapuraBay)
>
a
l
(NorthNewGuinea)
>
r
a
(BaliSasak)
>
@
a
(ProtoChuuk)
>
e
a
(BimaSumba)
>
e
n
(NorthNewGuinea)
>
N
a
(Atayalic)
>
u
f
(Western AdmiraltyIslands)
>
h
u
(NorthBomberai)
>
i
b
(SoutheastMaluku)
>
h
Not enough data available for reliable sound change estimates
E
(Pasismanua)
>
e
r
(East CentralMaluku)
>
l
l
(Nunusaku)
>
r
e
(MaselaSouthBabar)
>
E
f
(NilaSerua)
>
w
N
(PatpatarTolai)
>
n
n
(FloresLembata)
>
N
f
(Ellicean)
>
h
O
(West NewGeorgia)
>
o
a
(CentralPapuan)
>
o
a
(LandDayak)
>
o
u
(EastFormosan)
>
o
O
(Choiseul)
>
o
Not enough data available for reliable sound change estimates
r
(BimaSumba)
>
z
a
(Sarmi)
>
e
k
(CentralMaluku)
>
P
a
(NehanNorthBougainville)
>
e
Not enough data available for reliable sound change estimates
l
(Ambon)
>
r
r
(NorthernLuzon)
>
l
n
(MaselaSouthBabar)
>
l
a
(CentralVanuatu)
>
e
v
(SouthHalmaheraWestNewGuinea)
>
p
u
(EasternMalayoPolynesian)
>
i
i
(NewIreland)
>
u
u
(Sulawesi)
>
o
t
(Babar)
>
k
l
(Bisayan)
>
y
a
(Manobo)
>
1
y
(West Barito)
>
i
o
(Eastern AdmiraltyIslands)
>
u
R
(ProtoCentr)
>
r
r
(CentralEasternOceanic)
>
l
t
(SouthCentralCordilleran)
>
b
e
(ProtoMalay)
>
1
l
(Malaita)
>
n
o
(South Babar)
>
u
b
(Timor)
>
w
i
(Vitiaz)
>
u
u
(Atayalic)
>
o
k
(Suauic)
>
P
r
(Nuclear PapuanTip)
>
l
1
(Subanun)
>
o
a
(SouthernPhilippine)
>
1
g
(Subanun)
>
d
e
(ProtoMalay)
>
a
u
(SamaBajaw)
>
o
e
(ProtoMalay)
>
o
N
(ProtoMalay)
>
n
u
(South Bisayan)
>
o
a
(Erromanga)
>
e
t
(SouthSulawesi)
>
P
N
(Tboli)
>
n
Child
Functional load
Number of occurrences
(SakaoPortO)
(Saliba)
(SamaBajaw)
(SamalSiasi)
(SambalBoto)
(Samoan)
(SamoicOutlier)
1.415e-05
0.01548
0.04499
0.03121
0.03570
7.913e-05
0.04761
7.37912
5.59439
11.37100
4.55207
51.44942
40.57180
2.52276
(SangilSara)
(Sangir)
(SangirTabu)
(Sangiric)
(SantaAna)
(SantaCatal)
(SantaIsabel)
(SaposaTinputz)
(SaranganiB)
(Sarmi)
(SarmiJayapuraBay)
(Sasak)
(Satawalese)
(Savu)
(Schouten)
(Sediq)
(Seimat)
(Sekar)
(Selaru)
0.01859
0.17663
0.17663
0.06360
0.20803
0.00562
0.01828
0.13005
0.10239
0.30507
0.09033
0.07763
0.11108
0.13262
0.12461
0.15623
0.03838
0.07251
0.00154
10.57112
33.66747
5.11841
9.41251
20.89020
1.99971
6.35726
8.23827
1.65720
3.74591
9.81639
7.22450
19.48892
10.66529
3.55253
4.94540
5.52135
14.57543
5.27609
(Sengseng)
(Seram)
(SeramStraits)
(Serili)
(Serua)
(Siar)
(Sika)
(Sikaiana)
(Simbo)
(SinagoroKeapara)
(Singhi)
(Siraya)
(Sisingga)
0.00635
0.17834
0.08036
0.03391
0.09929
0.11079
0.12345
0.01500
0.00573
0.25340
0.01654
4.190e-04
0.01953
6.51904
9.26050
17.65607
27.97054
7.71214
7.91352
10.33270
18.73830
3.68466
1.00322
17.32791
8.28489
13.70360
(Soa)
(Sobei)
(Soboyo)
(Solos)
0.00401
0.23073
0.00549
0.10574
6.96601
10.06460
7.46123
4.14147
(SouAmanaTe)
(SouthCentralCordilleran)
(SouthEastB)
(SouthEfate)
(SouthHalmahera)
(SouthHalmaheraWestNewGuinea)
(SouthNewIrelandNorthwestSolomonic)
(SouthSulawesi)
(South Babar)
(South Bisayan)
(South Manobo)
(South West Barito)
(SoutheastIslands)
(SoutheastMaluku)
(SoutheastSolomonic)
(SouthernCordilleran)
(SouthernPhilippine)
(Southern Malaita)
(SouthwestBabar)
(SouthwestMaluku)
(SouthwestNewBritain)
(SquliqAtay)
(Suau)
(Suauic)
(SubanonSio)
(Subanun)
(SubanunSin)
(Sulawesi)
(SuluBorneo)
(Sumatra)
(Sunda)
(Surigaonon)
(SyeErroman)
(TaeSToraja)
(Tagabili)
0.03275
0.03231
0.25859
0.06242
0.04413
0.15907
0.17058
0.04575
0.08233
0.06356
0.09378
0.00602
0.01939
0.02692
0.18698
0.17112
0.00192
0.09412
0.05720
0.05085
0.23570
0.00681
0.02509
0.04206
0.03138
0.05979
0.15626
0.04499
0.02973
0.00512
0.10751
0.03947
0.11628
0.06897
0.10214
3.58352
7.74272
6.79647
9.08441
6.86021
2.91215
3.74428
8.90368
5.96986
5.93967
14.76928
4.16231
2.52762
16.51716
6.93482
1.97141
17.69275
3.02851
5.18676
6.39151
3.84871
13.92923
20.06806
7.42557
42.58704
13.92604
6.97087
11.63780
3.89781
23.81901
21.82474
24.45484
12.15031
3.10286
23.38929
continued on next page
22
continued from previous page
Code
(639)
(640)
(641)
(642)
(643)
(644)
(645)
(646)
(647)
(648)
(649)
(650)
(651)
(652)
(653)
(654)
(655)
(656)
(657)
(658)
(659)
(660)
(661)
(662)
(663)
(664)
(665)
(666)
(667)
(668)
(669)
(670)
(671)
(672)
(673)
(674)
(675)
(676)
(677)
(678)
(679)
(680)
(681)
(682)
(683)
(684)
(685)
(686)
(687)
(688)
(689)
(690)
(691)
(692)
(693)
(694)
(695)
(696)
(697)
(698)
(699)
(700)
(701)
(702)
(703)
(704)
(705)
(706)
(707)
(708)
(709)
(710)
(711)
(712)
(713)
(714)
(715)
(716)
(717)
(718)
(719)
(720)
Parent
u
(CentralPhilippine)
>
o
k
(Kalamian)
>
q
q
(Kalamian)
>
k
u
(Tahitic)
>
o
e
(Tahitic)
>
a
a
(Tahitic)
>
e
f
(Central East Nuclear Polynesian)
>
h
v
(SaposaTinputz)
>
f
l
(Ellicean)
>
r
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
Not enough data available for reliable sound change estimates
h
(Guadalcanal)
>
G
f
(Wetar)
>
h
e
(Vanikoro)
>
a
a
(PatpatarTolai)
>
e
e
(Utupua)
>
u
a
(Vanuatu)
>
@
r
(Tanna)
>
l
a
(Vanuatu)
>
e
Not enough data available for reliable sound change estimates
f
(Sarmi)
>
p
o
(ButuanTausug)
>
u
Not enough data available for reliable sound change estimates
a
(Bilic)
>
O
O
(Tboli)
>
o
O
(Vanikoro)
>
o
l
(SouthwestBabar)
>
n
s
(SaposaTinputz)
>
h
N
(East NuclearTimor)
>
n
t
(TeunNilaSerua)
>
P
e
(SouthwestMaluku)
>
E
b
(WesternPlains)
>
f
u
(LavongaiNalik)
>
a
a
(LavongaiNalik)
>
o
r
(Futunic)
>
l
R
(ProtoCentr)
>
r
e
(Dayic)
>
o
s
(Northern Malaita)
>
8
k
(Sumatra)
>
h
s
(SamoicOutlier)
>
h
g
(Guadalcanal)
>
h
a
(Tongic)
>
o
s
(Polynesian)
>
h
l
(North Minahasan)
>
d
b
(North Minahasan)
>
w
n
(MonoUruava)
>
l
a
(Tsouic)
>
o
d
(Formosan)
>
c
n
(Tahitic)
>
N
a
(Wetar)
>
i
a
(MunaButon)
>
o
a
(LavongaiNalik)
>
e
u
(Barito)
>
o
s
(Ellicean)
>
h
p
(Are)
>
f
Not enough data available for reliable sound change estimates
p
(Aru)
>
f
h
(Nunusaku)
>
b
o
(Erromanga)
>
e
l
(MonoUruava)
>
r
a
(EasternOuterIslands)
>
o
s
(SamoicOutlier)
>
h
r
(Futunic)
>
l
r
(Futunic)
>
l
u
(Choiseul)
>
@
Not enough data available for reliable sound change estimates
a
(EasternOuterIslands)
>
e
o
(Vanikoro)
>
e
o
(RemoteOceanic)
>
u
o
(Choiseul)
>
O
Not enough data available for reliable sound change estimates
r
(SinagoroKeapara)
>
l
a
(NgeroVitiaz)
>
e
s
(MesoMelanesian)
>
d
t
(HuonGulf)
>
r
s
(BimaSumba)
>
h
Not enough data available for reliable sound change estimates
v
(CenderawasihBay)
>
w
r
(GeserGorom)
>
l
l
(AreTaupota)
>
r
Child
Functional load
Number of occurrences
(TagalogAnt)
(TagbanwaAb)
(TagbanwaKa)
(Tahiti)
(TahitianMo)
(Tahitianth)
(Tahitic)
(Taiof)
(Takuu)
0.01883
0.42879
0.42879
0.19238
0.23947
0.23947
0.05235
0.04497
0.01827
10.16369
5.41452
17.73591
6.98869
3.68949
2.71372
6.11830
6.79178
30.88407
(TalisePole)
(Talur)
(Tanema)
(Tanga)
(Tanimbili)
(Tanna)
(TannaSouth)
(Tape)
5.670e-05
0.03346
0.28248
0.10990
0.10880
0.01056
0.05929
0.08559
1.97576
6.74401
10.81075
12.52910
6.71919
13.80984
26.23479
36.62961
(Tarpia)
(TausugJolo)
0.09285
0.03983
4.76042
38.80244
(Tboli)
(TboliTagab)
(Teanu)
(TelaMasbua)
(Teop)
(TetunTerik)
(Teun)
(TeunNilaSerua)
(Thao)
(Tiang)
(Tigak)
(Tikopia)
(Timor)
(TimugonMur)
(Toambaita)
(TobaBatak)
(Tokelau)
(Tolo)
(Tongan)
(Tongic)
(Tonsea)
(Tontemboan)
(Torau)
(Tsou)
(Tsouic)
(Tuamotu)
(Tugun)
(TukangbesiBonerate)
(TungagTung)
(Tunjung)
(Tuvalu)
(Ubir)
0.01846
0.02446
0.13019
0.17076
0.06456
0.02632
0.01477
0.00128
0.01723
0.23417
0.10194
0.05835
0.02692
0.00467
0.01236
0.01209
0.02620
0.04461
0.28401
0.04938
0.05604
0.12446
0.17242
0.00357
0.02777
0.00257
0.34504
0.19256
0.07781
0.00125
0.01029
0.00167
18.79298
2.78683
24.94671
14.35748
11.64051
3.88184
8.79107
6.36734
9.66201
7.41175
2.85222
3.75175
17.70656
20.83722
5.00367
10.18636
9.71242
9.43634
6.49634
24.85227
2.89546
9.61430
4.80739
26.52983
7.43830
17.01166
3.34154
6.52913
8.41577
6.26453
12.42513
1.99937
(UjirNAru)
(UlatInai)
(Ura)
(Uruava)
(Utupua)
(UveaEast)
(UveaWest)
(VaeakauTau)
(Vaghua)
0.01141
0.07007
0.05761
0.18018
0.19429
0.02620
0.05835
0.05835
3.654e-05
10.59153
6.57134
9.45115
8.71608
6.30089
28.16233
27.69390
41.42212
19.13794
(Vanikoro)
(Vano)
(Vanuatu)
(Varisi)
0.33088
0.10677
0.11711
0.01953
4.43031
4.13392
14.34763
6.43868
(Vilirupu)
(Vitiaz)
(Vitu)
(Wampar)
(Wanukaka)
0.17037
0.13267
0.02619
0.06174
0.03743
8.15153
4.47748
6.27238
5.81284
14.17816
(Waropen)
(Watubela)
(Wedau)
0.08865
0.17521
0.04316
4.74200
13.68799
3.56623
continued on next page
23
continued from previous page
Code
(721)
(722)
(723)
(724)
(725)
(726)
(727)
(728)
(729)
(730)
(731)
(732)
(733)
(734)
(735)
(736)
(737)
(738)
(739)
(740)
(741)
(742)
(743)
(744)
(745)
(746)
(747)
(748)
(749)
(750)
(751)
(752)
(753)
(754)
(755)
Parent
Child
i
(BimaSumba)
>
e
(WejewaTana)
u
(Seram)
>
v
(Werinama)
t
(CentralPapuan)
>
k
(WestCentralPapuan)
a
(ProtoCentr)
>
o
(WestDamar)
f
(CentralPacific)
>
h
(WestFijianRotuman)
k
(NortheastVanuatuBanksIslands)
>
h
(WestSanto)
a
(Barito)
>
e
(West Barito)
a
(Central Manobo)
>
1
(West Central Manobo)
a
(Manus)
>
e
(West Manus)
O
(NewGeorgia)
>
o
(West NewGeorgia)
r
(NuclearTimor)
>
l
(West NuclearTimor)
Not enough data available for reliable sound change estimates
1
(West Central Manobo)
>
e
(WesternBuk)
s
(WestFijianRotuman)
>
c
(WesternFij)
Not enough data available for reliable sound change estimates
a
(ProtoOcean)
>
e
(WesternOceanic)
d
(Formosan)
>
s
(WesternPlains)
s
(AdmiraltyIslands)
>
h
(Western AdmiraltyIslands)
a
(MunaButon)
>
o
(Western Munic)
t
(SouthwestMaluku)
>
k
(Wetar)
r
(MesoMelanesian)
>
l
(Willaumez)
b
(CentralWestern)
>
v
(WindesiWan)
P
(Manam)
>
k
(Wogeo)
k
(ProtoChuuk)
>
g
(Woleai)
a
(ProtoChuuk)
>
e
(Woleaian)
a
(Sulawesi)
>
o
(Wolio)
w
(Western Munic)
>
v
(Wuna)
t
(Western AdmiraltyIslands)
>
P
(Wuvulu)
a
(HuonGulf)
>
e
(Yabem)
a
(SamaBajaw)
>
e
(Yakan)
a
(KeiTanimbar)
>
e
(Yamdena)
o
(Bashiic)
>
u
(Yami)
a
(ProtoOcean)
>
i
(Yapese)
a
(Javanese)
>
e
(Yogya)
o
(West SantaIsabel)
>
O
(ZabanaKia)
24
Functional load
Number of occurrences
0.04706
3.739e-05
0.13745
0.02891
0.00402
0.01801
0.06510
0.12386
0.16506
0.00631
0.14621
11.31057
7.78342
14.10036
9.89554
4.95883
2.59859
2.67923
21.47034
7.10396
2.83443
3.15074
4.647e-04
0.00926
76.71920
4.99871
0.12139
0.04913
0.01688
0.19256
0.09657
0.14060
0.16193
0.07162
0.00179
0.11108
0.06991
0.00508
0.00203
0.13370
0.04299
0.18822
0.00368
0.27352
0.08208
0.02351
2.58873
4.48311
4.34207
11.48592
3.25493
7.57978
3.04357
8.86144
35.44518
60.39379
8.71005
2.73532
11.89389
6.51629
27.73964
17.50706
5.79971
4.64937
4.09979
13.69682
Figure S.1: Percentage of words with varying levels of Levenshtein distance. Known Cognates (gold) were handannotated by linguists, while Automatic Cognates were found by our system.
25
P uka puk a
Tok e la u
Uv e a E a s t
Tonga n
( 683)
Niue
( 177)
F ijia nBa u
( 666)
Te a nu
( 708)
( 707)
Va no
( 654)
( 159)
( 98)
Ta ne ma
( 26)
Bum a
( 701)
( 656)
As um boa
( 738)
( 442)
Ta nim bili
( 581)
( 1)
Ne mb a o
( 748)
( 616)
S e im a t
( 330)
( 160)
W uv ul u
( 435)
( 426)
( 153)
L ou
( 373)
( 317)
Na un a
( 598)
( 729)
L
e ip o
( 608)
( 325)
( 609
( 329)
S iv is n
)
( 323)
L ik u ma T ita
( 266)
Lev e
( 200)
( 417)
L on i
( 97)
( 108
)
M u siu
( 532
)
Ka s s a u
( 718
Gim ira Ira h
( 408)
( 33
( 485 )
)
( 68)
B u li a n
)
( 46
( 25)
Mo
5)
( 120
)
M inr
( 72
4
B ig ay a ifu in
( 24 )
( 378
( 61
)
As M is o o
( 8) )
2)
( 46
l
War
0
( 742
( 13 )
)
N
5)
u o pe n
( 62
( 13
2)
4)
M a m fo r
Am ra u
( 38
W ba iY
( 66
6)
7
( 22 )
No irn d e s a p e n
(1
9
(4
52
i
( 16 )
5)
)
D
4
a th B a bW a n
)
(6
( 14
ar
Da i we ra D
( 6960)
( 58 8)
a
7
T
8
(
)
we
e
450
( 11 )
(
)
Im la M a
(5
( 6101
( 60 5)
86
01 )
r
(
o
s
6
1
)
E m in
)
)
92
bua
)
(1
E a pla g
61
(4
)
86
S e s tM wa s
)
Ce ri l i a s e
la
(5
(8
( 19
S n tr
87
(
( 7 6)
)
( 7 1)
( 357
W o u th a l M
22
19
( 360 )
)
)
T a e s tD E a s a s
(9
(6
( 44 )
)
98
( 2 517 9)
r
t
a
U
a
B
)
(6
( 8 )
N ji rN n g a m a r
( 72 7)
(5 )
( 76 1)
03
(2
G g a i Aru n B a
1
)
( 59 )
( 6 1)
( 57 9)
W e s e b o rS
04
( 1 716 8)
)
a
(
5
Ar
E t r
86 )
(4 )
)
H la t ube
25
)
S oi tu AKe i B l a
Am u A m b e s
P a m o
A a u ha a n n
M lu lo i a Te
B o u rnn e h i
W n a
B e fi a te n
S u r ri n
Al
M o b u Na a m
M a o m a
Ng a n m b y o ro
P
l
g o
K o a d g ru
W o nd h a r
B e di o a a i
S o i m je w k
S
a a
W a a
Ta
Ga a v u
na
ur nu
a N ka
k
gg a
au
,'
(5
19
)
(7
7)
(1
14
)
( 524)
( 680)
( 702)
( 682)
( 459)
( 162)
( 563)
( 182)
( 481)
( 513)
( 146)
( 116)
(
pb
f v
n
-
+
t d
kg
q!
s z
x"
0
i y
eø
6 œ
&u
7 8
a)
% /
5*
7 8
Figure S.2: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch.
(3
50
)
(1
87
)
.3
26
12
&u
a)
(4
( 3193)
1)
$o
n
ba
u mo a
S
T
d
s t ile oy e s
E a a l m b lor
B a F
e
o
L i d
a n
L n g e be r i tu g
E a m N an
N a lue a l
n
K a k r rg u n
P na a A a
A e k a u e rm
S ul iT
P o t n i a i ri k
)
R to m b Te
26
A a un k a le
( 3 65) )
( 1 28
M e t a e r o tI
4
(
T e m a l ol
K am ah
2)
L am
4
( 5 9)
L ik a a ng
( 2 56)
S ed
( 3 69)
)
9
K ra i r
6
4
( 77)
( 1 44) 8)
2
E a lu a i
)
(
2
( 30
8
)
5
T put i
(
( 25
6)
5
6
(
A e ra n
(1
)
9
)
5 )
P ugu
31
7)
( 2 496
7
0
(
T iun a
( 3 06)
( 11)
)
)
70
(
( 3 91)
I l e ru
55
( 1 53)
5
)
(1
( 73
S ila
(6 )
(2
( 14 6)
N un
0
( 5 0)
)
9
Te s a r
8
9
r
( 5 6)
( 6 3)
Ki m a a m a
2
5
(4
(2
R o s tD s e
)
79
4
E a ti n e e n a a
(
e
)
7
L m d im b
5)
0)
( 45
)
)
( 28 0)
Ya iTa n
( 74
0)
61
78
( 67
(1
( 54
(4
Ke l a ru I ri a
S e i wa i rro
1)
Ko a m o n g
( 67
1)
( 75 )
Ch m p u g
74
)
2
6
(
8
L a m e ri n
(2
)
3
)
2
5
(6
Ko n d a e j a
( 14 2)
6)
7
( 32
(6
S u ja n g R a b
Re o liTa g
5)
( 27
T b g a b il i lB
3)
8
5
(
T a n a d a iB
)
Ko ro n g a n
( 3100)
)
S a raa n Ko ro
( 29
( 665
8)
3
B il a a n S a ra
6
(
)
7
( 61
( 291)
B il a Ata y a
8)
( 573)
( 28
Ciu lili q Ata y
( 70)
S q u iq
)
71
)
(
4
( 66
Sed n
( 133)
)
B unu
( 625)
( 1279)
( 81)
( 580)
( 50 4)
T h a o rl a n g
( 65335)
Favo
(
2)
67
(
( 28)
Ru k a i n
6)
17
(
( 74)
P a iwa na bu
0)
10
(
Ka na ka
( 737)
( 260)
S a a ro a
( 545)
)
( 553
( 492)
Ts ou
)
( 687
P uy um a
( 688)
( 269)
Ka v a la n
( 530)
54)
(
( 472)
Ba s a i
Ce ntra lAmi
( 109)
( 147)
( 596)
S ira y a
( 504)
P a ze h
( 556)
S a is ia t
( 750)
Ya ka n
( 559)
( 40)
( 91)
( 632)
Ba jo
( 375)
( 560)
( 230)
M
a pun
( 629)
S a ma lS ia s i
( 630)
Ina ba kno n
( 628)
( 620)
S uba nun S in
( 733)
( 728)
( 362)
S ub a no nS
( 123)
( 370)
W e s te rn Buio
( 365)
( 366)
M a no bo k
( 43)
( 27)
M a no bo W e s t
( 614)
( 363)
( 377)
M a no bo Ili a
( 364)
( 79)
Di ba
M a no
( 369)
( 367)
( 368)
M a n o bb o Ata d
o Ata u
( 376)
( 355)
M
a
n
( 576)
o
( 40
( 233)
5)
M a n b o T ig w
( 42)
( 118
)
M a n o b o Ka la
( 80)
B in u ko b o S a ra
( 121
)
M a ra id
( 507
Ira n u n a o
)
( 25
( 613
3)
)
( 37
S
(
as n
7
17)
2
( 225
( 3)
(6 )
)
B a li a k
( 49
( 102
( 210
5)
( 6939)
( 10
)
)
Mam
)
7)
( 20
(
6
an
8)
35)
Il o
( 103
g g wa
)
Hi l in
( 662
ga y o
)
W
( 25
n
a
( 37 2)
B u ra y W o n
1)
( 64
(7
T a utu a n o na ra y
27
( 64 1)
)
s
ug
Su
(1
( 57 0)
Ak ri g a o Jo l o
(4 )
( 6 51)
93
( 4 94)
Ce bl a n o n n o n
)
( 547)
ua n B is
( 59
K
7
a
)
( 13 5)
M la ga o
(3
6)
49
(6
Ta a n s a n
)
1
( 2 5)
B ik ga lo k a
45
)
T a o l N g An
(1
26
Ta gba a ga Ct
(
)
6 œ
(5
2)
h#
5*
eø
( 179)
4
% /
i y
( 521)
m
E
( 163)
e
An u ta
( 13)
e ll e s
R e n nn a An iw
( 537)
F u tuM e le M
( 180)
Ifira W e s t
( 218)
( 703)
Uv e an a E a s t
( 181)
F u tu k a u T a u
( 704)
Va e a
7)
( 64
Ta k u u
( 694)
Tu v a luna
( 592)
S ik a ia a m a r
( 264)
Ka pi ng
( 483)
Nu ku oro
333)
(
L ua ng iua
m
$o
.3
12
)
77
(4
)
33
( 6 1)
4
(3
(
pb
,'
f v
7)
( 61
8)
( 28
)
64
(4
)
96
(3
(1
87
)
(3
50
)
)
54
(5
)
77
(4
)
33
( 6 1)
4
(3
6)
(5
0)
( 47
n
-
+
t d
kg
q!
s z
x"
0
i y
% /
eø
( 638)
( 291)
( 573)
( 70)
( 71)
( 133)
( 625)
( 580)
B il a n S a ra
B il a Ata y a
Ciu lili q Ata y
S q u iq
( 664)
Sed n
)
B unu
( 1279)
)
( 81
( 50 4)
T h a o rl a n g
( 65335)
Favo
(
( 672)
( 28)
Ru ka i n
( 176)
( 74)
P a iwa na bu
( 100)
Ka na ka
( 737)
( 260)
S a a ro a
( 545)
( 553)
( 492)
Ts ou
( 687)
P uy um a
( 688)
269)
(
Ka v a la n
( 530)
( 54)
( 472)
Ba s a i
Ce ntra lAmi
( 109)
( 147)
( 596)
S ira y a
( 504)
P a ze h
( 556)
S a is ia t
( 750)
Ya ka n
( 559)
( 40)
( 91)
( 632)
Ba jo
( 375)
( 560)
( 230)
M
a pun
( 629)
S a ma lS ia s i
( 630)
Ina ba kno n
( 628)
( 620)
S uba nun S in
( 733)
( 728)
( 362)
S ub a no nS io
( 123)
( 370)
W e s te rnB uk
( 365)
( 366)
M
a no bo
( 43)
( 27)
M a no bo W e s t
( 614)
( 363)
( 377)
M a no bo Ili a
( 364)
( 79)
Di ba
M a no bo
( 369)
( 367)
( 368)
M a n o b Ata d
o
A
( 376)
ta u
( 355)
M
an
( 576)
( 40
( 233)
5)
M a n o b o T ig w
( 42)
( 118
)
M a n oo b o Ka la
b
o S a ra
( 80)
B in u
( 121
)
M a rak id
( 507)
Ira n u n a o
(4
( 25
( 613
n
3)
( 3193)
)
( 37
S
( 717
as
2
1)
( 225)
)
( 3)
(6 )
B a li a k
( 49
( 102
( 210
5)
( 6939)
(5
( 10
)
)
Ma
)
2)
7)
( 20
( 635
8)
Il o nm a n wa
)
( 103
)
Hi l g g o
( 662
)
W aiga y n
( 25
( 37 2)
B u tura y W ao n
1)
( 64
(7
T
a a n o ra y
27
( 64 1)
)
S u us ug n
(1
( 57 0)
Ak lri g a o nJo l o
( 49 )
( 6 51)
93
4
(4 )
Ce a n o o n
)
( 547)
(5
Ka l b u a n n B i s
7)
( 1395)
a ga o
M
(3
6
)
49
(6
Ta a n s a n
)
1
( 2 5)
B ik ga lo k a
45
)
o
T
l
N g An
a
(1
26
Ta gba a ga t
(
( 2 410 )
B a g b a n wa C
15 )
(4
t
n
a
P
)
k w Ka
( 3 03)
(2
37
B a l a w P a l a Ab
(3
( 1 67)
)
27
37
S Wa n g g a n B a w
)
)
at
H P i
P a nu a la w
(4
S a la no a no
( 1 07)
(2
Dai n g hu a n o
( 4 79)
( 2 25)
06
K ya i
( 3 00)
)
( 4 98)
D a ti k B
( 3 87)
K a y nga a k
( 2 31)
M a d oa k N n a t
( 4 31)
M e r ri g a
( 3 8)
ju
T a a ina h
( 3 48)
99
Ke u n j n y M a
)
(2
(5
M
r u a n la
17
1
( 1 1)
)
M e l ai n c in g
30
)
O e y
L ga la y uS
I o w n u a ra
B an d o M a
M n n l
M a ja e a y
M e l a re s i a
P i la y B s n
Ch h a n a n y u a h e M
B a
H
n
M a ru Ra g k a ru n s
I o in
ng b
Gaba k e a n
a
C
I d y n n Ch
h
aa o
am
n
m
a)
% /
eø
( 179)
7 8
a)
( 521)
( 270)
Figure S.3: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch.
27
(7
36
)
.3
12
&u
5*
6 œ
( 631)
7 8
$o
n
s u ur
Du n M r
u
a
o
n d u g i tB
on
B u im la b lu nL
ng k
u
T e t wa
K i n ra i t h L o M u
B e la a u
l
B e ny na n M a n
B e la na o a
K e a a n oB
M ah g an
k
L ng g
)
E n g s a tas e c
77 )
( 6 276
E i a a B re s
(
N o b du B a
)
5
T a ta n t
( 6 2)
(6
M a ya an
9)
(9
Iv ba uy y
I t a b l a te n g
B a ra y a ro n
Ir ba o
)
)
It a m a y
97 ) 169 )
)
8)
3
2
3
3
( 0 ( 68
4
I s a s ro d
( 1 8)
1
( 2 37)
(3
(
Iv o i
o to a
( 2 9)
( 7 6)
I ma m b a l B a n g
( 3 34)
( 6 78)
( 2 38)
Y a m mp a y n
(2
2
( 35)
S a p a o B Ka
( 2 40)
K u g a h a n Ke
( 2 8)
If a lla ha n
2
(2
K a lla loi a n
)
2
( 22 7)
K iba a s in nI
)
( 25 8)
41
In ng uge k
2
(
( 25
7)
)
P a k i d tKa g a
6
1
( 1 55)
)
( 5663)
6
Ka n g o Am
)
( 4679
( 25
(2
)
2
I l o g a o B a ta n
(
)
( 75
2
3
I fu g a o Gu i
(2
I fu n to k Gu i n o
)
64)
1
5
(
)
5
( 2 7)
B o n to c n a y N
20)
(5
2
2
(
(2
B o nk a a w
7)
89)
21)
9
(
2
4
(
(
)
Ka l a n g Gu i
8)
( 87
( 49
)
B a linga on
3
1
(1
88)
)
(
9
1
Ka e g B i ng
(2
6)
2)
2
6
2
2
I tn d d a n p l o
(
(
9)
( 61
)
Ga a P a m a g
( 90
)
4
5
2
(
Att e g Di b
)
9
)
3
)
8
2
1
(
(4
( 47
I s nta
)
4
8
as
1
(
Ag a g a tC
)
0
3
(
)
Du ma n o
5)
0)
( 236
)
( 60
k
( 11
o
( 255
ia n
Il
2)
(
y a ne s
ly n e s
Y o gJa
( 144)
)
y oP o
va
( 216
Old te rn M a la
s
9)
W e in e s e S o
( 46
( 471)
B ug h
M a loa s s a r
)
( 754)
( 93)
( 468
8)
48
M a k T o ra ja
(
( 354)
T a e S irT a b u
( 224)
)
( 94
S a ng
( 567)
S a ng ir a ra
( 344)
( 566)
( 637)
S a ng ilS
( 565)
( 611)
Ba ntik ng
( 476)
Ka id ip a
248)
(
50)
(
Go ron ta loHon
( 568)
( 202)
( 201)
Bo la a ng M
( 84)
( 518)
P opa lia
( 203)
( 85)
Bon e ra te
( 691)
( 747)
W una
( 423)
( 739)
( 424)
M una Ka tobu
( 685)
Tonte mboa n
( 466)
( 684)
Tons e a
( 46)
( 51)
Ba ngga iW di
( 746)
( 418)
Ba re e
W olio
( 271)
( 96)
M ori
( 529)
Ka y a nUm a Ju
( 213)
Buk
at
( 715)
P un a nKe la
( 749)
W a mp a r i
( 484)
Ya be m
( 422)
Nu m ba
( 67)
( 624)
( 20)
M ou k m iS ib
( 309)
( 7)
Aria
( 451)
( 713)
( 462)
L a mo
( 500)
( 585)
Am a g a iM u l
( 268)
S e n gra
( 61)
( 73)
s
( 475)
K
a u ne n g
( 394)
g Au V
B il iblo
( 188)
il
( 401)
M e g ia
( 292
( 575
( 72)
( 387)
)
)
( 353)
Ge d a r
( 539
)
M a ge d
( 574
)
B il iatu k a r
( 57
( 272
9)
)
Men u
( 661
)
ge n
M
( 25
(
600)
a
0)
Ko le u
( 249
( 4)
)
T a rvpe
( 62
( 35
S o ia
8)
7)
( 48
0)
Ka b e i
( 34
( 28
( 49
3)
4
R i wy u p u l
9)
( 74 )
a uK
o
( 31
3)
K
)
a
( 55
Ki i ri ru
( 20
( 46
( 62 8)
5)
3)
6)
W os
( 10
g
4)
Al
eo
Au i
( 17
( 28
(5
)
S a he l
1)
08
( 14
)
( 14
S u l i b a a wa
( 41 0)
( 16
1)
2)
)
M au
( 72
Gu a i s i n
0)
( 69
(1
17
Di o m a w
( 18 5)
( 40
)
5)
di a n
M
( 2 9)
(7
80
a
23
U oli o
(
)
)
i y
( 243)
( 735)
h#
&u
5*
6 œ
4
$o
.3
12
Ag t a g a tC
Du ma n o
Il o k y a
e s ia n
Y o gJa v a n e s la y o P o ly n
)
6
1
2
(
Old te rn M a
W e sin e s e S o
( 471)
B ug h
M a loa s s a r
( 754)
( 93)
M a k T o ra ja
( 488)
( 354)
T a e S irT a b u
( 94)
S a ng
( 567)
S a n g ir a ra
( 344)
( 566)
( 637)
S a ng ilS
( 565)
Ba ntik ng
( 476)
Ka id ip a
)
( 248
( 50)
Go ro nta loHon
( 202)
( 201)
Bo la a ng M
( 84)
( 518)
P opa lia
( 85)
Bon e ra te
( 691)
( 747)
W una
( 739)
( 424)
M una Ka tobu
( 685)
Tonte mboa n
( 684)
Tons e a
Ba ngga iW di
Ba re e
W olio
( 271)
( 96)
M ori
( 529)
Ka y a nUm a Ju
( 213)
Buk
at
( 715)
P un a nKe la
( 749)
W a mp a r i
( 484)
Ya be m
( 422)
Nu m ba
( 67)
( 624)
( 20)
M ou k m iS ib
( 309)
( 7)
Ari a
( 451)
( 713)
( 462)
L a mo
( 500)
( 585)
Am a g a iM u l
( 268)
S e n gra
( 61)
( 73)
( 475)
Ka u s e n g
( 394)
n g Au V
B il iblo
( 188)
il
( 401)
M
( 292
( 575
( 72)
e g ia r
(
)
38
)
( 353)
7)
G
e
d
( 539
)
M a a ge d
( 574
)
B il iatu k a r
( 57
( 272
9)
)
Men u
( 661
)
M a ge n
( 25
( 600
0)
)
Ko le u
( 249
( 4)
)
T a rvpe
( 62
( 35
S
8
7
)
)
o ia
( 48
0)
Ka b e i
( 34
( 28
3)
4)
R i wy u p u l
(
7
4
a uK
( 31
3)
Ka o
)
( 55
Ki i ri ru
( 20
( 46
( 62 8)
5)
3)
6)
W os
( 10
4)
Al g e o
Au i
( 17
( 28
)
S a he la
1)
( 14
( 14
S u l i b a wa
( 41 0)
( 16
1)
2)
)
M au
( 72
Gu a i s i n
0)
( 69
(1
17
D ma
( 18 5)
( 40
)
5)
M i o d i wa n
( 2 9)
(7
8
a
23
U o l i mo
( 1 0)
)
83
a
)
Gab i r
W e pa p
(5
(4
94
D d a
82
)
(3
)
(1
M o b u a u i wa
42
4
)
( 2 3)
K is i a n
9
( 5 5)
(5
Gai l i v i m a
41
02
(
)
(7
39
)
D b la
( 4 12) ( 30 5)
K ou a d
21
5)
)
R u n ra i
(2
M o ro i
( 6 61)
L ek
( 5 55)
V al e
( 2 90)
M ili a o
( 5 93)
01
M o turu p u
)
K a
(7
(5
T a n go
30
( 5 44)
S a n d ri S
)
( 4 93)
Ku i a r g a a s o u
( 2 37)
t
P
( 2 96
R a an
( 3 12 )
S o tp u a
( 3 35) )
N im v i a t
36
)
Ku du boa n aa r
H
k
L o sa e
L u un a v gh
qa gg a e
a
m
(
pb
f v
-
+
t d
kg
q!
s z
x"
0
i y
( 30)
)
( 236
( 2)
( 144)
,'
n
% /
h#
&u
5*
$o
( 49
9)
(4
47
)
(5
08
)
( 466)
( 46)
( 51)
( 746)
( 418)
( 423)
( 203)
( 568)
( 611)
( 224)
( 270)
( 631)
(7
36
)
0)
( 47
( 243)
( 735)
9)
( 46
5)
( 60
)
( 468
0)
( 11
)
( 255
(
eø
4
)
48
(4
)
04
(4
6 œ
a)
% /
eø
( 523)
5*
6 œ
7 8
( 179)
Figure S.4: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch.
( 158)
28
(1
11
)
(1
1
12
&u
a)
( 521)
.3
ta
a
k oe g g
bo e l o n u
u
K gh a n un o
U h ng v k e
)
G a ro re
94 )
V a a s
( 2 96 )
e
M b lo f
( 6 93 )
M o io
( 1 06 )
ap
7
S a op nH
( 82
)
3
1
T e a n
( 9
h
a
(3
T
u
)
Ne i s s u g g a a T
46
( 6 68)
)
N a k in a n on
54
( 6 39)
H i s a t Gh
(1
( 4 58)
)
i
b
2
S
0
(4
( 6 2)
B a a ri si s i
7
V a r o a Av
(5
)
V i ri g g n a
38
7)
( 4 7) ( 59 )
R e n a ta a
8
20
S a b hu na L o
( 3 11)
(
)
7
)
0
(
B a g a ta n a Ka
4
10
(4
( 7 38)
V a b a ta n a g a
( 5 84)
B a b a ta u n
( 5 5)
B ab aL
( 3 05)
B ung o u
( 7 4)
L o n o Al
(3 )
7
M on u
(3 )
)
9
6
2
M o ra v a u ro
( 3 4)
(1
)
10
( 33 3)
T ru a F a
(6
( 41 4)
U ono
o
( 41 6)
M l u r e Ho l m a
( 68 0)
B i e k ge K a t
( 70 5)
h
n
i
C
r
eT
( 41
8)
M aa ri n g P o ro l
( 12 )
6)
9
( 41
M ga o ge L e
( 37 )
1
g
N a ri n a Ki a a s
( 38 )
2
M ba n S a m
( 45 )
)
0
7)
( 75
( 38 )
Z a ghu nga
( 15
5
( 75 )
L a a bla ga G
2
( 30
B l a bla n
2)
)
8
(
2
B l k o ta a Y s
( 73
83)
(
Ko o k a k
)
1)
9
7
8
5
2
(
(
Ki l n o n i
)
2
)
8
(2
( 124
B a d a ra u n g
M an ga gT t
T u ra W e s
)
0
4
(3
Ka k
)
2
( 49)
9
(6
Na lin g
5)
( 26
T ia a k
1)
( 43
T ig irS u n g l a s
( 673)
L ih a k L a m
( 674)
6)
Mad k
( 31
( 324)
B a ro tu tu
( 339)
M a u n a iB il
( 53)
Na k a la i
( 388)
( 338)
L a ka
( 430)
Vitu e
( 304)
Y a p e s otuDh
( 741)
M bu gh
( 393)
( 714)
Bu go tu ol i
95)
(
( 92)
Ta lis e M a la
( 651)
( 650)
Ta lis e M
( 195)
Gh a riNdi
( 194)
Gha ri
( 681)
Tol o
( 648)
Ta lis e
( 347)
M a la ngo
( 204)
( 196)
Gha riNgga e
( 190)
( 392)
M bira o
( 199)
Gha riTa nda
( 652)
Ta lis e P ole
( 197)
Gha riNgge r
( 649)
( 198)
Ta lis e Koo
( 319)
Gha riNg ini
( 189)
( 320)
L e ngo
( 321)
L e ng oGha
( 453)
L e ng oP a im
( 678)
Ng ge la rip
( 298)
To a m ba ita
( 618)
( 312)
Kwa i
( 299)
L a n g a la
( 473)
( 390)
nga
K
wa io
( 313)
M ba
( 389)
L a u e ngguu
( 345)
( 301)
M ba
( 175)
Kwa e le le a
( 315)
( 328)
aeSo
F a tara
( 314)
l
( 346
L a u Wle k a
)
a la d
( 551)
L a uN
e
( 550)
L o n g o rth
( 621
( 490
S a a gu
)
)
( 552
S a Uk iNiM
)
( 548
OroahS a a Vil a
)
( 142
l
Sa a
)
( 18)
S a aa Ul a wa
( 19)
D
( 58
o
(
549
)
)
Ar ri o
( 59
)
( 56
( 17
Ar e a re M
4)
2
a re a a s
( 24 )
Sae
W
( 57 7)
B a a Au l u a i a
( 22 0)
Vi l
B a uu ro B a
( 23 )
F a g ro H ro o
( 21 )
(6
Ka a n a u n u
( 24 )
57
)
S a h u a Mi
( 60 6)
(1
( 17 )
Ar n ta
a
71
)
( 1 3)
A o s Ca m i
i y
( 753)
7 8
$o
.3
12
T ia a k
T ig irS u n g l a s
L ih a k L a m
Mad k
( 316)
B a ro tu tu
M a u n a iB il
Na k a la i
( 388)
( 338)
L a ka
( 430)
Vitu e
( 304)
Y a p e s otuDh
1)
( 74
M bu gh
( 393)
( 714)
Bu go tu ol i
( 95)
( 92)
Ta lis e M a la
( 651)
( 650)
Ta lis e M
( 195)
Gh a riNdi
( 194)
Gha ri
( 681)
Tol o
( 648)
Ta lis e
( 347)
M a la ngo
( 204)
( 196)
Gha riNgga e
( 190)
( 392)
M bira o
( 199)
Gha riTa nda
( 652)
Ta lis e P ole
( 197)
Gha riNgge r
( 649)
( 198)
Ta lis e Koo
( 319)
Gha riNg ini
( 189)
( 320)
L e ngo
( 321)
L e ng oGha
( 453)
L e ng oP a im
( 678)
Ng ge la rip
( 298)
To a m ba ita
( 312)
Kwa i
( 299)
L a ng
( 473)
( 390)
Kwa ioa la n g a
( 313)
M
ba
( 389)
L a u e ngguu
( 345)
( 301)
M ba e
( 175)
Kwa le le a
( 315)
( 328)
ae
F a tara
( 314)
le S o l
( 346
L
a uW k a
)
a la d
( 551)
L a uN
o
( 550
rth e
L on
)
( 621
( 490
S a a ggu
)
)
( 552
S a a Uk iNiM
)
( 548
Oro S a a Vi l a
)
( 142
l
S a ha
)
( 18)
S a aa Ul a wa
( 19)
D
( 58
o
(
549
)
)
Ar ri o
( 59
)
( 56
( 17
Ar e a re M
4)
2
a re a a s
( 24 )
Sae
W
( 57 7)
B a a Au l u a i a
( 22 0)
Vi l
B a uu ro B a
( 23 )
F a g ro H ro o
( 21 )
Ka a n a u n u
( 24 )
S a h u a Mi
( 60 6)
( 17 )
Ar n ta
a
( 17 3)
Ar o s i Ca ta m i
( 56 4)
Aroo s i TOn e i b l
9)
(
( 6 663
Ka s i a wa
5
)
8
(3 )
t
B hu
( 3 00)
F a ur a
(7
( 6 18)
26
F a ga oP a
( 6 36)
)
99
S aa g a nn i Agwa V
)
T nt iR u
(1
T a wa a A i h u f
50
(1
)
Kwa n n ro gn a
( 4 5)
0
a aS a
L
( 1 2)
S e n me ou
( 4 0)
( 5 20)
Ur y e Ea k e ra th
a
( 4 10)
rro l
A
( 4 91)
m
M ra k
( 5 27)
an
Ame re i S o
( 6 31)
M
i ut
( 4 07)
h
P e o tab ry
( 4 89)
m
P t
( 4 54)
So
M aa er
( 4 32
ut
( 4 34) )
R w m a ra
29
S o a g o tl a e s M a
)
e
a
Or u
p
S
t
N k h
ou
N g o E
N a u n n fa t
e
N a m a
Ne a h ti a k i
T
r
a
An a p s e v a
q
ej e
om
An
ei
m
(
pb
(
( 44
( 1 659 4)
2) )
( 523)
( 753)
(4
67
)
(3
52
)
( 618)
(1
71
)
(1
19
)
(6
57
)
(
( 431)
( 673)
( 674)
( 324)
( 339)
( 53)
,'
( 158)
(7
09
)
)
57
(5
)
51
(3
(1
12
)
)
06
(4
(1
11
)
)
22
(5
2)
( 33
6)
( 53
f v
n
-
+
t d
kg
q!
s z
x"
0
i y
% /
eø
7 8
a)
% /
eø
7 8
( 521)
a)
( 179)
Figure S.5: Branch-specific, most frequent estimated changes. See Table S.5 for more information, crossreferenced with the code in parenthesis attached to each branch.
( 609
)
( 33
)
29
( 72
4
( 24 )
)
(1
14
)
(1
52
)
)
19
(5
(7
7
(
( 6101
01 )
)
(4
5)
(5
86
)
.3
1
12
&u
5*
6 œ
( 1)
$o
tO
or
oP
a
a
k v ei
se
S a v a v e a n te a n
A e m wa n n e
N a l u An i a s K
N u lo a le e A
P u l e wa s e s
P o ta k e k n
W a uu oc ia
)
28
S h rtl i n s e s
( 5 27)
C o o l k e l e ro
)
5
( 45
2) 5)
3
)
M a r u o ro a
7
( 44 )
( 77
)
5
( 33
C hu s nC a
( 32
(4
C o n pa nn
( 1 19)
S a i oA i e
( 4 06)
( 1 31)
S ul le a e s e s
( 1 03)
P o k il p n
( 6 55)
W o ila e a
5 6)
)
(
0
M ing a p i
2
2
( 5 44)
(5
P on a t
s
( 7 11)
P i ri b a i e l l e
( 4 12)
K us s ha
( 5 14)
K a r ru a
(5
)
M a u mw
16
(5
)
N e le
15
(5
N we l a
Ja a n a
)
C a i one
1
4
( 4 4)
Ia e ng
4
(2
)
N e hu a n ij
83
( 2 97)
D tu me rn F
( 2 85)
Ro e s t i
4)
(3
7
4
(
W a or
)
5
0
4) M h i ti
(1
n
( 37 )
2
T a n rh yo tu
4
6
)
(
36
5) P e
3)
(4
6)
a mi h i k i
( 50 )
( 54 4)
u
( 44
9
T
3
8
(6 )
(7
a n ua n o
4)
1
( 2143)
( 36 ) M u ru t a n M
n
i
6
( 4 39)
5)
( 54 ) R a h i t n g a
( 64
(1
3
T ro to n th
( 64
) Ra
a an
i
t
i
( 534
) Ta h u e s
5)
2)
( 644
q re v a
( 12
( 72
) Mar
ga
( 383
6)
) M a n ia n
( 15
( 359
ai
as
w
)
)
a
9
H pa nuiE
( 20
( 384
Ra m o a n
a
S
a
n
o
)
B e llo p ia
( 533 62)
( 63)
(5
T ik a e
( 675)
E m ta
( 163)
An u n e ll e s e
( 13)
R e n n a An iw
( 537)
F u tuM e le M
( 180)
( 481)
( 182)
8)
Ifira W e s t
( 21
3)
51
(
( 703)
Uv e an a E a s t
( 146)
( 116)
( 181)
F u tu k a u T a u
( 704)
Va e a
( 563)
( 647)
Ta k u u
)
( 694
Tu v a luna
( 592)
S ik a ia a m a r
( 264)
( 162)
Ka pi ng
( 483)
Nu ku oro
( 333)
L ua ng iua
( 524)
P uka puk a
( 680)
Tok e la u
( 702)
Uv e a E a s t
( 682)
Tonga n
( 683)
( 459)
Niue
( 177)
F ijia nBa u
( 666)
Te a nu
( 708)
( 707)
Va no
( 654)
( 159)
( 98)
Ta ne ma
( 26)
Bum a
( 701)
( 656)
As um boa
( 738)
( 442)
Ta nim bili
( 581)
Ne mb a o
( 748)
( 616)
S e im a t
( 330)
( 160)
W uv ul u
( 435)
( 153)
L ou
( 373)
( 317)
Na un a
( 598)
( 729)
L
e ip o
( 608)
( 325)
( 329)
S iv is n
( 323)
L ik u ma T ita
( 266)
Lev e
( 200)
( 417)
L on i
( 97)
( 108
)
M u siu
( 532
)
Ka s s a u
( 718
Gim ira Ira h
( 408)
( 485 )
( 68)
B u li a n
)
( 46
( 25)
Mo
5)
( 120
)
M inr
B i g ay a ifu in
( 378
( 61
As M i s o o
( 8) )
2)
( 46
l
Wa
0
( 742
( 13 )
)
Nu ro p e n
5
( 62
( 13 )
2)
4)
M a m fo r
r
a
Am u
( 38
W ba iY
( 66
6)
7
( 22 )
No irn d e s a p e n
9
th B i W a
( 16 )
D
4
a
a ba n
(6
( 14 )
Da we ra D
r
( 6 60)
( 58 8)
a we
Te l i
( 4597)
( 11 8)
0)
a
Im M a
( 60 5)
r
( 19
6)
E m o ing s bu
2)
a
(1
E a pla
6
(
i y
( 426)
h#
&u
5*
6 œ
4
$o
.3
12
A
B
(p, v)
(o, ə)
(r, d)
(ʀ, r)
(p, b)
(w, o)
(i, e)
(c, s)
(u, i)
(e, i)
(b, p)
(o, w)
(i, u)
(k, g)
(j, s)
others
0
0.1
0.2
0.3
0.4
Fraction of substitution errors
C
a
i
m
e
t
r
s
k
u
l
ŋ
p
q
v
n
others
q
a
ʀ
n
k
s
t
p
u
m
ŋ
r
w
i
o
others
0
0.075 0.150 0.225 0.300
Fraction of insertion errors
0
0.075 0.150 0.225 0.300
Fraction of deletion errors
Figure S.6: Most common substitution errors, insertion errors, and deletion errors in the PAn reconstructions
produced by our system. In (A), the first phoneme in each pair (x, y) represents the reference phoneme, followed
by the incorrectly hypothesized one. In (B), each phoneme corresponds to a phoneme present in the automatic
reconstruction but not in the reference, and vice-versa in (C).
30