Linguistic Divergence in Austronesian on the Island of Timor

Linguistic Divergence in Austronesian on the
Island of Timor
Riley Burkart
August 15, 2014
Abstract
The island of Timor, located in South East Asia, experienced a wave of
migration from the neighboring islands several thousand years ago, resulting in its domination by the Malayo-Polynesian branch of the Austronesian languages. The spread of these languages and their relationships to
each other remain poorly understood. Using statistical tools and computation, a data set comprised of words from 53 languages will be analyzed
in order to explore a new method for estimating linguistic divergence that
has only previously been applied to the Turkic languages. Regular sound
correspondences will be inferred and cognates will be identified to estimate the amount of time that has passed since the languages split. These
new methods, incorporating linguistic knowledge, will hopefully compare
favorably with current computational strategies and will contribute to
a deeper understanding of the movement of the Austronesian people in
Timor.
General Discussion
Located near the eastern end of the Lesser Sunda Islands, the small island
of Timor demonstrates a remarkable degree of linguistic diversity, containing
at least 16 distinct indigenous languages. First settled by Australoid peoples
between 40,000 and 20,000 BC, Timor has since experienced at least two successive migrations. The first of these occurred around 3,000 BC and consisted of
Melanesians who spoke Papuan languages and migrated from the nearby island
of New Guinea. Around 2,500 BC, Austronesians from the north settled on the
island, where they remain the predominant ethnic group (Timor, 2010).
The Austronesian language family originated in South China before spreading to Taiwan between 3,000 and 4,000 BC, around the advent of rice cultivation. Over the next several millennia, the Austronesian people would continue
to spread to other islands, and today, an estimated 270 million people speak
between 1,000 and 1,200 distinct Austronesian languages distributed more than
half way around the world from Easter Island to Madagascar. The Austronesian
languages spoken on the island of Timor, as well as those spoken on the nearby
1
island of Sumba, fall under the Central Malayo-Polynesian subgroup. (Bellwood
et al., 1995)
Aside from Indonesian and Portuguese, the only languages commonly spoken
on Timor today are either Papuan or Austronesian. Papuan languages, including Bunak, Makasai, and Fatalaku, are largely confined to the mountainous
regions of the island, and are believed to have originated from a single common ancestor. The Austronesian languages of Timor are generally referred to a
Timoric, and are divided by Geoffrey Hull (2004) into two subgroups: Fabronic
and Ramelaic. According to this classification, the significant languages in the
Fabronic group are Tetum and Atoni, while those in the Ramelaic group are
Mambai and Kemak. Ramelaic Languages are believed to differ from Fabronic
languages mainly in that they were influenced by a stronger Papuan substratum and a greater degree of creolization from trade languages such as Malay
and Ambonese during the 15th century (Hull, 2004).
Because they have not been extensively studied, the relationships between
the Timoric languages remain rather poorly understood. Using cognate analysis,
however, maps and phylogenies can be created to correct this. Two words
are considered to be cognates if they share a common etymological origin; in
other words, they must have a similar meaning and share systematic sound
correspondences. The number of cognates shared between two languages can be
used in order to estimate the distance between the languages. Assuming cognate
replacement to be a memoryless system, the probability that two languages
share cognates can be modeled as P = e−d , where d is the distance between the
languages (Dyen and University, 1973). Replacing the probability with m/n,
where m is the number of cognates two languages share and n is the number of
words compared between the languages yields:
d = − log m/n
Using this equation in conjunction with a list of words in different languages
with their cognates identified, a distance matrix can be created. Methods such as
multidimensional scaling and neighbor-joining can then be utilized to transform
such distance matrices into maps and trees, respectively.
A significant drawback to the use of such distance matrix methods, however,
lies in their tacit assumption that the rate of cognate replacement is the same
for all meanings. It has been shown that this is not the case, as can be seen
when comparing words like ’mother,’ which are replaced very slowly, to words
like ’food,’ which are replaced over a much shorter period of time (Dyen, Isidore
et al., 1967). A bayesian Markov Chain Monte Carlo model applied to the cognates can correct for this by directly estimating the rates of cognate replacement
(Pagel et al., 2007).
There remains, nevertheless, an inherent problem with any method that uses
cognates to determine the distance between languages. Given three cognates
such as the English hound, the German hund, and the Russian suka, a cognatebased approach of analysis would assign all three of these words an identical
cognate code and treat them the same. It is clear, however, that hound and
2
hund are far more similar to each other than either is to suka, a fact that the
cognate-based approach fails to take into account.
A new method that analyzes the sound correspondences rather than cognates and has only been used before in the Turkic languages will be applied
to the Timoric languages in order to produce a more accurate phylogeny for
the island. To do so, the phonemes in the languages must first be aligned, so
that regular and irregular sound correspondences can be inferred. Finally, these
sound correspondences will be subjected to MCMC methods, yielding a more
accurate phylogenetic tree.
Data Set
The data set used in the analysis is comprised of 7,118 words from 52 distinct
language samples (hereafter referred to as ’languages’) and was collected by
Stephen Lansing and his team. In addition to languages from the island of
Timor, Balinese and several languages from the island of Sumba are also included, but Mambai is not. Each word is written in IPA and is listed with two
other associated properties: a cognate code and a gloss, which is the meaning
of the word.
Several difficulties are encountered in using the list. For one, many of the
languages lack cognate codes, rendering them unusable for cognate analysis.
Furthermore, certain cognate codes 6 and 4142 refer respectively to errors and
isolates, which requires that more words be discarded. Finally, the glosses are
in Indonesian, and are not yet translated into English. Therefore, the glosses
have thus far not been used in this research.
Cognate Analysis
Classical Multidimensional Scaling
One way to examine the relationships between languages is through the creation
of a map of the languages’ locations, and one method of mapping is to use
classical multidimensional scaling, or CMDS. Given a distance matrix, CMDS
sets up a matrix of squared distances, to which it then applies double centering.
It then extracts the two largest positive eigenvalues and their corresponding
eigenvectors, which are used to create a coordinate matrix that maps the points
in two dimensions (Wickelmaier, 2003).
Because the map is dependent on only two eigenvalues, it is important that
these eigenvalues are significantly larger than all the others. If this is not the
case, then the map produced will have a high degree of error. This fact can
be used to determine a goodness of fit for the map. Two such methods are
implemented in R, and can be seen below:
Pm
|λi |
= 0.5372
GOF 1 = Pni=1
|λ
i|
i=1
3
Pm
max(0, λi )
= 0.6682
GOF 2 = Pi=1
n
i=1 max(0, λi )
The results of the CMDS, as applied by the cmdscale command in R, can
be seen in Figure 1. As neither of the goodness of fit values is very good, it
is not surprising to see that the CMDS only managed to resolve three rough
groups of languages, which can be categorized as Sumba, Tetum, and Atoni; the
Kemak group has been mixed in with the Tetum languages. When the resulting
distances are plotted against the original distance matrix, a large, aberrant
group of language comparisons, predominantly composed of Kemak languages,
can be seen.
●
1.5
●
●
●
3.5
Sumba
●
● ●
●
●
●
● ●●●
● ● ●
● ● ●● ● ●●
●
●●●
●
●
● ● ●
●● ● ●
● ●●●● ●●
●
●
●
●
●●●●●
●● ●● ●●●
● ●
●●
●
●
●
●
●● ●
●
●●
●
●
● ●
●●
●
●
●●
● ●● ● ●
●●●
●● ●●
●
●
●●
● ●
●
●●
●
●●
●
●
●●
●●
●●●●
●●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
● ● ●
●●●
●●
●●
● ●●
●●
●
●
● ●●
●
●
●
●●
●
●
●
● ● ●
●
●
●
●●
●
●
●●
●
●
●●
● ●●
●
● ● ●
●
●
●
●
3.0
●
1.0
●
●●
●
Tetum
●
●
0.0
−0.5
2.5
Distance Matrix
●
●
0.5
2.0
1.5
1.0
●
●
●
●
−1.0
Atoni
●
0.5
●●
●
●
●
●
−1.5
−1.5
−1.0
−0.5
0.0
0.0
0.5
1.0
1.5
●
●●
●●●●● ●
● ●●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●●●
●
●●
●●●
●
●●●
● ●
●●●
●
●
●
●● ●
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
CMDS
Figure 1: On the left is the CMDS map, which displays three cluster of languages
that correspond roughly to Sumba, Tetum, and Atoni. On the right is the plot
of the CMDS distances by the original distance matrix. It is roughly monotonic,
as it should be, but contains a major aberration.
It is clear that the CMDS method is insufficient for describing the relationships between the languages, as it cannot distinguish between certain recognized
language groups. Additionally, the goodness of fit is rather poor. One of the
problems with CMDS is that it gives priority to representing the languages with
the exact Euclidean distances that appear in the distance matrix, which is not
always possible.
Ordinal Multidimensional Scaling
As opposed to CMDS, ordinal or non-metric multidimensional scaling (NMDS)
neglects the actual Euclidean differences in favor of rank. Using a monotonically
4
3.0
3.5
increasing function, θ(d), the distance (d) is transformed into disparity, which
is an ordinal value. Using θ, the function attempts to minimize a value known
as stress:
P
2
i<j (θ(δij ) − dij )
P
ST RESS =
2
i<j dij
●
Sumba
3.441
●
●
●
●
●
●
● ●
●● ● ●● ●
●
●
●
● ●●
● ●
●
●
●●
●
● ●●●●
●●
●● ● ●
●●
●
● ●
●●●
●● ●
●●
●
●●● ● ●
●●
●
●
●●
●●
●●●
●
●●
●● ● ●●●●
●●
●
●●
●
● ● ● ●●●
●
●●
●
●
●
●●
●
●
●●●●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●●●● ● ●● ● ● ● ●●
●
●●
●●●●
●
●
●●●● ●
●
●
●
●
● ●
● ●
● ●
●
●
●●● ● ● ● ●
●
●
●
●
●
●
●●● ● ● ●
●
●
● ●
●●
●● ● ●
●
● ●●●
●●●
●●
●
●
●
1.0
1.5
Where d is the distance in the map that is being produced by the NMDS
(Wickelmaier, 2003). When 2-dimensional NMDS is used on the data set, the
stress value comes out to 9.9141%, while 6-dimensional NMDS produces a stress
of 0.056%, which is significantly better. The 6-dimensional NMDS produced by
the isoMDS command in the MASS package of R and mapped onto 2 dimensions
can be seen below:
●
●
3.433
Rindi
0.5
●
Tetum
Kemak
3.423
Kolhua
0.0
●
●
−0.5
0.017
−1.5
−1.0
0.008
−1.5
−1.0
●
Atoni
−0.5
0.0
0
0.5
1.0
1.5
●
●
●●●
●●
●
●● ●
●
● ●
●●
●
●
●
●●●
●
●●●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●●●
●
●●
●
●
●●●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
Figure 2: On the left is the 6-dimensional NMDS map. It discerns all four
major language groups (Tetum, Kemak, Atoni, and Sumba), and additionally
recognizes that Rindi and Kolhua are outliers in the Sumba and Atoni groups,
respectively. On the right is the comparison with the original distance matrix,
which is monotonically increasing.
The map that is produced by the NMDS did a significantly better job at
discerning languages than the CMDS method. Not only did it resolve that
Tetum, Kemak, Atoni, and Sumba were separate groups, but it also indicated
that two languages, Rindi and Kolhua, were outliers within their respective
groups of Sumba and Atoni. Similarly, when compared with the original distance
matrix, the points arrange in a monotonically increasing fashion, which is what
would be expected.
Unfortunately, the ordinal factor that made the NMDS so successful also
5
3.0
3.5
makes the results less useful. All of the languages are confined to a small number
of points, because the NMDS completely ignores the Euclidean distances. All
the languages are either extremely far from each other or extremely close, so
although we learned that Rindi and Kolhua are unique, we also lost a great deal
of information on the individual languages that are not so unique.
Ultimately, the map-like topology does not seem to be an accurate model
for the Timoric languages. Such a model would imply that languages that are
closer are closer because of geographic proximity, which would lead to mixing,
and because this does not seem to be the case, the next step is to see if the data
conforms to a tree-like topology.
Neighbor-Joining Tree
The most common method for producing a tree from a distance matrix is the
neighbor-joining algorithm. Essentially, the algorithm decides which two languages are closest to each other and then joins them with a node, the position
of which is determined by the average of the two languages’ distances to all
the other languages. The two languages are then dropped and replaced by the
node in the distance matrix. The algorithm then repeats itself until the distance
matrix is reduced down to three nodes, at which point the remaining nodes are
connected by branches (Saitou and Nei, 1987).
●
●
●
L540
L154
L1490
L1241
L803
0.0
L134
1.5
0.5
L136
L1778
L867
L800
L1315
●
●
●
●
●
●
●●●
●
●
●●
● ●
●
●
● ●●
● ●●
●
● ●
●
1.0
L164
treemat
2.0
L294
L884
●
●
●● ●
● ●
●
●
● ●●●
●
●
● ●●
●● ● ●
●●
●●
● ●●
●● ● ● ●
●●●● ● ●●
● ● ●● ●● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●● ●●
●●●
●
● ● ● ●●
●●
● ●●
●
●
●●●●●●
●
●●
●
● ●●
●
● ●
●●
●●
●●●
●
●●
●
●
●
●
●●
●●●
●●
●
●● ●
●
●
●
●
●
● ●
●●●●●
●
●●
●●
●
●● ●
●
●
●
●
● ●●●●●
●
●●
●
●
3.0
L290
L299
L289L169
L1222
L288
L168
L802
L170
L137
2.5
Tetum
Kemak
Atoni
Sumba
●
●
●
●
●●
●
●
● ●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
dist2mod
Figure 3: On the left is the neighbor-joining tree. On the right, the tree’s
impressive monotonic correlation with the distance matrix that produced it can
be seen.
As seen above, the tree produced by the bionj command in R’s ape pack6
3.0
3.5
age seems to display the languages rather accurately. All four of the major
language groups are accounted for, and it can also be seen that the Rindi and
Kolhua languages are outliers within their own groups, which conforms with
the information gleaned from the NMDS model. Furthermore, when the tree
is compared with the distance matrix, it produces an impressive approximation
of a straight line. It is rather clear that the tree-like topology, which views the
languages in a one-dimension, hierarchical manner, is superior to the map-like
topology for the Timoric languages. The tree differs greatly, however, from the
taxonomy proposed by Geoffrey Hull. Hull classified the Tetum and Atoni languages into one group and considered the Kemak group to be in an entire other
classification. The neighbor-joining, however, determined that Kemak is closely
related to Tetum, closer even than Tetum is to Atoni.
Bayesian Markov Chain Monte Carlo Simulation
Although the neighbor-joining tree proved to be quite successful, the distance
matrix data that it used is fundamentally flawed itself, due to its assumption
that the cognate replacement rate is the same for all cognates. The MCMC
method can make up for this weakness. Using Bayes’ Theorem, one can determine the probability of a certain tree being true given the data using:
p(D|τi )p(τi )
j=1 p(D|τj )p(τj )
p(τi |D) = P
Where p(τi |D) is the probability of tree i given the data, p(D|τi ) is the
probability of the data given tree i, and p(τi ) is the posterior probability of tree
i. Unfortunately, this equation is extremely difficult to solve, because it requires
summing up all possible trees. On the other hand, it is easy to determine the
probability of one tree relative to another given the data:
r=
p(D|τi )p(τi )
p(τi |D)
=
p(τj |D)
p(D|τj )p(τj ))
MCMC takes advantage of this fact with an algorithm that satisfies Markov’s
Theorem, producing a time average distribution that is equal to p(τi |D) (Pagel
et al., 2004). After running the MCMC for 40 million iterations with a burn in
of 2 million, 751 trees were sampled. The resulting consensus tree can be seen
in Figure 4.
7
Tetum
Kemak
Atoni
Sumba
70
78
63
52
89
57
52
24
67
28
48
26
88
59
22
73
79
53
76
38
60
78
100
Figure 4: The Markov Chain Monte Carlo tree. To the right of each node is its
posterior probability.
The MCMC tree clearly distinguishes between the four different languages
and also suggests that the Atoni group can be split into three subgroups, including the standalone Kolhua language. It is also in agreement with the neighborjoining tree on the relation of the four groups, with Tetum being more closely
related to Kemak than to Atoni.
Simple comparison shows that the topology of the MCMC tree resembles
that of the neighbor-joining tree quite closely (Figure 5). Using the PH85
method in the dist.topo command in R’s ape package, which counts the
number of different nodes, it is seen that the distance from the MCMC tree
to the neighbor-joining tree is 33, while the distance in the other direction is
only 23. On the other hand, the average distance between two random trees is
roughly 43, confirming that the MCMC tree and the neighbor-joining tree are
indeed similar.
8
L290
L284
L289
L288
L299
L169
L168
L137
L170
L1222
L802
L294
L1490
L154
L540
L164
L1241
L134
L803
L867
L800
L136
L1315
L884
L1778
Tetum
Kemak
Atoni
Sumba
5
L1
00
0
L8
49
5
31
L1
L15
31
L867
0
L8
0
L86
L1241
03
L1
64
L1
36
L1
L5
40
L134
7
Bionj Tree
L8
L134
41
L12
L803
MCMC Tree
L1
54
36
L1
4
L884
L540
L884
L1490
L1778
L164
L1778
L294
L290
L2
84
2
7
68
L1
2
88
89
L2
L169
L168
02
L290
88
9
L29
L169
L13
L8
37
L1
L2
L2
L1
70
L2
89
L1
22
22
L1
L170
L294
L299
2
L80
Figure 5: A comparison between the MCMC tree and the neighbor-joining tree.
When the branch lengths of the two trees are compared, however, the trees
differ significantly. This is not unexpected, because the assumption made in
using the P = e−d model does not take into account that different cognates
have different replacement rates. The MCMC method is generally accepted as
more accurate.
Sound Correspondence Analysis
Despite the successes of methods using cognate analysis, the information lost
through replacing phoneme information with simple cognate codes is noteworthy. In order to improve the Timoric phylogenies, therefore, it is necessary
to analyze the sound correspondences. To do this, one must first align the
phonemes. A program written by Daniel Hruschka and Tanmoy Bhattacharya
that has only previously been used on the Turkic languages was used to this
end. Comparing the probabilities of phonemes in words that are cognates, the
program produces an alignment, as seen in Figure 6.
9
1
2
3
4
5
6
7
8
9
L136
k
a
k
o
r
o
k
L164
L800
k
L1241
k
†
Ä”
k
o
r
o
g
o
r
u
Äı̂
r
Äı̂
r
Äı̂
Figure 6: An example of words from multiple languages aligned by the program.
Following the alignment, the program can be used to infer regular and irregular sound changes. Because of a bug in the code, this step has not yet been
completed. Once the sound changes have been found, the MCMC model can be
used to produce an extremely accurate phylogeny.
Conclusion
The island of Timor displays an incredible degree of linguistic diversity that is
challenging to interpret and classify using conventional linguistic methods. Statistical analysis can be of great value in understanding the relationships between
these languages. Certain methods work better than others, such as the superiority of the tree-like topology to the map-like topology, and the superiority
of the MCMC simulation to the neighbor-joining method. There is good reason to believe that our method of sound correspondence analysis, using aligned
phonemes, will prove to be an even better means of determining linguistic relationships and divergence.
Interestingly, the results thus far disagree with Geoffrey Hull’s classification
of the Timoric languages into the Fabronic and Ramelaic groups. Rather, both
the tree and map methods indicated that the Fabronic Tetum languages were
closer to the Ramelaic Kemak languages than they were to other Fabronic languages, such as the Atoni group. Furthermore, the trees both indicate that the
Atoni languages were about as similar to the languages of the nearby island of
Sumba as they were to the Tetum and Kemak groups.
Such observations lend valuable information to understanding the migrations
that have resulted in the modern distribution of peoples and languages on Timor.
Archaeological and cultural data will hopefully be used in conjunction with this
linguistic data in the near future in order to form a complete picture of the
Austronesian migrations to Timor.
10
References
Bellwood, P., Fox, J., Tryon, D., and Project, A. N. U. C. A. (1995). The
Austronesians: Historical and Comparative Perspectives. Comparative Austronesian Series. Department of Anthropology as part of the Comparative
Austronesian Project, Research School of Pacific and Asian Studies, Australian National University.
Dyen, I. and University, Y. (1973). Lexicostatistics in genetic linguistics: Proceedings of the Yale Conference, Yale University, April 3-4, 1971. Janua
linguarum. Mouton.
Dyen, Isidore, James, A. T., and Cole, J. W. L. (1967). Language divergence
and estimated word retention rate. Language, 43(1):150–171.
Hull, G. (2004). The languages of east timor: Some basic facts. The National
Linguistic Institute of the National University of East Timor.
Pagel, M., Atkinson, Q. D., and Meade, A. (2007). Frequency of word-use
predicts rates of lexical evolution throughout indo-european history. Nature,
449(7163):717–720.
Pagel, M., Meade, A., and Barker, D. (2004). Bayesian estimation of ancestral
character states on phylogenies. Systematic Biology, 53(5):673–684.
Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–
425.
Timor, E. (2010). A brief history of timor-leste. East Timor Presidential Website: http://www.presidencia.tl/eng/files/Presskit-Brief
Wickelmaier, F. (2003). An introduction to mds.
11