Investigating the potential of ancestral state

Investigating the potential of ancestral state
reconstruction algorithms in historical linguistics
Gerhard Jäger & Johann-Mattis List
Tübingen University & CRLAO / Team AIRE, Paris
Capturing Phylogenetic Algorithms for Linguistics, Leiden
October 28, 2015
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
1 / 42
Introduction
What is Ancestral State Reconstruction?
While tree-building methods seek to find branching diagrams which
explain how a language family has evolved, ASR methods use the
branching diagrams in order to explain what has evolved concretely.
Ancestral state reconstruction is very common in evolutionary biology
but only spuriously practiced in computational historical linguistics
(Bouchard-Côté et al. 2013).
In classical historical linguistics, on the other hand, linguistic
reconstruction of proto-forms and proto-meanings is very common and
one of the main goals of the classical comparative method (Fox 1995).
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
2 / 42
Introduction
ASR of Lexical Replacement Patterns
If we look for words corresponding to one meaning in a wordlist and
know which of the words are cognate or not, we may ask which of the
word forms was the most likely candidate to be used in the
proto-language of all descendant languages.
This question resembles the task of “semantic reconstruction”, but in
contrast to classical semantic reconstruction, we are only operating
within one concept slot here, disregarding all words with a different
meaning which may also be cognate with the words in our sample.
As a result of this restriction, it is quite likely that we cannot recover
the original form from our data.
It is, however, very interesting to see to which degree we can propose
a good candidate word form (cognate set) for the proto-language.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
3 / 42
Introduction
ASR of Lexical Replacement Patterns
Kopf
"head"
kop
"head"
head
"head"
Jäger & List (Tübingen/Paris)
tête
"head"
testa
"head"
Ancestral state reconstruction
cap
"head"
Leiden
4 / 42
Introduction
ASR of Lexical Replacement Patterns
Kopf
"head"
kop
"head"
head
"head"
Jäger & List (Tübingen/Paris)
tête
"head"
testa
"head"
Ancestral state reconstruction
cap
"head"
Leiden
4 / 42
Introduction
ASR of Lexical Replacement Patterns
?
?
?
"head"
?
Kopf
"head"
?
kop
"head"
head
"head"
Jäger & List (Tübingen/Paris)
tête
"head"
testa
"head"
Ancestral state reconstruction
cap
"head"
Leiden
4 / 42
Introduction
ASR of Lexical Replacement Patterns
*kop
"head"
Kopf
"head"
kop
"head"
testa
"head"
head
"head"
Jäger & List (Tübingen/Paris)
tête
"head"
testa
"head"
Ancestral state reconstruction
cap
"head"
Leiden
4 / 42
Introduction
ASR of Lexical Replacement Patterns
*kaput"head"
*haubud"head"
caput
"head"
*kop
"head"
Kopf
"head"
kop
"head"
testa
"head"
head
"head"
Jäger & List (Tübingen/Paris)
tête
"head"
testa
"head"
Ancestral state reconstruction
cap
"head"
Leiden
4 / 42
Introduction
This talk
reconstruction of cognate class at the root
?
A
Jäger & List (Tübingen/Paris)
A
B
C
Ancestral state reconstruction
C
B
Leiden
5 / 42
Introduction
This talk
reconstruction of cognate class at the root
B
A
Jäger & List (Tübingen/Paris)
A
B
C
Ancestral state reconstruction
C
B
Leiden
5 / 42
Materials and Methods
Materials
Data
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
6 / 42
Materials and Methods
Materials
Data
IELex
ABVD
153 Indo-European doculects
207 concepts
entries for Proto-Indo-European
for 135 concepts → used as
gold standard
arbitrarily split into training set
and test set:
743 Austronesian doculects →
100 were selected at random
210 concepts; for 154 of them
entries for Proto-Austronesian
split into training set and test
set:
training set: 67 concepts,
1127 cognate classes (83
occur in PIE)
test set: 68 concepts, 957
cognate classes (79 from
PIE)
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
training set: 81 concepts,
1695 cognate classes (88
occur in PAn)
test set: 74 concepts,
1584 cognate classes (79
occur in PAn)
Leiden
7 / 42
Materials and Methods
Methods
Prerequisites: Trees
Anakalang
EastSumbaneseUmbuRatuNggaidialect
Mamboru
EastSumbaneseKamberaSoutherndialect
EastSumbaneseLewadialect
Kambera
Masiwang
TetunTerikFehandialect
Lakalai
NakanaiBilekiDialect
GhariNggeri
GhariTandai
Talise
TaliseMalagheti
Tolo
KwaraaeSolomonIslands
Toambaita
Lau
Saa
Tabar
Babuyan
Isamorong
Ivasay
Itbayat
Itbayaten
Imorod
Iraralay
Yami
KakidugenIlongot
Cebuano
Surigaonon
Tagalog
TagalogAnthonydelaPaz
ManoboAtadownriver
ManoboAtaupriver
WesternBukidnonManobo
DayakNgaju
Katingan
Indonesian
MalayBahasaIndonesia
Melayu
Kerinci
Ogan
Komering
KomeringUluAdumanisVillage
KomeringIlirPalauGemantungVillage
KomeringKayuAgungAsli
KomeringUluDamarpuraVillage
LampungApiDaya
KomeringUluPerjayaVillage
LampungApiBelalau
LampungApiKotaAgung
LampungApiKrui
LampungApiRanau
LampungApiSukau
LampungApiKalianda
LampungApiTalangPadang
LampungApiJabung
LampungApiPubian
LampungApiSungkai
LampungApiWayKanan
Lampung
LampungNyoAbungKotabumi
LampungNyoAbungSukadana
LampungNyoMenggalaTulangBawang
Carolinian
Woleai
Chuukese
FijianBau
Neveei
TannaSouthwest
FutunaEast
Niue
Samoan
Tongan
Luangiua
Sikaiana
Rennellese
Tikopia
Hawaiian
Marquesan
Maori
Pukapuka
Penrhyn
Rarotongan
Tuamotu
Rurutuan
TahitianModern
BabatanaKatazi
Sengga
Kubokota
Luqa
Blablanga
BlablangaGhove
MaringeKmagha
KilokakaYsabel
Kokota
CiuliAtayalBandai
SquliqAtayal
Old_Persian
Avestan
PaiwanKulalao
Trees
trees were inferred with full
data set (training + test
data) via Bayesian inference
IELex outgroup: Anatolian
ABVD outgroup:
Malayo-Polynesian
Prasun
Ashkun
Kati
Sogdian
Ossetic
Digor_Ossetic
Iron_Ossetic
Wakhi
Shughni
Sariqoli
Baluchi
Kurdish
Zazaki
Tadzik
Persian
Pashto
Waziri
Vedic_Sanskrit
0.06
random samples of 1000
trees from posterior
distributions
Old_Prussian
Old_Church_Slavonic
Old_Breton
Old_Cornish
Old_Welsh
Old_Irish
maximum clade credibility
trees
Old_High_German
Old_English
Old_Swedish
Tocharian_A
Tocharian_B
Ancient_Greek
Classical_Armenian
Lycian
Cornish
Breton_Se
Breton_List
Breton_St
Welsh_C
Welsh_N
Gaulish
Vlach
Rumanian_List
Dolomite_Ladino
Romansh
Ladin
Friulian
Italian
Walloon
French
Provencal
Catalan
Brazilian
Portuguese_St
Spanish
Sardinian_L
Sardinian_C
Sardinian_N
Gothic
Old_Gutnish
Old_Norse
Luvian
Palaic
Hittite
Latvian
Lithuanian_O
Lithuanian_St
Bulgarian_P
Bulgarian
Macedonian
Macedonian_P
Serbocroatian
Serbian
Serbocroatian_P
Slovenian
Slovenian_P
Russian
Russian_P
Ukrainian_P
Polish
Ukrainian
Byelorussian
Byelorussian_P
Slovak
Czech_E
Czech
Slovak_P
Czech_P
Polish_P
Upper_Sorbian
Lower_Sorbian
Irish_A
Irish_B
Gaelic_Scots
Manx
Oscan
Umbrian
Latin
Kashmiri
Hindi
Lahnda
Panjabi_St
Urdu
Bhojpuri
Magahi
Sindhi
Marwari
Gujarati
Marathi
Assamese
Oriya
Bengali
Bihari
Nepali
Khaskura
Gypsy_Gk
Singhalese
Afrikaans
Flemish
Dutch_List
Frisian
German
Standard_German_Munich
Schwyzerduetsch
Letzebuergesch
Pennsylvania_Dutch
English
Icelandic_St
Faroese
Stavangersk
Norwegian
Danish
Danish_Fjolde
Gutnish_Lau
Oevdalian
Swedish
Swedish_Up
Swedish_Vl
Albanian_T
Albanian
Albanian_G
Standard_Albanian
Albanian_Top
Albanian_K
Albanian_C
Greek_Ml
Greek_D
Greek_Md
Tsakonian
Greek_Mod
Greek_K
Armenian_Mod
Armenian_List
600.0
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
8 / 42
Materials and Methods
Methods
Phylogenetic uncertainty
Vedic_Sanskrit
proper way to deal with it:
work with posterior sample
rather than with a single tree
poor man’s method:
Prasun
Ashkun
Kati
Sogdian
Ossetic
Digor_Ossetic
Iron_Ossetic
Pashto
Waziri
Baluchi
Kurdish
Zazaki
Tadzik
Persian
Wakhi
Shughni
Sariqoli
Old_Persian
Avestan
Kashmiri
Nepali
Khaskura
Bengali
Assamese
Oriya
Bihari
Gujarati
Marathi
Sindhi
Marwari
Hindi
Urdu
Lahnda
Panjabi_St
Bhojpuri
Magahi
Old_Prussian
Serbocroatian
Serbian
Serbocroatian_P
Bulgarian_P
Bulgarian
Macedonian
Macedonian_P
Slovenian
Slovenian_P
Russian
Russian_P
Ukrainian_P
Byelorussian_P
Byelorussian
Polish
Ukrainian
Polish_P
Upper_Sorbian
Lower_Sorbian
Czech
Slovak
Czech_E
Slovak_P
Czech_P
100.0
Gothic
remove all short branches
(shorter than some
threshold)
do ASR with resulting
multifurcating tree
Old_High_German
Old_English
German
Standard_German_Munich
Pennsylvania_Dutch
Schwyzerduetsch
Letzebuergesch
Frisian
Afrikaans
Flemish
Dutch_List
English
Old_Gutnish
Stavangersk
Norwegian
Danish
Danish_Fjolde
Gutnish_Lau
Oevdalian
Swedish
Swedish_Up
Swedish_Vl
Old_Swedish
Faroese
Old_Norse
Icelandic_St
Old_Breton
Old_Cornish
Old_Welsh
Welsh_C
Welsh_N
Cornish
Breton_St
Breton_Se
Breton_List
Gaulish
Old_Irish
Irish_A
Irish_B
Gaelic_Scots
Manx
Oscan
Umbrian
Latin
Vlach
Rumanian_List
Dolomite_Ladino
Romansh
Ladin
Friulian
Italian
Walloon
French
Provencal
Catalan
Brazilian
Portuguese_St
Spanish
Sardinian_L
Sardinian_C
Sardinian_N
Tocharian_A
Tocharian_B
Albanian_T
Standard_Albanian
Albanian
Albanian_G
Albanian_Top
Albanian_K
Albanian_C
Ancient_Greek
Classical_Armenian
Luvian
Palaic
Hittite
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Gypsy_Gk
Singhalese
Latvian
Lithuanian_O
Lithuanian_St
Old_Church_Slavonic
Lycian
Leiden
Greek_Mod
Greek_Md
Greek_Ml
Greek_D
Tsakonian
Greek_K
Armenian_Mod
Armenian_List
9 / 42
Materials and Methods
Methods
Coding
Multi-state
Binarized
non-A
B
A
A
non-A
non-A
non-A
non-A
B
A
A
B
C
C
B
non-C
non-B
non-C
Jäger & List (Tübingen/Paris)
non-C
non-C
C
C
non-B
B
non-B
non-B
B
non-C
Ancestral state reconstruction
Leiden
10 / 42
Materials and Methods
Methods
Polymorphisms (a.k.a. synonyms)
problem for multistate
coding
possible representations:
Kopf
"head"
Haupt
"head"
kop
"head"
hoofd
"head"
head
"head"
tête
"head"
Jäger & List (Tübingen/Paris)
testa
"head"
cap
"head"
Ancestral state reconstruction
epistemic: both
observations have 50%
(subjective) probability
lifted model: states in the
technical sense are sets of
cognate classes
Leiden
11 / 42
Materials and Methods
Methods
Parsimony reconstruction
B
Parsimony = 2
B
B
C
A
A
Jäger & List (Tübingen/Paris)
A
B
C
Ancestral state reconstruction
C
B
Leiden
12 / 42
Materials and Methods
Methods
Parsimony reconstruction
A
Parsimony = 3
A
B
C
A
A
Jäger & List (Tübingen/Paris)
A
B
C
Ancestral state reconstruction
C
B
Leiden
12 / 42
Materials and Methods
Methods
Parsimony reconstruction
C
Parsimony = 3
A
C
C
A
A
Jäger & List (Tübingen/Paris)
A
B
C
Ancestral state reconstruction
C
B
Leiden
12 / 42
Materials and Methods
Methods
Weighted parsimony reconstruction
B
Weighted
Parsimony = 3
B
B
A B C
A
B
C
C
A
A
Weight matrix
A
B
Jäger & List (Tübingen/Paris)
C
C
0
1
2
1
0
2
2
2
0
B
Ancestral state reconstruction
Leiden
13 / 42
Materials and Methods
Methods
Weighted parsimony reconstruction
A
Weighted
Parsimony = 4
A
B
A B C
A
B
C
C
A
A
Weight matrix
A
B
Jäger & List (Tübingen/Paris)
C
C
0
1
2
1
0
2
2
2
0
B
Ancestral state reconstruction
Leiden
13 / 42
Materials and Methods
Methods
Weighted parsimony reconstruction
C
Weighted
Parsimony = 5
A
C
A B C
A
B
C
C
A
A
Weight matrix
A
B
Jäger & List (Tübingen/Paris)
C
C
0
1
2
1
0
2
2
2
0
B
Ancestral state reconstruction
Leiden
13 / 42
Materials and Methods
Methods
Dynamic Programming (Sankoff Algorithm)
wp(mother, s) =
min (w(s, s0 ) + wp(d, s0 ))
X
d∈daughters
A
Jäger & List (Tübingen/Paris)
A
B
s0 ∈states
C
Ancestral state reconstruction
C
B
Leiden
14 / 42
Materials and Methods
Methods
Dynamic Programming (Sankoff Algorithm)
wp(mother, s) =
min (w(s, s0 ) + wp(d, s0 ))
X
d∈daughters
A
Jäger & List (Tübingen/Paris)
A
B
s0 ∈states
C
Ancestral state reconstruction
C
B
Leiden
14 / 42
Materials and Methods
Methods
Dynamic Programming (Sankoff Algorithm)
wp(mother, s) =
min (w(s, s0 ) + wp(d, s0 ))
X
d∈daughters
A
Jäger & List (Tübingen/Paris)
A
B
s0 ∈states
C
Ancestral state reconstruction
C
B
Leiden
14 / 42
Materials and Methods
Methods
Dynamic Programming (Sankoff Algorithm)
wp(mother, s) =
min (w(s, s0 ) + wp(d, s0 ))
X
d∈daughters
A
Jäger & List (Tübingen/Paris)
A
B
s0 ∈states
C
Ancestral state reconstruction
C
B
Leiden
14 / 42
Materials and Methods
Methods
Weighted Parsimony reconstruction
the state with the lowest parsimony score wins
in case of ties, frequency at the leafs is tie-breaker
binary characters:
w(0 → 2) = 1; w(1 → 0) = 2
multi-state characters:
all weights = 1
polymorphism only admitted at tips:
Jäger & List (Tübingen/Paris)
w(a → {a, b})
=
0
w(a → {b, c})
=
1
Ancestral state reconstruction
Leiden
15 / 42
Materials and Methods
Methods
The MLN Method for ASR
The MLN method (List et al. 2014a) uses parsimony for ancestral
state reconstruction.
In contrast to classical parsimony, MLN tests different weighting
schemes for gains and losses and selects the optimal scheme with help
of the vocabulary size criterion.
The vocabulary size criterion states that the amount of synonyms per
word should be similar in the ancestral and the descendant languages.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
16 / 42
Materials and Methods
Methods
The MLN Method for ASR
Too many
synonyms in
ancestral nodes!
The vocabulary size criterion states that the amount of synonyms per word
(here reflected by the size of the nodes in the tree) should be similar across
ancestral and descendant languages. With help of this criterion, an optimal
weighting scheme for gain-loss rates is chosen for individual datasets.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
17 / 42
Materials and Methods
Methods
The MLN Method for ASR
Too few
synonyms in
ancestral nodes!
The vocabulary size criterion states that the amount of synonyms per word
(here reflected by the size of the nodes in the tree) should be similar across
ancestral and descendant languages. With help of this criterion, an optimal
weighting scheme for gain-loss rates is chosen for individual datasets.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
17 / 42
Materials and Methods
Methods
The MLN Method for ASR
Optimal amount of
synonyms in
ancestral nodes!
The vocabulary size criterion states that the amount of synonyms per word
(here reflected by the size of the nodes in the tree) should be similar across
ancestral and descendant languages. With help of this criterion, an optimal
weighting scheme for gain-loss rates is chosen for individual datasets.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
17 / 42
Materials and Methods
Methods
Reconstruction on a posterior sample
if a sample of trees is used: A state is reconstructed if it is
reconstructed in more than θ trees in the sample. θ is estimated using
the training set.
values:
database method
IELex
Jäger & List (Tübingen/Paris)
θ
Sankoff/binary
0.690
Sankoff/multistate 0.056
MLN
0.464
Ancestral state reconstruction
Leiden
18 / 42
Materials and Methods
Methods
Likelihood-based reconstruction
log
= s) =
P L(tips below|mother
P
0
d∈daughters
s0 ∈states log P (s → s |branchlength)+
log(L(tips below d|d = s0 ))
A
Jäger & List (Tübingen/Paris)
A
B
C
Ancestral state reconstruction
C
B
Leiden
19 / 42
Materials and Methods
Methods
Likelihood-based reconstruction
note: likelihoods (unlike parsimony scores) depend on branch lengths!
likelihoods at the root give likelihood of a reconstruction, given all
observed data (for that character)
total likelihood is obtained by multiplying root state likelihoods with
equilibrium probabilities given a rate matrix
rate matrix is optimized to maximize likelihood
rates across characters are independently optimized
for multistate characters, all rates are constrained to be equal
(otherwise BayesTraits crashes…)
using equilibrium probabilities, you can derive exptected state
probabilities for root states
a state is likelihood-reconstructed if its expected probability > θ2
again, threshold θ2 must be estimated from training set
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
20 / 42
Results
General Results
Evaluation
0.8
0.8
0.8
0.6
0.6
0.6
algorithm
database
ABVD
0.4
IELex
0.2
ML
MLN
0.4
0.0
precision
recall
F.score
multi−valued
0.0
precision
0.8
binary valued
0.2
0.2
0.0
character type
0.4
Sankoff
recall
F.score
precision
recall
F.score
0.8
0.6
0.6
tree type
bifurcating
0.4
multifurcating
0.2
tree sample
posterior sample
0.4
summary tree
0.2
0.0
0.0
precision
recall
F.score
Jäger & List (Tübingen/Paris)
precision
recall
F.score
Ancestral state reconstruction
Leiden
21 / 42
Results
General Results
Evaluation
IELex
algorithm
characters
furcating
treeSample
precision
recall
F-score
ML
ML
ML
ML
Sankoff
Sankoff
Sankoff
Sankoff
ML
MLN
MLN
MLN
Sankoff
Sankoff
Sankoff
ML
MLN
Sankoff
ML
ML
MLN
MLN
MLN
MLN
binary
binary
binary
binary
binary
binary
binary
binary
multi
multi
binary
binary
multi
multi
multi
multi
multi
multi
multi
multi
multi
binary
multi
binary
bifurcating
bifurcating
multifurcating
multifurcating
multifurcating
bifurcating
multifurcating
bifurcating
bifurcating
bifurcating
multifurcating
bifurcating
bifurcating
multifurcating
bifurcating
multifurcating
multifurcating
multifurcating
multifurcating
bifurcating
multifurcating
multifurcating
bifurcating
bifurcating
summary tree
posterior sample
summary tree
posterior sample
summary tree
summary tree
posterior sample
posterior sample
posterior sample
posterior sample
posterior sample
posterior sample
summary tree
posterior sample
posterior sample
posterior sample
posterior sample
summary tree
summary tree
summary tree
summary tree
summary tree
summary tree
summary tree
0.817
0.795
0.792
0.756
0.716
0.704
0.720
0.72
0.642
0.743
0.743
0.743
0.671
0.671
0.671
0.629
0.758
0.735
0.735
0.721
0.584
0.584
0.742
0.742
0.734
0.734
0.722
0.747
0.734
0.722
0.684
0.684
0.772
0.658
0.658
0.658
0.722
0.722
0.722
0.772
0.633
0.633
0.633
0.620
0.658
0.658
0.291
0.291
0.773
0.763
0.755
0.752
0.725
0.712
0.701
0.701
0.701
0.698
0.698
0.698
0.695
0.695
0.695
0.693
0.690
0.680
0.680
0.667
0.619
0.619
0.418
0.418
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
22 / 42
Results
General Results
Evaluation
ABVD
algorithm
characters
furcating
treeSample
precision
recall
F-score
ML
ML
ML
ML
Sankoff
Sankoff
ML
ML
Sankoff
ML
Sankoff
MLN
MLN
Sankoff
Sankoff
Sankoff
MLN
MLN
MLN
MLN
ML
Sankoff
MLN
MLN
multi
binary
multi
binary
multi
binary
binary
multi
binary
binary
multi
multi
binary
binary
multi
multi
multi
multi
binary
binary
multi
binary
multi
binary
bifurcating
bifurcating
bifurcating
bifurcating
bifurcating
multifurcating
multifurcating
multifurcating
bifurcating
multifurcating
multifurcating
bifurcating
bifurcating
bifurcating
multifurcating
bifurcating
multifurcating
bifurcating
multifurcating
bifurcating
multifurcating
multifurcating
multifurcating
multifurcating
posterior sample
posterior sample
summary tree
summary tree
summary tree
posterior sample
posterior sample
summary tree
posterior sample
summary tree
summary tree
summary tree
summary tree
summary tree
posterior sample
posterior sample
posterior sample
posterior sample
posterior sample
posterior sample
posterior sample
summary tree
summary tree
summary tree
0.738
0.682
0.740
0.757
0.691
0.781
0.761
0.726
0.726
0.732
0.679
0.655
0.655
0.629
0.542
0.542
0.414
0.414
0.414
0.414
0.421
0.469
0.667
0.667
0.747
0.759
0.684
0.681
0.709
0.633
0.646
0.671
0.671
0.658
0.696
0.722
0.722
0.557
0.570
0.570
0.848
0.848
0.848
0.848
0.709
0.570
0.405
0.405
0.742
0.719
0.711
0.711
0.700
0.699
0.699
0.697
0.697
0.693
0.688
0.687
0.687
0.591
0.556
0.556
0.556
0.556
0.556
0.556
0.528
0.514
0.504
0.504
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
23 / 42
Results
Specific Results
Summary on Indo-European ASR
Error Type
Missing forms
Different forms
Additional forms in ASR
Missing root in ASR
Summary
Jäger & List (Tübingen/Paris)
GS
A
A
A
A, B
ASR
Ø
B
A, B
A
Ancestral state reconstruction
Number
7
9
5
4
25
Leiden
24 / 42
Results
Specific Results
Evaluating the Differences
We evaluate the differences qualitatively by checking
the reflection of the proposed root in the branches, especially with
semantically shifted word forms which may not occur in the wordlist
data, using standard sources like Meier-Brügger (2002), Wodtko et al.
(2008), Rix et al. (2002), and Pokorny (1959) for Indo-European in
general, and specific sources like Vaan (2008) for Latin, Derksen
(2008) and Vasmer (1986/1987) for Slavic, and Kroonen (2013) for
Germanic.
the likelihood of semantic shift of the given root with help of the
Database of Cross-Linguistic Colexifications (CLICS, List et al. 2013
and 2014b, http://clics.lingpy.org),
whether the cognate sets in the data are really reflexes of the
proposed PIE root.
Based on this check, we distinguish four grades of root quality:
erroneous problematic possible good
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
25 / 42
Results
Specific Results
Indo-European ASR: Missing forms
Concept
Form
Meaning in
Reflexes
Comment
SEE
SEE
*derḱ*weid-
to see
Only reflected in Indo-Iranian, cognates also problematic.
SING
*kan-
Root is proposed for PIE on the basis of Germanic reflexes meaning “rooster”
which is a highly unlikely semantic change
SMELL
SMALL
to sing or the
rooster
*h₃ed*mei-
to smell
Potential root for PIE, but only reflected in Greek and Romance
small
Wrong cognate judgments in the database, since neither Russian malenkij
nor English small go back to this root
THINK
*teng-
to think or to
feel
Root only reflected in Germanic languages with spurious reflexes in semantically shifted form in other branches. A better candidate for PIE would be
*men- “the mind or to think”.
WASH
*leh₂w-
Wrong cognate assignment in the source since Romance and Albanian reflexes are not annotated.
WASH
to wash or to
pour
*neigʷ-
Very unlikely cognate assignment, due to the extreme shift from “to wash”
to “water monster” (cf. English nix) in the Germanic languages.
WET
to wash or water
monster
*wed-
water or wet
Semantic change from “water” to “wet” is likely according to CLICS, but it
is not clear why this should have already happened in PIE times.
erroneous
to see
know
problematic
Jäger & List (Tübingen/Paris)
or
to
Safe root for Indo-European.
possible
good
Ancestral state reconstruction
Leiden
26 / 42
Results
Specific Results
Indo-European ASR: Missing forms
Concept
Form
Meaning in
Reflexes
Comment
SEE
SEE
*derḱ*weid-
to see
Only reflected in Indo-Iranian, cognates also problematic.
SING
*kan-
Root is proposed for PIE on the basis of Germanic reflexes meaning “rooster”
which is a highly unlikely semantic change
SMELL
SMALL
to sing or the
rooster
*h₃ed*mei-
to smell
Potential root for PIE, but only reflected in Greek and Romance
small
Wrong cognate judgments in the database, since neither Russian malenkij
nor English small go back to this root
THINK
*teng-
to think or to
feel
Root only reflected in Germanic languages with spurious reflexes in semantically shifted form in other branches. A better candidate for PIE would be
*men- “the mind or to think”.
WASH
*leh₂w-
Wrong cognate assignment in the source since Romance and Albanian reflexes are not annotated.
WASH
to wash or to
pour
*neigʷ-
Very unlikely PIE root, due to the extreme shift from “to wash” to “water
monster” (cf. English nix) in the Germanic languages.
WET
to wash or water
monster
*wed-
water or wet
Semantic change from “water” to “wet” is likely according to CLICS, but it
is not clear why this should have already happened in PIE times.
erroneous
to see
know
problematic
Jäger & List (Tübingen/Paris)
or
to
Safe root for Indo-European.
possible
good
Ancestral state reconstruction
Leiden
26 / 42
Results
Specific Results
Indo-European ASR: Missing Forms in ASR
Concept
NOT
Form in GS
*meh₁
Comment
SLEEP
*drem
VOMIT
This form is mainly reflected in Latin and spuriously in Indian and Greek.
It is much more likely that it meant something else in PIE and then shifted
into this meaning.
*h₁rewg-
YEAR
No need to reconstruct this form back to PIE, since it is only reflected in
two languages of Romance.
*ieHr-
This form has only reflexes in Germanic languages. Generally, the meaning
“year” is difficult to reconstruct, due to the high potential for shift from
“summer”, “winter”, “time”, etc. as shown in CLICS.
erroneous
This form is reflected in Old Greek as a prohibitive negation and also reconstructed as such. Whether it was the normal negation in PIE is less
clear.
problematic
Jäger & List (Tübingen/Paris)
possible
good
Ancestral state reconstruction
Leiden
27 / 42
Results
Specific Results
Indo-European ASR: Missing Forms in ASR
Concept
NOT
Form in GS
*meh₁
Comment
SLEEP
*drem
VOMIT
This form is mainly reflected in Latin and spuriously in Indian and Greek.
It is much more likely that it meant something else in PIE and then shifted
into this meaning.
*h₁rewg-
YEAR
No need to reconstruct this form back to PIE, since it is only reflected in
two languages of Romance.
*ieHr-
This form has only reflexes in Germanic languages. Generally, the meaning
“year” is difficult to reconstruct, due to the high potential for shift from
“summer”, “winter”, “time”, etc. as shown in CLICS.
erroneous
This form is reflected in Old Greek as a prohibitive negation and also reconstructed as such. Whether it was the normal negation in PIE is less
clear.
problematic
Jäger & List (Tübingen/Paris)
possible
good
Ancestral state reconstruction
Leiden
27 / 42
Results
Specific Results
Indo-European ASR: Different Forms
Concept
RIVER
GS
*h₂ekʷeh₂
ASR
*h₂ep-
Comment
RUB
*melh₁-
*terh₁-
Form in GS is not reflected in the standard literature (LIV and LIN), form in ASR is
reflected in the meaning “to rub, to bore”.
SCRATCH
*gerbʰ-
*kes-
Form in GS is only reflected in few Germanic languages, probably with a wrong cognate
assignment. Following Derksen (2008), assuming the GSR form is a much better
candidate for the PIE word for “scratch”.
SKIN
*pel
*(s)kewH-
Form in GS is a good PIE root, but not necessarily with the meaning “skin”, as the
meaning of the reflexes differs greatly. The GSR form derives from a PIE verb meaning
“to cover”, but the cognate should not contain Slavic words (Derksen 2008).
WALK
*ǵʰeh₁
*h₁ei-
The GS form is only reflected in Germanic. The ASR form is a clear PIE root, but the
meaning may also have been “to go”.
WATER
*h₂ekʷeh₂
*wódr̥
The ASR form is a much better candidate for “water” in PIE, due to its high number
of reflexes in all branches.
WHITE
*h₂elbʰós
*h₂erǵó-
The GS form is only reflected in Romance in this meaning and as meaning “cloud”
in Hittite. The ASR form is a much better candidate, with a much more plausible
connection between reflexes meaning “shine” and “white”, as also confirmed by CLICS.
WORM
*wr̥mi-
*kʷr̥mis
The ASR form is reflected in more different branches of PIE, while the GS form is only
reflected in Germanic and Romance.
erroneous
problematic
Jäger & List (Tübingen/Paris)
Form in GS meant “water” in PIE. Although a shift from “water” to “river” is likely
according to CLICS, this meaning is an innovation in Germanic. The ASR form is
reflected across multiple branches and a much better candidate.
possible
good
Ancestral state reconstruction
Leiden
28 / 42
Results
Specific Results
Indo-European ASR: Different Forms
Concept
RIVER
GS
*h₂ekʷeh₂
ASR
*h₂ep-
Comment
RUB
*melh₁-
*terh₁-
Form in GS is not reflected in the standard literature (LIV and LIN), form in ASR is
reflected in the meaning “to rub, to bore”.
SCRATCH
*gerbʰ-
*kes-
Form in GS is only reflected in few Germanic languages, probably with a wrong cognate
assignment. Following Derksen (2008), assuming the GSR form is a much better
candidate for the PIE word for “scratch”.
SKIN
*pel
*(s)kewH-
Form in GS is a good PIE root, but not necessarily with the meaning “skin”, as the
meaning of the reflexes differs greatly. The GSR form derives from a PIE verb meaning
“to cover”, but the cognate should not contain Slavic words (Derksen 2008).
WALK
*ǵʰeh₁
*h₁ei-
The GS form is only reflected in Germanic. The ASR form is a clear PIE root, but the
meaning may also have been “to go”.
WATER
*h₂ekʷeh₂
*wódr̥
The ASR form is a much better candidate for “water” in PIE, due to its high number
of reflexes in all branches.
WHITE
*h₂elbʰós
*h₂erǵó-
The GS form is only reflected in Romance in this meaning and as meaning “cloud”
in Hittite. The ASR form is a much better candidate, with a much more plausible
connection between reflexes meaning “shine” and “white”, as also confirmed by CLICS.
WORM
*wr̥mi-
*kʷr̥mis
The ASR form is reflected in more different branches of PIE, while the GS form is only
reflected in Germanic and Romance.
erroneous
problematic
Jäger & List (Tübingen/Paris)
Form in GS meant “water” in PIE. Although a shift from “water” to “river” is likely
according to CLICS, this meaning is an innovation in Germanic. The ASR form is
reflected across multiple branches and a much better candidate.
possible
good
Ancestral state reconstruction
Leiden
28 / 42
Results
Specific Results
Indo-European ASR: Additional Forms
Concept
MOON
SNOW
Form in ASR
*lewk-s-nh₂
*ǵʰéi-mn̥-
Comment
This form would go back to a PIE root meaning “to shine” and is often said
to have independently turned to mean “moon” in Romance and Slavic and
other branches. The shift from “shine” to “moon” is however not very likely
(no evidence in CLICS), so it is also possible that the word meant already
“moon” in PIE as an epithet (Vaan 2008).
The form has probably independently shifted from the original meaning
“frost, cold”, which is a very likely shift according to CLICS.
SUCK
*suḱ-
THIS
The root is present in this meaning in many subbranches and a good candidate for PIE in this meaning.
*so / *to
WITH
The root is a clear PIE demonstrative (Meier-Brügger 2010), but the reflexes
in the daughter languages vary greatly, due to analogical levelling.
*sm̥
A very good candidate for the meaning with reflexes in Greek, Indo-Iranian
and Slavic.
erroneous
problematic
Jäger & List (Tübingen/Paris)
possible
good
Ancestral state reconstruction
Leiden
29 / 42
Results
Specific Results
Indo-European ASR: Additional Forms
Concept
MOON
SNOW
Form in ASR
*lewk-s-nh₂
*ǵʰéi-mn̥-
Comment
This form would go back to a PIE root meaning “to shine” and is often said
to have independently turned to mean “moon” in Romance and Slavic and
other branches. The shift from “shine” to “moon” is however not very likely
(no evidence in CLICS), so it is also possible that the word meant already
“moon” in PIE as an epithet (Vaan 2008).
The form has probably independently shifted from the original meaning
“frost, cold”, which is a very likely shift according to CLICS.
SUCK
*suḱ-
THIS
The root is present in this meaning in many subbranches and a good candidate for PIE in this meaning.
*so / *to
WITH
The root is a clear PIE demonstrative (Meier-Brügger 2010), but the reflexes
in the daughter languages vary greatly, due to analogical levelling.
*sm̥
A very good candidate for the meaning with reflexes in Greek, Indo-Iranian
and Slavic.
erroneous
problematic
Jäger & List (Tübingen/Paris)
possible
good
Ancestral state reconstruction
Leiden
29 / 42
Results
Specific Results
Evaluation against our manually created gold
standard
precision: 0.986 (1 false positive)
recall: 0.895 (8 false negatives)
F-score: 0.9381
1
The IELex PIE entries have an F-score of 0.854.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
30 / 42
Results
Specific Results
False positive
●
●
Classical Armenian
Ancient Greek
●
Gothic
● Old English
● Old High German
●
●
●
●
Old Swedish
Old Norse
Old Irish
Latin
●
●
●
●
Avestan
Vedic Sanskrit
snow:D
Jäger & List (Tübingen/Paris)
Old Prussian
Old Church Slavonic
Ancestral state reconstruction
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Albanian C
Albanian K
Albanian Top
Albanian T
Standard Albanian
Albanian G
Albanian
●
●
●
●
●
●
●
●
●
Frisian
Afrikaans
Dutch List
Flemish
German
Standard German Munich
Pennsylvania Dutch
Letzebuergesch
Schwyzerduetsch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Gaelic Scots
Irish B
Irish A
Welsh N
Welsh C
Cornish
Breton St
Breton List
Breton Se
Vlach
Sardinian L
Sardinian C
Italian
Dolomite Ladino
Friulian
Ladin
Romansh
Provencal
French
Walloon
Catalan
Spanish
Portuguese St
Brazilian
Latvian
Lithuanian St
Lithuanian O
Slovenian
Serbocroatian P
Serbian
Serbocroatian
Macedonian P
Macedonian
Bulgarian
Bulgarian P
Slovenian P
Russian P
Russian
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Czech E
Slovak
Sogdian
Ossetic
Iron Ossetic
Digor Ossetic
Wakhi
Sariqoli
Shughni
Waziri
Pashto
Persian
Tadzik
Baluchi
Zazaki
Kashmiri
Singhalese
Gypsy Gk
Khaskura
Nepali
Marathi
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Armenian List
Armenian Mod
Greek K
Greek Mod
Greek Md
Greek D
Greek Ml
English
Leiden
31 / 42
Results
Specific Results
False negatives
● Hittite
● Luvian
●
●
Classical Armenian
Ancient Greek
●
Gothic
Old English
●●Old High German
●
●
●
●
Old Swedish
Old Norse
Old Irish
Latin
●
●
●●Avestan
Old Persian
●
Vedic Sanskrit
river:O
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Old Prussian
Old Church Slavonic
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Albanian C
Albanian K
Albanian Top
Albanian T
Standard Albanian
Albanian G
Albanian
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Frisian
Flemish
German
Standard German Munich
Letzebuergesch
Schwyzerduetsch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Gaulish
Armenian List
Armenian Mod
Greek K
Greek Mod
Greek Md
Greek D
Greek Ml
Gaelic Scots
Irish B
Irish A
Welsh N
Welsh C
Cornish
Breton St
Breton List
Breton Se
Rumanian List
Vlach
Sardinian L
Sardinian N
Sardinian C
Italian
Dolomite Ladino
Friulian
Ladin
Romansh
Provencal
French
Walloon
Catalan
Spanish
Portuguese St
Brazilian
Latvian
Lithuanian St
Lithuanian O
Serbocroatian P
Serbian
Serbocroatian
Macedonian P
Macedonian
Bulgarian
Bulgarian P
Slovenian P
Russian P
Russian
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Slovak
Kati
Sogdian
Ossetic
Iron Ossetic
Digor Ossetic
Pashto
Persian
Tadzik
Zazaki
Singhalese
Khaskura
Nepali
Bengali
Oriya
Assamese
Marathi
Gujarati
Marwari
Sindhi
Hindi
Panjabi St
Leiden
32 / 42
Results
Specific Results
False negatives
● Tocharian B
● Tocharian A
●
●
Classical Armenian
Ancient Greek
● Old English
● Old High German
●
●
●
●
Old Swedish
Old Norse
Old Irish
Latin
●
●
●
Vedic Sanskrit
smell:W
Jäger & List (Tübingen/Paris)
Old Prussian
Old Church Slavonic
Ancestral state reconstruction
●
●
●
●
●
●
●
●
●
●
●
Albanian K
Albanian Top
Albanian T
Albanian
Armenian List
Armenian Mod
Greek K
Greek Mod
Greek Md
Greek D
Greek Ml
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Frisian
Afrikaans
Dutch List
Flemish
German
Standard German Munich
Letzebuergesch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Gaelic Scots
Irish A
Welsh N
Welsh C
Cornish
Breton St
Breton List
Breton Se
Rumanian List
Sardinian C
Italian
Dolomite Ladino
Romansh
Provencal
French
Walloon
Catalan
Spanish
Portuguese St
Brazilian
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Latvian
Lithuanian St
Slovenian
Serbocroatian P
Serbian
Serbocroatian
Macedonian P
Macedonian
Bulgarian
Slovenian P
Russian P
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Czech E
Slovak
Iron Ossetic
Digor Ossetic
Shughni
Pashto
Persian
Tadzik
Baluchi
Zazaki
Gypsy Gk
Khaskura
Nepali
Bihari
Bengali
Oriya
Assamese
Marathi
Gujarati
Sindhi
Hindi
Urdu
Panjabi St
Lahnda
Leiden
33 / 42
Results
Specific Results
False negatives
●
●
Classical Armenian
Ancient Greek
●
Gothic
Old English
●●Old High German
●
●
●
●
●
Latin
Old Church Slavonic
Avestan
Vedic Sanskrit
wet:I
Jäger & List (Tübingen/Paris)
Old Norse
Old Irish
●
●
Old Swedish
Ancestral state reconstruction
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Albanian C
Albanian K
Albanian Top
Albanian T
Standard Albanian
Albanian G
Albanian
Armenian List
Armenian Mod
Greek K
Greek Mod
Greek Md
Greek D
Greek Ml
English
Frisian
Afrikaans
Dutch List
Flemish
German
Standard German Munich
Pennsylvania Dutch
Letzebuergesch
Schwyzerduetsch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Gaelic Scots
Irish B
Irish A
Welsh N
Welsh C
Cornish
Breton St
Breton List
Breton Se
Rumanian List
Vlach
Sardinian L
Sardinian N
Sardinian C
Italian
Dolomite Ladino
Friulian
Ladin
Romansh
Provencal
French
Walloon
Catalan
Spanish
Portuguese St
Brazilian
Latvian
Lithuanian St
Lithuanian O
Slovenian
Serbocroatian P
Serbian
Serbocroatian
Macedonian P
Bulgarian
Bulgarian P
Slovenian P
Russian P
Russian
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Czech E
Slovak
Kati
Sogdian
Ossetic
Iron Ossetic
Digor Ossetic
Wakhi
Shughni
Waziri
Pashto
Persian
Tadzik
Baluchi
Kashmiri
Singhalese
Gypsy Gk
Bihari
Bengali
Oriya
Assamese
Marathi
Gujarati
Marwari
Sindhi
Hindi
Leiden
34 / 42
Results
Specific Results
False negatives
Tocharian B
●●Tocharian A
●
●
Ancient Greek
●
Classical Armenian
Gothic
Old English
●●Old High German
●
●
●
●
●
●
Old Norse
Old Irish
●
●
●
Old Cornish
Old Breton
Old Welsh
●
Old Church Slavonic
Latin
Avestan
Vedic Sanskrit
skin:B
Jäger & List (Tübingen/Paris)
Old Swedish
Ancestral state reconstruction
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Albanian C
Albanian K
Albanian Top
Albanian T
Standard Albanian
Albanian G
Albanian
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Frisian
Afrikaans
Dutch List
Flemish
German
Standard German Munich
Pennsylvania Dutch
Letzebuergesch
Schwyzerduetsch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Manx
Gaelic Scots
Irish B
Irish A
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Welsh N
Welsh C
Cornish
Breton St
Breton List
Breton Se
Rumanian List
Sardinian L
Sardinian N
Sardinian C
Italian
Dolomite Ladino
Friulian
Ladin
Romansh
Provencal
French
Walloon
Catalan
Spanish
Portuguese St
Brazilian
Latvian
Lithuanian St
Lithuanian O
Slovenian
Serbocroatian P
Serbian
Serbocroatian
Macedonian P
Macedonian
Bulgarian
Bulgarian P
Slovenian P
Russian P
Russian
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Czech E
Slovak
Prasun
Kati
Ashkun
Armenian List
Greek K
Greek Mod
Tsakonian
Greek Md
Greek D
Greek Ml
Sogdian
Ossetic
Iron Ossetic
Digor Ossetic
Wakhi
Waziri
Pashto
Persian
Tadzik
Baluchi
Kurdish
Kashmiri
Khaskura
Nepali
Bihari
Bengali
Oriya
Assamese
Marathi
Gujarati
Marwari
Hindi
Urdu
Lahnda
Leiden
35 / 42
Results
Specific Results
False negatives
●
Hittite
Tocharian B
●●Tocharian A
●
Ancient Greek
●
Gothic
Old English
●●Old High German
●
Old Gutnish
●
●
●
●
Old Swedish
Old Norse
Old Irish
Latin
●
●
●
●
Avestan
Vedic Sanskrit
sleep:E
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Old Prussian
Old Church Slavonic
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Albanian C
Albanian K
Albanian Top
Albanian T
Standard Albanian
Albanian G
Albanian
Armenian List
Armenian Mod
Greek K
Greek Mod
Tsakonian
Greek Md
Greek D
Greek Ml
English
Frisian
Afrikaans
Dutch List
Flemish
German
Standard German Munich
Pennsylvania Dutch
Letzebuergesch
Schwyzerduetsch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Manx
Gaelic Scots
Irish B
Irish A
Welsh N
Welsh C
Cornish
Breton St
Breton List
Breton Se
Rumanian List
Vlach
Sardinian L
Sardinian N
Sardinian C
Italian
Dolomite Ladino
Friulian
Ladin
Romansh
Provencal
French
Walloon
Catalan
Spanish
Portuguese St
Brazilian
Latvian
Lithuanian St
Lithuanian O
Slovenian
Serbocroatian P
Serbocroatian
Macedonian P
Macedonian
Bulgarian
Bulgarian P
Slovenian P
Russian P
Russian
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Czech E
Slovak
Kati
Sogdian
Ossetic
Iron Ossetic
Digor Ossetic
Wakhi
Sariqoli
Shughni
Waziri
Pashto
Persian
Tadzik
Baluchi
Zazaki
Kurdish
Kashmiri
Singhalese
Khaskura
Nepali
Bihari
Bengali
Oriya
Assamese
Marathi
Gujarati
Marwari
Sindhi
Bhojpuri
Hindi
Urdu
Panjabi St
Lahnda
Leiden
36 / 42
Results
Specific Results
False negatives
●
Hittite
Tocharian B
●●Tocharian A
●
Ancient Greek
●
Gothic
Old English
●●Old High German
●
Old Gutnish
●
●
●
●
●
●
Old Norse
Old Irish
●
●
●
Old Cornish
Old Breton
Old Welsh
●
Old Church Slavonic
Latin
Avestan
Vedic Sanskrit
white:E
Jäger & List (Tübingen/Paris)
Old Swedish
Ancestral state reconstruction
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Albanian C
Albanian K
Albanian Top
Albanian T
Standard Albanian
Albanian G
Albanian
Armenian List
Greek K
Greek Mod
Tsakonian
Greek Md
Greek D
Greek Ml
English
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Welsh N
Welsh C
Cornish
Breton St
Breton List
Breton Se
Rumanian List
Vlach
Sardinian L
Sardinian N
Sardinian C
Italian
Dolomite Ladino
Friulian
Ladin
Romansh
Provencal
French
Walloon
Catalan
Spanish
Portuguese St
Brazilian
Latvian
Lithuanian St
Lithuanian O
Serbocroatian P
Serbian
Serbocroatian
Macedonian P
Macedonian
Bulgarian
Bulgarian P
Slovenian P
Russian P
Russian
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Czech E
Slovak
Prasun
Kati
Ashkun
Frisian
Afrikaans
Dutch List
Flemish
German
Standard German Munich
Pennsylvania Dutch
Letzebuergesch
Schwyzerduetsch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Gaulish
Manx
Gaelic Scots
Irish B
Irish A
Sogdian
Ossetic
Iron Ossetic
Digor Ossetic
Sariqoli
Waziri
Pashto
Persian
Tadzik
Baluchi
Zazaki
Kurdish
Kashmiri
Singhalese
Gypsy Gk
Khaskura
Nepali
Bihari
Oriya
Marathi
Gujarati
Marwari
Hindi
Panjabi St
Lahnda
Leiden
37 / 42
Results
Specific Results
False negatives
Tocharian B
●●Tocharian A
●
●
Classical Armenian
Gothic
Old English
●●Old High German
●
●
●
●
Old Swedish
Old Norse
Old Irish
Latin
●
●
●
Vedic Sanskrit
worm:B
Jäger & List (Tübingen/Paris)
Old Prussian
Old Church Slavonic
Ancestral state reconstruction
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Albanian C
Albanian K
Albanian Top
Albanian T
Standard Albanian
Albanian G
Albanian
Armenian List
Armenian Mod
Greek K
Greek Mod
Greek Md
Greek D
Greek Ml
English
Frisian
Afrikaans
Dutch List
Flemish
German
Standard German Munich
Pennsylvania Dutch
Letzebuergesch
Schwyzerduetsch
Faroese
Icelandic St
Norwegian
Stavangersk
Danish Fjolde
Danish
Oevdalian
Gutnish Lau
Swedish
Swedish Vl
Swedish Up
Gaelic Scots
Irish B
Welsh N
Cornish
Breton St
Breton List
Breton Se
Rumanian List
Vlach
Sardinian L
Sardinian N
Sardinian C
Italian
Dolomite Ladino
Friulian
Ladin
Provencal
French
Walloon
Spanish
Portuguese St
Brazilian
Latvian
Lithuanian St
Lithuanian O
Serbocroatian P
Serbian
Serbocroatian
Macedonian P
Macedonian
Bulgarian
Bulgarian P
Slovenian P
Russian P
Russian
Ukrainian P
Byelorussian P
Byelorussian
Ukrainian
Polish
Polish P
Lower Sorbian
Upper Sorbian
Czech P
Slovak P
Czech
Czech E
Slovak
Sogdian
Iron Ossetic
Digor Ossetic
Wakhi
Sariqoli
Waziri
Pashto
Persian
Tadzik
Baluchi
Zazaki
Kashmiri
Singhalese
Nepali
Bengali
Oriya
Assamese
Marathi
Gujarati
Sindhi
Magahi
Hindi
Urdu
Panjabi St
Lahnda
Leiden
38 / 42
Results
Specific Results
Summary on Indo-European
As the qualitative evaluation shows, the proto-forms proposed to be
reconstructed back to PIE by our best ASR method are mostly equally
good if not even better candidates than those which we found in the gold
standard. Given the general and well-known uncertainties in semantic
reconstruction in classical historical linguistics, it seems that ASR methods
could provide actual help in semantic reconstruction by providing objective
evolutionary scenarios for word evolution along a given tree which follow a
specific evolutionary model.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
39 / 42
Discussion
Benefits of ASR (?)
If the language family is well-known
ASR is of limited use in semantic reconstruction, since independent
reconstructions by the comparative methods are available, but
it is quite useful to check data quality and reference tree topology in
lexicostatistical datasets.
If the language family is less well-known
ASR is definitely useful as a preliminary analysis for semantic
reconstruction, since it gives a more objective assessment of the
consequences of a given theory of lexical replacement and external
language change (a tree topology).
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
40 / 42
Discussion
Benefits of ASR (!)
ASR may help
1
2
3
to identify loci of homoplasy and gives thus a first hint for parallel
semantic change patterns and borrowing.
to quantify differential rates of lexical replacements for the concepts in
a given wordlist.
to automatically identify sound change patterns and proto-form
reconstructions.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
41 / 42
Discussion
Caveats
Our current models are still very simplistic, in so far as they
operate independently for each meaning slot,
handle only binary (yes-no) cognate relations between words.
Future research will show whether it is possible to model lexical change
across meanings and to allow for more fine-grained relations between
cognate classes.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
42 / 42
References
A. Bouchard-Côté, D. Hall, T. L. Griffiths, and D. Klein. Automated reconstruction of ancient
languages using probabilistic models of sound change. Proceedings of the National Academy
of Sciences of the United States of America, 110(11):4224–4229, 2013.
R. Derksen. Etymological dictionary of the Slavic inherited lexicon. Brill, Leiden and Boston,
2008.
G. Kroonen. Etymological dictionary of Proto-Germanic. Number 11 in Leiden Indo-European
Etymological Dictionary Series. Brill, Leiden and Boston, 2013.
J.-M. List, A. Terhalle, and M. Urban. Using network approaches to enhance the analysis of
cross-linguistic polysemies. In Proceedings of the 10th International Conference on
Computational Semantics – Short Papers, pages 347–353, Stroudsburg, 2013. Association for
Computational Linguistics.
J.-M. List, T. Mayer, A. Terhalle, and M. Urban. Clics: Database of Cross-Linguistic
Colexifications. Online Resource, 2014a. URL http://clics.lingpy.org.
J.-M. List, S. Nelson-Sathi, H. Geisler, and W. Martin. Networks of lexical borrowing and lateral
gene transfer in language and genome evolution. Bioessays, 36(2):141–150, 2014b.
M. Meier-Brügger. Indogermanische Sprachwissenschaft. de Gruyter, Berlin and New York, 8
edition, 2002.
J. Pokorny. Indogermanisches etymologisches Wörterbuch, volume 1. Francke, Bern, 1959.
M. Vaan. Etymological dictionary of Latin and the other Italic languages. Number 7 in Leiden
Indo-European Etymological Dictionary Series. Brill, Leiden and Boston, 2008.
M. Vasmer. Ėtimologičeskij slovar’ russkogo jazyka. Progress, Moscow, 1986/1987.
D. Wodtko, B. Irslinger, and C. Schneider. Nomina im Indogermanischen Lexikon. Winter,
Heidelberg, 2008.
Jäger & List (Tübingen/Paris)
Ancestral state reconstruction
Leiden
42 / 42