Figure S1 - Approximate sampling locations. Circle size is

ANGOLA
ZAMBIA
BOTSWANA
NAMIBIA
SOUTH AFRICA
Figure S1 - Approximate sampling locations.
Circle size is proportional to number of individuals (minimum=1, maximum=30).
A
0.5
1.0
Southern African sequences plus
CEPH-HGDP sequences (Lippold et
al. 2014): dataset B
0.0
Fraction of sites with Ns after imputation (%)
1.5
The southern African data set:
dataset A
0
10
20
30
40
Fraction of polymorphic sites masked (%)
50
B
0.6
0.4
Error rate (%)
0.8
Results from dataset A, previous method
Results from dataset A, present method
Results from dataset B, previous method
Results from dataset B, present method
0
10
20
30
40
50
Fraction of polymorphic sites masked (%)
Figure S2 - Imputation performance.
A Fraction of Ns left after imputation for the southern African sequences (dataset A) and the southern
African sequences plus the sequences from Lippold et al. 2014 (dataset B); B Comparison of the error
rates for the imputation method used here and the method used previously.
A
Average number of wrongly assigned genotypes per sample
exponential
uniform
3
2
1
0
0.2
0.3
0.8
5%
10%
20%
B
1.2
1.9
0.2
0.4
0.8
1.3
2.3
30%
50%
5%
10%
20%
30%
50%
number of missing sites prior to imputation [%]
exponential
uniform
singletons
doubletons
tripletons
Average number of lost n−tons
60
40
69
62
44
43
20
31
27
0
7
0
5%
0
12
1
10%
0
2
0
20%
3
30%
0
7
50%
1
8
0
5%
0
16
0
10%
number of missing sites prior to imputation [%]
0
3
20%
0
5
30%
1
8
3
50%
Figure S3 - Effect of imputation on observed genetic diversity.
A Average number of wrongly-assigned genotypes for varying amounts of missing data prior to
imputation, with sites with missing data distributed under either an exponential (left) or a uniform (right)
model. The numbers under each bar indicate the average number of wrongly-assigned genotypes and the
error bar is the 95% CI. The average number of wrongly assigned genotypes per sample was 0 for an upper
boundary ≤ 10% and increased to up to 2 (i.e. 0.1%) for higher boundaries independently of the underlying
sampling distribution.
B Number of singletons, doubletons and tripletons that were converted to invariant sites during imputation.
A2c
R1b
E1b1a
+L485
A2a
E1b1a1
A2b
E1b1a
E1b1a8a
E1b1b
A2
G,I,O,T,R1
E2
A3b1
E
A3b1a1
BT
A3b1c
A00
B2a1a
B2a
B2b
B2b1
Figure S4 - Network of all of the sequences.
A3b1b
B2b4a
A2c
P262
u
vio
e
pr
sly
re
un
d
te
r
po
M114
M212
A2a
previously unreported
P28 A2b
V37
A2
A3b1b
A3b1
P71
P291
M51
PF1362
L439
L441
A3b1a1
previously unreported
V306
A3b1c
previously unreported
Figure S5 - Network of haplogroup A2 and A3b1 sequences (zoomed in from Figure S4).
Dashed lines indicate branches shortened for graphic purposes. Mutations reported in ISOGG and covered
both by our sequences and by the additional genotyping are indicated on the corresponding branches.
B2a
B2a1a
B2b
50f2(P)
M152
pre
vio
us
ly u
nre
po
rte
d
P70
nre
yu
usl
vio
pre
B2b4a
prev
ious
ly un
repo
rted
P6
ted
por
B2b1
Figure S6 - Network of haplogroup B2a and B2b sequences (zoomed in from Figure S4).
Dashed lines indicate branches shortened for graphic purposes. Mutations reported in ISOGG and covered
both by our sequences and by the additional genotyping are indicated on the corresponding branches. Note
that the branch labeled as “B2a” is not confirmed by any diagnostic SNPs from ISOGG, but is assumed as
the most plausible branch upstream from B2a1a.
ev
pr
P253
U186
M191 M263.2
iou
sly
un
re
po
rte
d
E1b1a7a
E1b1a+L485
L485
Page27
M58
E1b1a1
P277
P278.1
U175
E1b1a8a
Figure S7 - Network of sequences from haplogroups E1b1a8a, E1b1a1, and E1b1a+L485 (zoomed in
from Figure S4).
Dashed lines indicate branches shortened for graphic purposes. Mutations reported in ISOGG and covered
both by our sequences and by the additional genotyping are indicated on the corresponding branches.
Western Pygmies
Eastern Pygmies
Central Africa
East Africa
West Africa
Southern Africa mix
Southern Africa Bantu
Southern Africa Khoisan
Madagascar
Figure S8 - Median Joining Network for B2a sequences, based on 16 STR loci and including
published data (Batini et al. 2011).
The haplotypes are color coded to indicate major geographical areas and populations. The sample
“Southern Africa mix” corresponds to a mixed sample of Bantu and Khoisan speakers (Batini et al. 2011).
Haplotypes are available in Table S4.
A2
A2b
A2c
A
A2a
A3b1b
A3b1
A3b1a1
A3b1c
B2a
B2a1a
B
B2b
B2b4a
B2b1
G2a2b2
G,I,O,T,R1
O
E2
I1
I2
T R1a
R1b
E1b1b
E
E1b1a1
E1b1a
E1b1a
+L485
E1b1a8a
200000
150000
100000
Figure S9 - Bayesian tree for the 547 southern African sequences.
Haplogroups are color-coded as in Figure 1.
50000
0
single haplogroup - BEAST runs
whole data set - BEAST tree
150000
count of mutations
tMRCA [years]
100000
50000
x 64k x 59k
x 33k x 32k x 27k
0
A2
A3b1
x 50k
x 47k
x 62k
x 74k
x 46k
x 51k x 46k
B2a
x 22k x 21k
x 16k
B2b
Haplogroup
E1b1a+L485
x 15k x 23k
E1b1a8a
x 7k
x 21k x 19k
E1b1b
x 9k
Figure S10 - Comparison of TMRCA estimates for the major haplogroups.
Based on: a BEAST run of the entire dataset (Figure S9); a BEAST run for the specific haplogroup; and
from the count of mutations from the root.
Effective size x generation time
Effective size x generation time
1.E8
A2
1.E7
1.E6
1.E5
1.E4
1.E3
0
10000
20000
30000
40000
1.E8
A3b1
1.E7
1.E6
1.E5
1.E4
1.E3
0
10000
B2a
1.E7
1.E6
1.E5
1.E4
1.E3
0
10000
20000
30000
40000
1.E8
B2b
1.E6
1.E5
1.E4
1.E3
0
10000
1.E7
1.E6
1.E5
1.E4
20000
30000
40000
1.E8
Effective size x generation time
Effective size x generation time
E1b1a8a
10000
Effective size x generation time
Effective size x generation time
E1b1b
1.E7
1.E6
1.E5
1.E4
10000
20000
Time in years ago
40000
E1b1a+L485
1.E6
1.E5
1.E4
1.E3
0
10000
20000
30000
40000
30000
40000
Time in years ago
1.E8
0
30000
1.E7
Time in years ago
1.E3
20000
Time in years ago
1.E8
0
40000
1.E7
Time in years ago
1.E3
30000
Time in years ago
1.E8
Effective size x generation time
Effective size x generation time
Time in years ago
20000
30000
40000
1.E8
E1b1a8a
E1b1a+L485
A3b1
B2b
E1b1b
B2a
A2
1.E7
1.E6
1.E5
1.E4
1.E3
0
10000
20000
Time in years ago
Figure S11 - BSP plots for the major haplogroups, with a relaxed exponential clock model and a
mutation rate of 0.82x10-9.
The thick line is the median estimate, bracketed by thin lines for the 95% HPD intervals. Bottom right, a
summary BSP with the separate haplogroup median lines.
Fraction of sites with missing data before imputation [%]
A
strict
ULN
5%
10%
20%
30%
50%
0.1
0.2
0.3
0.0
0.1
0.2
0.3
95% CI interval of the deviation of the observed node heights from expected values
5%
10%
20%
30%
50%
expected
●
●
●
●
●
observed #10
●
●
●
●
●
observed #9
●
●
●
●
observed #8
●
●
●
●
●
observed #7
●
●
●
●
●
observed #6
●
●
●
●
●
observed #5
●
●
●
●
●
observed #4
●
●
●
●
●
observed #3
●
●
●
●
●
observed #2
●
●
●
●
●
observed #1
●
●
●
●
●
expected
●
●
●
●
●
observed #10
●
●
●
observed #9
●
●
●
●
observed #8
●
●
●
●
observed #7
●
●
●
●
observed #6
●
●
●
●
observed #5
●
●
●
●
observed #4
●
●
●
●
observed #3
●
●
●
observed #2
●
●
●
observed #1
●
●
5e−04
6e−04
4e−04
5e−04
●
4e−04
5e−04
●
●
●
●
●
●
●
●
6e−04
●
6e−04
4e−04
ULN
4e−04
●
strict
observed trees/ expected tree
B
0.0
●
●
●
●
●
5e−04
6e−04
4e−04
5e−04
6e−04
mean (●), median ( ) and 95% HPD interval of the observed root heights compared to the expected values
Figure S12 - Effect of imputation on the Bayesian phylogenetic analysis using BEAST.
A Distribution of the sum of the squared deviation of all node heights of the simulated observed trees
(based on varying amounts of missing data before imputation) compared to the expected tree (without
missing data) for strict and ULN clock models.
B The mean (point), the median (cross) and the 95% HPD interval of the observed tree root heights (blue)
of the 10 repeats per set of conditions compared to the expected tree root height (red) calculated with a
strict clock model (upper) and a ULN clock model (lower).
mean, median and 95% HPD/ confidence interval of the root height [years]
A
300000
250000
●
200000
●
●
●
●
150000
100000
ot
t
47
5
al
s
es
pl
am
−
T
AS
BE
ST
T
L
3
25
m
sa
es
pl
−
S
EA
B
H
53
2
m
sa
es
pl
−
A
BE
es
pl
al
t
to
7
54
m
sa
−
t
un
co
3L
25
s
es
pl
am
−
t
un
co
different subsets of the Y−chromosomal data
0.000
0.005
0.010
Density
0.015
0.020
0.025
B
50
100
150
200
Number of mutations to the A2-T node
Figure S13 - Impact of imputation on the root height estimation in the southern African data and a
less imputed dataset.
A 95% HPD interval for the root height of BEAST MCC trees for the entire data set of 547 sequences,
the less-imputed data set of 253 sequences (253L), and a random sample of 253 sequences (253H) from
the entire data set. For the TMRCA estimates obtained by counting the mutations along the branches the
95% confidence interval is shown. The circle in each interval is the mean and the cross is the median. B
Distribution of the number of mutations to the A2-T node for the 253L (red) and 253H datasets (black).
B
5%
●
10%
●
20%
5%
1e−03
●
30%
5e−04
●
50%
●
0.000
0.005
0.010
0.015
0.020
0e+00
95% CI interval of the deviation of the observed
clock rate from expected values
●
●
●
●
●
●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●
●●●●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●●
●
●●
●
●●
●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●
●
●
30%
20%
from the expected mutation rate for ULN clock
χ2 of the mutation rate per node
10%
χ2 of the mutation rate per node
Fraction of sites with missing data before imputation [%]
A
●
●
●
50%
●
●
1e−03
●
●
5e−04
●
●
●
●
●●
●
●
●
●
0e+00
●
●
●●
● ●● ● ●
●●
●
●
●
●●●●●●
●
●●●
●
●
●●
●●●●
●
●●
●●●●●●●●● ●●●●●● ●● ●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●
●●
●●
●
●●
●●●●●●● ●● ●● ● ●
●
●
●
●●●●●●●●●●●●●
●●● ●● ● ● ● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●● ●●
●
●●
● ●
●●●●● ● ●
●
●
●
●
●
●
●
●
●●
●● ●●●
●
●●●●●●● ●●● ●●● ● ● ●
●
●
●
●
●
node height
Figure S14 - Impact of imputation on variation in the clock rate across the tree for simulated data.
A Distribution of the sum of the squared deviation of the clock rates of the simulated observed trees (based
on varying amounts of missing data before imputation) compared to the expected tree (without missing
data).
B Scatter plots of the squared deviation values involving clock rates plotted against node height on the
tree. Plots are shown for varying amounts of missing data prior to imputation.
●
80
60
40
0
20
Frequency in the dataset
0
500
1000
1500
Number of Ns per sample before imputation
Figure S15 - Number of sites with missing information per sample before imputation.
1000
0
500
Number of Ns
1500
A
E1b1a+
E2
E1b1a8a
Total A2 A3b1 B2a B2b
L485
dataset
E1b1a1
G,I,O,T,R1 E1b1b
0
5
10
15
B
E1b1a+
E2
E1b1a8a
Total A2 A3b1 B2a B2b
L485
dataset
E1b1a1
G,I,O,T,R1 E1b1b
Figure S16 - Impact of imputation on the major haplogroup branches.
A. Distribution of the number of sites with missing information before imputation. B. Distribution of the
number of sites with missing information after imputation.
A
B
E2
E2
E1b1b
E1b1b
E1b1a
E1b1a
A2
A2
2.0E-5
C
2.0E-5
D
E2
E2
E1b1b
E1b1b
E1b1a
E1b1a
A2
2.0E-5
A2
2.0E-5
Figure S17 - Impact of sequence coverage on branch heterogeneity.
Maximum clade credibility trees of BEAST runs of all lineages of haplogroup E using haplogroup A2 as
an outgroup. A. Data set based on 344 samples using 2,844 SNPs. B. Data set based on 344 samples using
only sites for which the alternative allele was supported by at least three reads in at least one sample (2,028
SNPs). C. Data set based on samples with a minimum coverage of 10x in the target region (249 samples)
and all sites (2,844 SNPs). D. Data set based the combination of the filter criteria of B and C (249 samples;
2,028 SNPs).