ANGOLA ZAMBIA BOTSWANA NAMIBIA SOUTH AFRICA Figure S1 - Approximate sampling locations. Circle size is proportional to number of individuals (minimum=1, maximum=30). A 0.5 1.0 Southern African sequences plus CEPH-HGDP sequences (Lippold et al. 2014): dataset B 0.0 Fraction of sites with Ns after imputation (%) 1.5 The southern African data set: dataset A 0 10 20 30 40 Fraction of polymorphic sites masked (%) 50 B 0.6 0.4 Error rate (%) 0.8 Results from dataset A, previous method Results from dataset A, present method Results from dataset B, previous method Results from dataset B, present method 0 10 20 30 40 50 Fraction of polymorphic sites masked (%) Figure S2 - Imputation performance. A Fraction of Ns left after imputation for the southern African sequences (dataset A) and the southern African sequences plus the sequences from Lippold et al. 2014 (dataset B); B Comparison of the error rates for the imputation method used here and the method used previously. A Average number of wrongly assigned genotypes per sample exponential uniform 3 2 1 0 0.2 0.3 0.8 5% 10% 20% B 1.2 1.9 0.2 0.4 0.8 1.3 2.3 30% 50% 5% 10% 20% 30% 50% number of missing sites prior to imputation [%] exponential uniform singletons doubletons tripletons Average number of lost n−tons 60 40 69 62 44 43 20 31 27 0 7 0 5% 0 12 1 10% 0 2 0 20% 3 30% 0 7 50% 1 8 0 5% 0 16 0 10% number of missing sites prior to imputation [%] 0 3 20% 0 5 30% 1 8 3 50% Figure S3 - Effect of imputation on observed genetic diversity. A Average number of wrongly-assigned genotypes for varying amounts of missing data prior to imputation, with sites with missing data distributed under either an exponential (left) or a uniform (right) model. The numbers under each bar indicate the average number of wrongly-assigned genotypes and the error bar is the 95% CI. The average number of wrongly assigned genotypes per sample was 0 for an upper boundary ≤ 10% and increased to up to 2 (i.e. 0.1%) for higher boundaries independently of the underlying sampling distribution. B Number of singletons, doubletons and tripletons that were converted to invariant sites during imputation. A2c R1b E1b1a +L485 A2a E1b1a1 A2b E1b1a E1b1a8a E1b1b A2 G,I,O,T,R1 E2 A3b1 E A3b1a1 BT A3b1c A00 B2a1a B2a B2b B2b1 Figure S4 - Network of all of the sequences. A3b1b B2b4a A2c P262 u vio e pr sly re un d te r po M114 M212 A2a previously unreported P28 A2b V37 A2 A3b1b A3b1 P71 P291 M51 PF1362 L439 L441 A3b1a1 previously unreported V306 A3b1c previously unreported Figure S5 - Network of haplogroup A2 and A3b1 sequences (zoomed in from Figure S4). Dashed lines indicate branches shortened for graphic purposes. Mutations reported in ISOGG and covered both by our sequences and by the additional genotyping are indicated on the corresponding branches. B2a B2a1a B2b 50f2(P) M152 pre vio us ly u nre po rte d P70 nre yu usl vio pre B2b4a prev ious ly un repo rted P6 ted por B2b1 Figure S6 - Network of haplogroup B2a and B2b sequences (zoomed in from Figure S4). Dashed lines indicate branches shortened for graphic purposes. Mutations reported in ISOGG and covered both by our sequences and by the additional genotyping are indicated on the corresponding branches. Note that the branch labeled as “B2a” is not confirmed by any diagnostic SNPs from ISOGG, but is assumed as the most plausible branch upstream from B2a1a. ev pr P253 U186 M191 M263.2 iou sly un re po rte d E1b1a7a E1b1a+L485 L485 Page27 M58 E1b1a1 P277 P278.1 U175 E1b1a8a Figure S7 - Network of sequences from haplogroups E1b1a8a, E1b1a1, and E1b1a+L485 (zoomed in from Figure S4). Dashed lines indicate branches shortened for graphic purposes. Mutations reported in ISOGG and covered both by our sequences and by the additional genotyping are indicated on the corresponding branches. Western Pygmies Eastern Pygmies Central Africa East Africa West Africa Southern Africa mix Southern Africa Bantu Southern Africa Khoisan Madagascar Figure S8 - Median Joining Network for B2a sequences, based on 16 STR loci and including published data (Batini et al. 2011). The haplotypes are color coded to indicate major geographical areas and populations. The sample “Southern Africa mix” corresponds to a mixed sample of Bantu and Khoisan speakers (Batini et al. 2011). Haplotypes are available in Table S4. A2 A2b A2c A A2a A3b1b A3b1 A3b1a1 A3b1c B2a B2a1a B B2b B2b4a B2b1 G2a2b2 G,I,O,T,R1 O E2 I1 I2 T R1a R1b E1b1b E E1b1a1 E1b1a E1b1a +L485 E1b1a8a 200000 150000 100000 Figure S9 - Bayesian tree for the 547 southern African sequences. Haplogroups are color-coded as in Figure 1. 50000 0 single haplogroup - BEAST runs whole data set - BEAST tree 150000 count of mutations tMRCA [years] 100000 50000 x 64k x 59k x 33k x 32k x 27k 0 A2 A3b1 x 50k x 47k x 62k x 74k x 46k x 51k x 46k B2a x 22k x 21k x 16k B2b Haplogroup E1b1a+L485 x 15k x 23k E1b1a8a x 7k x 21k x 19k E1b1b x 9k Figure S10 - Comparison of TMRCA estimates for the major haplogroups. Based on: a BEAST run of the entire dataset (Figure S9); a BEAST run for the specific haplogroup; and from the count of mutations from the root. Effective size x generation time Effective size x generation time 1.E8 A2 1.E7 1.E6 1.E5 1.E4 1.E3 0 10000 20000 30000 40000 1.E8 A3b1 1.E7 1.E6 1.E5 1.E4 1.E3 0 10000 B2a 1.E7 1.E6 1.E5 1.E4 1.E3 0 10000 20000 30000 40000 1.E8 B2b 1.E6 1.E5 1.E4 1.E3 0 10000 1.E7 1.E6 1.E5 1.E4 20000 30000 40000 1.E8 Effective size x generation time Effective size x generation time E1b1a8a 10000 Effective size x generation time Effective size x generation time E1b1b 1.E7 1.E6 1.E5 1.E4 10000 20000 Time in years ago 40000 E1b1a+L485 1.E6 1.E5 1.E4 1.E3 0 10000 20000 30000 40000 30000 40000 Time in years ago 1.E8 0 30000 1.E7 Time in years ago 1.E3 20000 Time in years ago 1.E8 0 40000 1.E7 Time in years ago 1.E3 30000 Time in years ago 1.E8 Effective size x generation time Effective size x generation time Time in years ago 20000 30000 40000 1.E8 E1b1a8a E1b1a+L485 A3b1 B2b E1b1b B2a A2 1.E7 1.E6 1.E5 1.E4 1.E3 0 10000 20000 Time in years ago Figure S11 - BSP plots for the major haplogroups, with a relaxed exponential clock model and a mutation rate of 0.82x10-9. The thick line is the median estimate, bracketed by thin lines for the 95% HPD intervals. Bottom right, a summary BSP with the separate haplogroup median lines. Fraction of sites with missing data before imputation [%] A strict ULN 5% 10% 20% 30% 50% 0.1 0.2 0.3 0.0 0.1 0.2 0.3 95% CI interval of the deviation of the observed node heights from expected values 5% 10% 20% 30% 50% expected ● ● ● ● ● observed #10 ● ● ● ● ● observed #9 ● ● ● ● observed #8 ● ● ● ● ● observed #7 ● ● ● ● ● observed #6 ● ● ● ● ● observed #5 ● ● ● ● ● observed #4 ● ● ● ● ● observed #3 ● ● ● ● ● observed #2 ● ● ● ● ● observed #1 ● ● ● ● ● expected ● ● ● ● ● observed #10 ● ● ● observed #9 ● ● ● ● observed #8 ● ● ● ● observed #7 ● ● ● ● observed #6 ● ● ● ● observed #5 ● ● ● ● observed #4 ● ● ● ● observed #3 ● ● ● observed #2 ● ● ● observed #1 ● ● 5e−04 6e−04 4e−04 5e−04 ● 4e−04 5e−04 ● ● ● ● ● ● ● ● 6e−04 ● 6e−04 4e−04 ULN 4e−04 ● strict observed trees/ expected tree B 0.0 ● ● ● ● ● 5e−04 6e−04 4e−04 5e−04 6e−04 mean (●), median ( ) and 95% HPD interval of the observed root heights compared to the expected values Figure S12 - Effect of imputation on the Bayesian phylogenetic analysis using BEAST. A Distribution of the sum of the squared deviation of all node heights of the simulated observed trees (based on varying amounts of missing data before imputation) compared to the expected tree (without missing data) for strict and ULN clock models. B The mean (point), the median (cross) and the 95% HPD interval of the observed tree root heights (blue) of the 10 repeats per set of conditions compared to the expected tree root height (red) calculated with a strict clock model (upper) and a ULN clock model (lower). mean, median and 95% HPD/ confidence interval of the root height [years] A 300000 250000 ● 200000 ● ● ● ● 150000 100000 ot t 47 5 al s es pl am − T AS BE ST T L 3 25 m sa es pl − S EA B H 53 2 m sa es pl − A BE es pl al t to 7 54 m sa − t un co 3L 25 s es pl am − t un co different subsets of the Y−chromosomal data 0.000 0.005 0.010 Density 0.015 0.020 0.025 B 50 100 150 200 Number of mutations to the A2-T node Figure S13 - Impact of imputation on the root height estimation in the southern African data and a less imputed dataset. A 95% HPD interval for the root height of BEAST MCC trees for the entire data set of 547 sequences, the less-imputed data set of 253 sequences (253L), and a random sample of 253 sequences (253H) from the entire data set. For the TMRCA estimates obtained by counting the mutations along the branches the 95% confidence interval is shown. The circle in each interval is the mean and the cross is the median. B Distribution of the number of mutations to the A2-T node for the 253L (red) and 253H datasets (black). B 5% ● 10% ● 20% 5% 1e−03 ● 30% 5e−04 ● 50% ● 0.000 0.005 0.010 0.015 0.020 0e+00 95% CI interval of the deviation of the observed clock rate from expected values ● ● ● ● ● ●●●●●●●●●●●●●●●●● ●●●●●● ●● ●● ●●●● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ●●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●● ●●●●●● ●● ●● ● ● 30% 20% from the expected mutation rate for ULN clock χ2 of the mutation rate per node 10% χ2 of the mutation rate per node Fraction of sites with missing data before imputation [%] A ● ● ● 50% ● ● 1e−03 ● ● 5e−04 ● ● ● ● ●● ● ● ● ● 0e+00 ● ● ●● ● ●● ● ● ●● ● ● ● ●●●●●● ● ●●● ● ● ●● ●●●● ● ●● ●●●●●●●●● ●●●●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ●● ●● ● ●● ●●●●●●● ●● ●● ● ● ● ● ● ●●●●●●●●●●●●● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ● ●●●●●●● ●●● ●●● ● ● ● ● ● ● ● ● node height Figure S14 - Impact of imputation on variation in the clock rate across the tree for simulated data. A Distribution of the sum of the squared deviation of the clock rates of the simulated observed trees (based on varying amounts of missing data before imputation) compared to the expected tree (without missing data). B Scatter plots of the squared deviation values involving clock rates plotted against node height on the tree. Plots are shown for varying amounts of missing data prior to imputation. ● 80 60 40 0 20 Frequency in the dataset 0 500 1000 1500 Number of Ns per sample before imputation Figure S15 - Number of sites with missing information per sample before imputation. 1000 0 500 Number of Ns 1500 A E1b1a+ E2 E1b1a8a Total A2 A3b1 B2a B2b L485 dataset E1b1a1 G,I,O,T,R1 E1b1b 0 5 10 15 B E1b1a+ E2 E1b1a8a Total A2 A3b1 B2a B2b L485 dataset E1b1a1 G,I,O,T,R1 E1b1b Figure S16 - Impact of imputation on the major haplogroup branches. A. Distribution of the number of sites with missing information before imputation. B. Distribution of the number of sites with missing information after imputation. A B E2 E2 E1b1b E1b1b E1b1a E1b1a A2 A2 2.0E-5 C 2.0E-5 D E2 E2 E1b1b E1b1b E1b1a E1b1a A2 2.0E-5 A2 2.0E-5 Figure S17 - Impact of sequence coverage on branch heterogeneity. Maximum clade credibility trees of BEAST runs of all lineages of haplogroup E using haplogroup A2 as an outgroup. A. Data set based on 344 samples using 2,844 SNPs. B. Data set based on 344 samples using only sites for which the alternative allele was supported by at least three reads in at least one sample (2,028 SNPs). C. Data set based on samples with a minimum coverage of 10x in the target region (249 samples) and all sites (2,844 SNPs). D. Data set based the combination of the filter criteria of B and C (249 samples; 2,028 SNPs).
© Copyright 2026 Paperzz