Genetics: Early Online, published on November 9, 2016 as 10.1534/genetics.116.193243 1 Accuracy of genomic prediction in synthetic populations depending on the 2 number of parents, relatedness and ancestral linkage disequilibrium 3 Pascal Schopp*,1, Dominik Müller*,1, Frank Technow*, Albrecht E. Melchinger* 4 September 29, 2016 5 6 7 8 *Institute of Plant Breeding, Seed Science and Population Genetics 1 These authors contributed equally to this work 9 10 University of Hohenheim 11 70599 Stuttgart, Germany 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 1 Copyright 2016. 27 Running Head: Genomic prediction in synthetics 28 29 30 Key Words: genomic prediction, synthetic populations, GBLUP, genetic relationships, linkage 31 disequilibrium 32 33 34 Corresponding Author: 35 A.E. Melchinger 36 37 Institute of Plant Breeding, Seed Sciences and Population Genetics 38 University of Hohenheim 39 Fruwirthstr. 21, 40 Stuttgart 70599, GERMANY 41 [email protected] 42 Tel.: 0049 0711 459-22334 43 Fax.: 0049 0711 459-22343 44 45 46 47 48 49 50 51 52 53 2 54 ABSTRACT 55 Synthetics play an important role in quantitative genetic research and plant breeding, but few studies 56 have investigated the application of genomic prediction (GP) to these populations. Synthetics are 57 generated by intermating a small number of parents (๐๐ ) and thereby possess unique genetic 58 properties, which make them especially suited for systematic investigations of factors contributing to 59 the accuracy of GP. We generated synthetics in silico from ๐๐ = 2 to 32 maize (Zea mays L.) lines taken 60 from an ancestral population with either short- or long-range linkage disequilibrium (LD). In eight 61 scenarios differing in relatedness of the training and prediction sets and in the types of data used to 62 calculate the relationship matrix (QTL, SNPs, tag markers, pedigree), we investigated the prediction 63 accuracy of GBLUP and analyzed contributions from pedigree relationships captured by SNP markers 64 as well as from co-segregation and ancestral LD between QTL and SNPs. The effects of training set size 65 ๐๐๐ and marker density were also studied. Sampling few parents (2 โค ๐๐ < 8) generates substantial 66 sample LD that carries over into synthetics through co-segregation of alleles at linked loci. For fixed 67 ๐๐๐ , ๐๐ influences prediction accuracy most strongly. If the training and prediction set are related, 68 using ๐๐ < 8 parents yields high prediction accuracy regardless of ancestral LD because SNPs capture 69 pedigree relationships and Mendelian sampling through co-segregation. As ๐๐ increases, ancestral LD 70 contributes more information, while other factors contribute less due to lower frequencies of closely 71 related individuals. For unrelated prediction sets, only ancestral LD contributes information and 72 accuracies were poor and highly variable for ๐๐ โค 4 due to large sample LD. For large ๐๐ , achieving 73 moderate accuracy requires large ๐๐๐ , long-range ancestral LD and high marker density. Our approach 74 for analyzing prediction accuracy in synthetics provides new insights into the prospects of GP for many 75 types of source populations encountered in plant breeding. 3 INTRODUCTION 76 77 Synthetic populations, known as synthetics, have played an important role in quantitative- 78 genetic research on gene action in complex heterotic traits and comparison of selection methods (cf. 79 Hallauer et al. 2010). In many crops, synthetics also serve as cultivars in agricultural production or as 80 source population for recurrent selection programs (cf. Bradshaw 2016). Synthetics are usually created 81 by crossing a small number of parents (๐๐ ) and subsequently cross-pollinating the F1 individuals for 82 one or several generations (Falconer and Mackay 1996). A prominent example is the โIowa Stiff Stalk 83 Syntheticโ (BSSS) generated from 16 parents of maize, from which numerous successful elite inbred 84 lines such as B73 have been derived (Hagdorn et al. 2003). Further examples of synthetics include 85 composite crosses (Suneson 1956) and multi-parental advanced inter-cross (MAGIC, see Table S1 for 86 list of abbreviations) populations (Cavanagh et al. 2008) advocated for breeding purposes in crops 87 (Bandillo et al. 2013). Importantly, two-way and four-way crosses, widely employed as source material 88 in recycling breeding (Mikel and Dudley 2006), can be viewed as special cases of synthetics when ๐๐ = 89 2 and 4, respectively. 90 Genomic prediction (GP) proposed by Meuwissen et al. (2001) led to a paradigm-shift in animal 91 breeding during the past decade (Hayes et al. 2009a; de Koning 2016) and has also been widely 92 adopted in plant breeding (Lin et al. 2014). In cattle breeding, GP is predominantly applied within 93 closed breeds and training sets (TS) commonly encompass thousands of individuals. By comparison, in 94 plant breeding the TS sizes are much smaller (e.g., hundreds or fewer of individuals) and populations 95 are usually structured into multiple segregating families or subpopulations. Numerous studies 96 addressed the implementation of GP in structured plant breeding populations (cf. Lorenzana and 97 Bernardo 2009; Albrecht et al. 2011; Lehermeier et al. 2014; Technow and Totir 2015), but systematic 98 investigations on the prospects of GP in synthetics are lacking so far, although they were proposed as 99 particularly suitable source material for recurrent genomic selection (Windhausen et al. 2012; Gorjanc 100 et al. 2016). 4 101 Genomic best linear unbiased prediction (GBLUP), a modification of the traditional pedigree 102 BLUP devised by Henderson (1984), is a widely used method to implement GP in animal and plant 103 breeding (Mackay et al. 2015). Here, the pedigree relationship matrix is replaced by a marker-derived 104 genomic relationship matrix to estimate actual relationships at QTL (Hayes et al. 2009c). The success 105 of this approach depends on three sources of information, namely (i) pedigree relationships captured 106 by markers, (ii) co-segregation of QTL and markers and (iii) population-wide linkage disequilibrium 107 between QTL and markers (Habier et al. 2007, 2013; Wientjes et al. 2013). 108 In classical quantitative-genetics, pedigree relationships between individuals are calculated as 109 twice the probability of identity-by-descent (IBD) of alleles at a locus, conditional on their pedigree 110 (Wright 1922; Falconer and Mackay 1996). However, actual IBD relationships at QTL deviate from 111 pedigree relationships โ which correspond to expected IBD relationships โ due to Mendelian sampling 112 (Hill and Weir 2011). In GP, pedigree relationships are captured best with a large number of 113 stochastically independent markers (Habier et al. 2007), whereas capturing the Mendelian sampling 114 term requires co-segregation of QTL and markers (Hayes et al. 2009c; Habier et al. 2013). 115 In pedigree analysis, the founders of the pedigree are by definition assumed to be unrelated 116 (i.e., IBD equal to zero), but in reality, there usually exist latent similarities at QTL contributing to 117 variation in identity-by-state (IBS) relationships among these individuals. Markers enable capturing 118 these IBS relationships if they are in population-wide LD with the QTL in an ancestral population of 119 founders. Thus, ancestral LD between QTL and markers provides also information between individuals 120 that are unrelated by pedigree to the TS (Wientjes et al. 2013; Habier et al. 2013). Ancestral LD 121 generally results from various population-historic processes like mutation, drift and selection (Flint- 122 Garcia et al. 2003) and varies within species primarily due to different bottlenecks imposed by artificial 123 selection or population admixture (Hill 1981; Hartl and Clark 2007). The influence of different levels of 124 ancestral LD on prediction accuracy (PA) in synthetics and related types of populations have so far 125 received little attention. 126 The contributions of the three sources of information to PA were demonstrated in theory and 127 simulations by Habier et al. (2013) using half-sib families in cattle breeding and multiple biparental 5 128 (full-sib) families in maize breeding, where both of these examples consisted of numerous families 129 derived from a large number of parents. However, it is unclear whether these results generalize to 130 other breeding situations, in particular those involving only few parents. In such situations, diverse 131 relationship patterns are generated, new statistical associations between loci arise due to sampling, 132 and ancestral LD might be only partially present in the progeny. These factors are expected to 133 profoundly affect the contributions of the three sources of information to PA and thus, affect the 134 application of GP on related and unrelated genotypes. Synthetics represent an ideal framework for 135 examining the influence of these factors on PA, because the number of parents used for producing 136 them can be varied over a wide range. Here, we simulated two ancestral populations differing 137 substantially in their LD and analyzed synthetics generated from different numbers of parents under 138 eight scenarios that enabled dissecting the factors contributing to PA. 139 The objectives of this study were to (i) examine how PA in synthetics depends on the number 140 of parents and LD in the ancestral population, (ii) assess the importance of the three sources of 141 information for PA and how they are influenced by training set size and marker density, and (iii) analyze 142 the relationship of LD between QTL and markers among the ancestral population, parents, and the 143 synthetics generated from them. Finally, we discuss how our approach provides a general framework 144 for analyzing the factors influencing PA and we draw inferences on the prospects of GP in other 145 scenarios encountered in breeding. 146 147 6 148 METHODS 149 Genome properties and genetic map: We used maize (Zea mays L.) as a model species in our study. 150 Physical map positions of the 56K Illumina maize SNP BeadChip were used to account for the markedly 151 reduced recombination rate and lower marker density in the centromere regions (McMullen et al. 152 2009). These positions were converted into genetic map positions required for simulating meiosis 153 events (File S1). In total we obtained 37,286 SNPs distributed over the 10 chromosomes of length 276, 154 200, 193, 188, 221, 171, 203, 173, 151 and 137 cM (1913 cM in total), corresponding to an average 155 marker density of 24.4 SNPs cM-1. All subsequent meiosis events were simulated using the count- 156 location model without crossover-interference, where the number of chiasmata was drawn from a 157 Poisson distribution with parameter ๐ equal to the chromosome length in Morgan, and where 158 crossover positions were sampled from a uniform distribution across the chromosome. 159 160 Simulation of ancestral populations: Two ancestral populations that differed substantially in their 161 level and decay of LD (LDA, Figure 1), were simulated with the software QMSim (Sargolzaei and 162 Schenkel 2009). Ancestral population LR displayed extensive long-range LDA, whereas SR displayed only 163 short-range LDA. The simulation of LR was carried out by closely following Habier et al. (2013) and 164 involved the following steps (Figure S1): First, we generated an initial population of 1,500 diploid 165 individuals by sampling alleles at each (biallelic) locus independently from a Bernoulli distribution with 166 probability 0.5. Second, 5,000 loci were randomly sampled from all SNPs and henceforth interpreted 167 as QTL; all remaining loci were considered as SNP markers. Third, these individuals were randomly 168 mated for 3,000 generations using a constant population size of 1,500 and a mutation rate of 2.5×10- 169 5 170 individuals, followed by 15 more generations of random mating to generate extensive long-range LDA. 171 Fifth, we conducted three more generations of random mating with a population size of 10,000 172 individuals to eliminate close pedigree relationships in the ancestral population LR. To produce SR, we . Fourth, a severe bottleneck was introduced by reducing population size to 30 randomly chosen 7 173 randomly mated LR for 100 more generations at a population size of 10,000 individuals to remove 174 long-range LDA. Thus, LR and SR strongly differed in their LDA structure, but only marginally in their 175 allele frequencies (Table S2). Always in the last generation, a single gamete was randomly sampled per 176 individual from both SR and LR and treated as completely homozygous doubled haploid line. These 177 10,000 lines represented the final ancestral population used for production of the synthetics. All lines 178 were considered unrelated when calculating pedigree relationships among their progeny. 179 180 Simulation of synthetics: We generated synthetics differing in ๐๐ by sampling ๐๐ โ 181 {2, 3, 4, 6, 8, 12, 16, 24, 32} parent lines from the same ancestral population. From these parents, we 182 produced all possible (๐2๐ ) combinations of single crosses (Syn-1 generation, Figure S1), where the 183 number of Syn-1 progenies per cross was chosen to obtain at least 1,000 individuals in total. For 184 production of the Syn-2 generation, the Syn-1 individuals were intermated at random, allowing also 185 for selfing. Finally, a single doubled haploid line was derived from each of the 1,000 individuals of the 186 Syn-2 generation to obtain the genotypes of the final synthetic. This approach was chosen to avoid 187 additional full-sib relationships among doubled haploid lines that arise when deriving them from the 188 same Syn-2 individual. 189 190 Genetic model: For simulating the polygenic target trait, we sampled a subset of 1,000 of the 5,000 191 QTL in each simulation replicate. Following Meuwissen et al. (2001), the corresponding QTL effects 192 were drawn from a gamma distribution with scale and shape parameter 0.4 and 1.66, respectively. 193 Signs of QTL effects were sampled from a Bernoulli distribution with probability parameter 0.5. 194 The vector ๐ of true breeding values for all individuals in the synthetic was calculated as ๐ = 195 ๐พ๐, where ๐พ is the matrix of genotypic scores at QTL coded {2,0} depending on whether an individual 196 was homozygous for the 1 or 0 allele, respectively, that were adjusted for twice the frequency of the 197 1 allele in the ancestral population (cf. Figure S1), and ๐ is the vector of QTL effects. The corresponding 198 vector ๐ of phenotypes was obtained as ๐ = ๐ + ๐ (Goddard et al. 2011; de los Campos et al. 2013; 199 Habier et al. 2013), i.e., assuming a null mean and adding a vector of independent normally distributed 8 200 environmental noise variables ๐, where variance ๐๐2 was chosen to be identical for the two ancestral 201 populations and all choices of ๐๐ , assuming that environmental effects affect phenotypes 202 independently of additive-genetic variance ๐๐ข2 in the synthetic. The value of ๐๐2 was therefore set equal 203 2 to the additive-genetic variance ๐๐ด๐ in ancestral population LR averaged across 1,000 simulation 204 replicates. The heritability โ² of the target trait was then on average equal to 0.5 for LR and SR due to 205 2 nearly identical allele frequencies at QTL, but lower in the synthetics, because ๐๐ข2 < ๐๐ด๐ (Table S2). 206 We restricted our simulations to a single level of heritability, because preliminary analyses showed 207 that changing โ² resulted in fairly relatively constant shift of PA. 208 209 Analysis of the sources of information exploited in genomic prediction: We conceived eight scenarios 210 to evaluate to what extent the three sources of information contribute to PA in synthetics (Figure 2), 211 when actual relationships at QTL are estimated by marker-derived genomic relationships. The 212 scenarios can be differentiated by three factors. 213 First, individuals in the TS and prediction set (PS) were either related (โReโ-scenarios) or 214 unrelated (โUnโ- scenarios), depending on whether the parents of the TS (๐๐๐ ) and of the PS (๐๐๐ ) 215 were identical (i.e., ๐๐๐ = ๐๐๐ ) or disjoint (i.e., ๐๐๐ โฉ ๐๐๐ = โ ). For the โReโ-scenarios, we sampled 216 individuals for the TS and PS from the same synthetic, whereas for the โUnโ- scenarios, individuals 217 were sampled from two different synthetics produced from disjoint sets ๐๐๐ and ๐๐๐ , each of size ๐๐ . 218 Both sets of parents originated always from the same ancestral population. 219 Second, pairs of QTL and SNPs were either in LD (โLDAโ-scenarios) as found in the ancestral 220 population, or in linkage equilibrium (โLEAโ-scenarios). To achieve the latter, we permuted complete 221 QTL haplotypes among the ๐๐ parents (for โUnโ-scenarios separately in each set ๐๐๐ and ๐๐๐ ), while 222 keeping their SNP haplotypes unchanged (i.e., conserving their LDA). This procedure eliminates any 223 systematic association between QTL and SNP alleles originating from the ancestral population, but 224 maintains allele frequencies and polymorphic states at QTL, as well as LDA between them. In contrast 225 to previous approaches (cf. Habier et al. 2013), this approach avoids influencing PA by altering actual 226 relationships at QTL. Importantly, after removal of LDA, there is still LD between QTL and SNPs in 9 227 the parents, but this LD is purely due to the limited sample size and thus subsequently referred to as 228 sample LD. 229 Third, four different types of data were used to calculate the relationship matrix ๐ฒ used in 230 BLUP: (i) For the โSNPโ- scenarios, we used SNP genotypes to calculate the marker-derived genomic 231 relationship matrix ๐ฒ โ ๐ฎ = (๐๐๐ ) as ๐๐๐ = โ๐(๐ฅ๐๐ โ 2๐๐ )(๐ฅ๐๐ โ 2๐๐ )โ2 โ๐ ๐๐ (1 โ ๐๐ ) (Habier 232 et al. 2007; VanRaden 2008), where ๐ฅ๐๐ is the genotype of the ๐-th individual at the ๐-th locus coded 233 {2,0} depending on whether this individual was homozygous for the 1 or 0 allele, respectively, and ๐๐ 234 is the frequency of the 1 allele at the ๐-th SNP marker in the ancestral population. (ii) For the โQTLโ- 235 scenarios, the QTL genotypes were used to calculate the actual relationship matrix ๐ฒ โ ๐ธ = 236 (๐๐๐ ) using the same formula. (iii) For the โPedโ-scenario, pedigree records were used to calculate the 237 pedigree relationship matrix ๐ฒ โ ๐จ = (๐๐๐ ) with elements ๐๐๐ being equal to expected IBD 238 relationships (i.e., twice the coefficient of co-ancestry). (iv) For the โTagโ-scenario, tag markers 239 labeling the origin of QTL alleles at each locus from the ๐๐ parents were used to calculate the actual 240 IBD relationship matrix ๐ฒ โ ๐ป = (๐๐๐ ) with elements ๐๐๐ being equal to twice the proportion of 241 identical tag marker alleles between each pair of individuals. Tag markers label each QTL allele, 242 regardless of its state, uniquely with a number ั {1, . . , ๐๐ ) in the parents and thus, they allow tracking 243 the segregation process during intermating and identifying the parental origin of each QTL allele in the 244 synthetic. 245 Scenario Re-LDA-SNP reflects the situation mostly encountered in practical applications of GP 246 and used information from pedigree relationships among individuals in the TS and PS captured by 247 SNPs, deviations from pedigree relationships due to (i) Mendelian sampling at QTL captured by co- 248 segregation of QTL and SNPs and (ii) ancestral LD between QTL and SNPs. Scenario Re-LDA-Ped used 249 only pedigree relationships, but ignored deviations due to Mendelian sampling, whereas Re-LDA-Tag 250 accounted for both pedigree relationships and Mendelian sampling. Both scenarios ignored actual 251 relationships among parents by assuming unrelated founders, and thus, did not account for alleles that 252 are IBS but not IBD in the synthetic. Scenario Re-LEA-SNP was artificial, with the goal of determining 10 253 the influence of ancestral LD on PA in scenario Re-LDA-SNP. Scenario Re-LDA-QTL was employed to 254 determine for the โReโ-scenarios the maximum PA achievable with GBLUP (cf. de los Campos et al. 255 2013), when assuming that each QTL explains an equal proportion of the additive-genetic variance. 256 The purpose was thus to quantify the reduction in PA for all other โReโ-scenarios when using a 257 different data type to estimate actual relationships. 258 Scenarios Un-LDA-SNP and Un-LDA-QTL (โUnโ-scenarios) represent the conceptual counter- 259 parts to Re-LDA-SNP and Re-LDA-QTL (Figure 2). Un-LDA-SNP reflects the practical situation of predicting 260 the genetic merit of individuals unrelated to the TS, whereas Un-LDA-QTL provides the corresponding 261 upper bound of PA. For both scenarios, alleles in the TS and PS had IBD probability equal to zero and, 262 thus, the only remaining source of information contributing to PA in Un-LDA-SNP was ancestral LD 263 between QTL and SNPs to track actual relationships among parents. Scenario Un-LEA-SNP was 264 employed as negative-control scenario to validate the simulation designs. As expected, PA for Un-LEA- 265 SNP fluctuated around zero for all investigated settings (results not shown), confirming that there are 266 only three sources of information contributing to PA when using ๐ฒ โ ๐ฎ. 267 268 Analysis of linkage disequilibrium and linkage phase similarity: We calculated LD as the squared 269 correlation coefficient (๐ 2 , Hill and Robertson 1968) between all pairs of QTL and SNPs in (i) each 270 ancestral population (LDA), (ii) the set of ๐๐ parents sampled from the ancestral population, and (iii) 271 the synthetic generated from the parents. Furthermore, we computed the linkage phase similarity of 272 QTL-SNP pairs in the TS and PS. Here, we adopted a similar approach as de Roos et al. (2008), but 273 replaced the correlation by the cosine similarity 274 ๐ฟ๐๐๐๐๐๐ ๐โ๐๐ ๐ ๐ ๐๐๐๐๐๐๐๐ก๐ฆ = ๐๐ ๐๐ โ๐ ๐ ๐๐ ๐๐ ๐๐ 2 ๐ ๐๐ 2 โโ๐ ๐ (๐๐ ) โโ๐ (๐๐ ) , (1) 275 where ๐ refers to the index of the QTL-SNP pair and ๐ is the number of pairs for which linkage phase 276 similarity is calculated. The reason was to account not only for the ranking but also for the absolute 277 size of the ๐ statistics in the two data sets (see File S2 for details). Linkage phase similarity was 11 278 calculated for all QTL-SNP pairs falling into consecutive bins of 0.5 cM width. LD was first averaged 279 within each bin and subsequently, both LD and linkage phase similarity statistics were averaged across 280 chromosomes and simulation replicates. 281 282 283 Genomic prediction: The statistical model used for predicting breeding values can be written as ๐ = ๐๐ + ๐๐ + ๐บ, (2) 284 where ๐ is the incidence matrix linking phenotypes with breeding values, ๐ is the vector of random 285 breeding values with mean zero and variance-covariance matrix var(๐) = ๐ฒ๐๐ข2 , where ๐ฒ is a 286 relationship matrix, calculated from different data types as described above, and ๐๐ข2 is the additive- 287 genetic variance in the synthetic. Residuals ๐บ are random with mean zero and var(๐บ) = ๐ฐ๐๐2 , where ๐ฐ 288 is an identity matrix and ๐๐2 is the residual variance. Estimates of variance components ๐๐ข2 and ๐๐2 were 289 ฬ were predicted using the obtained by restricted maximum likelihood and estimated breeding values ๐ 290 mixed.solve function from R-package rrBLUP (Endelman 2011). PA was always calculated as the 291 ฬ for the PS in each simulation replicate. correlation between ๐ and ๐ 292 Following previous studies (Goddard et al. 2011; de los Campos et al. 2013), we also 293 investigated how well estimated relationships ๐๐๐ (i.e., ๐๐๐ , ๐๐๐ , ๐๐๐ ) between individuals ๐ and ๐ in the 294 TS and PS reflect the corresponding actual relationships ๐๐๐ at QTL. We therefore calculated the 295 2 coefficient of determination ๐ ๐,๐ of the regression of ๐๐๐ on ๐๐๐ in each simulation replicate and all 296 2 scenarios (except for Re-LDA-QTL and Un-LDA-QTL, where ๐ ๐,๐ = 1.0). 297 In order to assess the effect of TS size on PA, we sampled ๐๐๐ = 125, 250, 500 or 750 298 individuals from the 1,000 lines of the synthetic, where 250 was used as default when another factor 299 (e.g., marker density) was varied. For the PS, we always sampled ๐๐๐ = 100 individuals from (i) the 300 remaining individuals that were not part of the TS in the โReโ-scenarios or (ii) the second synthetic in 301 the โUnโ-scenarios. For all โSNPโ-scenarios, the effect of marker density on PA was evaluated for two 302 values of 5 and 0.25 SNPs cM-1, the former being used as default. The number of randomly sampled 303 SNPs per chromosome in each simulation replicate was proportional to the respective chromosome 12 304 length. The two marker densities of 5 and 0.25 SNPs cM-1 resulted in an average genetic map distance 305 between each QTL and its closest nearby SNP of 0.18 cM and 2.02 cM, respectively (Figure 1). 306 All reported results are arithmetic means over 1,000 simulation replicates, which were 307 stochastically independent conditional on the ancestral populations. A simulation replicate comprises 308 (i) random sampling of 1,000 QTL from the 5,000 initial QTL and sampling of QTL effects, (ii) sampling 309 of the parents from the ancestral population and, in the case of the โLEAโ-scenarios, additionally 310 permuting QTL haplotypes, (iii) creation of synthetics from each set of parents, (iv) sampling of the 311 individuals for the TS and PS, (v) sampling of the noise variable ๐ and calculation of the breeding and 312 phenotypic values, and (vi) training of the prediction equation and calculation of estimated breeding 313 values as well as PA in the PS (Figure S1). All computations were carried out in the R statistical 314 environment (R Core Team 2012). 13 315 RESULTS 316 Linkage disequilibrium in the ancestral populations: For ancestral population LR, LDA showed a steep 317 decline extending to a genetic map distance โ = 0.5 cM and approached an asymptote of about 0.08 318 for โ > 1 cM (Figure 1), reflecting the presence of long-range LDA. By comparison, LDA in ancestral 319 population SR started at slightly smaller values for closely linked loci and showed a similar decline for 320 โ < 1 cM. It levelled off at about โ = 2 cM, where it almost reached its asymptotic value of zero due 321 to absence of long-range LDA resulting from the 100 additional generations of random mating. 322 323 Linkage disequilibrium in the parents and the synthetics: Figure 3A shows the distribution of LD 324 between QTL-SNP pairs in the parents, measured as ๐ 2 , as a function of โ. LD in the parents takes on 325 only a limited number of values in the interval [0,1], because only a finite number of genotype 326 configurations is possible for two biallelic loci, which depends exclusively on ๐๐ . For ๐๐ = 2, all LD 327 values are equal to 1. For ๐๐ = 3 and 4, possible LD values are { , 1} and {0, , , 1}, respectively, 328 whereas for ๐๐ = 16, more than 100 values are possible, resulting in a nearly continuous distribution 329 of LD values in the parents. Under LEA (i.e., ancestral linkage equilibrium due to permutation of QTL 330 haplotypes), the frequency of LD values in the parents was thus almost independent of โ, except for 331 some small residual deviations due to similarity of ancestral allele frequencies at closely linked loci (see 332 File S4 for details). Under LEA, the high frequencies of pairs of loci in high LD for ๐๐ = 3 and 4 333 demonstrate the magnitude of sample LD (Figure 3A, left column). If additionally, ancestral LD was 334 present, large parental LD values occurred more frequently for tightly linked loci (โ < 1 cM) for both 335 ancestral populations. Under short-range LDA in SR, the frequencies of high parental LD values were 336 almost identical to those found under LEA for โ > 1 cM, regardless of ๐๐ . Conversely, under long-range 337 LDA in LR, the frequency of high parental LD values was considerably elevated also for โ > 1 cM. 338 Altogether, the distribution of LD values in the parents was much stronger influenced by ๐๐ than by 1 4 1 1 9 3 14 339 ancestral LD. The proportion of QTL-SNP pairs in high LD diminished as ๐๐ increased, but grew when 340 shifting from short- to long-range LDA (Figure 3A, SR vs. LR). 341 Figure 3B shows the average LD between QTL-SNP pairs in synthetics as a function of โ and 342 ๐๐ . The level of the LD curve dropped rapidly as ๐๐ increased from 2 to 8 and approached the curve 343 of ancestral LD. Under LEA, LD in synthetics was still substantial for ๐๐ = 4 due to sample LD, yet 344 successively approached zero as ๐๐ was increased further. For ๐๐ > 2, the presence of ancestral LD 345 resulted in elevated LD in the synthetics, where the increment was large between tightly linked QTL- 346 SNP pairs (โ < 1 cM) for both ancestral populations and moderate between loosely linked loci (โ > 1 347 cM) for LR. 348 349 Linkage phase similarity between training and prediction set: For scenario Re-LDA-SNP (๐๐๐ = ๐๐๐ ), 350 linkage phase similarity between TS and PS exceeded 0.8 up to โ = 20 cM, regardless of the ancestral 351 population (Figure 4). By comparison, values were much lower for Un-LDA-SNP (๐๐๐ โฉ ๐๐๐ = โ ). 352 Increasing ๐๐ reduced linkage phase similarity only marginally for Re-LDA-SNP even for โ = 20 cM, but 353 resulted in a substantial increase for Un-LDA-SNP. The higher ancestral LD in LR resulted only in a minor 354 increase in linkage phase similarity in Re-LDA-SNP, but in a large increase for Un-LDA-SNP, irrespective 355 of ๐๐ . Since permuting QTL haplotypes eliminated ancestral LD in scenario Re-LEA-SNP, linkage phase 356 similarity was identical for SR and LR and showed similar results as Re-LDA-SNP for SR (results not 357 shown). 358 359 Influence of ancestral linkage disequilibrium and number of parents on prediction accuracy: PA 360 declined for all โReโ-scenarios (except Re-LDA-Ped), but increased for all โUnโ-scenarios with an 361 increasing number of parents ๐๐ (Figure 5), where the strongest changes occurred between ๐๐ = 2 362 and 8 for all scenarios. The highest PA was always achieved by scenario Re-LDA-QTL, closely followed 363 by Re-LDA-SNP for small ๐๐ , with an increasing difference for larger ๐๐ . PA increased when shifting 364 from low (SR) to high (LR) ancestral LD for scenario Re-LDA-SNP, but decreased for Re-LEA-SNP. For 15 365 scenario Re-LDA-Tag, PA was always intermediate between Re-LDA-SNP and Re-LEA-SNP. For Re-LDA- 366 Ped, PA concavely increased from ๐๐ = 2 up to its maximum value of 0.4 for ๐๐ = 8, followed by a 367 minor decrease. Re-LDA-Ped and Re-LEA-SNP approached identical PA for large ๐๐ under long-range 368 LDA in LR, whereas Re-LEA-SNP retained superior PA under short-range LDA in SR. For Un-LDA-QTL, PA 369 strongly increased for both ancestral populations, especially from ๐๐ = 2 to 8, followed by a moderate 370 increase. For Un-LDA-SNP, the overall level of PA was much lower, but showed a similarly increasing 371 curvature as Un-LDA-QTL for long-range LDA, whereas under short-range LDA, PA was almost 372 consistently < 0.2 without sizeable increase for all values of ๐๐ . 373 374 Influence of training set size and marker density on prediction accuracy: Increasing TS size (๐๐๐ ) from 375 125 to 750 individuals was overall most beneficial for all โReโ-scenarios, except for Re-LDA-Ped (Figure 376 S3). Conversely, Re-LDA-Ped, as well as Un-LDA-SNP under short-range LDA, showed only a minor 377 increase in PA for larger ๐๐๐ . However, for Un-LDA-SNP under long-range LDA and for Un-LDA-QTL under 378 both short- and long-range LDA, the increase in PA along with ๐๐๐ was notable, especially for ๐๐ > 8. 379 Reducing the marker density from 5 SNPs cM-1 to 0.25 SNPs cM-1 resulted in a substantial 380 reduction of PA for all โSNPโ-scenarios (Figure S4). This reduction was reinforced for scenarios utilizing 381 ancestral LD (Re-LDA-SNP and Un-LDA-SNP), especially in the presence of long-range LDA and for large 382 values of ๐๐ . 16 383 DISCUSSION 384 In plant breeding, GP has been applied to various types of populations such as single or multiple 385 biparental families or diversity panels of inbred lines. These materials differ fundamentally in their 386 pedigree structure, the number of founder individuals involved in their development, as well as the LD 387 in the ancestral population from which they were taken. Synthetics are especially suited for 388 systematically assessing the influence of these factors on prediction accuracy, because the variable 389 number of parents used for generating synthetics leads to (i) different pedigree relationships among 390 individuals and (ii) a trade-off between ancestral LD and sample LD arising in the parents. Thus, our 391 approach provides new insights into how these factors influence the ability of molecular markers to 392 capture actual relationships at causal loci, which determines the accuracy in various applications of 393 GP. 394 395 Influence of the number of parents and ancestral LD on actual relationships at causal loci and 396 prediction accuracy: The accuracy of GP relies on the distribution of actual relationships ๐๐๐ at causal 397 loci (QTL) between individuals in the TS and PS and (ii) the quality of the approximation of ๐๐๐ by 398 marker-derived genomic relationships ๐๐๐ (Goddard et al. 2011; Habier et al. 2013). We first 399 investigated PA using the actual relationship matrix ๐ธ, which provides an upper bound of PA given 400 fixed values for ๐๐ , ๐๐๐ and โ² (de los Campos et al. 2013). Subsequently, we estimated ๐ธ by the 401 marker-derived genomic relationship matrix ๐ฎ and inferred how the three sources of information 402 contributed to PA. 403 Actual relationships ๐๐๐ between two individuals ๐ and ๐ can be factorized into 404 ๐๐๐ = ๐๐๐ + ๐๐๐ + ๐๐๐ , 405 where ๐๐๐ is their expected IBD relationship at QTL, ๐๐๐ = ๐๐๐ โ ๐๐๐ is the deviation of the actual from 406 the expected IBD relationship due to Mendelian sampling, and ๐๐๐ is the deviation of the actual (IBS) 407 relationship from the actual IBD relationship. Whereas ๐๐๐ and ๐๐๐ provide information solely with (3) 17 408 respect to the parents (i.e., the founders of the pedigree), ๐๐๐ accounts also for actual relationships 409 among the parents (Powell et al. 2010; Vela-Avitúa et al. 2015). 410 If TS and PS are related (โReโ-scenarios), the distribution of ๐๐๐ depends on ๐๐ (Figure 6A) and 411 on the mating scheme employed for production of the synthetic (Figure S1). For small ๐๐ , this 412 distribution is dominated by full-sib and half-sib relationships, whereas distantly related and unrelated 413 individuals dominate for larger ๐๐ . The closer the pedigree relationships between individuals, the 414 longer are the chromosome segments they inherit from common ancestors and the larger is the 415 conditional variance in actual IBD relationships, i.e., var(๐๐๐ |๐๐๐ ) (Figure 6B, Hill and Weir 2011; 416 Goddard et al. 2011). In other words, var(๐๐๐ |๐๐๐ ) is inversely proportional to the number of 417 independently segregating chromosome segments and, hence, the length and number of 418 chromosomes must be taken into account when transferring our results to other species. For example, 419 in bread wheat (2n = 42), var(๐๐๐ |๐๐๐ ) โ and consequently PA attributable to the Mendelian sampling 420 term โ are expected to be smaller than in maize (2n = 20). 421 The contribution of ๐๐๐ to ๐๐๐ depends on the level of ancestral LD. Elevated LDA increases 422 var(๐๐๐ ) in the ancestral population (Figure S2, LR vs. SR) and in turn increases the variation in similarity 423 of haplotypes among parents sampled therefrom (Habier et al. 2013). Consequently, var(๐๐๐ |๐๐๐ ) in 424 synthetics increases with ancestral LD (Figure 6B and S2), on top of the variance var(๐๐๐ |๐๐๐ ) caused by 425 Mendelian sampling. Assuming known actual relationships and fixed TS size, PA therefore decreases if 426 (i) ๐๐ increases and (ii) ancestral LD decreases (Figure 5). This is because both factors reduce the 427 absolute frequency of close actual relationships among TS and PS (Figure S5). If actual relationships 428 among the parents were not accounted for, the decline in PA was reinforced as ๐๐ increased (Figure 429 5, scenario Re-LDA-Tag). The reason for this follows from the factorization (Eq. 3): the larger ๐๐ , the 430 more frequent are pairs of individuals with small or zero pedigree relationship (Figure 6A) and the 431 more important it is to account for actual relationships among parents. Conversely, PA was consistently 432 higher for small ๐๐ due to strong pedigree relationships and Mendelian sampling, despite the 18 433 accompanying negative effects of reduced heritability in the TS and the reduced additive-genetic 434 variance in the PS on PA (Table S2). 435 Restricting predictive information to pedigree relationships (scenario Re-LDA-Ped), resulted in 436 only moderate PA, unless for ๐๐ = 2 (Figure 5). In this case, all individuals in the TS and PS were full- 437 sibs (Figure 6A), which resulted in identical estimated breeding values by pedigree BLUP, so that PA 438 could not be calculated (indicated as PA = 0 in Figure 5). For ๐๐ > 2, there was variation in pedigree 439 relationships in synthetics (Figure 6A) and thus, PA > 0. Further research is warranted on the 440 importance of variation in pedigree relationships for GP in the presence of Mendelian sampling and 441 ancestral LD, e.g., by considering mating schemes such as MAGIC, which reduce or even entirely avoid 442 variation in pedigree relationships. 443 If the TS and PS are unrelated (โUnโ-scenarios), only ๐๐๐ contributes to variation in ๐๐๐ , because 444 ๐๐๐ and ๐๐๐ are equal to zero. Moreover, if ๐๐ is small, QTL in the TS and PS can (i) be fixed for different 445 alleles (Table S2) or (ii) differ in their LD structure due to sample LD. This limits the occurrence of close 446 actual relationships between TS and PS (Figure S5, Un-LDA-QTL) and reduces the upper bounds of PA 447 compared with the corresponding โReโ-scenarios (Figure 5, Un-LDA-QTL vs. Re-LDA-QTL). As ๐๐ 448 increases, allele frequencies and LD between loci converge towards those in the ancestral population 449 in both โReโ- and โUnโ-Scenarios (Table S2). In turn, the closest actual relationships between TS and 450 PS converge as well (Figure S5), resulting ultimately in similar PA for Re-LDA-QTL and Un-LDA-QTL 451 (Figure 5). In conclusion, the difference in predicting related and unrelated genotypes vanishes as ๐๐ 452 increases for a given TS size, because it is then primarily ancestral information that drives the accuracy 453 of GP. 454 455 Sample LD and co-segregation โ crucial factors for prediction accuracy in synthetics: LD in the parents 456 represents a combination of LD carrying over from the ancestral population and LD generated anew 457 due to limited ๐๐ . The latter LD, herein referred to as sample LD, results from a bottleneck in 458 population size similar to that used in our simulations for generating long-range LD in the ancestral 459 population (cf. Figure S1), but can be much stronger if ๐๐ is small (e.g. 4). Co-segregation is defined as 19 460 the co-inheritance of alleles at linked loci on the same gamete and thus describes the process that 461 prevents parental LD between them from being rapidly eroded by recombination (Figure S6). Together, 462 sample LD and co-segregation result in high LD in synthetics, which for small ๐๐ exceeds by far the 463 level of ancestral LD (Figure 3B, see File S3 for details). The crucial property of sample LD , however, is 464 that it is specific to a set of parents and thus provides predictive information only for their descendants. 465 Hence, using co-segregation as โsource of informationโ in GP relies on the presence of pedigree 466 relationships (Habier et al. 2013). Conversely, the fraction of parental LD that stems from ancestral LD 467 is a commonality among all descendants of the ancestral population, irrespective of pedigree 468 relationships. The particularly small number of parents used in synthetics makes sample LD and co- 469 segregation crucial factors contributing to PA, a situation that differs greatly from previously 470 investigated scenarios (e.g., Habier et al. 2007, 2013; Wientjes et al. 2013). Hence, knowledge of how 471 ancestral LD and sample LD contribute to parental LD, depending of ๐๐ , is essential for evaluating the 472 applicability of training data to prediction of both related and unrelated genotypes. 473 The influence of sample LD on parental LD and PA in the โReโ-scenarios is illustrated best by 474 considering different values of ๐๐ : For ๐๐ = 2, sample LD in the parents is maximized, because all 475 pairs of polymorphic loci are in complete LD (r² = 1.0), irrespective of ancestral LD, linkage or genetic 476 map distance. Co-segregation of linked QTL and SNPs during intermating largely conserves LD, even 477 for loosely linked loci (Figure S6), so that LD in synthetics remained at high levels (Figure 3B). Therefore, 478 replacing ๐ธ with ๐ฎ resulted in merely a marginal reduction of PA (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP). 479 Previous studies claimed that PA in biparental populations is the maximum obtainable for given TS size 480 (Riedelsheimer et al. 2013; Lehermeier et al. 2014), despite absence of variation in pedigree 481 relationships. Our results demonstrate that this is exclusively attributable to the efficient utilization of 482 sample LD via co-segregation. For ๐๐ = 3 and 4, LD can take two and four discrete values, respectively 483 (see Results). Thus, sample LD still takes up a large share of parental LD (Figure 3A). However, the 484 occurrence of different LD values (in contrast to ๐๐ = 2) introduces a dependency on ancestral LD: 485 the frequency of loci with high parental LD increases in the presence of ancestral LD compared with 486 ancestral linkage equilibrium. This difference carries over during intermating and resulted in increased 20 487 LD in the synthetics, especially under long-range ancestral LD (Figure 3B, LR). However, the increment 488 in PA was only marginal (Figure 5, Re-LDA-SNP vs. Re-LEA-SNP) owing to the overriding contribution of 489 sample LD to parental LD for ๐๐ = 3 and 4. Nevertheless, the reduction in sample LD for ๐๐ = 3 or 4, 490 compared with ๐๐ = 2, impaired co-segregation information and reinforced the decline in PA when 491 relying on markers (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP). For ๐๐ โฅ 16, sample LD becomes negligible 492 (Figure 3A) so that parental LD hardly differed from ancestral LD. This led to (i) reinforced reduction in 493 PA, when using markers rather than known QTL genotypes (Figure 5, Re-LDA-QTL vs. Re-LDA-SNP), 494 especially for short-range ancestral LD, and (ii) convergence of PA of GBLUP and pedigree BLUP in the 495 absence of ancestral LD (Figure 5, Re-LEA-SNP vs. Re-LDA-Ped). The reason for the latter is that under 496 marginal contribution of co-segregation, PA stems primarily from capturing pedigree relationships by 497 SNPs. 498 For the โUnโ-scenarios, sample LD is manifested independently in ๐๐๐ and ๐๐๐ , which 499 results in different co-segregation โpatternsโ in TS and PS that cannot reliably be exploited in GP. 500 Therefore, the ancestral LD that is common to both sets of parents โ measured by linkage phase 501 similarity in the synthetics (Figure 4) โ provides the only source of information connecting the TS and 502 PS. This constraint resulted in a much larger drop in PA when replacing ๐ธ with ๐ฎ in the โUnโ-scenarios 503 (Figure 5, Un-LDA-QTL vs. Un-LDA-SNP) compared with the corresponding โReโ-scenarios (Figure 5, Re- 504 LDA-QTL vs. Re-LDA-SNP), especially under short-range ancestral LD. This decline in PA when predicting 505 the genetic merit of unrelated instead of related genotypes corroborates previous findings on GP 506 across populations in both animal and plant breeding (Hayes et al. 2009b; de Roos et al. 2009; Technow 507 et al. 2013; Riedelsheimer et al. 2013; Albrecht et al. 2014; Heslot and Jannink 2015). 508 Variation in linkage phase similarity between TS and PS caused by sample LD affects GP of 509 unrelated genotypes in an unforeseeable manner: while identical and reversed QTL-SNP linkage phases 510 manifested by sample LD cancel out on average, individual TS-PS combinations can show above or 511 below average linkage phase similarity and thus, co-segregation โpatternsโ. This translates into large 512 variation of PA among different TS-PS combinations. Additional simulations using unequal ๐๐ to derive 513 the TS and PS showed that variation in PA was even higher when using small ๐๐ to generate the PS 21 514 than for the TS (Figure S7). A possible explanation might be that regardless of the TS composition, small 515 ๐๐ for the PS drastically reduces the frequency of polymorphic loci (Table S2) and thereby increases 516 the variation in linkage phase similarity with the TS for the remaining loci, which in turn increases the 517 variability of prediction. Considering the practical relevance of such prediction scenarios, further 518 research is needed to investigate this finding in greater detail. 519 520 Influence of LD on capturing pedigree relationships: The ability to capture pedigree relationships by 521 SNPs increases with the effective number of independently segregating SNPs in the model (Habier et 522 al. 2007). Higher LD between SNPs reduces this number and thus, reduces the contribution of pedigree 523 relationships captured by SNPs to PA. Scenario Re-LEA-SNP demonstrates this fact for large values of 524 ๐๐ , where LD between QTL and SNPs in synthetics was small (Figure 3B) and hence, PA mainly relied 525 on capturing pedigree relationships. In line with this reasoning, PA decreased from SR to LR (Figure 5, 526 Re-LEA-SNP) as well as when marker density was reduced from 5 to only 0.25 SNPs cM-1, because similar 527 to increasing LD, using low marker density reduced the number of independently segregating SNPs 528 (Figure S4, Re-LEA-SNP). 529 In GBLUP, the consequences of an imprecise estimation of pedigree relationships by SNPs 530 due to strong LD are limited, because the loss in PA compared with pedigree BLUP is mostly 531 overcompensated for by capturing either co-segregation (Figure 5; small ๐๐ , Re-LEA-SNP vs. Re-LDA- 532 Ped) or long-range ancestral LD (Figure 5; large ๐๐ , Re-LEA-SNP vs. Re-LDA-Ped). An exception is the 533 combination of large ๐๐ and short-range ancestral LD, where the comparatively small contribution of 534 ancestral LD to PA does not necessarily compensate for that loss, so that GBLUP might not provide the 535 desired advantage over pedigree BLUP. Alternative models employing variable-selection (e.g., BayesB), 536 which capitalize more on LD rather than pedigree relationships (Habier et al. 2007; Zhong et al. 2009; 537 Jannink et al. 2010), might help to improve the prospects of GP in such cases. 538 539 Influence of training set size on prediction accuracy: In this study, we varied training set size ๐๐๐ for 540 given values of ๐๐ , because resources devoted to the TS differ between breeding programs and do not 22 541 necessarily depend on ๐๐ . Under fixed ๐๐ , the absolute frequency of individuals with close actual 542 relationship among TS and PS increases with ๐๐๐ (Figure S5), which led to similar benefits in PA for all 543 โReโ-scenarios (Figure S3, except Re-LDA-Ped). However, the general decline of PA in these scenarios 544 with increasing NP was only slightly attenuated even when using 750 instead of 125 individuals in the 545 TS. This is because the need for larger TS size increases rapidly as pedigree relationships with the PS 546 decrease (Habier et al. 2010), which in turn shifts the distribution of actual relationships toward lower 547 values (Figure 6B and S2). Thus, ๐๐๐ must generally be increased along with ๐๐ to counteract as much 548 as possible the expected decline in PA. 549 According to Habier et al. (2013), altering TS size affects the contributions of the three 550 sources of information to PA, but this inference is based on the assumption that TS size was increased 551 by adding new families to the TS (unrelated to the initially included families), which is comparable to 552 increasing ๐๐ in our study. De los Campos et al. (2013) showed that the estimation of actual 553 2 relationships by SNPs is sufficiently characterized by ๐ ๐,๐ (Figure S8) and thus, largely independent of 554 ๐๐๐ , apart from estimation error. In synthetics, the distribution of actual relationships ๐๐๐ is defined 555 by ๐๐ and LDA (Figure 6 and S2). Thus, increasing ๐๐๐ increases the chances for each individual in the 556 PS to have several individuals with close actual relationships ๐๐๐ in the TS, which was previously found 557 to be crucial for achieving high PA (Jannink et al. 2010; Clark et al. 2012). Therefore, the contributions 558 to PA from co-segregation and ancestral LD increase with ๐๐๐ , because they are required to capture 559 deviations from pedigree relationships. Conversely, using small ๐๐๐ will tend to hamper the 560 occurrence of high ๐๐๐ values and hence, increase the reliance on pedigree relationships. 561 If TS and PS are unrelated, the absolute frequency of close actual relationships is low, even 562 if ๐๐๐ is large (Figure S5). Additionally, actual relationships are rather poorly estimated by SNPs when 563 relying solely on ancestral LD (Figure S8, Un-LDA-SNP). Consequently, huge ๐๐๐ (>> 750) and high 564 marker density would be required to substantially elevate PA, especially if there is only short-range 565 ancestral LD (cf. de los Campos et al. 2013). 23 566 Influence of marker density on prediction accuracy: High marker density is especially important if LD 567 between QTL and SNPs extends only to short map distances (Solberg et al. 2008; Zhong et al. 2009; 568 Hickey et al. 2014). This applies in our study if either sample LD was negligible (Figure S4; large ๐๐ , Re- 569 LDA-SNP vs. Re-LEA-SNP) or if TS and PS were unrelated (Figure S4, Un-LDA-SNP), so that PA relied 570 heavily on ancestral LD. Our results also show that in the latter case, using high marker density strongly 571 improved PA for both ancestral populations, implying that capturing LD between tightly linked loci 572 (โ < 1 cM) is beneficial even if long-range ancestral LD prevails. With low marker density, capturing 573 only the โlong-range partโ of ancestral LD (Figure 1, LR) still provided moderate PA (Figure S4, LR), but 574 PA dropped below < 0.1 for short-range ancestral LD (Figure S4, SR). This was likely because most SNPs 575 were no longer in LD with QTL and thus contributed mostly noise to the prediction equation. These 576 results are in agreement with former studies (de los Campos et al. 2013; Habier et al. 2013; Hickey et 577 al. 2014; Lorenz and Smith 2015) reporting that under insufficient marker density, adding individuals 578 unrelated to the PS to the TS can even decrease PA. 579 In summary, the required marker density for ๐๐ โฅ 16 should be chosen in compliance 580 with the extent of ancestral LD. While in this case, high density is mandatory if TS and PS are unrelated, 581 moderate PA can still be achieved under low marker density if TS and PS are related due to pedigree 582 relationships contributing to PA. For small ๐๐ , extensive LD in synthetics (due to sample LD and co- 583 segregation) lowers the requirements on marker density. Although co-segregation is captured 584 optimally if SNPs and QTL are as tightly linked as possible, medium marker density (โฅ 1 SNPs cM-1, 585 depending on ๐๐ ) is likely sufficient to reach PA near the optimum. 586 587 Expected impact of ancestral LD on GP in synthetics: In GP of genetic predisposition in humans or 588 breeding values of bulls, the availability of several thousand training individuals, in conjunction with 589 high marker densities, allows for efficient use of rather low levels of ancestral LD, as usually observed 590 in these species (de Roos et al. 2008; Goddard and Hayes 2009; de los Campos et al. 2013). We showed 591 that short-range ancestral LD is generally less valuable in plant breeding, where TS usually comprise 592 only hundreds or fewer individuals. Ancestral LD can differ substantially among crops and different 24 593 germplasm within crops (Flint-Garcia et al. 2003). Usually, low levels of ancestral LD are found in 594 diversity panels that encompass lines from different breeding programs and/or geographic origin as 595 well as in materials largely unselected by breeders, such as landraces or gene bank accessions (Hyten 596 et al. 2007; Delourme et al. 2013; Romay et al. 2013). Recently, Gorjanc et al. (2016) proposed GP for 597 recurrent selection of synthetics generated from doubled haploid lines derived from landraces. In the 598 light of our findings, such an approach generally requires large TS size and high marker density to 599 outperform pedigree BLUP, unless one chooses small ๐๐ to ensure satisfactory PA due to co- 600 segregation. 601 In contrast, extensive long-range ancestral LD is usually found in elite breeding germplasm of 602 major crops such as maize (Windhausen et al. 2012; Unterseer et al. 2014), wheat (Maccaferri et al. 603 2005), barley (Zhong et al. 2009), soybean (Hyten et al. 2007) or sugar beet (Würschum et al. 2013). If 604 synthetics were derived from such germplasm, ancestral LD is expected to contribute substantially to 605 PA, as shown by our results. However, LD determined from biallelic SNPs might overestimate ancestral 606 LD between QTL-SNP pairs, because their allele frequencies can deviate due to ascertainment bias in 607 discarding SNPs with low minor allele frequencies for the construction of SNP arrays (Ganal et al. 2011; 608 Goddard et al. 2011). Such an overestimation would impair the advantage of GP approaches over 609 pedigree BLUP. 610 611 Implications for other scenarios relevant in plant breeding: Research on GP in plant breeding has so 612 far focused primarily on the use of single (e.g., Lorenzana and Bernardo 2009; Riedelsheimer et al. 613 2013) and multiple segregating biparental families (BF) (e.g., Heffner et al. 2011; Albrecht et al. 2011; 614 Schulz-Streeck et al. 2012; Habier et al. 2013; Lehermeier et al. 2014). For ๐๐ = 2, our scenarios Re- 615 LDA-SNP and Un-LDA-SNP correspond exactly to GP within and between BF derived from unrelated 616 parents. In practice, breeders mostly derive lines directly from F1 crosses (Mikel and Dudley 2006), 617 whereas we applied a further generation of intermating (Figure S1). This additional meiosis slightly 618 reduces LD in synthetics (see File S3 for details) and in turn, PA (results not shown). While GP within BF 619 generally works well, predicting an unrelated BF can be risky and unreliable (Riedelsheimer et al. 2013) 25 620 as underlined by our results for scenario Un-LDA-SNP (Figure S7, ๐๐ = 2). Similar uncertainties might 621 be encountered if new lines from an untested BF are predicted based on pre-existing data from 622 multiple BF (Heffner et al. 2011), diversity panels (Würschum et al. 2013) or populations of 623 experimental hybrids (Massman et al. 2013) to obtain predicted breeding values prior to partially 624 phenotyping the new cross (Figure S7, ๐๐ > 2 in TS and ๐๐ = 2 in PS). The risk of such approaches is 625 likely attenuated in advanced breeding cycles, where putatively โunrelatedโ BF usually share more 626 recent common ancestors than a TS comprising truly unrelated material, as would be the case in an 627 โidealโ diversity panel. However, Hickey et al. (2014) showed that if two BF share only a grand-parent 628 as their most recent common ancestor, PA was not substantially higher than for unrelated BF. This 629 underpins the need for close relatives in the TS (e.g., full-sibs or half-sibs) to warrant high and robust 630 PA across different prediction targets. Accordingly, previous studies on GP in diversity panels 631 concluded that the observed medium to high PAs were partially attributable to latent groups of related 632 germplasm (e.g., Rincent et al. 2012; Schopp et al. 2015). 633 If a BF is too small for training the prediction equation, multiple BF can be alternatively pooled 634 together (Heffner et al. 2011; Technow and Totir 2015). Such a combined TS can be constructed by 635 sampling lines from each BF to predict the remainder in each BF (โwithinโ) or by using some BF to 636 predict other BF (โacrossโ) (cf. Albrecht et al. 2011). Our scenarios Re-LDA-SNP and Un-LDA-SNP are 637 similar to these โwithinโ and โacrossโ situations for ๐๐ > 2, but - besides the additional meiosis 638 discussed above - show another important difference to F1-derived multiple BF: generating synthetics 639 by random mating of the Syn-1 generation breaks up the clear pedigree structure in full-sib, half-sib 640 and unrelated families (Figure S9). This reduces both the mean and variance of pedigree relationships, 641 which in turn reduces PA (results not shown). As discussed above, capturing pedigree relationships 642 plays a major role in GP of both synthetics and multiple BF if TS and PS are related, especially if ๐๐ is 643 large. This is because in both situations, co-segregation is barely used to obtain โaccuracy within 644 familiesโ (cf. Habier et al. 2013). In practical breeding programs using multiple BF, the situation might 645 be slightly different, if some parents are overrepresented compared with others and introduce a 646 predominant linkage phase patterns that can be exploited in GP. Moreover, one has the opportunity 26 647 to improve information from co-segregation by (i) clustering related BF into the TS to reflect the co- 648 segregation pattern of the PS or (ii) explicit modelling of co-segregation (cf. Habier et al. 2013) or 649 family-specific effects using hierarchical models (Technow and Totir 2015). However, both of these 650 strategies are not easily accessible in synthetics, unless one replaces random by controlled mating in 651 order to keep track of pedigree relationships. Since ancestral LD persists well over generations (Habier 652 et al. 2007), its contribution to PA is expected to be only marginally affected by additional intermating 653 generations. Thus, ancestral LD can generally be considered of great importance for GP of material 654 related or unrelated to the TS, particularly if NP is large. 655 In the present study, we considered the two most extreme situations of relatedness or 656 unrelatedness of the TS and PS, because their parents were either identical or entirely different. 657 Further research is warranted for situations of partial overlapping of parents among families, which 658 occurs frequently in practice, e.g., when proven inbred lines contribute to multiple crosses in 659 subsequent breeding cycles. Moreover, we focused here exclusively on PA, but the genetic gain from 660 genomic selection, which is of ultimate interest to breeders, depends additionally on the genetic 661 variance in the population. Since both parameters are influenced by the choice of ๐๐ , the potential of 662 recurrent genomic selection in synthetics needs to be examined for different values of ๐๐ and different 663 levels of ancestral LD, ideally across multiple selection cycles. 27 664 ACKNOWLEDGMENTS 665 We thank Chris-Carolin Schön, Tobias Würschum, José Marulanda, Willem Molenaar and three 666 anonymous reviewers for valuable suggestions to improve the content of the manuscript. PS 667 acknowledges Syngenta for partially funding this research by a Ph.D. fellowship and AEM the financial 668 contribution of CIMMYT/GIZ through the CRMA Project 15.78600.8-001-00. 669 670 DATA AVAILABILITY STATEMENT 671 The authors state that all simulated data and results necessary for confirming the conclusions 672 presented in the article are represented fully within the article and data supplements. Figure S1 673 provides a detailed overview over the entire simulation scheme and assumptions underlying all results 674 presented herein. 675 28 676 LITERATURE CITED 677 678 679 Albrecht, T., H.-J. Auinger, V. Wimmer, J. O. Ogutu, C. Knaak et al., 2014 Genome-based prediction of maize hybrid performance across genetic groups, testers, locations, and years. Theor. Appl. Genet. 127: 1375โ1386. 680 681 Albrecht, T., V. Wimmer, H. Auinger, M. Erbe, C. Knaak et al., 2011 Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: 339โ350. 682 683 Bandillo, N., C. Raghavan, and P. Muyco, 2013 Multi-parent advanced generation inter-cross (MAGIC) populations in rice: progress and potential for genetics research and breeding. Rice 6: 1โ15. 684 Bradshaw, J. E., 2016 Plant Breeding: Past, Present and Future. Springer International Publishing. 685 686 Cavanagh, C., M. Morell, I. Mackay, and W. Powell, 2008 From mutations to MAGIC: resources for gene discovery, validation and delivery in crop plants. Curr. Opin. Plant Biol. 11: 215โ221. 687 688 689 Clark, S. a, J. M. Hickey, H. D. Daetwyler, and J. H. J. van der Werf, 2012 The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4. 690 691 692 Delourme, R., C. Falentin, B. F. Fomeju, M. Boillot, G. Lassalle et al., 2013 High-density SNP-based genetic map development and linkage disequilibrium assessment in Brassica napus L. BMC Genomics 14: 120. 693 694 Endelman, J. B., 2011 Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome 4: 250โ255. 695 696 Falconer, D. F., and T. S. C. Mackay, 1996 Introduction to Quantitative Genetics (1996 Longman, Ed.). Pearson, Essex. 697 698 Flint-Garcia, S. a, J. M. Thornsberry, and E. S. Buckler, 2003 Structure of linkage disequilibrium in plants. Annu. Rev. Plant Biol. 54: 357โ74. 699 700 701 Ganal, M. W., G. Durstewitz, A. Polley, A. Bérard, E. S. Buckler et al., 2011 A large maize (Zea mays L.) SNP genotyping array: development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLoS One 6: e28334. 702 703 Goddard, M. E., and B. J. Hayes, 2009 Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10: 381โ391. 704 705 Goddard, M. E., B. J. Hayes, and T. H. E. Meuwissen, 2011 Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: 409โ421. 706 707 708 Gorjanc, G., J. Jenko, S. J. Hearne, and J. M. Hickey, 2016 Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations. BMC Genomics 17: 30. 709 710 Habier, D., R. L. Fernando, and J. C. M. Dekkers, 2007 The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: 2389โ2397. 711 712 Habier, D., R. L. Fernando, and D. J. Garrick, 2013 Genomic BLUP Decoded: A Look into the Black Box of Genomic Prediction. Genetics 194: 597โ607. 713 714 Habier, D., J. Tetens, F. Seefried, P. Lichtner, and G. Thaller, 2010 The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5. 715 716 Hagdorn, S., K. Lamkey, M. Frisch, G. P. E. O., and M. A. E., 2003 Molecular genetic diversity among progenitors and derived elite lines of BSSS and BSCB1 maize populations. Crop Sci. 43: 474โ482. 29 717 718 Hallauer, A. R., M. J. Carena, and J. de M. Filho, 2010 Quantitative genetics in maize breeding. Springer. 719 Hartl, D. L., and A. G. Clark, 2007 Principles of Population Genetics. Sinauer Associates, Inc. 720 721 Hayes, B. J., P. J. Bowman, A. J. Chamberlain, and M. E. Goddard, 2009a Genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92: 433โ443. 722 723 Hayes, B. J., P. J. Bowman, A. C. Chamberlain, K. Verbyla, and M. E. Goddard, 2009b Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51. 724 725 Hayes, B. J., P. M. Visscher, and M. E. Goddard, 2009c Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. Cambridge 91: 47โ60. 726 727 Heffner, E. L., J. Jannink, and M. E. Sorrells, 2011 Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program. Plant Genome 4: 65โ75. 728 Henderson, C., 1984 Applications of linear models in animal breeding. University of Guelph, ON. 729 730 Heslot, N., and J.-L. Jannink, 2015 An alternative covariance estimator to investigate genetic heterogeneity in populations. Genet. Sel. Evol. 47: 93. 731 732 733 Hickey, J. M., S. Dreisigacker, J. Crossa, S. Hearne, R. Babu et al., 2014 Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: 1476โ1488. 734 735 Hill, W. G., 1981 Estimation of effective population size from data on linkage disequilibrium. Genet. Res. 38: 209โ216. 736 737 Hill, W. G., and A. Robertson, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226โ231. 738 739 Hill, W. G., and B. S. Weir, 2011 Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. Cambridge 93: 47โ64. 740 741 Hyten, D. L., I. Y. Choi, Q. Song, R. C. Shoemaker, R. L. Nelson et al., 2007 Highly variable patterns of linkage disequilibrium in multiple soybean populations. Genetics 175: 1937โ1944. 742 743 Jannink, J.-L., A. J. Lorenz, and H. Iwata, 2010 Genomic selection in plant breeding: from theory to practice. Briefings Funct. genomics proteomics 9: 166โ177. 744 de Koning, D.-J., 2016 Meuwissen et al. on Genomic Selection. Genetics 203: 5โ7. 745 746 Lehermeier, C., N. Krämer, E. Bauer, C. Bauland, C. Camisan et al., 2014 Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: 3โ16. 747 748 Lin, Z., B. J. Hayes, and H. D. Daetwyler, 2014 Genomic selection in crops, trees and forages: A review. Crop Pasture Sci. 65: 1177โ1191. 749 750 Lorenzana, R. E., and R. Bernardo, 2009 Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120: 151โ161. 751 752 Lorenz, A. J., and K. P. Smith, 2015 Adding genetically distant individuals to training populations reduces genomic prediction accuracy in Barley. Crop Sci. 55: 2657โ2667. 753 754 de los Campos, G., A. I. Vazquez, R. Fernando, Y. C. Klimentidis, and D. Sorensen, 2013 Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor. PLoS Genet. 9: 7. 755 756 Maccaferri, M., M. C. Sanguineti, E. Noli, and R. Tuberosa, 2005 Population structure and long-range linkage disequilibrium in a durum wheat elite collection. Mol. Breed. 15: 271โ289. 757 758 Mackay, I., E. Ober, and J. Hickey, 2015 GplusE: beyond genomic selection. Food Energy Secur. 4: 25โ 35. 30 759 760 Massman, J. M., A. Gordillo, R. E. Lorenzana, and R. Bernardo, 2013 Genomewide predictions from maize single-cross data. Theor. Appl. Genet. 126: 13โ22. 761 762 Mcmullen, M. D., S. Kresovich, H. S. Villeda, P. Bradbury, H. Li et al., 2009 Genetic Properties of the Maize Nested AssociationMapping Population. Science (80-. ). 325: 737โ740. 763 764 Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819โ1829. 765 766 Mikel, M. A., and J. W. Dudley, 2006 Evolution of North American dent corn from public to proprietary germplasm. Crop Sci. 46: 1193โ1205. 767 768 Powell, J. E., P. M. Visscher, and M. E. Goddard, 2010 Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet. 11: 800โ805. 769 R Core Team, 2012 R: A language and environment for statistical computing. ISBN 3-900051-07-0. 770 771 Riedelsheimer, C., J. B. Endelman, M. Stange, M. E. Sorrells, J. L. Jannink et al., 2013 Genomic predictability of interconnected biparental maize populations. Genetics 194: 493โ503. 772 773 774 Rincent, R., D. Laloë, S. Nicolas, T. Altmann, D. Brunel et al., 2012 Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: 715โ728. 775 776 Romay, M. C., M. J. Millard, J. C. Glaubitz, J. a Peiffer, K. L. Swarts et al., 2013 Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 14: R55. 777 778 de Roos, a P. W., B. J. Hayes, and M. E. Goddard, 2009 Reliability of genomic predictions across multiple populations. Genetics 183: 1545โ1553. 779 780 de Roos, a P. W., B. J. Hayes, R. J. Spelman, and M. E. Goddard, 2008 Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179: 1503โ1512. 781 782 Sargolzaei, M., and F. S. Schenkel, 2009 QMSim: a large-scale genome simulator for livestock. Bioinformatics 25: 680โ681. 783 784 785 Schopp, P., C. Riedelsheimer, H. F. Utz, C.-C. Schön, and A. E. Melchinger, 2015 Forecasting the accuracy of genomic prediction with different selection targets in the training and prediction set as well as truncation selection. Theor. Appl. Genet. 128: 2189โ2201. 786 787 Schulz-Streeck, T., J. O. Ogutu, Z. Karaman, C. Knaak, and H. P. Piepho, 2012 Genomic Selection using Multiple Populations. Crop Sci. 52: 2453โ2461. 788 789 Solberg, T. R., a K. Sonesson, J. a Woolliams, and T. H. E. Meuwissen, 2008 Genomic selection using different marker types and densities. J. Anim. Sci. 86: 2447โ2454. 790 Suneson, C. A., 1956 An Evolutionary Plant Breeding Method. Agron. J. 6: 1โ4. 791 792 793 Technow, F., A. Bürger, and A. E. Melchinger, 2013 Genomic prediction of northern corn leaf blight resistance in maize with combined or separated training sets for heterotic groups. G3 3: 197โ 203. 794 795 Technow, F., and L. R. Totir, 2015 Using Bayesian Multilevel Whole Genome Regression Models for Partial Pooling of Training Sets in Genomic Prediction. G3 5: 1603โ1612. 796 797 798 Unterseer, S., E. Bauer, G. Haberer, M. Seidel, C. Knaak et al., 2014 A powerful tool for genome analysis in maize: development and evaluation of the high density 600 k SNP genotyping array. BMC Genomics 15: 823. 799 800 VanRaden, P. M., 2008 Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414โ 4423. 31 801 802 803 Vela-Avitúa, S., T. H. Meuwissen, T. Luan, and J. Ødegård, 2015 Accuracy of genomic selection for a sib-evaluated trait using identity-by-state and identity-by-descent relationships. Genet. Sel. Evol. 47: 9. 804 805 Wientjes, Y. C. J., R. F. Veerkamp, and M. P. L. Calus, 2013 The Effect of Linkage Disequilibrium and Family Relationships on the Reliability of Genomic Prediction. Genetics 193: 621โ631. 806 807 808 Windhausen, V. S., G. N. Atlin, J. M. Hickey, J. Crossa, J.-L. Jannink et al., 2012 Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 2: 1427โ1436. 809 Wright, S., 1922 Coefficients of Inbreeding and Relationship. Am. Nat. 56: 330โ338. 810 811 Würschum, T., J. C. Reif, T. Kraft, G. Janssen, and Y. Zhao, 2013 Genomic selection in sugar beet breeding populations. BMC Genet. 14: 85. 812 813 814 Zhong, S., J. C. M. Dekkers, R. L. Fernando, and J.-L. Jannink, 2009 Factors Affecting Accuracy From Genomic Selection in Populations Derived From Multiple Inbred Lines: A Barley Case Study. Genetics 182: 355โ364. 815 32 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 FIGURES Figure 1 Linkage disequilibrium (LDA) between pairs of loci plotted against their genetic map distance โ in centimorgans (cM), for the two ancestral populations SR (shortrange LD) and LR (long-range LD). The two vertical lines represent the average distance between QTL and its closest nearby SNP for the two marker densities investigated in our study. Figure 2 Flowchart of the eight scenarios analyzed in this study. Training set and prediction set were either related (โReโ-scenarios) or unrelated (โUnโ-scenarios). The arrows represent the changes made between scenarios, e.g., removal of ancestral LD between QTL and SNPs (LDA ๏ LEA) or replacing the relationship matrix (๐ฎ โ ๐ธ). The background texture indicates whether identity-by-state or identityby-descent information was used. The green circles show for the SNP-based scenarios the sources of information that contributed to prediction accuracy (cf. Habier et al. 2013), where in addition to LDA, RS refers to pedigree relationships at QTL captured by SNPs and CS refers to co-segregation of QTL and SNPs. 33 836 837 838 839 840 841 842 843 844 845 Figure 3 (A) Frequency of QTL-SNP pairs falling into 8 disjoint intervals of linkage disequilibrium (LD) in the parents of synthetics, plotted against their genetic map distance โ, for three different numbers of parents ๐๐ . (B) Average LD between QTL-SNP pairs, plotted against their genetic map distance โ, for synthetics generated from different ๐๐ . The mean LD in the respective ancestral population (LDA) is shown for comparison (red graphs). The left column in A and B refers to scenarios Re-LEA-SNP and UnLEA-SNP (independent of the ancestral population), where ancestral LD between QTL and SNPs was eliminated, whereas the other two columns correspond to all other scenarios, for the ancestral populations SR (short-range LD) and LR (long-range LD), respectively. 34 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 Figure 4 Linkage phase similarity of QTL-SNP pairs in the training set (TS) and prediction set (PS) for scenarios Re-LDA-SNP and Un-LDA-SNP, plotted against the number of parents ๐๐ used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD), and for different genetic map distances โ (0.5, 5 and 20 cM ± 0.5 cM) between QTL and SNPs. Figure 5 Prediction accuracy for seven scenarios (scenario Un-LEA-SNP not shown), plotted against the number of parents ๐๐ used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD). Results refer to a training set size of ๐๐๐ = 250 doubled haploid lines and a marker density of 5 SNPs cM-1. 35 861 862 863 864 865 866 867 Figure 6 (A) Frequency of the seven possible values ๐๐๐ of pedigree relationships for different numbers of unrelated inbred parents ๐๐ used to generate synthetics. (B) Conditional distributions ๐๐๐ |๐๐๐ of actual relationships ๐๐๐ conditional on their pedigree relationship ๐๐๐ between individuals ๐ and ๐ in the training set and prediction set, respectively, for the two ancestral populations SR (short-range LD) and LR (long-range LD). 36
© Copyright 2026 Paperzz