Genomic Selection Theory

Genomic Selection Theory
11 July 2012
Jean-Luc Jannink
1
Acknowledgments
• Martha T. Hamblin, Nicolas Heslot, Jeff
Endelman, Jessica Rutkoski, Delphine Ly, Julie
Dawson, Mark Sorrells
• Vanessa Weber, Gary Atlin, Albrecht Melchinger
• Kevin P. Smith, Vikas Vikram, Ahmad H. Sallam,
Aaron J. Lorenz
• Moshood A. Bakare, Melaku Gedil, Ismail Rabbi,
Peter Kulakow
Outline
• What is genomic selection?
– Minimal model
– Prediction accuracy
• Number of markers
• Size of the training population
– Should you replicate lines in the TP?
– Population structure
• Relationship between the training population
and the selection candidates
Genomic selection:
Prediction using many markers
Breeding
Material
Genotyping
Calculate
GEBV
Make
Selections
Meuwissen et al. 2001 Genetics 157:1819-1829
Genomic selection principles
• Meuwissen et al. 2001 Genetics 157:1819-1829
• No distinction between “significant” and “nonsignificant”: all markers contribute to prediction
• (–) More markers than there are phenotypes
• (+) Estimated effects are unbiased
• (+) Capture small effects
Statistical modeling: The two cultures
X
Observed
inputs
Nature
Can we understand Y?
X
Regression
Observed
responses
Y
Identify causal inputs
Y
Can we predict Y?
X
?
Regression
Decision trees
Whatever works
Breiman 2001 Stat. Sci. 16:199-231
Y
Baseline model
This is an allele dosage!
7
Marker effects –> additive relationship
8
Prediction accuracy =
Correlation(predicted, true)
R = irAσA
rA = corr(selection criterion, breeding value)
• On simulated data corr(Â, A) is easy
• On real data:
Prediction accuracy =
Correlation(predicted, true)
R = irAσA
rA = corr(selection criterion, breeding value)
• On simulated data corr(Â, A) is easy
• On real data:
Effective population size
• Idealized populations randomly mate (real
populations don’t)
• Increasing N weakens drift and strengthens
selection force to shift allele frequencies
• Those forces have a certain strength in a real
population
• Ne summarizes that with reference to an
idealized population
Why Ne matters
• As Ne increases “historical recombination”
increases
– Segments stay in the population longer
before being eliminated or fixed
– Recombination generates segments
that segregate independently
– Expectation: 2Ne effective segments per Morgan
E(r2) = 1/(1 + 4Nec)
E(r2) = (5 + 2Nec)/(11 + 26Nec + 8[Nec]2)
Marker density requirements
Solberg et al. 2008
Markers per Morgan
0.9
Accuracy
0.8
0.7
0.6
0.5
SNP:
1 Ne 2 Ne 4 Ne 8 Ne
SSR: ¼ Ne ½ Ne 1 Ne 2 Ne
Calus & Veerkamp 2007
• Average adjacent marker LD
should be r2 = 0.20
• Implies a density of 4Ne
markers per Morgan
Training population size requirements
• Daetwyler, H.D. et al. 2008. Accuracy of Predicting
the Genetic Risk of Disease Using a Genome-Wide
Approach. PLoS ONE 3:e3395
• Assume all loci affecting the trait are
known and are independent
• Assume marker effects are fixed
λ
20
10
5
2
1
0.5
0.1
0.02
Replicating hurts: 2000 with 1 plot is better than 1000 with 2 plots
To replicate or not to replicate
500 Lines replicated once
Ridge Regression
168 Lines replicated three times
BayesB
More lines –> higher pressure
Mean of selected poplua on
Mean of selected population
1.40
1.30
h2 = 0.7
h2 = 0.7
1.20
h2 = 0.2
1.10
1.00
0.90
h2 = 0.2
0.80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Fraction replicated
Frac on replicated
0.8
0.9
1
The importance of structure
• Correlation between prediction and phenotype for Grain
Yield, Anthesis Date, and Anthesis–Silking Interval in maize
• CIMMYT diversity panel
• 50K SNPs; TP size: 200; VP size 50
GY: 0.44
AD: 0.45
ASI: 0.36
• Structure: 8 Subpopulations
– Use 200 to estimate subpopulation mean then prediction is mean for
subpopulation of origin
GY: 0.50
AD: 0.44
ASI: 0.46
• Fraction of the variation due to subpopulation
GY: 0.26
AD: 0.16
ASI: 0.27
Accuracy
VP
The importance of being
in relationship
TP
Clark et al. GSE 2012 44:4
Mean Max Relationship
As it might play out in alfalfa
Clonally propagate
& Intercross
Clonally propagate
?
Intercross
50 ½-Sib families
800 Plants
50 Plants
Phenotype
800 Families
Genotype
& Select
Update Model
1 Cycle / Year (?)
3 Years to Phenotype (?)
Take home messages
• This approach is totally feasible now
– Statistical and marker technologies are easily
powerful enough
– Logistics and informatics are a challenge
• Planning the training population is most
critical
• Expect some outcomes to be non-intuitive
• Don’t trust a theoretician: go out and do it
Questions?
Empirical vignettes
• Genomic selection can work for a difficult
crop: Cassava
• We have a field-validated success in barley
Is cassava a little like a forage?
• Cassava is a “young” crop (still a bit wild)
• Cassava is outcrossed and clonally propagated
Phenotypic data
•
•
•
•
DNA from 623 Clones important at IITA
~ 50,000 plots of historic data
Very high differences in replication
17 traits (disease, morphology, yield)
GBS markers
• 4984 SNPs
• Average of 25.5% missing data per marker
• Calling heterozygotes is a challenge
Validation method
• Cross validation:
Predict subsets of
the data that did
not contribute to
building the model
Genomic
Prediction Model
Accuracy genomic predic on
Comparison to phenotypic accuracy
0.8
0.6
CMD
0.4
0.2
0.0
0
0.2
0.4
0.6
0.8
Accuracy single phenotypic plot
Higher accuracy would be good but
• GS will reduce the breeding cycle time from 5
to 2 years [2.5 × Faster]
• Any accuracy above 0.4 cannot be beat by
phenotypic selection
• Biases make phenotypic look better and
genomic look worse than reality
Barley Fusarium head blight
FHB resistance is polygenic
8
10
13-14
2-4
3-4
1(7H)
3-4
5-8
12-14
3
5(1H)
Frederickson
13
15-16
6-8
4-6
Chevron
11-12
3(3H)
2(2H)
13
4(4H)
6
Harbin
1
6(6H)
CI4196
Zhedar 2
9-10
13
7(5H)
Gobernadora
Russian 6
Genomic selection set up
•
•
•
•
•
•
•
Three breeding programs
685 barley (6-row) lines
Three years, 2 locations
1500 SNPs
Choose 384 optimized for PIC & distribution
1440 progeny from 60 crosses
“Project” parental SNP onto progeny
Head-to-head comparison
Crossing
AxB
Yr 1
F1
Inbreeding
F2
F3
Yr 2
F4 Winter Nursery
F4:5
Plant Rows
Phenotypic
Selection
Crossing
AxB
F1
Inbreeding
F2
F3
F4 Winter Nursery
Yr 3
Evaluation
Yr 1
Evaluation
Genomic
Selection
Yr 2
40%
50%
60%
70%
80%
90%
100%
110%
120%
130%
140%
150%
160%
170%
180%
190%
200%
210%
220%
10
8
10
40%
50%
60%
70%
80%
90%
100%
110%
120%
130%
140%
150%
160%
170%
180%
190%
200%
210%
220%
Phenotypic vs. Genomic Selection
FHB
Yield
16
30
14
12
25
99.7
86.2
Phenotypic
6
4
2
0
20
15
Genomic
5
0
20
15
15
10
5
0
90.6
10
5
0
30
25
20
100.7