g - Wiki Oulu

Estimating heritability and predictive accuracy of genomic
prediction in plant breeding programs
Hans-Peter Piepho
Biostatistics Unit
Universität Hohenheim
Germany
Oulu, 4 November 2014
Hans-Peter Piepho
1
Table of contents
1. Introduction
2. Heritability for balanced data
3. Heritability for unbalanced data
4. Predictive accuracy
5. Summary
6. References
Oulu, 4 November 2014
Hans-Peter Piepho
2
1. Introduction
Broad-sense heritability
Narrow-sense heritability
H 2   g2  p2
h 2   a2  p2
 g2
 a2
= genotypic variance
 p2
= additive genetic variance
= phenotypic variance
Uses of heritability
 Descriptive measure of precision of trial
2
 Compute response to selection (R = h S, where S = selection differential)
 Compute predictive accuracy in genomic prediction
Oulu, 4 November 2014
Hans-Peter Piepho
3
1. Introduction
What is the ‘phenotype’ here?




Single plot observation?
Genotype mean in a trial?
Genotype mean in multi-environment trial (MET)?
BLUP of genotypic value?
 How to estimate the phenotypic variance
Oulu, 4 November 2014
 p2
?
Hans-Peter Piepho
4
2. Heritability for balanced data
Basic model for MET data
yijk    g i  e j   ge ij  rjk   ijk
yijk
= yield of the ith genotype in the jth location and kth replicate

= overall mean
gi
ej
= main effect of the ith genotype;
ge ij

~ N 0,  g2

= main effect of the jth environment
= ijth genotype  environment interaction effect;

2
~ N 0,  ge

rjk
= kth replicate effect in jth environment
 ijk
= residual comprising both genotype  location  year interaction
as well as the error of a mean;

~ N 0, 2

Model assumes a randomized complete block design (RCBD) per environment
Oulu, 4 November 2014
Hans-Peter Piepho
5
2. Heritability for balanced data
The phenotype
Assume balanced data from RCBD or CRD & complete G x E classification 
yi    g i  e   ge i  r   i
 p2   g2   ge2 m   2 rm 
variance of a mean = ½ variance of a difference
m = number of environments
r = number of replicates per environment
Oulu, 4 November 2014
Hans-Peter Piepho
6
3. Heritability for unbalanced data
The phenotype : mostly from unbalanced data / designs
Crop variety trials and plant breeding trials:
 Test performance for target region
 Trials in large number of environments (ideally random sample from target)
Standard trial designs for large number of treatments:
 Lattice designs, -designs, row-column designs (Williams and John, 1995)
 Designs with spatial analysis in mind (Cullis et al., 2006; Williams et al., 2006)
 Unreplicated designs with checks, p-rep designs, augmented p-rep designs
(Cullis et al., 2006; Williams et al., 2011, 2013)
Oulu, 4 November 2014
Hans-Peter Piepho
7
3. Heritability for unbalanced data
A multi-location trial in a maize breeding programme (KWS)
6 locations
16 series (laid out as 10 x 10 lattices)
Oulu, 4 November 2014
(T. Albrecht, TUM)
Hans-Peter Piepho
8
3. Heritability for unbalanced data
Holland et al. (2003, p.64)
Balanced case:
 p2   g2   ge2 m   2 rm 
Divisor of
 ge2
Divisor of
2
= no. of environments m
= no. of plots p = rm
Expected genetic gain (EGG):
EGG  i g H
Oulu, 4 November 2014
Hans-Peter Piepho
9
3. Heritability for unbalanced data
Unbalanced (used with all kinds of incomplete block design):
 p2   g2   ge2 m h   2 ph
mh 
n
1

i 1 mi
n
,
ph 
n
n

i 1
1
pi
mi = no. of environments for ith genotype
pi = no. of plots for ith genotype
n
= no. of genotypes
(Holland et al. 2003)
Oulu, 4 November 2014
Hans-Peter Piepho
10
3. Heritability for unbalanced data
Problems:

yi is not the best ‘phenotype’ (BLUE of   g i ) with unbalanced data

 ge2 m h   2 ph
no matter if
is not ½ variance of a difference,
yi or the BLUE of   g i is used
 some or many genotype-environment combinations may be missing!
Oulu, 4 November 2014
Hans-Peter Piepho
11
3. Heritability for unbalanced data
Piepho & Möhring (2007)
H2 
 g2
 g2  0.5vd
Rationale:

vd is the average variance of a difference between adjusted means
(BLUE of   g i ) based on an analysis with fixed genotype effects
 proportional to effective error mean square
 For balanced data:
2
0.5vd   ge
m   2 rm
Oulu, 4 November 2014
Hans-Peter Piepho
12
3. Heritability for unbalanced data
When ‘phenotype’ is a BLUP
 So far, phenotype was a “mean” (ideally the BLUE)
 When genotypes are random, BLUP of
H  1
2
gi is usually the better estimator
υ BLUP
2 σg2
υBLUP = mean variance of a difference of the BLUP of gi
Rationale:
 Mean of
corr gi , gˆ i 2
 Concept of effective error variance
 For balanced data
Oulu, 4 November 2014
EGG  i g H
(Cullis et al., 2006)
Hans-Peter Piepho
13
3. Heritability for unbalanced data
When genotypic effects are correlated
 So far, genotypic effects were assumed to be i.i.d.
 Often, use pedigree or markers to model genetic covariance
 Generalized heritability (Oakey et al. 2006)
 Simulation (Piepho & Möhring, 2007)
Oulu, 4 November 2014
Hans-Peter Piepho
14
3. Heritability for unbalanced data
Generalized heritability
(Oakey et al., 2006)
y  Xβ  Z g g  ...
y ~ MVN  Xβ ,V  (observed data vector)
g  g1 , g 2 ,..., g n  ~ MVN 0, G 
T
Consider contrast
H
2
c T g where c is a contrast vector

covc g , c gˆ 

varc g varc gˆ 
T
T
2
T
T
where
gˆ  BLUP g 
2
 Find c such that H is maximized
Oulu, 4 November 2014
Hans-Peter Piepho
15
3. Heritability for unbalanced data
H2
c GZ P Z Gc

covc g , c gˆ 


varc g varc gˆ 
c Gc
T
2
T
T
1
T
T
g
T
T
1

1
Pv  V  V X X V X
T

1
v
g
X TV 1
T
Constraint: c Gc = 1
Method of Lagrange multipliers:


Maximize c T GZ gT Pv Z gGc   c T Gc  1
 = first eigenvalue of Z gT Pv Z gG
c = corresponding eigenvector
Oulu, 4 November 2014
Hans-Peter Piepho
16
3. Heritability for unbalanced data
max H c2  
c
is a component of the full heritability
Full set of non-zero eigenvalues:
1 , 2 ,..., s
 full heritability
 eigenvectors c1, c2,…, cs = full set of orthogonal genotype contrasts
Generalized heritability
s
H2 

h 1
h
s
Oulu, 4 November 2014
(Oakey et al., 2006)
Hans-Peter Piepho
17
3. Heritability for unbalanced data
Questions
 Would a breeder select for
cT g ?
 Why would we allow contrast vector c to be determined by the data?
 What information is contained in the s orthogonal contrasts corresponding
to
1 , 2 ,..., s ?
covc g, c gˆ 
varc g varc gˆ 
T
 average of
Oulu, 4 November 2014
T
2
T
T
across all orthogonal contrasts?
Hans-Peter Piepho
18
3. Heritability for unbalanced data
Monte-Carlo simulation
gˆ  BLUP g 
 From this can simulate many realizations of  g , gˆ 
2
 Simulate H = squared correlation of g and ĝ
 Can estimate variance and covariance of g and
 Simulate response to selection!
2
 Simulate anything else you would want to use H for in the balanced case!
(Piepho & Möhring, 2007)
Advantages
 Completely flexible
 Can handle any covariance structure
 Can directly simulate any statistic of interest
Oulu, 4 November 2014
Hans-Peter Piepho
19
3. Heritability for unbalanced data
Example
 Rapeseed variety trials in Germany
 120 cultivars (G) tested in 4 years (Y) and at 4 locations (L)
 At some locations, several trials (T) were performed
 The series was rather unbalanced
 Trials were laid out in randomized complete blocks
 Trial means were analyzed by the variance components model
L.Y.T : G + G.L + G.Y + G.L.Y + G.L.Y.T
Oulu, 4 November 2014
Hans-Peter Piepho
20
3. Heritability for unbalanced data
Oulu, 4 November 2014
Hans-Peter Piepho
21
3. Heritability for unbalanced data
Based on simulation
Oulu, 4 November 2014
Hans-Peter Piepho
22
3. Heritability for unbalanced data
Example






Sugar beet
26 breeding trials (6 x 6 simple lattices)
Connected by checks
825 entries
33 crosses
Pedigree data available (ad hoc measures do not apply)
T.R + C + Ts : T.R.B + X.G
T = trial, R = replicate, B = block
C = factor separating individual checks from entries
Ts = tester
G = genotype, X = dummy variable (1 for entries, 0 for checks)
Oulu, 4 November 2014
Hans-Peter Piepho
23
3. Heritability for unbalanced data
Oulu, 4 November 2014
Hans-Peter Piepho
24
3. Heritability for unbalanced data
Oulu, 4 November 2014
Hans-Peter Piepho
25
3. Heritability for unbalanced data
Sillanpää (2011)
2
2



H2  p 2 e
p
 p2
= phenotypic variance
 e2
= residual variance from whole genome random marker regression model
ˆ p2  pcT Qpc n  1
pc = vector of mean-centered observed phenotypes (genotype means) = Pup
n = number of genotypes
Q = identity matrix I or realized relationship estimated from markers
Oulu, 4 November 2014
Hans-Peter Piepho
26
3. Heritability for unbalanced data
Count data
η  X  Zu
(GLMM)
E  y |    g 1   , where g . denotes a link function
Example: Poisson distribution
  E  y |    exp 
Challenge:
 Genetic variance on link-scale
 Error variance on observed scale (at least partly)
Oulu, 4 November 2014
Hans-Peter Piepho
27
3. Heritability for unbalanced data
Foulley et al (1987)
2

H 2  2 g 1
g  

= average of the Poisson parameter

across observations
(assumed no random effects here for simplicity only)
This is answer to question:
What is the value of a variance for an imaginative error term on the link scale
that leads to the same variance on the back-transformed scale as that of a
Poisson random variable with expectation parameter?
Oulu, 4 November 2014
Hans-Peter Piepho
28
3. Heritability for unbalanced data
Sketch of derivation (for over-dispersed Poisson: variance = )
*   e
What is a suitable variance of
conditional variance of
 
g 1  *
e (  e2 ) on the link scale, such that the
given
 , varg 1  *  |   ,
approximately
equals that of the observed data y ?
2
  
var g  |   var exp  |    * 
  e2
   * 
   
1
*
   
*
  
 exp  *
2

*
*
2
2 2




 e  
e


 e2  1
(Delta method)
Oulu, 4 November 2014
Hans-Peter Piepho
29
3. Heritability for unbalanced data
Over-dispersed binomial distribution
   




var g   |   var   |   
 
1
*
*

     e2 
2

2
 1   
 * 

2
e
   
* 2

*
2


e

.
n
 1   
 
  2 n
2
e
(Bennewitz et al., 2013)
Oulu, 4 November 2014
Hans-Peter Piepho
30
4. Predictive accuracy
Estimation of genetic values
 Classical plant breeding based on phenotypic data alone (field trials)
 Hunting for single genes:
 Use of marker data for mapping of quantitative trait loci (QTL) in simple
segregating populations, linkage mapping
 Association mapping in larger populations with diverse structure (multiple
crosses, diverse breeding material, gene bank data)
 Marker-assisted selection (MAS) based on detected QTL
Giving up the hunt:
 Just try to improve estimate of genotypic value (breeding value)
using all (or most of the) markers
(Meuwissen et al., 2001)
Oulu, 4 November 2014
Hans-Peter Piepho
31
4. Predictive accuracy
Key idea of genomic prediction (genomic selection)
Predict genotypic value gi of i-th genotype by regression on marker types
M
g i   uk zik
i  1,2,..., n 
k 1
zik = regressor variable for the i-th genotype and k-th marker (k = 1, …, M)
uk = regression coefficients
Example: Biallelic marker (SNP) with alleles A1 and A2, DH lines:
zik  1
zik  1
zik  0
for A1A1
for A2A2
for A1A2 or when the marker genotype is missing
Oulu, 4 November 2014
Hans-Peter Piepho
32
4. Predictive accuracy
Genomic prediction
g  Zu
Z = {zik} = marker (SNP) covariate design matrix
u = vector of random SNP effects uk
 Estimate u from training dataset with phenotyped genotypes
 Predict g for unphenotyped genotypes
 There are many alternative models / methods to predict g
- RR-BLUP / G-BLUP, Bayesian methods (ABC…)
- Machine learning methods, Artificial neural networks (ANNs)
- Reproducing Kernel Hilbert Spaces (RKHS) etc.
 Very successful in animal breeding
 Increasingly popular in plant breeding
Oulu, 4 November 2014
Hans-Peter Piepho
33
4. Predictive accuracy
The basic ridge-regression (RR) BLUP model:
p  1n   Zu  e
where
Z
=
=
=
=
u
= random SNP marker effects with
e
= residual error associated with p, with e ~ N 0, Ine
p
1n

adjusted genotype means (2-stage approach) = observed phenotype
n-vector of ones
common intercept
n x p matrix containing the SNP marker information

u ~ N 0, I pu2


2

 var g  Zu  G  ZZ T u2
Oulu, 4 November 2014
Hans-Peter Piepho
34
4. Predictive accuracy
k-fold cross-validation
 Split data into k parts (folds)
 Use k1 parts for estimation of model
 Use k-th part for validation
Predictive ability
 Correlation between
ĝ and p: rp , gˆ
Predictive accuracy
 Correlation between
Oulu, 4 November 2014
ĝ and g: rg , gˆ
Hans-Peter Piepho
35
4. Predictive accuracy
Estimation of predictive accuracy
rg , gˆ
rp , gˆ

Hˆ
(e.g. Dekkers, 2007)
Oulu, 4 November 2014
Hans-Peter Piepho
36
4. Predictive accuracy
Rationale:
Assume
rg , gˆ 
sg , gˆ  s p , gˆ and H 2 
sg , gˆ
2 2
g gˆ

s s
s p, gˆ
2 2
g gˆ
s s

s g2
s 2p
 s g2  H 2 s 2p 
s p, gˆ
2 2 2
p gˆ
H s s

1 rgˆ , p
 
2 2
H
s p s gˆ H
s p, gˆ
where
rgˆ , p 
sgˆ , p
sgˆ2 sp2
Oulu, 4 November 2014
Hans-Peter Piepho
37
4. Predictive accuracy
Problems
 Many ways to estimate H!
 Models for estimating H not always commensurate with RR-BLUP
 Equations for r assume i.i.d. sampling (independent errors)
Oulu, 4 November 2014
Hans-Peter Piepho
38
4. Predictive accuracy
Method 1
2
Hm1
 g2
 2
 g  2 / r
2
Method 2
Method 3
Oulu, 4 November 2014
σg
2
Hm 2 
H
2
m3
2
σg  v / 2
vBLUP
1
2 σ g2
Hans-Peter Piepho
39
4. Predictive accuracy
Method 4
H
2
m4



E s 
E sg2
2
p
1 n
2
T


s 
g

g

g
Pu g with

i
n  1 i 1
2
g
1 
1 
Pu 
 In  Jn 
n 1
n 
n
1
2
T


s 2p 
p

p

p
Pu p

i
n  1 i 1
For example:
 
E s g2  trace PuG  , where G  ZZ T  u2
 Models for genomic selection and H are commensurate
Oulu, 4 November 2014
Hans-Peter Piepho
40
4. Predictive accuracy
Method 5
E rg , gˆ  
E s g , gˆ 
  
E sg2 E sg2ˆ
Plug in estimates of the three expected values
Oulu, 4 November 2014
Hans-Peter Piepho
41
4. Predictive accuracy
Method 7
2
ˆ




cov
,
g
g
i
i
 i2 
var g i  var gˆ i 
 estimate from mixed model equations (MME)
ˆ
2
m7
1 n 2
  ˆi
n i 1
(Mrode and Thompson, 2005; Piepho and Möhring, 2007)
Oulu, 4 November 2014
Hans-Peter Piepho
42
4. Predictive accuracy
Simulation of datasets
 Consider a single trial laid out as an -design
 Simulate block and plot effects using the marker and error variances,
obtained from real datasets (AgReliant, KWS)
 The true breeding values simulated as
data
g  Zu using Z and u2 from real
 The phenotypic data was calculated as:
genetypic value + rep + rep.block + plot error
 The correlation between the true and the predicted breeding value ( rg , gˆ )
used as a benchmark
 Use Methods 1-7 to estimate rg , gˆ
Oulu, 4 November 2014
Hans-Peter Piepho
43
4. Predictive accuracy
Figure: Predictive accuracy (estimates less than 0 were set to 0 whereas estimates
greater than 1 were set to 1) for all the seven methods in each of the four scenarios.
Oulu, 4 November 2014
Hans-Peter Piepho
44
4. Predictive accuracy
Table: The means of the estimated heritability for all simulated datasets.
M0 is the square of the true correlation between the predicted and the true
simulated breeding values.
Methods
Scenario
M0
M1
M2
M3
M4
M5
1
0.71
0.32
0.48
0.48
0.34
0.67
2
0.42
0.09
0.15
0.18
0.08
0.33
3
0.73
0.51
0.49
0.50
0.40
0.72
4
0.52
0.14
0.13
0.13
0.07
0.39
Oulu, 4 November 2014
Hans-Peter Piepho
45
5. Summary
Heritability
 Heritability can be estimated by several ad hoc methods
 Response to selection can be approximated by plugging in ad hoc estimates
 But in more complex cases approximation poor or not available
 Can simulate any statistic of interest by simulation (parametric bootstrap)
Predictive accuracy (PA)
 Indirect methods use estimator of H in denominator
 None of the available estimators of H works very well for estimating PA
 Direct methods for estimating PA work best (New Method 5 and Method 7
from animal breeding preferred)
Oulu, 4 November 2014
Hans-Peter Piepho
46
6. References
Bennewitz, J., Böglein, S., Stratz, P., Rodehutscord, M., Piepho, H.P., Kjaer,
W., Bessei, W. (2014): Genetic parameters for feather pecking and
aggressive behaviour in a large F2-cross of laying hens using generalized
linear mixed models. Poultry Science 93, 810-817.
Estaghvirou, B., Ogutu, J.O., Schulz-Streeck, T., Knaak, C., Ouzenova, M.,
Gordillo, A., Piepho, H.P. (2013): Evaluation of approaches for estimating
prediction accuracy in genomic selection in plant and animal breeding. BMC
Genomics 14, 860.
Estaghvirou, B., Ogutu, J.O., Piepho, H.P. (2014): Influence of outliers on
accuracy and robustness of genomic prediction. G3 (online)
Piepho, H.P., Möhring, J. (2007): Computing heritability and selection
response from unbalanced plant breeding trials. Genetics 177, 1881-1888.
Oulu, 4 November 2014
Hans-Peter Piepho
47
Thanks!
Oulu, 4 November 2014
Hans-Peter Piepho
48