Computational Genomics

Statistical Genomics
Lecture 23: Cross validation
Zhiwu Zhang
Washington State University
Administration
 Homework 5, due April 13, Wednesday, 3:10PM
 Final exam: May 3, 120 minutes (3:10-5:10PM), 50
Course evaluation and response
Genomic selection methods with packages in R
 GS by GWAS
 rrBLUP
 gBLUP
 cBLUP
 sBLUP
 Bayesian
 LASSO
Outline








GS by GWAS
Over fitting
Cross validation
K-fold validation
Jack knife
Re-sampling
Two ways of calculating accuracy
Bias and correction
Setup GAPIT
#source("http://www.bioconductor.org/biocLite.R")
#biocLite("multtest")
#install.packages("gplots")
#install.packages("scatterplot3d")#The downloaded link at: http://cran.rproject.org/package=scatterplot3d
library('MASS') # required for ginv
library(multtest)
library(gplots)
library(compiler) #required for cmpfun
library("scatterplot3d")
source("http://www.zzlab.net/GAPIT/emma.txt")
source("http://www.zzlab.net/GAPIT/gapit_functions.txt")
Import data and simulate phenotype
myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T)
myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T)
myCV=read.table(file="http://zzlab.net/GAPIT/data/mdp_env.txt",head=T)
#Simultate 10 QTN on the first half chromosomes
X=myGD[,-1]
index1to5=myGM[,2]<6
X1to5 = X[,index1to5]
taxa=myGD[,1]
set.seed(99164)
GD.candidate=cbind(taxa,X1to5)
source("~/Dropbox/GAPIT/Functions/GAPIT.Phenotype.Simulation.R")
mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[index1to5,],h2=.5,NQ
TN=10, effectunit =.95,QTNDist="normal",CV=myCV,cveff=c(.51,.51))
setwd("~/Desktop/temp")
Prediction with PC and ENV
5
0
-5
-15
-10
mySim$Y[, 2]
-8
-6
-4
-2
0
2
4
0
R square=0.0214198362063903
myGAPIT$Pred[, 8]
-4
-6
-8
-10
ry2=cor(myGAPIT$Pred[,8],mySim$Y[,2])^2
ru2=cor(myGAPIT$Pred[,8],mySim$u)^2
par(mfrow=c(2,1), mar = c(3,4,1,1))
plot(myGAPIT$Pred[,8],mySim$Y[,2])
mtext(paste("R square=",ry2,sep=""), side = 3)
plot(myGAPIT$Pred[,8],mySim$u)
mtext(paste("R square=",ru2,sep=""), side = 3)
mySim$u
-2
myGAPIT <- GAPIT(
Y=mySim$Y,
GD=myGD,
GM=myGM,
PCA.total=3,
CV=myCV,
group.from=1,
group.to=1,
group.by=10,
QTN.position=mySim$QTN.position,
#SNP.test=FALSE,
memo="GLM",)
10
R square=0.66245823745266
-8
-6
-4
-2
0
2
4
Choosing the top ten SNPs
ntop=10
index=order(myGAPIT$P)
top=index[1:ntop]
myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)])
5
0
-5
-10
mySim$Y[, 2]
-15
-5
0
5
-8
-6
-4
-2
0
R square=0.185090090074047
myGAPIT$Pred[, 8]
-10
ry2=cor(myGAPIT2$Pred[,8],mySim$Y[,2])^2
ru2=cor(myGAPIT2$Pred[,8],mySim$u)^2
par(mfrow=c(2,1), mar = c(3,4,1,1))
plot(myGAPIT2$Pred[,8],mySim$Y[,2])
mtext(paste("R square=",ry2,sep=""), side = 3)
plot(myGAPIT2$Pred[,8],mySim$u)
mtext(paste("R square=",ru2,sep=""), side = 3)
R square=0.813735024203838
-10
mySim$u
myGAPIT2<- GAPIT(
Y=mySim$Y,
GD=myGD,
GM=myGM,
#PCA.total=3,
CV=myQTN,
group.from=1,
group.to=1,
group.by=10,
QTN.position=mySim$QTN.position,
SNP.test=FALSE,
memo="GLM+QTN",)
10
Prediction with top ten SNPs
-10
-5
0
5
0
-5
-15
-10
mySim$Y[, 2]
-15
-10
-5
0
5
10
-8
-6
-4
-2
0
R square=0.171036001292668
myGAPIT2$Pred[, 8]
-10
myGAPIT2<- GAPIT(
Y=mySim$Y,
GD=myGD,
GM=myGM,
#PCA.total=3,
CV=myQTN,
group.from=1,
group.to=1,
group.by=10,
QTN.position=mySim$QTN.position,
SNP.test=FALSE,
memo="GLM+QTN",)
mySim$u
ntop=200
index=order(myGAPIT$P)
top=index[1:ntop]
myQTN=cbind(myGAPIT$PCA[,1:4],
myCV[,2:3],myGD[,c(top+1)])
R square=0.94300576514178
5
10
Prediction with top 200SNPs
-15
-10
-5
0
5
10
Validation
All individuals
training
Phenothpe
Testing
Genotype
Genotype
Phenotype
Accuracy
SNP effect
Prediction
Cross validation
All individuals
Testing
Phenothpe
Training
Genotype
Genotype
Phenotype
Accuracy
Prediction
SNP effect
Five fold Cross validation
Reference
Inference
By Yao Zhou
Jack Knife
Until every individuals get predicted
Inference
Inference
Jack Knife: extreme case of K=N
 N: number of individuals
 K: number of folds
 Leave-one-out cross-validation
 Inference (training) contain only one individuals
 Not possible to calculate correlation between
observed and predicted within inference
 Evaluation of accuracy must be hold until every
individuals receive predictions.
 Resampling is not available
Re-sampling
 Sample partial population, e.g., 20%, as inference
(testing), and leave the rest as reference (Training)
 Instantly evaluate accuracy of inference
 Repeated for multiple times
 Average accuracy across replicates
 Some individuals may never be in the testing
Negative prediction accuracy
Theor Appl Genet. 2013 Jan;126(1):13-22
Genomewide predictions from maize single-cross data.
Massman JM1, Gordillo A, Lorenzana RE, Bernardo R.
Two ways of calculating correlation
Artifactual negative hold accuracy
Hold bias relates to number of fold
Problem of instant accuracy
Small sample causes bias
Correction of instant accuracy
é (1- r 2 ) ù
r̂ = r ê1+
ú
ë 2(n - 4) û
Highlight








GS by GWAS
Over fitting
Cross validation
K-fold validation
Jack knife
Re-sampling
Two ways of calculating accuracy
Bias and correction