Machine learning for genomic selection: theory and

Machine learning for genomic selection: theory
and practice
Stasia Grinberg
Manchester Institute of Biotechnology and
School of Computer Science
University of Manchester
EGF2014, Aberystwyth, 9th September
What is genomic selection (GS) and why is it useful?
What is genomic selection (GS) and why is it useful?
I
A process of selecting for desired trait in a breeding population with
the help of genomic information, in particular a set of markers
densely covering the organism’s genome.
I
The idea is based on linkage disequillibrium (LD), an association of
two or more loci located close to each other on genetic map.
I
Markers in LD with QTLs can lead to discovery of QTLs and genes
important for a particular trait, and ultimately to a better
understanding of the organism’s biology.
I
GS has the potential to reduce the cost and time of a breeding
program.
Machine Learning (ML) techniques for GS: an overview
I
Lies at the intersection of computer science and statistics.
I
Is comprised of a powerful and efficient set of algorithms capable of
dealing with the p n problems.
I
Linear and non-linear models capable of dealing with complex
relationships between variables (e.g. SNP markers).
I
A wide arrange of implementations and software readily available for
PC and MAC: R, Weka, RapidMiner, etc.
Machine Learning (ML) techniques for GS: basic principle
ML in theory: linear models
Ridge and lasso regressions
ML in theory: linear models
Ridge and lasso regressions
Ridge and lasso regressions bias against complicated models by imposing
penalty on large coefficients via complexity parameter λ ∈ [0, ∞]. λ is
chosen via cross-validation with λ = 0 corresponding to ordindary least
squares regression.
ML in theory: linear models
Ridge and lasso regressions
Ridge and lasso regressions bias against complicated models by imposing
penalty on large coefficients via complexity parameter λ ∈ [0, ∞]. λ is
chosen via cross-validation with λ = 0 corresponding to ordindary least
squares regression.
I Ridge regression.
I
I
Shrinks coefficents towards each other and 0, averaging coefficients
of highly correlated variables.
Taking λ = σ2 /σg2 corresponds to BLUP (best linear unbiased
prediction).
ML in theory: linear models
Ridge and lasso regressions
Ridge and lasso regressions bias against complicated models by imposing
penalty on large coefficients via complexity parameter λ ∈ [0, ∞]. λ is
chosen via cross-validation with λ = 0 corresponding to ordindary least
squares regression.
I Ridge regression.
I
I
I
Shrinks coefficents towards each other and 0, averaging coefficients
of highly correlated variables.
Taking λ = σ2 /σg2 corresponds to BLUP (best linear unbiased
prediction).
Lasso regression.
I
I
Induces sparsity in solution by shrinking most coefficients all the way
to 0.
Performs attribute selection.
ML in theory: beyond linearity
Regression trees and random forest
ML in theory: beyond linearity
Regression trees and random forest
I
The idea behind regression trees: divide attribute space into
‘rectangular’ subsets and approximate output within each via a linear
model. This results in a decision tree with binary splits.
Contig33910_142>=0.5
|
Contig41676_250>=0.5
31.21
Contig17280_335< −0.5
32.68
Contig7309_1548>=−0.5
32.26
33.74
I
35.42
Implicit attribute selection and user friendly output but low
predictive power.
ML in theory: beyond linearity
Regression trees and random forest
Random forest combines many trees to obtain more accurate
predictions, but at a loss of interpretability.
45
40
35
25
30
Mean squared error (test set)
50
55
I
0
200
400
600
800
1000
Number of trees
I
Tuning parameters include: number of iterations and depth of trees.
One can even use tree stumps, degenerate trees with only one split.
ML in action: lolium perenne
ML in action: lolium perenne
I
Phenotypic data: two diploid breeding populations: intermediate
(F13) and late (F5) flowering time sub-populations with 54 and 105
genotyped and phenotyped plants, respectively, + historic F11
population with 122 geno- and phenotyped plants.
I
Genotypic data: ∼ 3,000 polymorphic SNP markers.
ML in action: lolium perenne
I
Phenotypic data: two diploid breeding populations: intermediate
(F13) and late (F5) flowering time sub-populations with 54 and 105
genotyped and phenotyped plants, respectively, + historic F11
population with 122 geno- and phenotyped plants.
I
Genotypic data: ∼ 3,000 polymorphic SNP markers.
F13
DMD. BLUP: 0.23, Random forest: 0.24, Lasso: 0.30.
NDF digestibility. BLUP: 0.21, Random forest: 0.23.
WSC. BLUP: 0.14, Random forest: 0.16.
Tot. yield (yr1). (F11+F13, cont. adj.) RF, BLUP: 0.13-0.12; RF,
BLUP: 0.09.
ML in action: lolium perenne
I
Phenotypic data: two diploid breeding populations: intermediate
(F13) and late (F5) flowering time sub-populations with 54 and 105
genotyped and phenotyped plants, respectively, + historic F11
population with 122 geno- and phenotyped plants.
I
Genotypic data: ∼ 3,000 polymorphic SNP markers.
F13
DMD. BLUP: 0.23, Random forest: 0.24, Lasso: 0.30.
NDF digestibility. BLUP: 0.21, Random forest: 0.23.
WSC. BLUP: 0.14, Random forest: 0.16.
Tot. yield (yr1). (F11+F13, cont. adj.) RF, BLUP: 0.13-0.12; RF,
BLUP: 0.09.
F5
Nitrogen. BLUP: 0.15, Random forest: 0.10.
Seed yield. BLUP: 0.18, Random forest: 0.13.
WSC. BLUP: 0.17, Random forest: 0.06; (F11+F13+F5) BLUP:
0.22.
ML in action: yeast
ML in action: yeast
I
I
Phenotypic data: 1008 haploid segregants phenotyped for 46 traits.
Genotypic data: 11,623 SNP markers covering all 16 chromosomes.
ML in action: yeast
Phenotypic data: 1008 haploid segregants phenotyped for 46 traits.
Genotypic data: 11,623 SNP markers covering all 16 chromosomes.
0.2 0.3 0.4 0.5 0.6 0.7 0.8
●
●●
● ●
0.6
0.4
0.2
●●
●
●
●
●●
●
●
●
●
●
●
● ●
● ●●
●
●
●
●●
● ●
●
●
●
● ●
●
●●
●● ● ●
Lasso
●
●
●
●
●
●
● ●
●
●
●
●
●
●●●
● ●
● ●●
●
●●● ●●● ●
●
●
●
●
●
●
●
●●
● ●
● ●
●●
●
●● ● ●
●●
●●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
0.6
●
● ●●
●
●●
●
●●
●
●●
● ●●
●
● ●
●●●
●
●
●●
● ●
● ●
●
●●
0.6
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
● ● ● ● ●
●
● ● ● ●●
●
●
●● ●
●
●●
●● ●
●
●
●
● ●●
●
0.8
●
●
●
●
●
●
● ● ●
● ● ●●
●
● ●
● ●●●
●
● ●
●
●
● ●
●●
●
●● ● ●
● ●●
●● ●
●●
●
Random forest
●
●
●
●
●
● ●
●
●
●
0.4
●
●
●
●●
● ●
●
● ●
●
●
●●
●
●
0.2
●
●
●
●
●
●
● ●●
● ● ●
●
●
●●
●●●
● ● ●
●
●
● ● ●
●
●●
●
●
●
●
●
●
● ●●
●
●● ●
●
●
●
●
0.8
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
● ●●
●
●
● ●●●
●
●●
●
● ●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●● ● ●●
● ● ● ●●
●●
●
●
●
●
●
● ●●
●●
●●
●●
●
●●
●● ●
●
0.2
0.4
GBM
0.6
0.8
0.6
0.8
●●
●
●
●
0.4
●
●● ●
0.2
●
●
●
●
● ●
●
● ● ●●
● ●●
●
● ●●
●
●
● ●● ●
● ●●
● ●
● ●●
● ●●
●
● ●
●
●●
0.4
●
● ●
●
●
● ● ●
●
●●
●
●
●●
●● ●● ●
●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●● ●
● ● ●●
●●
● ● ● ●
●
●
●
●●●●
●
●●
●●
● ●●
●
●●
●
●●
●●
●
0.2
●
●
●
Bloom et al
●
●
0.6
●
0.8
●
0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.4
I
0.2
I
ML for GS: some considerations
Accuracy of predictions can be affected by:
I
Population size and homogeneity.
Number and distribution of markers.
I
The size of organism’s genome.
I
Trait heritability.
I
External (non-genetic) factors.
I
Acknowledgements
I
University of Manchester
I
Aberystwyth University
I
I
I
I
I
I
I
I
I
I
Ross D. King
Leif Skøt
Alan Lovatt
Andi Macfarlane
Matt Hegarty
Tina Blackmore
Kirsten Skøt
Ian Armstead
Rhys Kelly
Wayne Powell
Principal component analysis for F11, F13, F14 and F5.
o
oo
o oo ooo
o
o
o
o
o
●
o oo
o o o oo
●
oo oo o
o
ooo
oo
oo oo oo ooo oooo o
o o
oo
o
ooo
o
o
o
ooo
o
o o o oo o oooo
o o
o
o
oo
o o oo oooooo oooo o o o o
o o
oo
o
o
o o oo oo o o
o o o o
o o
o oo oo o
oo
o
o o
oo
o o
o
o
o
o
o
o
o ooo ooo
oo oo o o o oo o oo
o
o
oo oo o
o
oo oo ooooo
o
o
o
o
o
o
o
o
o oo
o oo
o oo oo o o o o
oo
o
o o
o oo
oooooooooo
o
o ooo oooo oooooooo ooo
oo o o
o o o oooooo ooooo ooo
o
o
o
o
o
o
o
o
o
o o oooo oo oooooooooo oo
ooo
o oooooo
o oo o
o oo o
ooo
ooooooooo o
oo
o o o oo o oooo
o
oooo
o
o
ooooooooo oooo
o oo oooo oo o o
o
o ooooooo o
o
o
o oo oo
ooo oo o oo o oo o o
ooooooo ooo o
ooooooo ooo
o
o
o
o
o
o
o
o
o ooo ooooo o oo o
o
ooo ooooooooo o
o
o
o
ooooooooo oo
oo oo
ooooo
o
oo
ooo
oo o ooooooooo
o oo
o
o
o
o
o o
o
o
oo o
o
oo
o
o
o
o
−20
−10
PC2
0
10
20
●
●
LATE_F5
INT_F11
INT_F13
INT_F14
−20
−10
0
PC1
10
ML in theory: linear models
Lasso regression
578
477
367
264
206
141
104
59
39
26
13
3
40
●
●
35
●
●
●
●
30
●
25
0
●
●
●
●
●
●
●
●
●
●
●
●
20
Mean−Squared Error
1
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●
●●●●●
●●
●●●●●●●
−4
−3
−2
log(Lambda)
−1
0