Machine learning for genomic selection: theory and practice Stasia Grinberg Manchester Institute of Biotechnology and School of Computer Science University of Manchester EGF2014, Aberystwyth, 9th September What is genomic selection (GS) and why is it useful? What is genomic selection (GS) and why is it useful? I A process of selecting for desired trait in a breeding population with the help of genomic information, in particular a set of markers densely covering the organism’s genome. I The idea is based on linkage disequillibrium (LD), an association of two or more loci located close to each other on genetic map. I Markers in LD with QTLs can lead to discovery of QTLs and genes important for a particular trait, and ultimately to a better understanding of the organism’s biology. I GS has the potential to reduce the cost and time of a breeding program. Machine Learning (ML) techniques for GS: an overview I Lies at the intersection of computer science and statistics. I Is comprised of a powerful and efficient set of algorithms capable of dealing with the p n problems. I Linear and non-linear models capable of dealing with complex relationships between variables (e.g. SNP markers). I A wide arrange of implementations and software readily available for PC and MAC: R, Weka, RapidMiner, etc. Machine Learning (ML) techniques for GS: basic principle ML in theory: linear models Ridge and lasso regressions ML in theory: linear models Ridge and lasso regressions Ridge and lasso regressions bias against complicated models by imposing penalty on large coefficients via complexity parameter λ ∈ [0, ∞]. λ is chosen via cross-validation with λ = 0 corresponding to ordindary least squares regression. ML in theory: linear models Ridge and lasso regressions Ridge and lasso regressions bias against complicated models by imposing penalty on large coefficients via complexity parameter λ ∈ [0, ∞]. λ is chosen via cross-validation with λ = 0 corresponding to ordindary least squares regression. I Ridge regression. I I Shrinks coefficents towards each other and 0, averaging coefficients of highly correlated variables. Taking λ = σ2 /σg2 corresponds to BLUP (best linear unbiased prediction). ML in theory: linear models Ridge and lasso regressions Ridge and lasso regressions bias against complicated models by imposing penalty on large coefficients via complexity parameter λ ∈ [0, ∞]. λ is chosen via cross-validation with λ = 0 corresponding to ordindary least squares regression. I Ridge regression. I I I Shrinks coefficents towards each other and 0, averaging coefficients of highly correlated variables. Taking λ = σ2 /σg2 corresponds to BLUP (best linear unbiased prediction). Lasso regression. I I Induces sparsity in solution by shrinking most coefficients all the way to 0. Performs attribute selection. ML in theory: beyond linearity Regression trees and random forest ML in theory: beyond linearity Regression trees and random forest I The idea behind regression trees: divide attribute space into ‘rectangular’ subsets and approximate output within each via a linear model. This results in a decision tree with binary splits. Contig33910_142>=0.5 | Contig41676_250>=0.5 31.21 Contig17280_335< −0.5 32.68 Contig7309_1548>=−0.5 32.26 33.74 I 35.42 Implicit attribute selection and user friendly output but low predictive power. ML in theory: beyond linearity Regression trees and random forest Random forest combines many trees to obtain more accurate predictions, but at a loss of interpretability. 45 40 35 25 30 Mean squared error (test set) 50 55 I 0 200 400 600 800 1000 Number of trees I Tuning parameters include: number of iterations and depth of trees. One can even use tree stumps, degenerate trees with only one split. ML in action: lolium perenne ML in action: lolium perenne I Phenotypic data: two diploid breeding populations: intermediate (F13) and late (F5) flowering time sub-populations with 54 and 105 genotyped and phenotyped plants, respectively, + historic F11 population with 122 geno- and phenotyped plants. I Genotypic data: ∼ 3,000 polymorphic SNP markers. ML in action: lolium perenne I Phenotypic data: two diploid breeding populations: intermediate (F13) and late (F5) flowering time sub-populations with 54 and 105 genotyped and phenotyped plants, respectively, + historic F11 population with 122 geno- and phenotyped plants. I Genotypic data: ∼ 3,000 polymorphic SNP markers. F13 DMD. BLUP: 0.23, Random forest: 0.24, Lasso: 0.30. NDF digestibility. BLUP: 0.21, Random forest: 0.23. WSC. BLUP: 0.14, Random forest: 0.16. Tot. yield (yr1). (F11+F13, cont. adj.) RF, BLUP: 0.13-0.12; RF, BLUP: 0.09. ML in action: lolium perenne I Phenotypic data: two diploid breeding populations: intermediate (F13) and late (F5) flowering time sub-populations with 54 and 105 genotyped and phenotyped plants, respectively, + historic F11 population with 122 geno- and phenotyped plants. I Genotypic data: ∼ 3,000 polymorphic SNP markers. F13 DMD. BLUP: 0.23, Random forest: 0.24, Lasso: 0.30. NDF digestibility. BLUP: 0.21, Random forest: 0.23. WSC. BLUP: 0.14, Random forest: 0.16. Tot. yield (yr1). (F11+F13, cont. adj.) RF, BLUP: 0.13-0.12; RF, BLUP: 0.09. F5 Nitrogen. BLUP: 0.15, Random forest: 0.10. Seed yield. BLUP: 0.18, Random forest: 0.13. WSC. BLUP: 0.17, Random forest: 0.06; (F11+F13+F5) BLUP: 0.22. ML in action: yeast ML in action: yeast I I Phenotypic data: 1008 haploid segregants phenotyped for 46 traits. Genotypic data: 11,623 SNP markers covering all 16 chromosomes. ML in action: yeast Phenotypic data: 1008 haploid segregants phenotyped for 46 traits. Genotypic data: 11,623 SNP markers covering all 16 chromosomes. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ● ●● ● ● 0.6 0.4 0.2 ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● Lasso ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● 0.6 ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● Random forest ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● 0.2 ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● 0.8 ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ●●● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ●● ● ●● ●● ● ● 0.2 0.4 GBM 0.6 0.8 0.6 0.8 ●● ● ● ● 0.4 ● ●● ● 0.2 ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● 0.4 ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●●● ● ●● ●● ● ●● ● ●● ● ●● ●● ● 0.2 ● ● ● Bloom et al ● ● 0.6 ● 0.8 ● 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.4 I 0.2 I ML for GS: some considerations Accuracy of predictions can be affected by: I Population size and homogeneity. Number and distribution of markers. I The size of organism’s genome. I Trait heritability. I External (non-genetic) factors. I Acknowledgements I University of Manchester I Aberystwyth University I I I I I I I I I I Ross D. King Leif Skøt Alan Lovatt Andi Macfarlane Matt Hegarty Tina Blackmore Kirsten Skøt Ian Armstead Rhys Kelly Wayne Powell Principal component analysis for F11, F13, F14 and F5. o oo o oo ooo o o o o o ● o oo o o o oo ● oo oo o o ooo oo oo oo oo ooo oooo o o o oo o ooo o o o ooo o o o o oo o oooo o o o o oo o o oo oooooo oooo o o o o o o oo o o o o oo oo o o o o o o o o o oo oo o oo o o o oo o o o o o o o o o ooo ooo oo oo o o o oo o oo o o oo oo o o oo oo ooooo o o o o o o o o o oo o oo o oo oo o o o o oo o o o o oo oooooooooo o o ooo oooo oooooooo ooo oo o o o o o oooooo ooooo ooo o o o o o o o o o o o oooo oo oooooooooo oo ooo o oooooo o oo o o oo o ooo ooooooooo o oo o o o oo o oooo o oooo o o ooooooooo oooo o oo oooo oo o o o o ooooooo o o o o oo oo ooo oo o oo o oo o o ooooooo ooo o ooooooo ooo o o o o o o o o o ooo ooooo o oo o o ooo ooooooooo o o o o ooooooooo oo oo oo ooooo o oo ooo oo o ooooooooo o oo o o o o o o o o oo o o oo o o o o −20 −10 PC2 0 10 20 ● ● LATE_F5 INT_F11 INT_F13 INT_F14 −20 −10 0 PC1 10 ML in theory: linear models Lasso regression 578 477 367 264 206 141 104 59 39 26 13 3 40 ● ● 35 ● ● ● ● 30 ● 25 0 ● ● ● ● ● ● ● ● ● ● ● ● 20 Mean−Squared Error 1 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ● ●●●●● ●● ●●●●●●● −4 −3 −2 log(Lambda) −1 0
© Copyright 2024 Paperzz