application of machine learning to genotype-phenotype relationships workshop on molecular evolution, north america 2011 Wednesday, August 3, 11 1 expected learning outcomes • general factors affecting phenotype • understanding why analysis if genotype-phenotype relationships is not a simple problem • basic understanding of tree-based statistical models • basic understanding of random forests • how random forests can be applied to the study of genotypephenotype relationships Wednesday, August 3, 11 2 presentation outline • very general background on genotype-phenotype relationships • introduction to some machine learning and statistical methods • first example: drug resistant tuberculosis • more material on machine learning methods • second example: cold-adaptation in tubulin proteins Wednesday, August 3, 11 3 a few definitions • genotype - the class to which an organism belongs based upon its genetic material - partial genotypes are most often considered DNA sequence single nucleotide polymorphisms (SNPs) microsatellites protein sequences • phenotype - the class to which an organism belongs based upon its observable physical characteristics - partial phenotypes are most often considered Wednesday, August 3, 11 4 factors affecting phenotype • genetics: hereditary basis - epistasis: non-additive interaction between two or more loci • epigenesis: interaction between cells and substances during development • environment: external abiotic and biotic influences • experimental error: inaccuracies and irregularities in measurement Wednesday, August 3, 11 5 general genotype-phenotype questions • What differences in phenotype can be explained by genotype? • What particular genotypic features are important in determining phenotype? • How accurately can phenotype be predicted based on genotype data? Wednesday, August 3, 11 6 analytical challenges of genotype-phenotype data • unordered categorical variables: nucleotides, amino acids, SNPs • numerous levels of variables: four, twenty • mixture of variable types: categorical, continuous (numerical) • potential for non-additive interactions between variables: epistasis Wednesday, August 3, 11 7 desired analysis attributes • explanatory: provides a distinctive description • predictive: can be applied to observations where phenotype is unknown • quantitative: provides numeric measures of relationships and error • flexible: accommodates different variable types, mixtures of variable types, and interaction among variables • interpretable: provides means to understanding Wednesday, August 3, 11 8 candidate analysis methods • sequence motif identification: specific patterns of nucleotides or amino acids • phylogenetics: evolutionary or genealogical associations • artificial neural networks: a supervised machine learning technique • standard statistical methods: methodology subsumed under generalized linear models and extensions - analysis of variance (ANOVA) - logistic regression - discriminant analysis - kernel density estimation - nearest neighbor analysis Wednesday, August 3, 11 9 tree-based statistical models • recursively partition a data set in two (binary split) based on the value of a single predictor variable - to best achieve homogeneous subsets of a categorical response variable (classification) - to best separate low and high values of a continuous response variable (regression) developed for social science research at the University of Michigan in the 1960s • known as decision trees in machine learning • improved by Breiman, Friedman, Olshen and Stone (1984) • - established a firm theoretical basis - fixed methodological problems - developed improved algorithms Wednesday, August 3, 11 10 simple key to terrestrial plants (Embryophyta) 1 seed producing (Spermatophyta) → 2 2 seeds in carpel (Magnoliophyta) 2´ seeds not in carpel (“Gymnosperms”) 1´ not seed producing → 3 3 fern-like (Filicophyta) 3´ not fern-like → 4 4 stems jointed (Equisetophyta) 4´ stems not jointed → 5 Wednesday, August 3, 11 11 key as a classification tree terrestrial plants seeds producing not seed producing seed plants seeds in carpel seedless plants fern-like seeds not in carpel angiosperms gymnosperms not fern-like ferns others stems jointed stems not jointed horsetails Wednesday, August 3, 11 others 12 elements of initial tree growing procedure 1. set of binary questions - question: Is observation xi ∈ A? where A is a region of variable space, X answer: yes, xi ∈ A or no, xi ∉ A 2. goodness of fit criterion that can be evaluated for any split - often include concepts of impurity, entropy/information, deviance/ likelihood, least squares or least average deviation 3. stop-splitting rule - do not stop, but prune tree later 4. rule for assigning a value to every node - regression: mean of observations classification: most frequent class Breiman et al 1984 Classification and Regression Trees Wadsworth & Brooks/Cole Wednesday, August 3, 11 13 rifampin resistant tuberculosis Wednesday, August 3, 11 14 biological context: rifampin resistant tuberculosis • rifampin is a commonly used antibiotic for treatment of tuberculosis • rifampin inhibits RNA polymerization • resistance to rifampin has spontaneously arisen in Mycobacterium tuberculosis many times • genes involved in antibiotic resistance are borne on the bacterial chromosome and not on plasmids Wednesday, August 3, 11 15 practical relevance: rifampin resistant tuberculosis • tuberculosis is the leading cause of death due to an infectious organism • antibiotic resistance is a major problem in tuberculosis treatment • conventional antibiotic resistance testing of tuberculosis is slow and expensive • sequence-based antibiotic resistance prediction can eliminate the need for conventional testing Wednesday, August 3, 11 16 questions: rifampin resistant tuberculosis • What positions in the amino acid sequences of β-subunit of RNA polymerase are statistically associated with rifampin resistance? • How accurately can rifampin resistance be predicted based on sequence data of β-subunit of RNA polymerase? Wednesday, August 3, 11 17 data • protein sequences 173 partial amino acid sequences (22 amino acids) of β-subunit of RNA polymerase (positions 511-533 in the rpoB gene product) • clinical isolates from throughout Japan and from New York City • genotype is amino acid sequence • phenotype is minimum inhibitory concentration (MIC) of rifampin (μg/ml) 70 susceptible, 103 resistant Wednesday, August 3, 11 18 data MIC (μg/ml) number of strains 0.0625 48 0.125 2 0.25 2 < 0.39 13 0.5 2 1 3 2 2 4 2 8 6 12.5 3 16 3 32 1 > 32 15 50 1 64 7 100 1 128 19 200 1 > 200 7 256 13 512 18 > 512 4 differences 1 0 0 2 1 1 1 1 5 3 3 0 15 1 7 1 19 1 7 13 18 4 amino acid differences from consensus 515:V none none 521:P, 533:P 533:P 533:P 511:R and 512:T 516:V 516:V, 526:G, 526:L, 526:Q and 533:P 514:L and 516:V, 533:P 526:L, 526:N, 529:K none 511:R and 516:Y, 513:K, 513:L, 526:D, 526:Y, 531:L, 531:W, 533:P 531:L 531:L 531:L 513:L, 516:Y, 526:Y, 531:L 526:D 513:L, 516:A and 526:D, 526:P, 526:Y, 531:L 513:K, 516:Y and 526:N, 526:P, 531:L 526:P, 526:R, 526:Y, 531:L, 531:W 516:Y, 526:D, 526:P, 531:W Moghazeh et al 1996, Ohno et al 1996, Taniguchi et al 1996 Wednesday, August 3, 11 19 rifampin resistance: regression tree n = 173 6.373 μg/ml S H position 531 LW n = 121 n = 52 1.596 μg/ml 159.65 μg/ml position 526 DGLNPQRY n = 90 n = 31 0.355 μg/ml 125.79 μg/ml Cummings and Segal BMC Bioinformatics 5:137 2004 Wednesday, August 3, 11 20 rifampin resistance: classification tree 103 resistant 70 susceptible S H position 531 LW 51 resistant 52 resistant 70 susceptible 0 susceptible position 526 DGLNPQRY 20 resistant 31 resistant 70 susceptible 0 susceptible Cummings and Segal BMC Bioinformatics 5:137 2004 Wednesday, August 3, 11 21 variable importance mean decrease in Gini index 20 15 10 5 0 511 512 513 514 515 516 521 526 529 531 533 amino acid position Cummings and Segal BMC Bioinformatics 5:137 2004 Wednesday, August 3, 11 22 improving on single trees with random forests Wednesday, August 3, 11 23 random forest algorithm for (i= 0; i < n; i++) subsample observations for each sample calculate tree model (unpruned) for each node sample k predictor variables choose best partition end for each end calculate end for each end for aggregate models calculate summary statistics, variable importance, … Wednesday, August 3, 11 24 advantages of random forest • relatively low error (perhaps the lowest of any method) • no over-fitting • elegant handling of missing values • only partially “black-box” (e.g., results include variable importance, outlier detection) • can be used for supervised and unsupervised learning problems Wednesday, August 3, 11 25 why random forest works so well • variance reduction by averaging over models • randomization steps decrease correlation between individual models in the ensemble Wednesday, August 3, 11 26 single tree versus random forest single tree random forest variance explained 0.679 0.861 classification accuracy 0.884 0.942 Cummings and Segal BMC Bioinformatics 5:137 2004 Wednesday, August 3, 11 27 results for rifampin resistance: confusion matrix susceptible resistant class error susceptible 70 0 0 resistant 9 94 0.0874 overall prediction accuracy: 0.942 Cummings and Segal BMC Bioinformatics 5:137 2004 Wednesday, August 3, 11 28 rifampin resistance summary • very few of the polymorphic amino acid positions are important in rifampin resistance • most of the variance in rifampin resistance can be explained with two or three amino acid changes • some variation in rifampin resistance is determined by variables other than positions 511-533 in the rpoB gene product Wednesday, August 3, 11 29 cold adaptation of tubulins Wednesday, August 3, 11 30 biological context: microtubules and tubulins • microtubules are highly conserved structures that are essential for cell division, locomotion and the cytoskeleton • microtubules are formed by assembly (polymerization) of tubulin subunits and microtubule associated proteins (MAPS) • microtubules are formed from 13 microfilaments (heterodimers of α- and β-tubulins associated longitudinally) • polymerization of tubulin subunits is generally formed at relatively warm temperatures (30-37 ˚C), which favor hydrophobic interactions between α- and β-tubulins • microtubules disassemble (tubulins depolymerize) at cold temperatures (< 20 ˚C) Wednesday, August 3, 11 31 biological context: psychrophilic organisms • psychrophilic organisms, by definition, grow and reproduce (not just survive) at relatively cold temperatures (e.g., -4-10 ˚C) • psychrophilic organisms include a phylogenetically diverse range of organisms (e.g., fishes, multiple protist groups) • work on North Atlantic and Antarctic fishes has shown properties intrinsic to tubulin subunits are responsible for cold adaptation Wednesday, August 3, 11 32 questions: cold-adaptation of tubulins • What positions in the amino acid sequences of α- and βtubulins are associated with cold-adaptation? • How accurately can psychrophilic species be predicted based on tubulin sequence data? Wednesday, August 3, 11 33 study design 1. generate DNA sequences from a phylogenetically broad sample of psychrophilic and mesophilic protists 2. collect appropriate sequences from GenBank to augment the sample 3. analyze inferred protein sequences using machine learning/ statistical methods to identify important positions 4. examine the important positions in the context of the protein structure and general patterns of sequence evolution Wednesday, August 3, 11 34 data • protein sequences 117 alpha-tubulin (149-451 amino acids, some partial sequences) 89 warm, 28 cold 135 beta-tubulin (114-445 amino acids, some partial sequences) 99 warm, 36 cold • genotype is amino acid sequence • phenotype is habitat (cold, warm) Wednesday, August 3, 11 35 phylogenetically diverse sample (new data only) dinoflagellates cryptophytes chrysophytes bacillariophytes prymnesiophytes prasinophytes chlorophytes kinetoplastids/ bodonids choanoflagellates genera 3 2 2 5 1 3 1 species/isolates 5 2 5 9 3 3 2 3 3 3 3 Gast et al unpublished Wednesday, August 3, 11 36 Wednesday, August 3, 11 37 results for α-tubulin: confusion matrix cold warm class error cold 25 3 0.1071 warm 4 85 0.0449 overall prediction accuracy: 0.9402 Cummings et al unpublished Wednesday, August 3, 11 38 results for β-tubulin: confusion matrix cold warm class error cold 31 5 0.1389 warm 2 97 0.0202 overall prediction accuracy: 0.9481 Cummings et al unpublished Wednesday, August 3, 11 39 calculating permutation accuracy importance 1. prediction accuracy on the out-of-bag data is calculated 2. for each predictor variable 2.1. a single variable is permuted 2.2. prediction accuracy on the out-of-bag data is calculated with the variable permuted 2.3. difference in results between the original and permuted data sets are calculated, averaged over all trees, and normalized to yield permutation accuracy importance 3. procedure is repeated for all predictor variables Wednesday, August 3, 11 40 permutation accuracy importance alpha tubulin beta tubulin V23 I24 A397 V238 N167 A375 Q85 I86 S340 T376 T234 Q31 V260 A270 Y172 L318 S38 T190 S170 W388 N300 24 26 28 30 32 34 mean decrease in accuracy 14 16 18 20 22 24 mean decrease in accuracy Cummings et al unpublished Wednesday, August 3, 11 41 β- and α- tubulins: all domains Cummings et al unpublished Wednesday, August 3, 11 42 β- and α- tubulins: all domains with GTP Cummings et al unpublished Wednesday, August 3, 11 43 β- and α- tubulins: nucleotide binding domains Cummings et al unpublished Wednesday, August 3, 11 44 β- and α- tubulins: drug binding domains Cummings et al unpublished Wednesday, August 3, 11 45 β- and α- tubulins: MAP binding domains Cummings et al unpublished Wednesday, August 3, 11 46 β- and α- tubulins: all domains with GTP Cummings et al unpublished Wednesday, August 3, 11 47 tubulin summary • relative few of hundreds of variable amino acid positions are statistically associated with cold-adaptation of tubulins • the nucleotide binding domains have more important positions, followed by the drug binding domain • important positions generally differ between α- and β-tubulins Wednesday, August 3, 11 48 other examples • RNA editing in plant mitochondrial genomes (BMC Bioinformatics 5:132) • limb loss in tetrapods (J Mol Evol 67:581–593) • malaria: target gene, clinical symptoms (Sci Transl Med 1:2ra5) • prediction of data analysis runtime for job scheduling on grid computing system (PCGRID’11/IPDPS’11) • malaria: genome wide association studies (microsatellites and SNPs) - drug resistance (in preparation) - parasite clearance times (in preparation) Wednesday, August 3, 11 49 variable importance for GARLI runtime 93% of variance explained by the model ! Wednesday, August 3, 11 50 acknowledgments • • • • • Mark Segal, University of California, San Francisco (tuberculosis) Rebecca Gast, Woods Hole Oceanographic Institution (tubulins) Tiana Kohlsdorf, Universidade de São Paulo, and Gunter Wagner, Yale University (limb loss) Shannon Takala and Christopher Plowe, University of Maryland School of Medicine (malaria) some of my group at University of Maryland (various studies) - Adam Bazinet - Matthew Conte (now in Tom Kocher’s lab) - Daniel Myers (now at MIT) • National Science Foundation and other funding sources Wednesday, August 3, 11 51 questions Wednesday, August 3, 11 52
© Copyright 2026 Paperzz