application of machine learning to genotype

application of machine learning
to genotype-phenotype relationships
workshop on molecular evolution, north america
2011
Wednesday, August 3, 11
1
expected learning outcomes
•
general factors affecting phenotype
•
understanding why analysis if genotype-phenotype
relationships is not a simple problem
•
basic understanding of tree-based statistical models
•
basic understanding of random forests
•
how random forests can be applied to the study of genotypephenotype relationships
Wednesday, August 3, 11
2
presentation outline
•
very general background on genotype-phenotype relationships
•
introduction to some machine learning and statistical methods
•
first example: drug resistant tuberculosis
•
more material on machine learning methods
•
second example: cold-adaptation in tubulin proteins
Wednesday, August 3, 11
3
a few definitions
•
genotype
- the class to which an organism belongs based upon its genetic
material
- partial genotypes are most often considered
DNA sequence
single nucleotide polymorphisms (SNPs)
microsatellites
protein sequences
•
phenotype
- the class to which an organism belongs based upon its
observable physical characteristics
- partial phenotypes are most often considered
Wednesday, August 3, 11
4
factors affecting phenotype
•
genetics: hereditary basis
- epistasis: non-additive interaction between two or more loci
•
epigenesis: interaction between cells and substances during
development
•
environment: external abiotic and biotic influences
•
experimental error: inaccuracies and irregularities in
measurement
Wednesday, August 3, 11
5
general genotype-phenotype questions
•
What differences in phenotype can be explained by genotype?
•
What particular genotypic features are important in determining
phenotype?
•
How accurately can phenotype be predicted based on
genotype data?
Wednesday, August 3, 11
6
analytical challenges of genotype-phenotype data
•
unordered categorical variables: nucleotides, amino acids,
SNPs
•
numerous levels of variables: four, twenty
•
mixture of variable types: categorical, continuous (numerical)
•
potential for non-additive interactions between variables:
epistasis
Wednesday, August 3, 11
7
desired analysis attributes
•
explanatory: provides a distinctive description
•
predictive: can be applied to observations where phenotype is
unknown
•
quantitative: provides numeric measures of relationships and
error
•
flexible: accommodates different variable types, mixtures of
variable types, and interaction among variables
•
interpretable: provides means to understanding
Wednesday, August 3, 11
8
candidate analysis methods
•
sequence motif identification: specific patterns of nucleotides or
amino acids
•
phylogenetics: evolutionary or genealogical associations
•
artificial neural networks: a supervised machine learning
technique
•
standard statistical methods: methodology subsumed under
generalized linear models and extensions
- analysis of variance (ANOVA)
- logistic regression
- discriminant analysis
- kernel density estimation
- nearest neighbor analysis
Wednesday, August 3, 11
9
tree-based statistical models
•
recursively partition a data set in two (binary split) based on the
value of a single predictor variable
- to best achieve homogeneous subsets of a categorical response
variable (classification)
- to best separate low and high values of a continuous response
variable (regression)
developed for social science research at the University of
Michigan in the 1960s
• known as decision trees in machine learning
• improved by Breiman, Friedman, Olshen and Stone (1984)
•
- established a firm theoretical basis
- fixed methodological problems
- developed improved algorithms
Wednesday, August 3, 11
10
simple key to terrestrial plants (Embryophyta)
1 seed producing (Spermatophyta) → 2
2 seeds in carpel (Magnoliophyta)
2´ seeds not in carpel (“Gymnosperms”)
1´ not seed producing → 3
3 fern-like (Filicophyta)
3´ not fern-like → 4
4 stems jointed (Equisetophyta)
4´ stems not jointed → 5
Wednesday, August 3, 11
11
key as a classification tree
terrestrial plants
seeds producing
not seed producing
seed plants
seeds in carpel
seedless plants
fern-like
seeds not in carpel
angiosperms
gymnosperms
not fern-like
ferns
others
stems jointed
stems not jointed
horsetails
Wednesday, August 3, 11
others
12
elements of initial tree growing procedure
1. set of binary questions
-
question: Is observation xi ∈ A? where A is a region of variable
space, X
answer: yes, xi ∈ A or no, xi ∉ A
2. goodness of fit criterion that can be evaluated for any split
-
often include concepts of impurity, entropy/information, deviance/
likelihood, least squares or least average deviation
3. stop-splitting rule
-
do not stop, but prune tree later
4. rule for assigning a value to every node
-
regression: mean of observations
classification: most frequent class
Breiman et al 1984 Classification and Regression Trees Wadsworth & Brooks/Cole
Wednesday, August 3, 11
13
rifampin resistant tuberculosis
Wednesday, August 3, 11
14
biological context: rifampin resistant tuberculosis
•
rifampin is a commonly used antibiotic for treatment of
tuberculosis
•
rifampin inhibits RNA polymerization
•
resistance to rifampin has spontaneously arisen in
Mycobacterium tuberculosis many times
•
genes involved in antibiotic resistance are borne on the
bacterial chromosome and not on plasmids
Wednesday, August 3, 11
15
practical relevance: rifampin resistant tuberculosis
•
tuberculosis is the leading cause of death due to an infectious
organism
•
antibiotic resistance is a major problem in tuberculosis
treatment
•
conventional antibiotic resistance testing of tuberculosis is slow
and expensive
•
sequence-based antibiotic resistance prediction can eliminate
the need for conventional testing
Wednesday, August 3, 11
16
questions: rifampin resistant tuberculosis
•
What positions in the amino acid sequences of β-subunit of
RNA polymerase are statistically associated with rifampin
resistance?
•
How accurately can rifampin resistance be predicted based on
sequence data of β-subunit of RNA polymerase?
Wednesday, August 3, 11
17
data
•
protein sequences
173 partial amino acid sequences (22 amino acids) of β-subunit of
RNA polymerase (positions 511-533 in the rpoB gene product)
•
clinical isolates from throughout Japan and from New York City
•
genotype is amino acid sequence
•
phenotype is minimum inhibitory concentration (MIC) of rifampin
(μg/ml)
70 susceptible, 103 resistant
Wednesday, August 3, 11
18
data
MIC (μg/ml) number of strains
0.0625
48
0.125
2
0.25
2
< 0.39
13
0.5
2
1
3
2
2
4
2
8
6
12.5
3
16
3
32
1
> 32
15
50
1
64
7
100
1
128
19
200
1
> 200
7
256
13
512
18
> 512
4
differences
1
0
0
2
1
1
1
1
5
3
3
0
15
1
7
1
19
1
7
13
18
4
amino acid differences from consensus
515:V
none
none
521:P, 533:P
533:P
533:P
511:R and 512:T
516:V
516:V, 526:G, 526:L, 526:Q and 533:P
514:L and 516:V, 533:P
526:L, 526:N, 529:K
none
511:R and 516:Y, 513:K, 513:L, 526:D, 526:Y, 531:L, 531:W, 533:P
531:L
531:L
531:L
513:L, 516:Y, 526:Y, 531:L
526:D
513:L, 516:A and 526:D, 526:P, 526:Y, 531:L
513:K, 516:Y and 526:N, 526:P, 531:L
526:P, 526:R, 526:Y, 531:L, 531:W
516:Y, 526:D, 526:P, 531:W
Moghazeh et al 1996, Ohno et al 1996, Taniguchi et al 1996
Wednesday, August 3, 11
19
rifampin resistance: regression tree
n = 173
6.373 μg/ml
S
H
position 531
LW
n = 121
n = 52
1.596 μg/ml
159.65 μg/ml
position 526
DGLNPQRY
n = 90
n = 31
0.355 μg/ml
125.79 μg/ml
Cummings and Segal BMC Bioinformatics 5:137 2004
Wednesday, August 3, 11
20
rifampin resistance: classification tree
103 resistant
70 susceptible
S
H
position 531
LW
51 resistant
52 resistant
70 susceptible
0 susceptible
position 526
DGLNPQRY
20 resistant
31 resistant
70 susceptible
0 susceptible
Cummings and Segal BMC Bioinformatics 5:137 2004
Wednesday, August 3, 11
21
variable importance
mean decrease in Gini index
20
15
10
5
0
511
512
513
514
515
516
521
526
529
531
533
amino acid position
Cummings and Segal BMC Bioinformatics 5:137 2004
Wednesday, August 3, 11
22
improving on single trees with random forests
Wednesday, August 3, 11
23
random forest algorithm
for (i= 0; i < n; i++)
subsample observations
for each sample
calculate tree model (unpruned)
for each node
sample k predictor variables
choose best partition
end for each
end calculate
end for each
end for
aggregate models
calculate summary statistics, variable importance, …
Wednesday, August 3, 11
24
advantages of random forest
•
relatively low error (perhaps the lowest of any method)
•
no over-fitting
•
elegant handling of missing values
•
only partially “black-box” (e.g., results include variable
importance, outlier detection)
•
can be used for supervised and unsupervised learning
problems
Wednesday, August 3, 11
25
why random forest works so well
•
variance reduction by averaging over models
•
randomization steps decrease correlation between individual
models in the ensemble
Wednesday, August 3, 11
26
single tree versus random forest
single tree random forest
variance explained
0.679
0.861
classification accuracy
0.884
0.942
Cummings and Segal BMC Bioinformatics 5:137 2004
Wednesday, August 3, 11
27
results for rifampin resistance: confusion matrix
susceptible resistant
class error
susceptible
70
0
0
resistant
9
94
0.0874
overall prediction accuracy: 0.942
Cummings and Segal BMC Bioinformatics 5:137 2004
Wednesday, August 3, 11
28
rifampin resistance summary
•
very few of the polymorphic amino acid positions are important
in rifampin resistance
•
most of the variance in rifampin resistance can be explained
with two or three amino acid changes
•
some variation in rifampin resistance is determined by variables
other than positions 511-533 in the rpoB gene product
Wednesday, August 3, 11
29
cold adaptation of tubulins
Wednesday, August 3, 11
30
biological context: microtubules and tubulins
•
microtubules are highly conserved structures that are essential
for cell division, locomotion and the cytoskeleton
•
microtubules are formed by assembly (polymerization) of tubulin
subunits and microtubule associated proteins (MAPS)
•
microtubules are formed from 13 microfilaments (heterodimers
of α- and β-tubulins associated longitudinally)
•
polymerization of tubulin subunits is generally formed at
relatively warm temperatures (30-37 ˚C), which favor
hydrophobic interactions between α- and β-tubulins
•
microtubules disassemble (tubulins depolymerize) at cold
temperatures (< 20 ˚C)
Wednesday, August 3, 11
31
biological context: psychrophilic organisms
•
psychrophilic organisms, by definition, grow and reproduce (not
just survive) at relatively cold temperatures (e.g., -4-10 ˚C)
•
psychrophilic organisms include a phylogenetically diverse
range of organisms (e.g., fishes, multiple protist groups)
•
work on North Atlantic and Antarctic fishes has shown
properties intrinsic to tubulin subunits are responsible for cold
adaptation
Wednesday, August 3, 11
32
questions: cold-adaptation of tubulins
•
What positions in the amino acid sequences of α- and βtubulins are associated with cold-adaptation?
•
How accurately can psychrophilic species be predicted based
on tubulin sequence data?
Wednesday, August 3, 11
33
study design
1. generate DNA sequences from a phylogenetically broad
sample of psychrophilic and mesophilic protists
2. collect appropriate sequences from GenBank to augment the
sample
3. analyze inferred protein sequences using machine learning/
statistical methods to identify important positions
4. examine the important positions in the context of the protein
structure and general patterns of sequence evolution
Wednesday, August 3, 11
34
data
•
protein sequences
117 alpha-tubulin (149-451 amino acids, some partial sequences)
89 warm, 28 cold
135 beta-tubulin (114-445 amino acids, some partial sequences)
99 warm, 36 cold
•
genotype is amino acid sequence
•
phenotype is habitat (cold, warm)
Wednesday, August 3, 11
35
phylogenetically diverse sample (new data only)
dinoflagellates
cryptophytes
chrysophytes
bacillariophytes
prymnesiophytes
prasinophytes
chlorophytes
kinetoplastids/
bodonids
choanoflagellates
genera
3
2
2
5
1
3
1
species/isolates
5
2
5
9
3
3
2
3
3
3
3
Gast et al unpublished
Wednesday, August 3, 11
36
Wednesday, August 3, 11
37
results for α-tubulin: confusion matrix
cold
warm
class error
cold
25
3
0.1071
warm
4
85
0.0449
overall prediction accuracy: 0.9402
Cummings et al unpublished
Wednesday, August 3, 11
38
results for β-tubulin: confusion matrix
cold
warm
class error
cold
31
5
0.1389
warm
2
97
0.0202
overall prediction accuracy: 0.9481
Cummings et al unpublished
Wednesday, August 3, 11
39
calculating permutation accuracy importance
1. prediction accuracy on the out-of-bag data is calculated
2. for each predictor variable
2.1. a single variable is permuted
2.2. prediction accuracy on the out-of-bag data is calculated with
the variable permuted
2.3. difference in results between the original and permuted data
sets are calculated, averaged over all trees, and normalized to
yield permutation accuracy importance
3. procedure is repeated for all predictor variables
Wednesday, August 3, 11
40
permutation accuracy importance
alpha tubulin
beta tubulin
V23
I24
A397
V238
N167
A375
Q85
I86
S340
T376
T234
Q31
V260
A270
Y172
L318
S38
T190
S170
W388
N300
24
26
28
30
32
34
mean decrease in accuracy
14
16
18
20
22
24
mean decrease in accuracy
Cummings et al unpublished
Wednesday, August 3, 11
41
β- and α- tubulins: all domains
Cummings et al unpublished
Wednesday, August 3, 11
42
β- and α- tubulins: all domains with GTP
Cummings et al unpublished
Wednesday, August 3, 11
43
β- and α- tubulins: nucleotide binding domains
Cummings et al unpublished
Wednesday, August 3, 11
44
β- and α- tubulins: drug binding domains
Cummings et al unpublished
Wednesday, August 3, 11
45
β- and α- tubulins: MAP binding domains
Cummings et al unpublished
Wednesday, August 3, 11
46
β- and α- tubulins: all domains with GTP
Cummings et al unpublished
Wednesday, August 3, 11
47
tubulin summary
•
relative few of hundreds of variable amino acid positions are
statistically associated with cold-adaptation of tubulins
•
the nucleotide binding domains have more important positions,
followed by the drug binding domain
•
important positions generally differ between α- and β-tubulins
Wednesday, August 3, 11
48
other examples
•
RNA editing in plant mitochondrial genomes (BMC
Bioinformatics 5:132)
•
limb loss in tetrapods (J Mol Evol 67:581–593)
•
malaria: target gene, clinical symptoms (Sci Transl Med 1:2ra5)
•
prediction of data analysis runtime for job scheduling on grid
computing system (PCGRID’11/IPDPS’11)
•
malaria: genome wide association studies (microsatellites and
SNPs)
- drug resistance (in preparation)
- parasite clearance times (in preparation)
Wednesday, August 3, 11
49
variable importance for GARLI runtime
93% of variance explained by the model
!
Wednesday, August 3, 11
50
acknowledgments
•
•
•
•
•
Mark Segal, University of California, San Francisco (tuberculosis)
Rebecca Gast, Woods Hole Oceanographic Institution (tubulins)
Tiana Kohlsdorf, Universidade de São Paulo, and Gunter Wagner, Yale
University (limb loss)
Shannon Takala and Christopher Plowe, University of Maryland School of
Medicine (malaria)
some of my group at University of Maryland (various studies)
- Adam Bazinet
- Matthew Conte (now in Tom Kocher’s lab)
- Daniel Myers (now at MIT)
•
National Science Foundation and other funding sources
Wednesday, August 3, 11
51
questions
Wednesday, August 3, 11
52