Exploring disease data

Exploring Disease Data
CS 640 Bioinformatics
Some Definitions
• Data mining: extracting hidden patterns and
useful info from large data sets. Ex- clustering,
machine learning.
Should not be:
"Torturing data until it confesses ... and if
you torture it enough, it will confess to
anything" - Jeff Jonas, IBM
• Machine learning: the ability of a program to
learn from experience. Ex- neural networks,
decision trees, rule-based methods, MDR.
Methods
• Regression methods: modeling the relationship
between a dependent variable and one of more
independent variables.
• Data mining methods: Search the space of
possible models efficiently. Better with non-linear
and high-dimensional data, or data with many
potential interactions.
• Exhaustive Search: search all possible models
for the best one.
Linear regression
• Relates outcome as a linear combination of the
parameters (but not necessarily of the
independent variables).
• Ex: Let y = incidence of disease, n data points.
Independent variables A,B
1) yi = b0 + b1Ai + εi,
i = 1,…,n
2) yi = b0 + b2 (Bi)2 + εi,
i = 1,…,n
where b0, b1, b2 = parameters, εi is error term.
In both of these examples, the disease is modeled
as linear in the parameters, although it is
quadratic in variable B
Linear regression
Given a sample, we estimate the params
(ex: can use least squares) to arrive at the
linear regression model: yˆ i  Bˆ 0  Bˆ1 xi
[1]
Multiple regression
•
Relates the the probability of an event to
a linear combination of predictor
variables.
• Ex: Let y = incidence of disease, n data
points. Independent variables x1, x2
yi = b0 + b1xi1 + b2xi2 + … + bpxip + εi, i = 1,…,n
Best-fit line:
yˆ i  Bˆ 0  Bˆ1 xi1  Bˆ 2 xi 2  ...  Bˆ p xip
For each unit increase in xip, ŷ i is expected to
increase by B̂ p .
Fitted line for 2 input variables is a plane
yˆi  Bˆ0  Bˆ1 xi1  Bˆ2 xi 2
Logistic regression[1]
• Often used when the outcome is binary,
relates the log-odds of the probability of an
event to a linear combination of predictor
variables. Ex:
• ln(p/(1 – p)) = α + βxB + γxC + ixBxC,
where xB and xC are measured binary
indicator variables, and regression
coefficients β and y represent main
effects, i represents interaction.
Other statistical methods [1]
• Bayesian model selection: a statistical
approach incorporating both prior
distributions for parameters and observed
data into the model.
• Maximum likelihood: a statistical method
used to make inferences about the
combination of parameter values resulting
in the highest probability of obtaining the
observed data
Modeling Terminology[1]
• Saturated: a statistical model that is as full
as possible (saturated) with parameters.
• Marginal effects: the effects of one
parameter averaged over the possible
values taken by other parameters
• Entropy: the uncertainty associated with a
random variable
Modeling Terminology[1]
• Cross-validation: partitioning a data set into n
subsets, then using each subset in turn as the
test set while using the other n-1 to train.
• Overfitting: a model that provides a good fit to a
specific data set but generalizes poorly.
• Marginal effects: the effects of one parameter
averaged over the possible values taken by
other parameters.
Marginal Effects [2]
Marginal penetrance: Ex: The probability P(D|A=Aa), irrespective of what
value B has
Table II. Penetrance values for combinations of genotypes from two
single nucleotide polymorphisms exhibiting interactions in the absence
of independent main effects
Genotype
Genotype
AA (0.25)
Marginal penetrance B
Aa (0.50)
aa (0.25)
BB (0.25)
0
1
0
0.5
Bb (0.50)
1
0
1
0.5
bb (0.25)
0
1
0
0.5
Marginal
penetrance A
0.5
0.5
0.5
Genotype frequencies are given in parentheses
Marginal penetrance values for the A, B genotypes.
Weka [3]
• A collection of visualization tools and algorithms
for data analysis and predictive modeling.
• Preprocessing tools for reading data in a variety of
formats and transforming it.
• Classification algorithms include regression,
neural network, support vector machine, decision
tree. Display includes ROC curves
• Clustering: k-means, expectation maximization
• Visualization includes scatter-plot, bar graph
References
[1] Cordell, 2009, Detecting gene–gene
interactions that underlie human diseases.
Nature Review Genetics
[2]
McKinney et al, 2006, Machine Learning
for Detecting Gene-Gene Interactions, A Review.
Biomedical Genomics and Proteomics
[3]
Weka site:
http://www.cs.waikato.ac.nz/ml/weka