W08: Gene Expression Analysis File

Gene expression analysis
Curtis Huttenhower
Slides courtesy of:
Amy Caudy (Princeton)
Gavin Sherlock (Stanford)
Matt Hibbs (Jackson Labs)
Florian Markowetz (Cancer Research UK)
Olga Troyanskaya (Princeton)
Harvard School of Public Health
Department of Biostatistics
03-26-14
Gene expression analyses
• Unsupervised
– Clustering (class discovery)
– Ordination
– Coexpression (network construction)
• Supervised
– Differential expression (class comparison)
– Biomarker discovery (class prediction)
2
Supervised analysis
= learning from examples, classification
– We have already seen groups of healthy and
sick people. Now let’s diagnose the next person
walking into the hospital.
– We know that these genes have function X (and
these others don’t). Let’s find more genes with
function X.
– We know many gene-pairs that are functionally
related (and many more that are not). Let’s
extend the number of known related gene pairs.
Known structure in the data needs to be
generalized to new data.
Un-supervised analysis
= pattern finding
– Are there groups of genes that behave similarly
in all conditions?
– Disease X is very heterogeneous. Can we
identify more specific sub-classes for more
targeted treatment?
– What are the major patterns of variation (genes,
pathways) active under these conditions?
No structure is known. We first need to find
it. Exploratory analysis.
Supervised analysis
Calvin, I still don’t know the
difference between cats
and dogs …
Oh, now I get it!!
Class 1: cats
Don’t worry!
I’ll show you once
more:
Class 2: dogs
Un-supervised analysis
Calvin, I still don’t know the
difference between cats
and dogs …
I don’t know it either.
Let’s try to figure it out
together …
Unsupervised analysis clustering
Visualizing Data
MAK16
YAL025C
MAK16
YBL015W ACH1
5
0.5
4
0
OD 7.30
OD 6.90
OD 3.70
OD 1.80
OD 0.80
OD 0.46
OD 0.26
3
2
-0.5
1
MAK16
-2
-2
-2.5
-3
-4
7.
30
D
6.
90
O
D
3.
70
O
D
1.
80
O
D
0.
80
O
D
O
D
O
D
-1
O
-1.5
0.
46
0
0.
26
-1
YBL048W
YBL048W
YBL049W
YBL049W
YBL064C
YBL064C
YBL078C
YBL078C
YBR072W
HSP26
YBR139W
YBR139W
YBR147W
YBR147W
YCR021C
HSP30
YDL085W
YDL085W
YDL204W
YDL204W
YDL208W NHP2
Visualizing Data (cont.)
Expression During Sporulation
5
Series1
Series2
Series3
Series4
4
Series5
Series6
Series7
Series8
Series9
Series10
3
Series11
Series12
Series13
Series14
Series15
2
Series16
Series17
Series18
Series19
Log Ratio
Series20
Series21
1
Series22
Series23
Series24
Series25
Series26
0
Series27
0
2
4
6
8
10
Series28
Series29
Series30
Series31
-1
Series32
Series33
Series34
Series35
Series36
Series37
-2
Series38
Series39
Series40
Series41
Series42
-3
Series43
Series44
Series45
Series46
Series47
-4
Series48
Time (hours)
Series49
Series50
Series51
What is clustering?
• Reordering of gene (or experiment)
expression vectors in the dataset so that
similar patterns are next to each other (or in
separate groups)
• Identify subsets of genes (or experiments)
that are related by some measure
Why cluster?
Conditions
• Dimensionality reduction:
datasets are too large to
be able to get information
out without reorganizing
the data
Genes
• “Guilt by association” =>
if unknown gene i is
similar in expression to
known gene j, maybe
they are involved in the
same/related pathway
Clustering Techniques
• Algorithm (Method)
–
–
–
–
–
–
–
–
Hierarchical
K-means
Self Organizing Maps
QT-Clustering
NNN
.
.
.
• Distance Metric
–
–
–
–
–
–
–
–
Euclidean (L2)
Pearson Correlation
Spearman Correlation
Manhattan (L1)
Kendall’s t
.
.
.
Distance Measures
• Choice of distance measure is important for most clustering
techniques
• Pair-wise measures – compare vectors of numbers
– e.g. genes x & y, each with n measurements
Euclidean Distance
Pearson Correlation
Spearman Correlation
Distance Measures
Euclidean Distance
Spearman Correlation
Pearson Correlation
Hierarchical clustering
• Imposes (pair-wise) hierarchical structure on
all of the data
• Often good for visualization
• Basic Method (agglomerative):
1.
2.
3.
4.
Calculate all pair-wise distances
Join the closest pair
Calculate pair’s distance to all others
Repeat from 2 until all joined
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Single linkage Clustering
Nearest Neighbor
•
•
+•
•
•
•
•
+
•
This method produces
long chains which
form straggly clusters.
•
•
Complete Linkage Clustering
Uses the Furthest
Neighbor
•
•
+•
•
•
•
•
+
•
•
• This method tends to
produce very tight
clusters of similar
patterns
Average Linkage Clustering
•
Average (only
shown for two
cases)
•
+•
•
•
•
•
+
•
•
•
The red and blue ‘+’
signs mark the
centroids of the two
clusters.
Centroid Linkage Clustering
•
Centroid
•
+•
•
•
•
•
+
•
•
•
The red and blue ‘+’
signs mark the
centroids of the two
clusters.
Hierarchical clustering: problems
• Highly sensitive to similarity measure, algorithm
• Clusters everything, hard to define distinct clusters
• Genes assigned to clusters on the basis of all
experiments
• Optimizing node ordering hard (finding the optimal
solution is NP-hard)
• Can be driven by one strong cluster – a problem
for gene expression b/c data in row space is often
highly correlated
K-means Clustering
• Groups genes into a pre-defined number of
independent clusters
• Basic algorithm:
1. Define k = number of clusters
2. Randomly initialize each cluster with a seed (often
with a random gene)
3. Assign each gene to the cluster with the most
similar seed
4. Recalculate all cluster seeds as means (or
medians) of genes assigned to the cluster
5. Repeat 3 & 4 until convergence
(e.g. No genes move, means don’t change much, etc.)
K-means example
K-means example
K-means example
K-means: problems
• Have to set k ahead of time
– Ways to choose “optimal” k: minimize withincluster variation compared to random data or
held out data
• You’ll get k clusters whether they exist or not
• Each gene only belongs to exactly 1 cluster
• One cluster has no influence on the others
(one dimensional clustering)
• Genes assigned to clusters on the basis of
all experiments
Can a gene belong to N clusters?
• Fuzzy clustering: each gene’s relationship to
a cluster is probabilistic
• Gene can belong to many clusters
• More biologically realistic,
but harder to get to work well/fast
0.85
• Harder to interpret
0.15
Advanced clustering methods
•
•
•
•
Fuzzy clustering
Clustering with resampling
Biclustering
Clustering based on physical properties
(spring models, “attraction of points”)
• Dimensionality reduction
Clustering Tools
• TIGR MeV
– http://www.tm4.org/mev.html
• Sleipnir
– http://huttenhower.sph.harvard.edu/sleipnir
• Cluster & JavaTreeView
– http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm
– http://jtreeview.sourceforge.net/
• CLICK & EXPANDER
– http://www.cs.tau.ac.il/~rshamir/expander/expander.html
Never underestimate the power of Excel in conjunction with Python!
Exercise Caution
• Typically, when constructing a microarray
dataset, certain filters are applied to retain
only the ‘interesting’ genes.
• Clustering imposes an ordering/cluster
structure on the genes, whether one exists or
not.
• This is accentuated by filtering out of genes.
An Example
Bryan, 2004
The Result of Filtering
Groups in the data, which didn’t exist with the full dataset,
suddenly appear!
Bryan, 2004
Cluster Evaluation
• Mathematical consistency
– Compare coherency of clusters to background
• Look for functional consistency in clusters
– Requires a gold standard, often based on GO,
KEGG, MSigDB, etc.
– ROC curves/AUC or precision/recall
• Evaluate likelihood of enrichment in clusters
– Hypergeometric distribution, etc.
– Several tools available (DAVID, GSEA, others)
Gene set overlap
• Inputs:
– One (or more) result set(s) of discrete (“hard”) clusters
– One (or more) characterized gene sets to test
• Probability of observing x or more genes in a
cluster of n genes with a common annotation
–
–
–
–
–
Hypergeometric or Fisher’s exact tests appropriate
N = total number of genes in genome
M = number of genes with annotation
p  value 
n = number of genes in cluster
x = number of genes in cluster with annotation
M N  M 


n  
j
n

j
 

 N 
j x
 
n 
• Multiple hypothesis correction required if testing
multiple functions (Bonferroni, FDR, etc.)

Gene set enrichment
• Inputs:
– One (or more) result set(s) of genes in rank order
– One (or more) characterized gene sets to test
• Probability of observing x or genes from a pathway
in the top/bottom n ranked
result genes
• Multiple hypothesis
correction still required over
all characterized gene sets
tested.
GO term Enrichment Tools
• GSEA (http://www.broadinstitute.org/gsea/)
• GOrilla (http://cbl-gorilla.cs.technion.ac.il/)
• DAVID (http://david.abcc.ncifcrf.gov/)
• AmiGO & Princeton’s GoTermFinder
– http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
– http://go.princeton.edu
• GOLEM (http://function.princeton.edu/GOLEM)
Sealfon et al., 2006
Advanced analysis methods –
a very brief overview
More Unsupervised Methods
• Search-based approaches
– Start with query gene/condition, find similar
– Also referred to as “prioritization”
• Singular Value Decomposition (SVD) &
Principal Component Analysis (PCA)
– Decomposition of data matrix into
“patterns”, “weights”, and “contributions”
– Real names are “principal components”,
“singular values”, and “left/right eigenvectors”
– Used to remove noise, reduce dimensionality,
identify common/dominant signals
SVD (& PCA)
• SVD is a general method for matrix decomposition,
PCA is performing SVD on centered data
• Projects data into another orthonormal basis
• New basis ordered by variance explained

Vt
=
X
U
Singular “Eigen-genes”
values
Original
“Eigen-conditions”
Data matrix
PCA
PCA
Supervised methods – a very
brief introduction
Supervised vs. Unsupervised
• Unsupervised methods can find novel profile
groupings
• Supervised methods take known groupings
and create rules for reliably assigning genes
or conditions into these groups
Hierarchical clustering of lung cancers
Microbiome ordination
Only use unsupervised methods
A) as a qualitative, visual guide or
B) when you don’t know what you’re looking for.
49
Supervised analysis: setup
• Training set
– Data: samples (conditions) or genes
– Labels: classes of interest (e.g. case/control for
conditions, function annotations for genes)
• Test set
– Data: as above without labels.
– E.g. Genes without known function
• Goal: generalization
– Build a classifier from the training data that is
good at predicting the right class for new data.
Learning to classify
expression profiles
Think of a space with #genes
dimensions
(yes, it’s hard for more than 3).
Expression of gene 2
Each sample corresponds to a point in
this space.
Same principle as PCA, ordination
If gene expression is similar under
some conditions, the points will be
close to each other.
If gene expression overall is very
different, the points will be far away.
Expression of gene 1
Which line separates best?
A
B
C
D
No sharp knife, but a …
Support Vector Machines
Maximal margin
separating hyperplane
Datapoints closest
to separating
hyperplane
= support vectors
How well did we do?
Training error: how well
do we do on the data we
trained the classifier on?
But how well will we do in
the future, on new data?
Test error: How well does
the classifier generalize?
Same classifier (= line)
New data from same classes
The classifier will usually perform
worse than before:
Test error > training error
Cross-validation
Training error
Train classifier and test it
Test error
Train
Test
K-fold Cross-validation
Here for
K=3
Step 1.
Train
Train
Test
Step 2.
Train
Test
Train
Step 3.
Test
Train
Train
Other supervised methods
•
•
•
•
•
•
Bayesian networks
Neural networks
Linear discriminant analysis (LDA)
Logistic regression
Boosting
Decision trees
Summary II
• Supervised and un-supervised learning
… are needed everywhere in biology and medicine
• Microarrays = points in high-dimensional spaces
• Classifiers = surfaces (hyperplanes) in these spaces
• Support Vector Machines use maximal margin
hyperplanes as classifiers
• Classifier performance: Test error > training error
• Cross-validation is the right way to evaluate
classifier performance
Identifying biomarkers
Class discovery
Class comparison
Class prediction
The problem
• Have samples in two groups A and B
• Want to identify biomarker genes between A and B
• Challenges:
– Data are not normally distributed
– Expression data are often noisy
A
B
• Goal:
robust and reliable methods for identification of
biomarker genes
Hierarchical clustering of lung cancers
Patient survival for
Adenocarcinoma subgroups
1
Cum. Survival (Group 1)
Cum. Survival
.8
Cum. Survival (Group 2)
Cum. Survival (Group 3)
.6
.4
p = 0.002
.2
for Gr. 1 vs. Gr. 3
0
0
10
20
30
40
Time (months)
50
60
Nonparametric t-test
• Want to pick genes with:
– Maximal difference in mean expression
between samples
A
B
A
B
– Minimal variance of expression within sample
A
B
A
B
Nonparametric t-test
group 1 : n1 samples, with average expression X1
group 2 : n 2 samples, with average expression X 2
t statistic

X
:t 
S
1  X2
S X1  X 2 
X1  X 2
s12 s22

n1 n2
p value (from column permutations) :

p
j p erm
count(t j p erm  t jo b s )
count(perm
utations)
count(permutations)
Low
Expression
High
Expression
Wilcoxon rank-sum test
• Tests for equality of means of two samples
• Uses rank data
• Good for non-normal data
Original data
2 | 0 | 3 | 5 | 9 gene 1
Ranks
2 | 1 | 3 | 4 | 5 gene 1
0 | 1 | 2 | 3 | 5 gene 2
1 | 2 | 3 | 4 | 5 gene 2
7 | 4 | 3 | 2 | 1 gene 3
5 | 4 | 3 | 2 | 1 gene 3
• Identifies genes with skewed distribution of ranks
Microarrays: MeV
www.tm4.org/mev
66
Microarrays: MeV
www.tm4.org/mev
67
Analysis summary
• Unsupervised methods –
– Clustering
• Hierarchical clustering
• K means clustering
– Decomposition
• SVD/PCA
• Supervised methods
– Require examples with known answers
• Need both positive and negative examples
– Support Vector Machines
• Biomarker identification
– Nonparametric t-test
– Rank sum test