cluster

Multidimensional data analysis
Kathleen Marchal
Clustering
Overview
• Case studies
• Clustering
– Distance/similarity measures
– Clustering algorithms
Case study: clustering gene expression
Patients/conditions =observations
Patients/conditions = variables
P1 P2 P3 P4 … Pm
P1 P2 P3 P4 … Pm
Genes = observations
Genes = variables
G1
G2
…
G3
G4
Gn
Patient profiles
G1
G2
…
G3
G4
Gn
Gene profiles
Case study: clustering gene expression
gene profiles
Patients/conditions = variables
P1 P2 P3 P4 … Pm
Genes = observations
G1
G2
…
G3
Variable 2
Patient 2
G4
Gn
Variable 1, patient 1
Gene
Gene profiles
Gene cluster
Case study: clustering gene expression
•
Study of Mitotic cell cycle of Saccharomyces cerevisiae with
oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18)
•
time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et
al., 1999)
Original dataset : 6178 genes
Preprocessing:
• select 4634 most variable (25 % most variable)
• variance normalized
Case study: clustering gene expression
Gene (observation) = vector with expression values (values of
that gene for the different experimental conditions)
Gene 1
.
.
.
Gene n
2
.
.
.
6
4
.
.
.
6
Dim(variable)<dim(observations)
(classical situation)
6
.
.
.
4
Case study: clustering gene expression
1) Represent the data in an 3 Dim space
2) Measure distances between expression vectors
3) Group genes with minimal distance
Gene 1
Gene 2
Case study: clustering gene expression
gene profiles
Patients/conditions = variables
P1 P2 P3 P4 … Pm
Genes = observations
G1
G2
…
G3
Variable 2
Patient 2
G4
Gn
Variable 1, patient 1
Gene
Gene profiles
Gene cluster
Case study: clustering gene expression
M/G1
3.5
3
RATN2
2.5
2
1.5
1
0.5
CLN3
0
0
50
100
Time (min)
150Swi4
CLN2
clustering
Case study: clustering gene expression
Measure expression of all genes
• During time (dynamic profile)
• In different conditions
Clustering
Identify coexpressed genes
Motif Finding
Identify mechanism of coregulation
Case study: clustering patient profiles
Patients/conditions =observations
Patients/conditions = variables
P1 P2 P3 P4 … Pm
P1 P2 P3 P4 … Pm
Genes = observations
Genes = variables
G1
G2
…
G3
G4
Gn
Patient profiles
G1
G2
…
G3
G4
Gn
Gene profiles
Clustering: Group observations based on the values we have for their
variables
Case study: clustering patient profiles
BASAL
HER2
LUMA
LUMB
Case study: clustering patient profiles
…but there are cofounding factors
– Expression signals contain related to age, drug
usage, gender,…
…there are redundant signals
Feature selection: select those genes that are
most distinctive for the phenotype of interest
Case study: clustering patient profiles
Dimensionality reduction (eigengenes):
Project the observations on the first (or first two PCs)
X coordinate of the first patient
P1_(PC1) = a11 P1_gene1 + a12 P1_gene2
General goal clustering
Exploratory data analysis
• visualizing data
• understanding general characteristics of data
Generalization
• infer something about an object (e.g. a gene) based on how it
relates to other objects in the cluster (guilt by association)
Clustering methods are unsupervised
Overview
• Case studies
• Clustering
– Distance/similarity measures
•
•
•
•
Euclidean distance
Pearson correlation
Spearman correlation
Distances and rescaling
– Clustering algorithms
Similarity/distance measures
• Many clustering methods employ a distance
(similarity) measure to assess the distance between
– a pair of profiles
– a cluster and a profile
– a pair of clusters
• given a distance value, it is straightforward to convert it into a
similarity value
Similarity/distance measures
• Properties of metrics
• Two of the easiest and most commonly used similarity
measures for gene expression data are Euclidean distance and
Pearson correlation coefficient.
Distance measures
Distance measures
x12
x1
x2
x22
x11
D(x1,x2)=
𝑥11 − 𝑥21
x21
2
+ 𝑥12 − 𝑥22
D(x1,x2)= 𝑥11 − 𝑥21 + 𝑥12 − 𝑥22
2
Similarity measures
Pearson correlation:
• a statistical measure of the strength of a linear relationship between two
vectors X and Y, giving a value between +1 and −1 inclusive, where 1 is
total positive correlation, 0 is no correlation, and −1 is total negative
correlation.
• measures how similar the directions are in which two expression vectors
point (similarity measure)
• assumes normality (X and Y normally distributed)
Plot of vector (gene) X versus
vector (gene) Y for the different
observed variables (timepoints)
Similarity measures
Pearson correlation:
p
 ( x  x )( y
s ( x, y ) 
i 1
i
p
i
 y)
p
2
(
x

x
)

(
y

y
)
 i
 i
2
i 1
1 p
x   xi
p i 1
1 p
y   yi
p i 1
i 1
Pearson correlation is the rescaled
covariance (i corresponds to the number of
measured conditions i.e. patients)
Similarity measures
Pearson correlation, Geometric interpretation: cos of the angle between the two
vectors A and B
Pearson correlation = 1
B
X2
Pearson correlation = 0
B
X2
A
a
X1
X1
A
Similarity measures
Spearman correlation:
• measure of the strength of a monotonic relationship between paired data.
• non parametric test: no normality assumption
• calculates Pearson’s correlation on the ranked values of this data
If data is monotonic -> you will observe a perfect correlation in the ranks
Similarity measures
Spearman correlation:
Overview distance/similarity measures
Similarity measures vs rescaling
! Note that Euclidean distance is sensitive to scaling and differences in average
expression level, whereas correlation is not.
A’’
X2
A’
A
a
Euclidean distance between B and respectively A, A’
and A’’ is different whereas the Pearson correlation is
the same
B
X1
Choice of distance measure is crucial for datainterpretation
Similarity measures vs rescaling
Euclidean distance can be large
Pearson correlation = 1
A’’
X2
All points are perfectly correlated. However their
Euclidean distance is not the same. Suppose these
points represent gene expression profiles in the n dim
space
A’
A
D(x1,x2)=
𝑥11 − 𝑥21
2
+ ⋯ + 𝑥𝑛2 − 𝑥𝑛2
2
X1
expressie
1
2
A’’
3
4
t
expressie
A’’
A’ (X2)
A’
A (X1)
A
t
Similarity measures vs rescaling
Mean centering
x x
x x
11
12
expressie
1
2
3
n
A’’
X=0
…
x x
1n
x11 x12 x13
x14
x1n
Variance rescaling
(x  x)^2
(x  x)^2
rescaling
11
12
…
expressie
1
2
3
x11 x12 x13
n
x14
x1n
A’’
X=0
(x  x)^2
1n
Similarity measures vs rescaling
A’’
X2
X2
A’’
A’
A’
A
X1
Euclidean distance <> 0
Pearson correlation = 1
A
X1
Euclidean distance = 0
Pearson correlation = 1
Similarity measures vs rescaling
Problem with rescaling: Sensitive to noise
• Euclidean distance
• Pearson correlation
A’’
X2
X2
A’’
A’
A’
A
X1
A
X1
That is why often the noise genes (or the genes that not change their
expression value over the conditions are prefiltered)
Similarity measures vs rescaling
X2
X2
A’
Pearson correlation
Variance rescaling
2
2
A
X1
1
1
A’’
A
7000
4000
X1
Rank based rescaling
Spearman correlation
X2
A’’
A’
A
X1
Euclidean distance
Overview
• Case studies
• Clustering
– Distance/similarity measures
– Clustering algorithms
• Hierarchical clustering
• K-means
Clustering algorithms
Given: expression profiles for a set of genes or experiments/individuals/time
points (whatever columns represent)
• organize profiles into clusters such that profiles in the same cluster are
highly similar to each other (within cluster similarity)
• profiles from different clusters have low similarity to each other (between
cluster distance)
Clustering algorithms
hierarchical algorithms work on
successive splitting (divisive clustering)
or merging (agglomerative clustering)
of the groups, depending on a measure
of distance or similarity between
objects, to form a hierarchy of clusters
partitioning algorithms search for a
partition of the data that optimizes
a global measure of quality for the
groups, usually based on distance
between objects
Hierarchical clustering
• Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of clusters are
merged as one moves up the hierarchy.
• Divisive: This is a "top down" approach: all observations start
in one cluster, and splits are performed recursively as one
moves down the hierarchy.
Hierarchical clustering
Agglomerative method (phylogenetic classification)
– Calculate pairwise distances between genes (distance matrix)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 1
0
d(G1,G2)
d(G1,G3)
d(G1,G4)
d(G1,G5)
Gene 2
d(G2,G1)
0
d(G2,G3)
d(G2,G4)
d(G2,G5)
Gene 3
d(G3,G1)
d(G3,G2)
0
d(G3,G4)
d(G3,G5)
Gene 4
d(G4,G1)
d(G4,G2)
d(G4,G3)
0
d(G4,G5)
Gene 5
d(G5,G1)
d(G5,G2)
d(G5,G3)
d(G5,G4)
0
Metrics
– Pearson correlation
– Euclidean distance,…
Hierarchical clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 1
0
1
5
6
3
Gene 2
3
0
6
4
4
Gene 3
2
5
0
8
6
Gene 4
5
4
4
0
3
Gene 5
4
6
3
6
0
Gene 3
Gene 4
Gene 5
Gene 3
Gene 4
Gene 5
C1
(gene1,
gene 2)
C1
How to calculate
the distance
between C1 and
the remainder
of the genes or
between two
clusters
Hierarchical clustering
cu
cv
Hierarchical clustering
Hierarchical clustering
M
Previous iteration
 d (i , j )
k
Davg (k , u ) 
i, j
c k cu
u
 d (i , j )
k
k
v
Davg ( k , M )
 d (i
k


i, j
, jv )   d (ik , ju )
i, j
c k cv  c k cu
c k cv Davg ( k , v )  c k cu Davg ( k , u )
c k cv  c k cu
Davg (k , v) 
u
i, j
c k cv
v
Hierarchical clustering
Hierarchical clustering
Different variants exist of the average linkage exist based on how this updating is done
• unweighted pair-group method average (UPGMA): the distance between any
two clusters X and Y is taken to be the average of all distances between pairs of
objects "x" in X and "y" in Y , that is, the mean distance between elements of
each cluster (formula previous page)
• weighted pair-group average identical to UPGMA (WPGMA) except that the
size (number of objects contained in a cluster) of the respective clusters is used
as a weight.
• within group clustering (centroid based clustering): similar to UPGMA except
that clusters are merged and a cluster average is used for further calculations
rather than the individual cluster elements.
Hierarchical clustering
– The two selected clusters (genes) are merged to produce a new object
(e.g. average of two merged objects)
Gene 3
Gene 4
Gene 5
C1
Gene 4
d(G4,G3)
0
d(G4,G5)
d(G4,C1)
Gene 5
d(G5,G3)
d(G5,G4)
0
d(G5,C1)
C1
d(C1,G3)
d(C1,G4)
d(C1,G5)
0
– Distance is recalculated (between genes, between merged objects,
between genes & merged objects)
Process is repeated until all genes are clustered
Hierarchical clustering
Extracting clusters: by cutting the tree at a certain level, which determines the
number of clusters
Hierarchical clustering
Hierarchical clustering
•
Properties
– deterministic
– userdefined parameters:
• Metric definition
• Rule
• Cut off value
•
Advantages
– visualisation possible: dendrogram
– Length of the branches is indicative for the distance between the
clusters
Disadvantages
– the number of clusters user defined.
•
K-means clustering
K means clustering seeks to partition a set of data into a specified number of groups
K by minimizing some numerical criterion, low values of which are considered
indicative of a ‘good solution’ e.g. try to find the partition of the n variables into k
groups which minimizes within group sum of squares
The problem appears simple: consider every possible partition of the n individuals
into k groups and select the one with the lowest within group sum of squares.
• However enumeration of every possible partition is in practice impossible .
• Greedy approach to search for the minimum values of the clustering criterion by
rearranging existing partitions and keeping the new one only if it provide an
improvement.
Such algorithms (non exhaustive) do of course not guarantee finding the minimum
of the criterion.
K-means clustering
• Drawback:
– they work well with similar size compact clusters, but often
fail when the shape of the cluster is more complex, or
when there is a large difference in the number of points
between clusters.
– The objective function decreases as a function of the
number of clusters in a nested sequence of partitions (a
new partition is obtained by splitting in two one cluster
from the previous partition). Given this property, the best
partitioning of the data would be when K = n clusters and
each point is a cluster by itself. To address this problem,
either the number of classes must be known beforehand,
or some additional criteria must be used that penalizes
partitions with large numbers of clusters.
K-means clustering
Criterion: distance
to the cluster
centroid is
minimal
K-means clustering
Predefined number of clusters = 5
Initialisation : randomly choose cluster centers (dark points)
K-means clustering
Attribute each point (gene) to cluster with closest center
K-means clustering
Attribute each point (gene) to cluster with closest center
K-means clustering
Recalculate cluster centers = mean expression profile of genes in cluster
K-means clustering
Repeat the whole process
K-means clustering
•
Properties
– Userdefined parameters
• number of clusters
• number of iterations
– Nondeterministic: dependent on the initialisation
•
Advantages
– Easy to understand
– Fast
•
Disadvantages
– number of cluster has to be user-specified
– outcome parameter sensitive (elaborated parameter finetuning essential)
– all genes in the dataset will be clustered: the presence of noisy genes will disturb the
average profile and the quality of the cluster of interest
K-means clustering
Sensitivity of K-means towards parameter setting
K-means, nr. of clusters: 10; nr. of iterations: 100
number of clusters = low
big clusters containing
noise
K-means clustering
K-means, nr. of clusters: 60; nr. of iterations: 100
Conclusion
• Case studies
• Clustering
– Distance/similarity measures
• Euclidean
• Pearson
• Spearman
– Clustering algorithms
• Agglomerative hierarchical
• K means
Cluster validation
Cluster validation
Preprocessing 2
Preprocessing 1
Clustering
Algorithm 1
Clustering
Algorithm 2
Clustering
Algorithm 3
Parameter
Setting 1
Parameter
Setting 2
Parameter
Setting 3
Why cluster validation?
• Different algorithms, parameters
• Intrinsic properties of the dataset (sensitivity to noise, to
outliers)
• Internal validation (one experiment)
• Comparisons of partitions (between cluster
runs)
– Relative validation
– External validation
Cluster validation
• Statistical validation
– Comparison within one experiment
• Cluster coherence testing (internal)
• Figure of Merit (internal)
• Sensitivity analysis
– Comparing between clustering results of
different runs (relative, external)
• RAND index
• Jaccard coefficient
• Biological validation
Cluster validation
• Statistical validation
– Comparison within one experiment
• Cluster coherence testing
• Sensitivity analysis
• Figure of Merit
– Comparing between clustering results of
different runs
• RAND index
• Jaccard coefficient
• Biological validation
Cluster validation (internal)
Cluster coherence testing
Based on biological intuition, a cluster result can be considered reliable if
the within cluster distance is small (i.e. all genes retained are tightly
coexpressed) and the cluster has an average profile well delineated from
the remainder of the dataset (maximal intercluster distance)
• Dunn’s validity index
• Silhouette coefficient
Remark: when comparing the outcome of different algorithms using one
of these statistical metrics one should take care of the fact that the
algorithm that uses the used metric as a distance metric for clustering
within the algorithm will probably tend to perform better.
Cluster validation (internal)
Cluster coherence testing
Silhouette
coefficient
Closest
neighboring
cluster
(intercluster
distance)
bi
i
ai
The average s(i) over all data of a cluster is a
measure of how tightly grouped all the data in
the cluster are.
average
dissimilarity
of i with all other
data within the
same cluster (intra
cluster distance)
Cluster validation (internal)
Cluster coherence testing
min(d(i,j))=
minimal inter-cluster
distance
The Dunn index aims to identify dense and
well-separated clusters. It is defined as the
ratio between the minimal inter-cluster
distance to maximal intra-cluster distance
(the larger the Dunn index the better)
max(d(k))=
maximal intra-cluster
distance
Cluster validation (internal)
Figure of Merit (sensitivity towards an experiment)
Leave one out cross validation
The result of this process is a family of partitions C1, . . . ,CP, each one computed over a
slightly different dataset. The agreement between all these partitions gives a measure
of the consistency of the algorithm and their predictive power (over the removed
column) gives a measure of the ability of the algorithm to generate meaningful
partitions.
Cluster validation (internal)
Figure of Merit (sensitivity towards an experiment)
•
•
•
Tested cluster algorithm is applied to all experimental conditions except the
left out condition
Hypothesis: if the cluster algorithm is robust it can predict the measured
values of the left out condition
To estimate the predictive power of the algorithm FOM is calculated
This is repeated for all conditions n and the average FOM is calculated
FOM of cluster j when K is left out Leave out each K at the time
FOM ( K , j ) 
1 K

n k 1
 (x
ij
iCkj
x )
k 2
j
Sum over all clusters
m
FOM ( K )   FOM ( K , j )
j 1
FOM is the root mean square deviation in the left-out condition e of
the individual gene expression levels relative to their cluster means
Cluster validation (internal)
Figure of Merit (predictive test)
C1
C2
C3
C4
C5
( x  x )2  ( x  x )2 ... ( x  x )2
15
x
Conditions
to
determine
the cluster
Left out
condition
25
n5
Sum of squares should be minimal
FOM ( K , j ) 
1 K

n k 1
k 2
(
x

x
 ij j )
iCkj
Define quality of a clustering algorithm as the spread of the expression values inside
the clusters, measured on the sample that was not used for clustering
Cluster validation
Sensitivity analysis = a way of assigning confidence to the
cluster membership
– create new in silico replica's of the dataset of interest by adding a small
amount of noise on the original data
– treat new datasets as the original one and cluster
– Genes consistently clustered together over all in silico replicas are
considered as robust towards adding noise
How to determine the noise?
Cluster validation
• Statistical validation
– Comparison within one experiment
• Cluster coherence testing
• Sensitivity analysis
• Figure of Merit
– Comparing between clustering results of
different runs
• RAND index
• Jaccard coefficient
• Biological validation
Cluster validation
Comparing cluster results
n objects into K groups: CA = {CA1 , . . . ,CAK} and CB = {CB1 , . . . ,CBK}. Each element CAk
and CBk of CA and CB is called a cluster and is identified by its index k. Let k = IA(x) be the
index of the cluster to which a vector x belongs for the partition CA (e.g., if IA(x) = 3
measure of disagreement
between two partitions is the
error measure ε(CA,CB) i.e.
proportion of objects that
belongs to different clusters
Cluster label unknown:
Cluster exp 1
C1
C2
C3
…
Cluster exp 2
C1
C2
C3
…
Need a measure that is label independent
Cluster exp 3
C1
C2
C3
…
Cluster exp 4
C1
C2
C3
…
Cluster validation
Comparing cluster results
Cluster label unknown:
Cluster exp 1
C1
C2
C3
…
Cluster exp 2
C1
C2
C3
…
Cluster exp 3
C1
C2
C3
…
Cluster exp 4
C1
C2
C3
…
Identify pairs of genes (vectors) that cluster consistently together
• when genes cluster together frequently, this indicates they truly belong
together
• the more frequently genes cluster together, is an indication for more stable
clusters i.e. robust clustering
How to assess
• RAND INDEX (Yeung et al. 2001)
• Jaccard coefficient (Ben-Hur et al. 2002)
Cluster validation
RAND index
• statistic designed to assess the degree of agreement between two partitions
• The Rand statistic measures the proportion of pairs of vectors that agree by
belonging either to the same cluster (a) or to different clusters (d) in both partitions
Adjusted RAND index
• adjusted so that the expected value of the RAND index between two random
partitions is zero
The rand index is defined as the fraction of agreement that is the number of pairs of objects that are
either in same groups in both partitions (a) or in different groups in both partitions (b) divided by the
total number of pairs of objects (a + b + c +d).
The rand index lies between 0 and 1.
ad
RAND 
abcd
a: the number of object pairs that are clustered together in
partitioning 1 and 2
b: the number of object pairs that are clustered together in
partitioning 1 but not in partitioning 2
c: the number of object pairs that are clustered together in p
partitioning 2 but not in partitioning 1
d: the number of object pairs that are put in different
clusters in both partitionings
Cluster validation (relative)
Jaccard coeffient
• The Jaccard coefficient measures the proportion of pairs that belong to the same
cluster (a) in both partitions, relative to all pairs that belong to the same cluster in at
least one of the two partitions (a+b+ c)
• the measure is a proportion of agreement between the partitions, but in contrast
with the Rand statistic, the Jaccard coefficient does not consider the pairs that are
separated (belong to different clusters) in both partitions (d).
The Jaccard coeffient is defined as the fraction of agreement that is the number of pairs of objects that
belong to the same cluster (a) divided by all pairs that belong to the same cluster in at least one
partition (a, b, c)
a
J
abc
a: the number of object pairs that are clustered together in
partitioning 1 and 2
b: the number of object pairs that are clustered together in
partitioning 1 but not in partitioning 2
c: the number of object pairs that are clustered together in
partitioning 2 but not in partitioning1
Cluster validation
The previous indices are based on the counting of the number of pairs of vectors, that
are placed on the same or different clusters, for each partition. For each partition C
the relationship between two vectors, whether they belong to the same cluster or not,
can be represented by a similarity matrix d(i, j) defined by d(i, j) = 1 if xi and xj belong
to the same cluster, and d(i, j) = 0 if they belong to different clusters
Partitioning 1
Partitioning 2
d(i,j)
G1
G2
…
Gn
d(i,j)
G1
G2
…
Gn
G1
1
1
1
0
G1
1
1
1
0
G2
0
1
0
0
G2
0
1
0
0
…
…
…
…
…
…
…
…
…
…
Gn
0
1
1
Gn
0
1
1
Rand index is inversely proportional to the square of the Euclidean distance
between the matrices dA and dB.
Cluster validation
• Statistical validation
– Comparison within one experiment
• Cluster coherence testing
• Sensitivity analysis
• Figure of Merit
– Comparing between clustering results of
different runs
• RAND index
• Jaccard coefficient
• Biological validation
Cluster validation
dataset
small clusters
• contain genes with highly
similar profile (+)
• some information given up
in first step (-)
validate “core” clusters
Motif finding
DNA level
literature/
knowledge
big clusters
• contain all real positives (+)
• increasing number of false positives
(-)
extend clusters
Cluster validation
------------------------------------------------------….
Number of
genes in the
cluster from GO
category K =>
x = number of
successes
GO category K:
K genes
------------------n genes in the
cluster =
number of trials
N: total number of genes
K: number of genes in ontology class K
x: number of successes
n = number of genes in the cluster = number of trials



p( X  x) 
 
K
x
N K
n x
N
n

Download Report

cluster

Paperzz.com

Your Paperzz