Cluster algorithms

MULTIVARIATE ANALYSIS CLUSTERING
Table of contents
Multivariate analysis clustering .......................................................................................................................................... 1
Introduction .................................................................................................................................................................... 2
Case study 1 Clustering genes expression profiles .................................................................................................... 2
Case study 2: Clustering patient profiles ................................................................................................................... 3
Distance/similarity metrics ............................................................................................................................................ 4
Euclidean distance ..................................................................................................................................................... 4
Pearson correlation .................................................................................................................................................... 5
Spearman correlation ................................................................................................................................................ 7
Data rescaling and distance measures ....................................................................................................................... 9
Cluster algorithms ........................................................................................................................................................ 11
Hierarchical clustering ............................................................................................................................................ 12
Cluster validation ......................................................................................................................................................... 21
Introduction ............................................................................................................................................................. 21
Statistical validation ................................................................................................................................................ 22
FOM (Figure of Merit) ............................................................................................................................................ 24
Adding noise (sensitivity analysis).......................................................................................................................... 26
Comparison of partitions ......................................................................................................................................... 27
Biological validation ............................................................................................................................................... 29
Kathleen Marchal
1
Multivariate analysis: clustering
INTRODUCTION
The general goal of clustering is data visualization, understanding data characteristics and inferring something about
an object based on its relation with other objects (guilt by association). Clustering is an unsupervised technique which
uses a distance/similarity metric in combination with an algorithm.
CASE STUDY 1 CLUSTERING GENES EXPRESSION PROFILES
When a biological system is surveyed under different conditions or when its behavior is profiled during the course of a
dynamic process i.e. time profiling experiments, different microarray/RNA experiments are performed. It can be
expected that genes involved in the process of interest alter their expression levels in different conditions, i.e.
between different time points. Moreover, genes that are co-regulated (either at transcriptional or at posttranslational
level) will behave similarly. The goal of clustering is to group together genes that behave similarly (have an identical
expression pattern). Clustering is an unsupervised method (no training data necessary).
An expression profiling experiment results in a huge data matrix in which the rows represent the genes (n-rows) and
the columns (m-columns) the experimental conditions tested (mutants, time points, patients, referring to the
variables). Each entry (i,j) in the data matrix corresponds to the expression measurement of gene i in condition j. Each
gene (observation) can therefore be represented as a vector in m-dimensional space (see figure). Clustering
techniques aim at finding the natural grouping among the data points. Points that are closely together in Mdimensional space are grouped together. The mathematical definition of ‘closely’ depends on the metric of distance
that is used.
Example in 3D space (3 experimental conditions are measured, corresponding to the variables). Points represent the genes or observations. Two
points closely together in this space will have almost identical coordinates (or expression values in the different experiments).
In general the distance metrics to measure ’closeness’ are defined in such a way that points with similar coordinates
are considered close to each other in the m-dimensional space. Observations that are considered as close to each
other thus have similar expression profiles throughout all variables which can be interpreted as a similar expression
behavior (e.g. coexpression or coregulation of grouped genes).
Kathleen Marchal / Ghent University
2
Multivariate analysis: clustering
CASE STUDY 2: CLUSTERING PATIENT PROFILES
Remark that besides clustering genes, experiments can also be clustered. In this case each experiment (treated now as
an observation) will be represented in an n-dimensional space (equal to the number of genes tested, which are now
treated as the variables) by an n-dimensional experiment vector. Distances are calculated between the ‘experiment’
vectors.
Kathleen Marchal / Ghent University
3
Multivariate analysis: clustering
Patients/conditions =observations
Patients/conditions = variables
P1 P2 P3 P4 … Pm
P1 P2 P3 P4 … Pm
Genes = observations
Genes = variables
G1
G2
…
G3
G4
Gn
Patient profiles
G1
G2
…
G3
G4
Gn
Gene profiles
Because in this case the number of variables (genes) is usually larger than the number of observations
(patients/conditions), clustering is preceded by a dimensionality reduction (PCA). This is needed because most of the
variation in gene expression profiles between patients could be due to only a few genes, therefore directly clustering
e.g. 200 patients for which 10000 genes have been measured can easily be obscured by genes which have identical
expression profiles throughout all patients or by noise in the data set. Using PCA for dimensionality reduction, the
eigenvectors of the covariant matrix of this data matrix (called eigengenes) are used to perform the clustering (see
exercises). Assuming that the direction of the highest variability (the eigenvectors with the largest eigenvalues)
corresponds to the biologically most important direction of variation, reducing the dimensionality allows removing
noise and redundant signals from the data and should improve the clustering in the patient dimension.
DISTANCE/SIMILARITY METRICS
The difference between distance metrics and similarity metric is that a similarity metric is a function which increases if
vectors are similar while distance metrics decrease when vectors are similar. A popular distance metric is the
Euclidean distance, a popular similarity metric is Pearson correlation. Most clustering algorithms work with distance
metrics. Similarity metrics can be converted to distance metrics.
EUCLIDEAN DISTANCE
Euclidean distance is a distance metric derived from the Minkowski metric.


Euclidean distance: measures the absolute distance between two points in the space (r=2). This corresponds
to the norm of the vector difference of x and y.
Manhattan distance (for binary clustering) (r=1)
Geometric interpretation
Kathleen Marchal / Ghent University
4
Multivariate analysis: clustering
Here a is the red line, which corresponds to the Euclidean distance, which is the vector difference of x and y. Geometrically this corresponds to a straight line,
or the “as the crow flies” distance between the points X and Y. The green lines correspond to the Manhattan distance which geometrically corresponds to a
strictly horizontal and vertical path.
PEARSON CORRELATION
Pearson’s correlation is a statistical measure of the strength of a linear relationship between two variables X and Y,
giving a value between +1 and −1 inclusive, where 1 means a perfect positive correlation between X and Y, 0 means no
correlation, and −1 means perfect negative correlation.

Pearson correlation assumes the data to be normally distributed.

It is the rescaled covariance between the expression profiles of gene x and y (rescaled by the variances of
either expression profile).

For mean centered data it measures how similar the directions are in which two expression vectors point
(similarity metric). The Pearson correlation can be interpreted as the cosine of the angle between vectors A
and B.
Kathleen Marchal / Ghent University
5
Multivariate analysis: clustering
Pearson correlation = 1
B
X2
Pearson correlation = 0
B
X2
A
a
X1
X1
A
Given the expression vectors A and B in the two-dimensional space (two time points measured).
Mathematical proof:
From equation (4) in the proof it can be seen that cos(a,b) is identical to s(x,y) when the expression vectors are mean
centered (𝑥̅ = 0 𝑎𝑛𝑑 𝑦̅ = 0)
Kathleen Marchal / Ghent University
6
Multivariate analysis: clustering
A
distance metric for two variables X and Y known as Pearson's distance can be defined from their correlation coefficient
as
SPEARMAN CORRELATION
Spearman’s correlation coefficient is a statistical measure of the strength of a monotonic relationship between paired
data. Note, unlike Pearson’s correlation, there is no requirement of normality and hence it is a nonparametric
statistic. The calculation of Spearman’s correlation coefficient and subsequent significance testing of it requires the
following data assumptions to hold: monotonically related.
Let us consider some examples to illustrate it. The following table gives x and y values for the relationship. From the
graph we can see that this is a perfectly increasing monotonic relationship.
Kathleen Marchal / Ghent University
7
Multivariate analysis: clustering
The calculation of Pearson’s correlation for this data gives a value of .699 which does not reflect that there is indeed a
perfect relationship between the data, because Pearson’s correlation only tests if there is a linear relationship
between the data. Spearman’s correlation for this data however is 1, reflecting the perfect monotonic relationship.
Spearman’s correlation works by calculating Pearson’s correlation on the ranked values of this data. Ranking (from low
to high) is obtained by assigning a rank of 1 to the lowest value, 2 to the next lowest and so on.
If we look at the plot of the ranked data, then we see that they are perfectly linearly related.
In the figures below various samples and their corresponding sample correlation coefficient values are presented. The
first three represent the “extreme” monotonic correlation values of -1, 0 and 1:
Kathleen Marchal / Ghent University
8
Multivariate analysis: clustering
Invariably what we observe in a sample are values as follows:
DATA RESCALING AND DISTANCE MEASURES
Note that Euclidean distance is sensitive to scaling and differences in average expression level, whereas correlation is
not. In the example below the Euclidean distance between B and respectively A, A’ and A’’ is different whereas the
Pearson correlation is the same
A’’
X2
A’
A
a
B
X1
Kathleen Marchal / Ghent University
9
Multivariate analysis: clustering
Euclidean distance can be large
Pearson correlation = 1
A’
X2
All points are perfectly correlated. However their
Euclidean distance is not the same. Suppose these
points represent gene expression profiles in the n dim
space
A’’
A
X1
expressie
expressie
A’’
A’’
A’
A’
A
A
t
t
The Pearson correlation thus implies an internal variance rescaling (it is the variance rescaled covariance).
Depending on how you want to identify clusters of gene expression, other measures and or data rescaling can be
applied. Most frequently gene expression profiles are mean centered and variance rescaled so that the Pearson
correlation derived distance or the Euclidean distance give the same results.
A’’
X2
X2
A’’
A’
A’
A
A
X1
X1
Euclidean distance <> 0
Pearson correlation = 1
Euclidean distance = 0
Pearson correlation = 1
Disadvantage of variance rescaling or the use of the Pearson correlation (which implicitly performs the variance
rescaling) is that noise in the data is blown up: a coincidentally (noisy) measured expression value in one condition can
result in detecting false positive correlations with the Pearson correlation.
A’’
X2
X2
A’’
A’
A’
A
X1
Kathleen Marchal / Ghent University
A
X1
10
Multivariate analysis: clustering
The Spearman looks at ranks instead of absolute expression values and thus performs an internal variance rescaling by
converting all absolute values to ranks (the range of variances does becomes the same). Two expression profiles will
be correlated if their values have similar ranks in the different conditions. By using ranks it is more robust to noise. In
the figure below the noisy profile (red) that by coincidence correlates with the truly changing profiles will be retained
by calculating the Pearson correlation or the Euclidean distance on the rescaled data but not when calculating the
rank based correlation
X2
X2
A’
Pearson correlation
Variance rescaling
2
2
A
X1
1
1
A’’
A
7000
4000
X1
Rank based rescaling
Spearman correlation
X2
A’’
A’
A
X1
Euclidean distance
OVERVIEW DISTANCE/SIMILARITY MEASURES
CLUSTER ALGORITHMS
A way to classify the clustering algorithms is based on how the algorithm forms the groups: hierarchical algorithms
work on successive splitting (divisive clustering) or merging (agglomerative clustering) of the groups, depending on a
measure of distance or similarity between objects, to form a hierarchy of clusters, while partitioning algorithms search
for a partition of the data that optimizes a global measure of quality for the groups, usually based on distance
between objects.
Kathleen Marchal / Ghent University
11
Multivariate analysis: clustering
Hierarchical algorithms are also classified by the way the distances or similarities are updated (linkage) after splitting
or merging of the groups, which has a great influence on the resulting clustering. Hierarchical algorithms can generate
partitions of the data as well, and are extensively used for this purpose, because each level of the hierarchy is a
partition of the data.
Another way to classify the algorithms is based on their output: in hard clustering the output is a partition of the data,
while in soft (i.e., fuzzy) clustering the output is a membership function, so each pattern can belong to more than one
group, with some degree of membership. A fuzzy cluster defines naturally a partition of the data, defined by the
maximum membership of each object. The quality of a clustering algorithm is often based on its ability to form
meaningful partitions.
The selection of a particular algorithm should be strongly related to the problem at hand. Each algorithm has its own
strengths and weaknesses, and is better adapted to a particular task. For example, hierarchical clustering algorithms
are extremely powerful for exploratory data analysis because it does not need prior specification of the number of
clusters, and their outputs can be visualized as a tree structure, called a dendrogram. On the other hand, when using
partitioning techniques, the groups are usually defined by a representative vector, simplifying the description of the
resulting clusters.
HIERARCHICAL CLUSTERING
Agglomerative (bottom up) and divisive algorithms (top down) can be distinguished.

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed
recursively as one moves down the hierarchy.
In terms of expression profiling also referred to as Eisen clustering because M Eisen was the first to use an
gglomerative hierarchical clustering algorithm on expression data. Eisen MB, Spellman PT, Brown PO, Botstein D.
Cluster analysis and display of genome-wide expression patterns.
Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8.
http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=9843981
Method outline:
Agglomerative hierarchical clustering creates a hierarchical tree of similarities between the vectors, called a
dendrogram. It starts with a family of clusters with one vector each, and merges the clusters iteratively based on
some distance measure until there is only one cluster left, containing all the vectors. For a problem with n objects to
be clustered, the algorithm starts with n clusters containing only one vector each, Ci = {xi}, i = 1, 2, . . . , n. The initial
distance between each pair of clusters is defined by the distance between their elements d(Ci,Cj) = d(xi, xj ). The
algorithm repeatedly merges the two nearest clusters, and updates all the distances relative to the newly formed
cluster, until there is only one cluster left, containing all the vectors.
Figure below shows an example of the complete process applied to 5 genes and 3 experiments (5 vectors of length 3).
Initially there are 5 clusters, C1, . . . ,C5, each one containing one gene. In the first step, as each cluster contains only
one element, the distance between the clusters is defined as the Euclidean distances between the vectors that belong
to them. The closest vectors are genes 1 and 4, so they are merged into a new cluster C14. To continue the process,
the distances between the unchanged clusters and the new cluster C14 are computed as function of their distance to
Kathleen Marchal / Ghent University
12
Multivariate analysis: clustering
C1 and C4. There is a need to compute the distances d(C2,C14), d(C3,C14), and d(C5,C14). Distances between
nonchanging clusters (in this instance, genes 2, 3, and 5) do not need to be updated. Based on the new distances, a
new pair of nearest clusters is selected. In this case clusters C3 and C5 are merged into a new cluster C35 , and the new
distances are computed for this new cluster relative to the unchanged clusters C14 and C2. In the new set of distances,
the nearest clusters are C2 and C14, so they are merged into a new cluster C124. Finally, the two remaining clusters,
C35 and C124, are merged into a final cluster, C12345, that includes all five genes.
The dendrogram tree (Figure 4.6f) resumes the whole process. The length of the horizontal lines indicates the distance
between the clusters. The process does not define a partition of the system, but a sequence of nested partitions C1 =
{C1,C2,C3,C4,C5}, C2 = {C14,C2,C3,C5}, C3 = {C14,C2,C35}, C4 = {C124,C35}, and C5 = {C12435}, each partition containing
one cluster less than the previous partition. To obtain a partition with K clusters, the process must be stopped K steps
before the end. For example, stopping the process before the last merging (K = 2) will result in two clusters, C124 and
C35 (Figure 4.6f).
Kathleen Marchal / Ghent University
13
Multivariate analysis: clustering
Example on gene expression data: Initially each cluster contains a single gene.

pairwise distance matrix is calculated for all genes (clusters)
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 1
0
d(G1,G2)
d(G1,G3)
d(G1,G4)
d(G1,G5)
Gene 2
d(G2,G1)
0
d(G2,G3)
d(G2,G4)
d(G2,G5)
Gene 3
d(G3,G1)
d(G3,G2)
0
d(G3,G4)
d(G3,G5)
Gene 4
d(G4,G1)
d(G4,G2)
d(G4,G3)
0
d(G4,G5)
Gene 5
d(G5,G1)
d(G5,G2)
d(G5,G3)
d(G5,G4)
0


the distance matrix is searched for the two most similar genes (clusters)
the two selected clusters are merged to produce a new object
Gene 3
Gene 4
Gene 5
C1
Gene 1
d(G1,G3)
d(G1,G4)
d(G1,G5)
d(G1,C1)
Gene 4
d(G4,G3)
0
d(G4,G5)
d(G4,C1)
Gene 5
d(G5,G3)
d(G5,G4)
0
d(G5,C1)
C1
d(C1,G3)
d(C1,G4)
d(C1,G5)
0
The process is repeated until all objects are in one cluster.
The question is how do we calculate the distance between those new objects and existing genes/objects?
Hierarchical clustering variants:
Several variations on hierarchical clustering exist which differ from each other in the rules and metrics used to
calculate the distances (metrics define how the mathematical formulation of the distance, either Euclidean,
correlation or mutual information).
The rules define how the distance between merged objects is calculated.
http://www.molmine.com/help/algorithms/linkage.htm

single linkage clustering: the distance between two clusters i and j is calculated as the minimum distance
between a member of cluster i and a member of cluster j or i.e. the distance of the two most similar profiles
in the two clusters . Consequently, this technique is also referred to as the minimum, or nearestneighbour, method. This method tends to produce clusters that are 'loose' because clusters can be
joined if any two members are close together. In particular, this method often results in 'chaining', or
the sequential addition of single samples to an existing cluster (see weblink above). This produces
trees with many long, single-addition branches representing clusters that have grown by accretion.
Kathleen Marchal / Ghent University
14
Multivariate analysis: clustering

complete linkage: maximum or furthest neighbor method. The distance between two clusters is calculated as
the greatest distance between members of the relevant clusters i.e. the distance of the two least similar
profiles Not surprisingly, this method tends to produce very compact clusters of elements and the
clusters are often very similar in size.

average linkage clustering: The distance between clusters is calculated using average values. Average
distance between the profiles.
Kathleen Marchal / Ghent University
15
Multivariate analysis: clustering
Based on the rules if we just merged Cu and Cv into Cj the distance to each other cluster Ck can efficiently be
calculated as follows:
Different variants exist of the average linkage
1.
Unweighted pair-group method average (UPGMA): the distance between any two clusters
X and Y is taken to be the average of all distances between pairs of objects "x" in X and "y"
in Y, that is, the mean distance between elements of each cluster: The two clusters with the
lowest average are joined. In the formula below (10) averages are weighted by the number
of taxa in each cluster at each step.
Kathleen Marchal / Ghent University
16
Multivariate analysis: clustering
2.
Weighted pair-group average identical to UPGMA (WPGMA) except that the size (number
of objects contained in a cluster) of the respective clusters is used as a
weight. WPGMA (weighted pair group method with averaging) because the distance
between clusters is calculated as a simple average. Though computationally easier, when
there are unequal numbers of entities in the clusters, the distances in the original matrix do
not contribute equally to the intermediate calculations. This method (rather than
UPGMA) should be used when the cluster sizes are suspected to be greatly uneven.
Note that the terms weighted and unweighted refer to the final result, not the math by
which it is achieved. Thus the simple averaging in WPGMA produces a weighted result, and
the proportional averaging in UPGMA produces an unweighted result.
M
Previous iteration
 d (i , j )
k
Davg (k , u ) 
c k cu
u
 d (i , j )
k
k
v
Davg (k , v) 
u
i, j
v
i, j
c k cv
Davg ( k , M )
 d (i
k


, jv )   d (ik , ju )
i, j
i, j
c k cv  c k cu
c k cv Davg ( k , v )  c k cu Davg ( k , u )
c k cv  c k cu
Calculation of the UPGMA
3.
Within group clustering (centroid based clustering): similar to UPGMA except that clusters
are merged and a cluster average is used for further calculations rather than the individual
cluster elements.
4.
Wards method: the total sum of the squared deviations from the mean of a cluster is calculated and
clusters are joined in such a manner that they produce the smallest increase in the sum of squared
error.
Kathleen Marchal / Ghent University
17
Multivariate analysis: clustering
Extracting clusters: cutting the tree at a certain level, will determine the number of clusters.
Properties of hierarchical clustering:
 deterministic
 user-defined parameters:
o metric definition
o rules
o cut off value
Advantages

visualisation possible: dendrogram

The (horizontal) length of the branches is indicative for the distance between the clusters
Disadvantages


The number of clusters has to be defined by the user. I.e. the user defines where to cut the tree by defining a
threshold distance. The choice of this distance is arbitrary.
If the wrong clusters or genes are merged during the first iterations of the algorithm (i.e. because of
inadequate user-defined parameters or significant noise in the data), the error cannot be repaired.
(based on Arabidopsis cDNA arrays by Maleck et al., 2000)
K-MEANS CLUSTERING (EXAMPLE OF A PARTIONING ALGORITHM)
Most of the partitioning algorithms are based on the minimization of an objective function that computes the quality
of the clusters. The most common objective function is the squared error to the centers of the clusters. Let C ={C1, . .
.CK} be a clustering of the data, and let μk be a vector representing the center of the cluster Ck, for k = 1, . . . ,K. The
objective function J is defined by
Kathleen Marchal / Ghent University
18
Multivariate analysis: clustering
The objective function J is the average square distance between each point and the center of the cluster where the
point belongs. It can be interpreted also as a measure of how good the centers μk are as representatives of the
clusters. One limitation of this objective function is the need of a cluster center to represent the points. Usually for
this objective function, the centers are defined as the average of all points in the group
where nk is the number of points in cluster Ck.
Partitioning algorithms, based on the minimization of an objective, suffer two major drawbacks. The first is that they
work well with similar size compact clusters, but often fail when the shape of the cluster is more complex, or when
there is a large difference in the number of points between clusters. The second drawback is that the objective
function decreases as a function of the number of clusters in a nested sequence of partitions (a new partition is
obtained by splitting in two one cluster from the previous partition). Given this property, the best partitioning of the
data would be when K = n clusters and each point is a cluster by itself. To address this problem, either the number of
classes must be known beforehand, or some additional criteria must be used that penalizes partitions with large
numbers of clusters.
The objective function is only a measure of the quality of a partition of the data. The naive way to find the best
partition is to compute the objective function for all possible partitions, and select the one that has the minimum
value of the objective function. The number of possible partitions grows exponentially with n, the size of the set, so it
is unfeasible except for very small problems (see below).
K means clustering seeks to partition a set of data into a specified number of groups K by minimizing some numerical
criterion, low values of which are considered indicative of a ‘good solution’. The most commonly used approach is to
try to find the partition of the n variables into k groups which minimizes e.g. the Euclidean distance over all variables.
The problem then appears relatively simple: namely consider every possible partition of the n individuals into k groups
and select the one with the lowest within group sum of squares. Unfortunately the problem in practice is not so
straightforward. The numbers involved are so vast that complete enumeration of every possible partition remains
impossible.
The impracticality of examining every possible solution has led to the development of iterative algorithms designed to
search for the minimum values of the clustering criterion by rearranging existing partitions and keeping the new one
only if it provides an improvement. Such algorithms do of course not guarantee finding the minimum of the criterion.
e.g. they can get stuck in local minima.
One of the most common iterative algorithms is the k-means algorithm, broadly used because of its simplicity of
implementation, its convergence speed, and the usually good quality of the clusters (for a limited family of problems).
The algorithm is presented with a set of n vectors x1, . . . , xn and a number K of clusters, and computes the centroids
μ1, μ2, μ3, . . . , μk, that minimizes the objective function J. One of the most used implementations of the algorithm,
called Forgy’s cluster algorithm, starts with a random election for the centroids, and then repeatedly assigns each
vector to the nearest centroid and updates the centroids positions, until convergence is reached, when the update
process does not change the position of the centroids. The procedure can be done in batch or iteratively. In the first
case all the vectors are assigned to a centroid before the update is completed. In the second part, the centroids are
updated after each assignment is made.
Kathleen Marchal / Ghent University
19
Multivariate analysis: clustering
The essential steps in these algorithms are as follows:
 Find some initial partition of the variables into the required number of groups. Such an initial partition could
be provided by a solution from hierarchical clustering
 Calculate the change in the clustering criterion produced by moving each individual from its own to another
cluster
 Make the change that leads to the greatest improvement in the value of the clustering criterion
 Repeat step 2 and 3 until no move of an individual causes the clustering criterion to improve
Method outline
 k initial cluster centers are randomly chosen
 Each gene is attributed to the cluster to which it is closest in distance
 For each cluster a new cluster center is calculated as the average profile of the genes belonging to that
cluster properties


User defined parameters
o Distance criterion
o Number of initial clusters (not necessarily the number of final clusters)
o Number of iterations (or to convergence)
Non-deterministic: dependents on the initialization
The process is repeated until the cluster centers are stabilized (algorithm converged) or until a user defined number of
iterations have been completed. The clusters are located in the space where the density of the points is highest.
advantages:
 easy to understand
 fast
disadvantages:
 number of partitions has to be user specified
 outcome is parameter sensitive (elaborated parameter fine-tuning essential)
 all genes in the dataset will be clustered: the presence of noisy genes will also end up in a cluster and disturb
the average profile and the quality of the cluster of interest.
Kathleen Marchal / Ghent University
20
Multivariate analysis: clustering
example : Cho et al 1998, yeast cell cycle dataset, interesting clusters, relevant for the biological process studied are
marked by circles.
Fig. K-means, 10 clusters, 100 iterations Result: 10 clusters with many genes, no well defined profile
Fig. K-means, 10 clusters, 100 iterations Result: 100 clusters with many genes, well defined profiles containing limited number of genes. Because
too many clusters were selected some clusters are detected multiple times and should be merged manually afterwards.Self organizing maps
CLUSTER VALIDATION
INTRODUCTION
Depending on



the preprocessing
distance metrics
algorithmic parameters
Clustering will produce different results. Even clustering on unrelated data will still produce clusters although they
might not be biologically meaningful. Therefore cluster validation after clustering is of outmost importance. In the
Kathleen Marchal / Ghent University
21
Multivariate analysis: clustering
following different methods to compare cluster results are described. Clusters can either be compared from statistical
point of view: i.e. the coherence of a cluster will be tested based on distance measure or the robustness of a cluster
result will be analyzed by some kind of sensitivity analysis. Of course it is very hard to select the best cluster output
since "the biological real" solution will only be known if the biological system studied is completely characterized. Still
from biological point of view there are ways to validate a cluster result such as motif finding, testing for enrichment of
functional classes within a cluster etc. These biological validations can give an indication on the validity of a cluster.
Fig. 1 Dependence of cluster result
STATISTICAL VALIDATION
Clustering is usually defined as a process that aims to group similar objects or as unsupervised learning. An open
problem with clustering algorithms is the validation of results. As a data mining tool, a clustering algorithm is good if it
generates new testable hypotheses, but as an analysis tool, it should be able to generate meaningful results that can
be related to properties of the objects under study. There are two basic ways to compute the quality of a clustering
algorithm. The first one is based on calculating properties of the resulting clusters, such as compactness, separation,
and roundness. This is described as internal validation because it does not require additional information about the
data. The second is based on comparisons of the partitions, and can be applied to two different situations. When the
partitions are generated by the same algorithm with different parameters, it is called relative validation, and it does
not include additional information. If the partitions to be compared are the ones generated by the clustering
algorithm and the true partition of the data (or a subset of the data), then it is called external validation. External and
relative validations are mainly based on comparison between different partitions of the data.
PROPERTIES OF CLUSTERS (INTERNAL VALIDATION)
For internal validation, the evaluation of the resulting clusters is based on the clusters themselves, without additional
information or repeats of the clustering process. This family of techniques is based on the assumption that the
algorithms should search for clusters whose members are close to each other and far from members of other clusters.
Kathleen Marchal / Ghent University
22
Multivariate analysis: clustering
In terms of expression clustering: a cluster result can be considered reliable if the within cluster distance is small (i.e.
all genes retained are tightly coexpressed) and the cluster has an average profile well delineated from the remainder
of the dataset (maximal intercluster distance).

Dunn’s validity index

Silhouette coefficient

Gap Statistics (part of the GeneShaving algorithm)
Dunns index
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal
inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated
by the following formula:
Where d(i,j) represents the distance between clusters i and j, and d’(k) measures the intra-cluster distance of
cluster k . The inter-cluster distance d(i,j) between two clusters may be any number of distance measures, such as the
distance between the centroids of the clusters. Similarly, the intra-cluster distance d’(k) may be measured in a variety
ways, such as the maximal distance between any pair of elements in cluster k . Since internal criterion seek clusters
with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index
are more desirable.
min(d(i,j))=
minimal inter-cluster
distance
max(d(k))=
maximal intra-cluster
distance
Silhouette coefficient
Assume the data have been clustered via any technique, such as k-means, into k clusters. For each data point i, let a(i)
be the average dissimilarity of i with all other data within the same cluster. Any measure of dissimilarity can be used
but distance measures are the most common. We can interpret a(i) as how well i is assigned to its cluster (the smaller
the value, the better the assignment). We then define the average dissimilarity of point i to a cluster C as the average
of the distance from i to points in C.
Let b(i) be the lowest average dissimilarity of i to any other cluster, of which i is not a member. The cluster with this
lowest average dissimilarity is said to be the "neighboring cluster" of i because it is the next best fit cluster for point i.
We now define a silhouette:
Which can be written as:
From the above definition it is clear that
For s(i) to be close to 1 we require a(i) <<b(i) . As a(i) is a measure of how dissimilar i is to its own cluster, a small
value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighbouring cluster.
Kathleen Marchal / Ghent University
23
Multivariate analysis: clustering
Thus an s(i) close to one means that i is appropriately clustered. If s(i) is close to negative one, then by the same logic
we see that i would be more appropriately clustered in its neighbouring cluster. An s(i) near zero means that i is on
the border of two natural clusters.
The average s(i) over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. Thus the
average s(i) over all data of the entire dataset is a measure of how appropriately the data has been clustered. If there
are too many or too few clusters, as may occur when a poor choice of k is used in the k-means algorithm, some of the
clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and averages may be used
to determine the natural number of clusters within a dataset.
Remark: when comparing the outcome of different algorithms using one of these statistical metrics one should take
care of the fact that the algorithm that uses the metric used in the cluster validation statistical metric as a distance
metric for clustering within the algorithm will probably tend to perform better.
Silhouette
coefficient
Closest
neighboring
cluster
bi
i
ai
average
dissimilarity
of i with all other
data within the
same cluster
FOM (FIGURE OF MERIT)
In the first case, if there are m samples (conditions), the algorithm is run m times on the subsets obtained by
removing one sample at a time. This approach is referred to as leave-one-out validation. The figure below shows an
example with 5 objects (n = 5) and 6 samples (m = 6), where the clustering algorithm is applied to two subsets of the
data, removing the first and second columns, obtaining two partitions C1 and C2, respectively. The process can be
repeated for the six columns, obtaining six partitions of the data.
The result of this process is a family of partitions C1, . . . ,CP, each one computed over a slightly different dataset. The
agreement between all these partitions gives a measure of the consistency of the algorithm and their predictive
power (over the removed column) gives a measure of the ability of the algorithm to generate meaningful partitions.
Kathleen Marchal / Ghent University
24
Multivariate analysis: clustering
Yeung’s figure of merit (FOM) is based on the assumption that the clusters represent different biological groups, and
therefore, genes in the same cluster have similar expression profiles in additional samples. This assumption leads to a
definition of the quality of a clustering algorithm as the spread of the expression values inside the clusters, measured
on the sample that was not used for clustering. Let m be the number of samples, n the number of objects (genes), and
K the number of clusters. Let Cj = {Cj1, . . . ,CjK} be the partition obtained by the algorithm when removing the
sample Sj . The FOM for sample Sj is computed as
Where
x kj is the jth element of the average of the vectors in Cjk.
The FOM (for the algorithm) is computed as the sum over the samples:
m
FOM ( K )   FOM ( K , j )
j 1
If the clusters for a partition define compact sets of values in the removed sample, then their average distances to
their centroids should be small. Yeung’s FOM is the average measure of the compactness of these sets. The lower the
FOM, the better the clusters are to predict the removed data and therefore, the more consistent the result of the
clustering algorithm.
Although simple at first glance, application of the FOM procedure to real biological data is not straightforward. At first
the FOM as a measure of the cluster predictive power tends to increase proportional to the number of clusters
because more clusters means smaller average sizes for the clusters. The more clusters are present, the smaller the
sets are. Sets with a smaller number of points are more likely to be compact and thus ∑𝑖𝜖𝐶 𝑗 (𝑥𝑖𝑗 − 𝑥̅𝑗𝑘 )2 will tend to be
𝑘
smaller. A solution to this problem is to adjust the values using a model-based correction factor SQRT((n − K)/n),
leading to
𝐹𝑂𝑀𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 (𝐾, 𝑗) =
𝐹𝑂𝑀(𝐾, 𝑗)
√𝑛 − 𝐾
𝑛
With n the number of samples and K the number of clusters in the specific partition when removing sample Sj.
The result when using this formula is called the adjusted FOM. In practice, when n is large and K is small, as in
clustering of microarray data, the correction factor is close to one and does not greatly affect the results. Last but not
least, one can wonder if the number of experiments available allows such methodology to be used.
Kathleen Marchal / Ghent University
25
Multivariate analysis: clustering
Figure of Merit (predictive test)
C1
C2
C3
C4
C5
( x  x )2  ( x  x )2 ... ( x  x )2
15
x
Conditions
to
determine
the cluster
25
n5
Sum of squares should be minimal
Left out
condition
Example of the application of FOM to expression clustering. The methodology is related to jackknife-based or leave one
out cross validation-like methodologies. The cluster algorithm to be tested is iteratively applied to all experimental
conditions except for one left-out condition. If the algorithm performs well, we expect that if we look at the genes from
a given cluster, their values for the left-out condition will be highly coherent. Therefore we compute the FOM for a
clustering result by summing, for the left out condition, the squares of the deviations of each gene relative to the mean
of the genes in its cluster for this condition. FOM measures the within cluster similarity of the expression values of the
removed experiment and therefore reflects the predictive power of the clustering. It is expected that removing one
experiment from the data should not interfere with the cluster output if the output is robust. For cluster validation,
each condition is subsequently used as a validation condition and the aggregate FOM over all conditions is used to
compare cluster algorithms.
http://bioinformatics.oupjournals.org/cgi/content/abstract/17/4/309 Yeung KY, Haynor DR, Ruzzo WL.Validating
clustering for gene expression data Bioinformatics. 2001 Apr;17(4):309-18
ADDING NOISE (SENSITIVITY ANALYSIS)
Gene expression levels can be considered as a superposition of real biological signals and small sometimes consistent
experimental errors. A way of assigning confidence to a cluster membership of a gene consists of creating new in silico
replica's (i.e., simulated replica’s) of the dataset of interest by adding a small amount of artificial noise to the original
data. Based on the intuition that ideally, real repeats of the original dataset would contain exactly the same
measurements as the original dataset except for the random experimental noise, the artificial noise used to generate
in silico datasets is chosen to approximate the experimental noise in the dataset. These newly generated datasets are
subsequently treated in the same way as the original dataset and clustered. If the biological signal is much more
pronounced than the experimental noise signals in the measurements of one particular gene, adding small artificial
variations (in the range of the experimental noise present in the dataset) to the expression profile of such gene will
not drastically influence its overall profile and therefore will not affect its cluster membership. The result (cluster
membership) of that particular gene can be considered as being robust towards what is called a sensitivity analysis
and a reliable confidence can be assigned to the cluster result of that gene. However, for genes of which the
measured signal has a low signal to noise ratio or for which the noise signal is within the range of the biological signal,
the outcome of the clustering result will be much more sensitive towards adding artificial noise. Through some
robustness statistic, sensitivity analysis let us detect which clusters are robust within the range of experimental noise
and therefore trustworthy for further analysis.
One of the reasons for this is, as mentioned before, that by rescaling, the range of variation of the noisy profiles will
be equalized to the range of variation of profiles containing significant signals. Accidentally, these noisy profiles might
have a profile sufficiently similar to one of a gene that underwent a real biological alteration. Adding artificial noise
might therefore affect the overall cluster result. Conclusively, sensitivity analysis might be helpful in pinpointing genes
Kathleen Marchal / Ghent University
26
Multivariate analysis: clustering
with a low signal to noise ratio. The arbitrary step in this method consists of choosing the noise level that will be used
for sensitivity analysis.
For instance, Bittner et al. (2000) perturb the data by adding random Gaussian noise with zero mean and a standard
deviation that is estimated as the median standard deviation for the log-ratios for all genes across the experiments (as
was mentioned earlier in most cDNA datasets, the log-ratio is used as an estimate of the differential expression level).
Moreover, using Gaussian noise is based on the strong assumption that errors on the log-ratio are normally
distributed. This implicitly assumes that ratios are unbiased estimators of relative expression. Reality shows often
otherwise.
Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N,
Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich
K, Beaudry C, Berens M, Alberts D, Sondak V.Molecular classification of cutaneous malignant melanoma by gene
expression profiling.Nature. 2000 Aug 3;406(6795):536-40.
COMPARISON OF PARTITIONS
Clustering results are sensitive towards choice of the cluster algorithm used and the specific parameter settings of a
particular algorithm. Many cluster algorithms are available, each of them with different underlying statistics and
inherent assumptions on the data. The best way to infer biological knowledge from a clustering experiment is to use
different algorithms with different parameter settings. Clusters detected by most algorithms will reflect the
pronounced signals in the dataset. Biologists tend to prefer algorithms with a deterministic output since this gives the
illusion that what they find is “right”. However, nondeterministic algorithms offer an advantage for cluster validation
since their use implicitly includes a form of sensitivity analysis.
Cluster results are compared by some statistical measures.
The RAND index and Jaccard coefficient are also used to compare cluster results obtained from different datasets (e.g
different platforms cDNA versus affymetrix).
Assume that there exist two partitions of the same set of n objects into K groups: CA = {CA1 , . . . ,CAK} and CB = {CB1
, . . . ,CBK}. Each element CAk and CBk of CA andCB is called a cluster and is identified by its index k. Let k = IA(x) be
the index of the cluster to which a vector x belongs for the partition CA (e.g., if IA(x) = 3,then the vector x belongs to
the cluster CA3 ). The natural measure of disagreement (or error) between the two partitions is the error measure
ε(CA,CB) defined as the proportion of objects that belongs to different clusters.
The first observation that can be derived is that if the partitions are the same, but the order of the indices is changed
in one of them (e.g., if CB1= CA2 , CB2= CA1 ,CB3= CA3 , . . . , CBK = CAK ), then, for any vector x in CA1 , IA(x) = 1 _= 2 =
IB(x),and the error ε(CA,CB) is greater than zero. It is clear that the disagreement between two partitions should not
depend on the indices used to label their clusters.
Cluster exp 1
C1
C2
C3
…
Cluster exp 2
C1
C2
C3
…
Cluster exp 3
C1
C2
C3
…
Cluster exp 4
C1
C2
C3
…
A way to compare two partitions CA and CB, without labeling the clusters, is based on a pairwise comparison between
the vectors. For each pair of vectors x, y (x _= y), there are four possible situations:
(a) x and y fall in the same cluster in both CA and CB,
(b) x and y fall in the same cluster in CA but in different clusters in CB,
(c) x and y fall in different clusters in CA but in the same cluster in CB,
(d) x and y fall in different clusters in both CA and CB.
Kathleen Marchal / Ghent University
27
Multivariate analysis: clustering
The measure of disagreement between CA and CB is quantified by the number of pairs of vectors that fall in situations
(b) and (c). Let a, b, c and d be the number of pair of different vectors that belong to situation (a), (b), (c), and (d),
respectively, and let M = n(n − 1)/2= a+b+c+d be the number of pair of different vectors.
The following indices measure the agreement between the two partitions:
RAND Index
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a
measure of the similarity between two data partitions. The rand index is a statistic designed to assess the degree of
agreement between two partitions. The rand index is defined as the fraction of agreement that is the number of pairs
of objects that are either in same groups in both partitions (a) or in different groups in both partitions (b) divided by
the total number of pairs of objects (a + b + c +d). The rand index lies between 0 and 1.






a: the number of object pairs that are clustered together in partitioning 1 and in partitioning 2
b: the number of object pairs that are clustered together in partitioning 1 but not in partitioning 2
c: the number of object pairs that are clustered together in partitioning 2 but not in partitioning 1
d: the number of object pairs that are put in different clusters in both partitioning
a, d: agreement between cluster results
b, c: disagreement between cluster results
RAND 
ad
abcd
RAND index counts the frequency with which pairs of genes are clustered together in two cluster results. If each pair
of genes clusters together reliably, this indicates stable clustering results.
When comparing partitions with different numbers of clusters, the adjusted RAND index can be used. Note that the
adjusted RAND index can also be negative as it corrects for the expected index value (if interested, see the paper of Ka
Yee Yeung and Walter L Ruzzo: Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper
“An empirical study on Principal Component Analysis for clustering gene expression data” for more information on the
adjusted RAND index).
Jaccard coefficient
J
a
abc
The Rand statistic measures the proportion of pairs of vectors that agree by belonging either to the same cluster (a) or
to different clusters (d) in both partitions. The Jaccard coefficient measures the proportion of pairs that belong to the
same cluster (a) in both partitions, relative to all pairs that belong to the same cluster in at least one of the two
partitions (a+b+ c). In both cases, the measure is a proportion of agreement between the partitions, but in contrast
with the Rand statistic, the Jaccard coefficient does not consider the pairs that are separated (belong to different
clusters) in both partitions (d).
The previous indices are based on the counting of the number of pairs of vectors, which are placed on the same or
different clusters, for each partition. For each partition C the relationship between two vectors, whether they belong
to the same cluster or not, can be represented by a similarity matrix d(i, j) defined by d(i, j) = 1 if xi and xj belong to the
same cluster, and d(i, j) = 0 if they belong to different clusters. The advantage of using this matrix instead of the four
numbers a, b, c, and d is that it allows additional comparisons: let dA and dB be the similarity matrices induced by two
partitions CA and CB, two similarity indices are computed as function of the correlation and the covariance of these
matrices.
Partitioning 1
Partitioning 2
d(i,j)
G1
G2
…
Gn
d(i,j)
G1
G2
…
Gn
G1
1
1
1
0
G1
1
1
1
0
G2
0
1
0
0
G2
0
1
0
0
…
…
…
…
…
…
…
…
…
…
Gn
0
1
1
Gn
0
1
Kathleen Marchal / Ghent University
1
28
Multivariate analysis: clustering
equation shows that the Rand index is inversely proportional to the square of the Euclidean distance between the
matrices dA and dB.
BIOLOGICAL VALIDATIO N
There are different ways of validating in silico the outcome of a cluster experiment biologically:
MOTIF FINDING
Coexpressed genes are expected to be coregulated either at transcriptional or at posttranslational level. When the
mechanism of coregulation occurs at transcriptional level coregulated genes are expected to contain in their upstream
regions (promoter regions) a consensus sequence. This is a short DNA sequence that is recognized by a transcriptional
regulator. If genes contained in the same cluster indeed contain such consensus motif, the presence of this motif
points towards transcriptional coregulation. A common mechanism of regulation between genes might explain the
similarities in their behavior during a gene expression profiling experiment and therefore biologically confirms the
cluster output (fact that they have tightly coexpressed profiles). It should be noted however, that the opposite is not
true. Not finding a common motif is no indication for a biologically irrelevant clustering.
Fig. 2 Location of the motif in a promoter region of a gene. Motifs are recognized by transcriptional regulators.
GO ENRICHMENT
The result of a clustering analysis is a list of genes found to be coexpressed under several conditions. One assumes
that all genes from such coexpressed cluster are involved in the same pathway or biological process (guilt by
association). If we can prove this is indeed true, we have an indication of the biological relevance of our clustering
result. Using information in ontology databases (MIPS, gene ontology, KEGG) can help us to identify functional
coherence for our clustering results.
Gene ontology provides for each single protein a structured description of its molecular function, biological processes
and cellular components for a number of different organisms. The ontologies are hierarchically organized (starting
from a general root, and going down to leave nodes that describe the function of the protein with an increasing level
of detail, deeper in the tree gives a more refined description). Genes that have related functions share at least part of
their ontological description (up to a certain level). Making use of these ontologies allows us to calculate the degree to
Kathleen Marchal / Ghent University
29
Multivariate analysis: clustering
which genes in a cluster are enriched in the same function: i.e. is their similarity in ontological terms significantly
different from what can be expected by chance. Gene enrichment is calculated by e.g. a hypergeometric test.
Example 1: Hypergeometric distribution (binomial)
The results of the expression profiling experiment of Cho et al. (1998) studying the yeast cell cycle (Saccharomyces
cerevisiae) in a synchronized culture is often used as a benchmark data set. It contains 6220 expression profiles taken
over 17 time points (measurements over 10-min intervals, covering nearly two cell cycles – also see http://cellcyclewww.stanford.edu). One of the reasons that this data is so frequently used as benchmark data for the validation of
new clustering algorithms is due to the fact that the majority of the genes included in the data have been functionally
classified and due to the existence of a functional classification scheme (MIPS database – see
http://mips.gsf.de/proj/yeast/catalogues/funcat/index.html) making it possible to biologically validate the results.
We start with a simple example to explain the hypergeometric distribution:
A day’s production of 850 manufactured parts contains 50 parts that do not conform to customer requirements. Two
parts are selected at random, without replacement from the day’s production. What is the probability that there are 2
nonconforming parts.
This example experiment consists of 2 trials. However, in this experiment the trials are not independent. Suppose the
first sample was replaced before the second was taken (which was not done here). In that case the trials would have
been independent and the number of nonconforming parts would have been a binomial random variable.
Let A and B denote the events that the first and second parts are nonconforming respectively.
P(B|A) = 49/849 and p(A) = 50/850 This means that knowledge that the first part is nonconforming suggests that it is
less likely that the second selected part is nonconforming. Thus, let P(X=k) be the chance that exactly k of these 2 car
parts do not conform to customer requirements:
P(X=0) = (800/850)(799/849)
P(X=2) =(50/850)(49/849)
P(X=1) = (50/850)(800/849)+(800/850)(50/849)
A general formula for computing the probability when samples are selected without replacement is the formula (this
is the probability mass function of random variable X, if X follows the hypergeometrical distribution):
  
p( X  k ) 

f
k
g f
nk
g
n
𝑎
𝑎!
𝑏
𝑏!(𝑎−𝑏)!
With ( ) a binomial coefficient which can be calculated as
𝑎
. This means there are ( ) ways to choose b
𝑏
different elements, disregarding their order, from a set of a elements. The parameters in the simple example case:
g: total number of entities (here number of produced parts)
f: number of entities in a given class (here parts that do not conform to customer requirements)
n: number of trials (number of parts selected)
x: number of successes (number of parts which belong to the parts that do not conform to customer requirements)
So the probability that the two selected parts are conform the norms is:
  
P(x=0) =
 
50
0
800
2
850
2
In terms of genes the parameters are interpreted as:
g: total number of entities (total number of genes of which measurements are available and that have a functional
description)
f: number of entities in a given class (number of genes in ontology class 1)
n: number of trials (number of genes in a cluster)
x: number of successes (number of genes in the cluster that belong to ontology class 1)
Kathleen Marchal / Ghent University
30
Multivariate analysis: clustering
Assume that a certain clustering method finds a set of clusters in the Cho et al. (1998) data. We could objectively look
for functionally enriched clusters as follows: suppose that one of the clusters has n genes where k genes belong to a
certain functional category in the MIPS database and suppose that this functional category in its turn contains f genes
in total. Also suppose that the total data set contains g genes (in the case of Cho et al. (1998) g would be 6220). Using
the cumulative hypergeometric probability distribution, we could calculate the probability or p-value that this degree
of enrichment could have occurred by chance, i.e., what is the chance of finding at least k genes in this specific cluster
from this specific functional category by chance?
f

k 1  i
P  1 

i 0
 g  f

 n  i
g
 
n

f
 min( n , f ) 
i

 
i k
 g  f

 n  i
g
 
n


.
These p-values can be calculated for each functional category in each cluster. Since there are about 200 functional
categories in the MIPS database, correcting for multiple hypothesis testing yields that only clusters where the p-value
< 0.0003 for a certain functional category, are said to be significantly enriched (level of significance 0.05). Note that
these P-values can also be used to compare the results from functionally matching clusters identified by two different
clustering algorithms on the same data.
This hypergeometric distribution calculation (for additional info on the statistics see Draghici et al., 2003) has been
implemented in OntoExpress (http://vortex.cs.wayne.edu/Projects.html#Onto-Express) for a few vertebrate
organisms and GO4G (http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html) for human.
This course is based on the tutorial: Clustering: revealing intrinsic dependencies in microarray data by
Marcel Brun, Charles D. Johnson, and Kenneth S. Ramos
Kathleen Marchal / Ghent University
31