Cancer data: gene clustering

Exercises Clustering
Clustering Exercises
CONTENTS
Libraries .........................................................................................................................................................................3
Loading libraries ........................................................................................................................................................3
Installing libraries ......................................................................................................................................................3
Cancer data: gene clustering .........................................................................................................................................3
UPLOADING THE DATA ..............................................................................................................................................3
RESCALing DATA ........................................................................................................................................................4
Distance calculations .................................................................................................................................................5
Euclidean distance .................................................................................................................................................5
Correlation coefficient ...........................................................................................................................................5
Clustering with correlation measure .........................................................................................................................6
Hierarchical clustering with correlation ................................................................................................................6
Kmeans with correlation......................................................................................................................................15
Clustering with Euclidian distance ...........................................................................................................................20
Hierarchical clustering with euclidean distance on RESCALED DATA ..................................................................20
Hierarchical clustering with the Euclidean distance on the original (non-rescaled) data ...................................23
K means clustering with Euclidean distance on the original (non-rescaled) data ...............................................25
K means clustering with Euclidean distance on the rescaled data ......................................................................28
clustering in gene direction/ yeast cell cycle ...............................................................................................................31
Upload Dataset ........................................................................................................................................................31
Filter Data ................................................................................................................................................................31
Rescaling (data is already rescaled) .........................................................................................................................32
calculate distance ....................................................................................................................................................32
Cluster hierarchical (ter info, niet in les) .................................................................................................................32
Run clustering ......................................................................................................................................................32
1
Exercises Clustering
Visualize data .......................................................................................................................................................33
Cluster with K Means ...............................................................................................................................................33
Calculate functional enrichment of a selected cluster ...........................................................................................37
REMARK SCALING AND HEATMAPS .........................................................................................................................40
2
Exercises Clustering
LIBRARIES
For this tutorial, we will use some R libraries, providing methods which are not part of the basic R package.
LOADING LIBRARIES
In principle, these libraries should be installed on your machine before the beginning of the practicals.
library(gplots)
library(multtest)
library(matrixStats)
library(rgl)
library(biomaRt)
setRepositories()
1 2 3 4 5 (alle CRANs en BioC)
If the libraries are loaded properly, you do not need to install them, and you can skip the next section.
INSTALLING LIBRARIES
If the libraries are not installed on your machine, you can install them yourself easily, provided you have an
internet connection. To install the required libraries, log in as system administrator, open R and type the following
command.
install.packages("multtest")
source("http://bioconductor.org/biocLite.R")
biocLite("multtest")
install.packages("rgl”)
install.packages("gplots", dependencies = TRUE)
install.packages("biomart”)
CANCER DATA: GENE CLUSTERING
UPLOADING THE DATA
library(multtest)
data(golub, package = "multtest")
# this imports the matrix golub and the variable golub.names
dim(golub)
#[1] 3051
38
typeof(golub)
#double
#we will know assign column names
colnames(golub)<-factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
rownames(golub)= golub.gnames[,2]
# rownames are needed if you want to add in the heatmap the names of the genes
(otherwise it just puts numbers from 1..# genes in the cluster)
3
Exercises Clustering
golub.gnames[1,]
> golub.gnames[1,]
[1] "36"
[2] "AFFX-HUMISGF3A/M97935_MA_at (endogenous control)"
[3] "AFFX-HUMISGF3A/M97935_MA_at"
Input : gene expression data collected by Golub et al. Science, Vol.286:531-537. 1999
Following Golub et al three preprocessing steps were applied to the normalized matrix of intensity values available
on the website: (i) thresholding: floor of 100 and ceiling of 16,000; (ii) filtering: exclusion of genes with max / min
5 or (max-min) 500, where max and min refer respectively to the maximum and minimum intensities for a
particular gene across mRNA samples; (iii) base 10 logarithmic transformation.
Boxplots of the expression levels for each of the 38 samples revealed the need to standardize the expression levels
within arrays before combining data across samples. The data were then summarized by a 3, 051×38 matrix X =
(xji), where xji denotes the expression level for gene j in tumor mRNA sample i.
RESCALING DATA
Rescaled in the gene direct (for gene clustering)
#variance rescaling and centering (variables are the genes and the mean function works on the rows)
#centering
row_mean = apply(golub, 1, mean, na.rm=TRUE)
golub_meancentered = golub - row_mean
dim(golub_meancentered)
#variance rescaling
SD = apply(golub_meancentered, 1, sd, na.rm=TRUE)
golub_rescaled = golub_meancentered/SD
dim(golub_rescaled)
SD = apply(golub_rescaled, 1, sd)#(should be one)
golub_rescaled[1,1:10]
mean(as.matrix(golub_rescaled[1,]), na.rm=TRUE)
4
Exercises Clustering
dim(golub_rescaled)
colnames(golub_rescaled)<-factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
rownames(golub_rescaled)= golub.gnames[,2]
In this tutorial, we apply different clustering methods implemented in R, to group genes according to their
expression profile. For this, we will use the profiles of expression of genes differentially expressed between two
cancer types (AML and ALL).
This should give the following result: We have thus selected x genes, showing a significantly different level of
expression between the 27 patients suffering from ALL (acute lymphoblastic leukemia) and the 11 patients
suffering from AML (acute myeloblastic leukemia). The dataset has been already filtered so that only genes are
retained that vary sufficiently across the different patients.
DISTANCE CALCULATIONS
EUCLIDEAN DISTANCE
## Calculate Euclidian distances without variance rescaling and centering
d.euclidian <- dist(golub, method="euclidian")
m <- as.matrix(d.euclidian)
dim(m)
m[1,1:20]
# 1
0.000000
10
26.910743
19
7.183134
2
2.834725
11
6.339802
20
7.471017
3
4
5
6
7
8
9
9.208276 16.706824 14.854179 14.786835 26.396032 25.876721 26.280625
12
13
14
15
16
17
18
3.462704 6.168800 15.689718 18.149789 10.537101 15.078885 5.128725
## distance calculation
d.euclidian_rescaled <- dist(golub_rescaled, method="euclidian")
m <- as.matrix(d.euclidian_rescaled)
dim(m)
#(3051,3051)
m[1,1:20]
# m[1,1:20]
1
2
3
4
5
6
7
8
9
10
0.000000 3.961071 6.147213 8.204167 8.626962 8.682193 8.701672 8.653082 8.799778 8.918089
11
12
13
14
15
16
17
18
19
20
7.223047 5.907884 5.835805 8.391445 8.344461 8.388069 8.951537 6.750473 7.993741 7.576723
Cluster results with Euclidean distance on the original space and the rescaled space are not the same.
CORRELATION COEFFICIENT
m<-cor(t(golub))
dim(m)
#3051 3051
d.correlation <- as.dist(1 - cor(t(golub)))
dim(as.matrix(d.correlation))
head(d.correlation)
#[1] 0.2120282 0.5106518 0.9095724 1.0057360 1.0186550 1.0232310
5
Exercises Clustering
#For comparison you can also calculate the correlation on the rescaled data.
#You will observe exactly the same similarity matrix
d.correlation_rescaled <- as.dist(1 - cor(t(golub_rescaled)))
dim(as.matrix(d.correlation_rescaled))
head(d.correlation_rescaled)
#[1] 0.2120282 0.5106518 0.9095724 1.0057360 1.0186550 1.0232310
Cluster results with correlation distance on the original space and the rescaled space are exactly the
same.
CLUSTERING WITH CORRELATION MEASURE
HIERARCHICAL CLUSTERING WITH CORRELATION
SMALL NUMBER OF CLUSTERS
## Read the help on the hierarchical clustering method.
help(hclust)
## Build a tree with the Euclidian distance
hclust.correlation <- hclust(d.correlation, method = 'complete')
## Plot the tree
X11()
plot(hclust.correlation)
The plot is barely readable, due to the large number of genes. We can slightly decrease the overlap between labels
by decreasing the font size.
x11(width=16,height=8)
plot(hclust.correlation,cex=0.7)
Cut the hierarchical tree in a predefined number of clusters
## Each object (gene) is assigned to a cluster, clusters.correlation gives the
assignment for each gene to any of the 10 clusters
clusters.correlation10 <- cutree(hclust.correlation, k=10)
6
Exercises Clustering
## Count the number of genes per cluster
table(clusters.correlation10)
#1
2
3
4
5
6
7
8
9 10
#308 237 258 580 315 464 345 238 235 71
#308 + 237+ 258 +580+ 315 +464+ 345+ 238+ 235 + 71 =3051
Plot the profiles of the genes in cluster 1
in the original space (using the original non rescaled data where rescaling refers to
mean centering and variance rescaling)
Using the built in color scheme
X11()
heatmap.2(golub[which(clusters.correlation10==1),], scale="none", cexRow=0.5,
cexCol=0.8, col=topo.colors(20), trace="none", Colv=FALSE, dendrogram = 'row')
In de rescaled space (i.e. using the rescaled data in the plot)
Using the built in color scheme
X11()
heatmap.2(golub_rescaled[which(clusters.correlation10==1),], scale="none",
cexRow=0.5, cexCol=0.8, col=topo.colors(20), trace="none", Colv=FALSE,
dendrogram = 'row')
7
Exercises Clustering
In the rescaled space, the differences in absolute values between the rows are gone.
GENERATE MORE CLUSTERS
Cut the tree in 100 clusters
Find the assignment for each gene to one of the 100 clusters
clusters.correlation100 <- cutree(hclust.correlation, k=100)
## Count the number of genes per cluster
table(clusters.correlation100)
1
2
3
4
5
6
7
8
9 10
23 62 45 68 29
9 16 75 101 52
21 22 23 24 25 26 27 28 29 30
68 34 27 35 22 60 15 49 42 81
41 42 43 44 45 46 47 48 49 50
33 45 36 29 25 19 85 22 19 11
61 62 63 64 65 66 67 68 69 70
25 27 12 23 26 46 24 19 12 14
81 82 83 84 85 86 87 88 89 90
13
8 28 18 24 15 16 10
7 18
11 12
29 114
31 32
39 55
51 52
23 26
71 72
13 11
91 92
15 26
13
32
33
72
53
7
73
28
93
16
14
10
34
26
54
44
74
24
94
12
15
11
35
70
55
33
75
17
95
14
16
26
36
54
56
15
76
50
96
17
17
30
37
44
57
28
77
15
97
14
18
28
38
48
58
42
78
13
98
10
19 20
19 24
39 40
25 22
59 60
18 34
79 80
22 44
99 100
12
8
Plot the profiles of the genes in cluster 1
In the original space (using the color scheme built in)
heatmap.2(golub[which(clusters.correlation100==1),], scale="none", cexRow=0.5,
cexCol=0.8, col=topo.colors(20), trace="none", Colv=FALSE, dendrogram = 'row')
8
Exercises Clustering
In the original space (using a recoloring based on the global dataset)
X11()
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub[which(clusters.correlation100==1),], col=bluered(40), scale="none", density.info='none',
trace="none", breaks=breaks.tmp, Colv=FALSE, dendrogram = 'row')
In the rescaled space (using the built in color scheme)
9
Exercises Clustering
X11()
heatmap.2(golub_rescaled[which(clusters.correlation100==1),], scale="none", cexRow=0.5,
cexCol=0.8, col=topo.colors(20), trace="none", Colv=FALSE, dendrogram = 'row')
In the rescaled space (using a recoloring based on the global rescaled dataset (-5-5))
X11()
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled[which(clusters.correlation100==1),], col=bluered(40), scale="none",
density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE, dendrogram = 'row')
Conclusion:
10
Exercises Clustering



All genes will be assigned to a cluster
The larger the number of clusters, the smaller the size of the cluster in number of genes
Cluster 1 contains a strong signal because in the rescaled space the max value is used. In the
remainder of the datasets this is not the case (i.e. based on one patient).
SEARCH FOR A CLUSTER THAT CONTAINS A BIOMARKER
Gene 1042 is an example of a gene that is differentially expressed between both cancer types (biomarker)
plot(golub[1042,])
golub.gnames[1042,] "CCND3 Cyclin D3" "M92287_at"
[1] "2354"
"CCND3 Cyclin D3" "M92287_at"
(omdat voor sommige plots rownames(golub)= golub.gnames[,1]-> zoek voor 2354
Find the cluster to which this biomarker belongs and make a heatmap of the corresponding cluster, explain what is
guilt by association (for more explanation on the heatmap function go to the end of the doc):
clusters.correlation100[1042]
#=21
In the original space (using a recoloring based on the full original dataset)
X11()
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )heatmap.2(golub
[which(clusters.correlation100==21),], col=bluered(40), scale="none", density.info='none', trace="none",
breaks=breaks.tmp, Colv=FALSE, Rowv=FALSE)
# in the first plot the rows are not resorted, in the second plot they are. They are resorted based on the Euclidian
distance in the data that was shown (here the original data, so taking into account absolute differences in
expression).
11
Exercises Clustering
##scale fixed based on the global dataset (-1.6 en 3.8 is de range)
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub [which(clusters.correlation100==21),], col=bluered(40),
scale="none", density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram = 'row')
12
Exercises Clustering
In the rescaled space (using a recoloring based on the global rescaled dataset)
#scale fixed based on the global rescaled dataset
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled[which(clusters.correlation100==21),], col=bluered(40),
scale="none", density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram = 'row')
13
Exercises Clustering
Color Key
-4
-2
0
2
4
Value
The cluster with the discriminating gene in it contains genes that are somehow consistently lower in the
AML patients than in the ALL and more homogeneous amongst the AMLs. If you check in the absolute it clearly
shows that not all genes have the same ‘absolute level’
Plot the genenames
golub.gnames[which(clusters.correlation100 ==21),2]
(these lists are unsorted)
[1] "RAS-RELATED PROTEIN RAB-11A"
[2] "HYPOTHETICAL MYELOID CELL LINE PROTEIN 2"
[3] "KIAA0064 gene"
[4] "Proteasome subunit p42"
[5] "KIAA0200 gene"
[6] "CD38 CD38 antigen (p45)"
[7] "KIAA0216 gene"
[8] "KIAA0235 gene, partial cds"
[9] "Calmodulin Type I"
[10] "EIF2A Eukaryotic translation initiation factor 2A"
[11] "CYC1 Cytochrome c-1"
[12] "NPY Neuropeptide Y"
[13] "CARS Cysteinyl-tRNA synthetase"
[14] "RPS6KA2 Ribosomal protein S6 kinase, 90kD, polypeptide 2"
[15] "IEF SSP 9502 mRNA"
[16] "GB DEF = Translation initiation factor eIF-2 gamma subunit mRNA"
[17] "SDH2 Succinate dehydrogenase 2, flavoprotein (Fp) subunit"
[18] "(clone S20iii15) mRNA, 3' end of cds"
[19] "RB1 Retinoblastoma 1 (including osteosarcoma)"
[20] "Terminal transferase mRNA"
[21] "Transposon-like element mRNA"
[22] "RASA1 GTPase-activating protein ras p21 (RASA)"
[23] "Recombination activating protein (RAG-1) gene"
[24] "MGMT 6-O-methylguanine-DNA methyltransferase (MGMT)"
[25] "CD72 CD72 antigen"
[26] "DCK Deoxycytidine kinase"
[27] "POLYPOSIS LOCUS PROTEIN 1"
[28] "NUCLEOLYSIN TIA-1"
[29] "TCF12 Transcription factor 12 (HTF4, helix-loop-helix transcription factors 4)"
[30] "CCND3 Cyclin D3"
[31] "GB DEF = Recombination acitivating protein (RAG2) gene, last exon"
[32] "FLI1 Friend leukemia virus integration 1"
[33] "BLK Protein-tyrosine kinase blk"
[34] "TFIID subunit TAFII55 (TAFII55) mRNA"
[35] "JTV-1 (JTV-1) mRNA"
[36] "PPP2R4 Protein phosphatase 2A, regulatory subunit B' alpha-1"
[37] "Liver 2,4-dienoyl-CoA reductase mRNA"
[38] "JUN V-jun avian sarcoma virus 17 oncogene homolog"
[39] "PLATELET-ACTIVATING FACTOR ACETYLHYDROLASE 45 KD SUBUNIT"
[40] "GTBP DNA G/T mismatch-binding protein"
[41] "Clone 23693 mRNA sequence"
[42] "GLYCYLPEPTIDE N-TETRADECANOYLTRANSFERASE"
[43] "Clone 23721 mRNA sequence"
[44] "Hlark mRNA"
[45] "Embryonic ectoderm development protein homolog (eed) mRNA, partial cds"
[46] "SRP19 Signal recognition particle 19 kD protein"
[47] "PFKL Phosphofructokinase (liver type)"
14
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
RAS-RELATED PROTEIN RAB-11A
Drebrin E
NPY Neuropeptide Y
Recombination activ ating protein (RAG-1
GB DEF = Translation initiation f actor eIF
P2 gene f or c subunit of mitochondrial A
MGMT 6-O-methy lguanine-DNA methy ltr
L2-9 transcript of unrearranged immunog
GB DEF = Immunoglobulin-related 14.1 p
KIAA0064 gene
Hlark mRNA
CCND3 Cy clin D3
Transcriptional activ ator hSNF2b
NUCLEAR FACTOR RIP140
mRNA (clone C-2k) mRNA f or serine/thr
KIAA0216 gene
Terminal transf erase mRNA
Calmodulin Ty pe I
TFIID subunit TAFII55 (TAFII55) mRNA
PROBABLE G PROTEIN-COUPLED REC
CARS Cy steiny l-tRNA sy nthetase
RD Radin blood group
Death domain containing protein CRADD
RPS26 Ribosomal protein S26
Proteasome subunit p42
Clone 23693 mRNA sequence
COATOMER BETA' SUBUNIT
EIF2A Eukary otic translation initiation f a
CD38 CD38 antigen (p45)
SRP19 Signal recognition particle 19 kD
Liv er 2,4-dienoy l-CoA reductase mRNA
Embry onic ectoderm dev elopment protei
SDH2 Succinate dehy drogenase 2, f lav o
FLI1 Friend leukemia v irus integration 1
MEF2A gene (my ocy te-specif ic enhance
POLY POSIS LOCUS PROTEIN 1
RASA1 GTPase-activ ating protein ras p2
GTBP DNA G/T mismatch-binding protein
IARS Isoleucine-tRNA sy nthetase
PPP2R4 Protein phosphatase 2A, regulat
RB1 Retinoblastoma 1 (including osteosa
KIAA0200 gene
Transposon-like element mRNA
GLRX Glutaredoxin (thioltransf erase)
HY POTHETICAL MY ELOID CELL LINE
NUCLEOLY SIN TIA-1
TCF12 Transcription f actor 12 (HTF4, he
KIAA0235 gene, partial cds
PLATELET-ACTIVATING FACTOR ACET
Carboxy l Methy ltransf erase, Aspartate,
DCK Deoxy cy tidine kinase
JUN V-jun av ian sarcoma v irus 17 oncog
Clone 23721 mRNA sequence
RABAPTIN-5 protein
IEF SSP 9502 mRNA
MULTIFUNCTIONAL AMINOACY L-TRNA
(clone S20iii15) mRNA, 3' end of cds
Histone H4 gene, clone FO108
CY C1 Cy tochrome c-1
PFKL Phosphof ructokinase (liv er ty pe)
JTV-1 (JTV-1) mRNA
GLY CY LPEPTIDE N-TETRADECANOY L
GB DEF = Recombination acitiv ating pro
NKG2-D TY PE II INTEGRAL MEMBRAN
CD72 CD72 antigen
GB DEF = Smooth muscle LIM protein (h
RPS6KA2 Ribosomal protein S6 kinase,
BLK Protein-ty rosine kinase blk
Exercises Clustering
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
"MULTIFUNCTIONAL AMINOACYL-TRNA SYNTHETASE"
"NKG2-D TYPE II INTEGRAL MEMBRANE PROTEIN"
"P2 gene for c subunit of mitochondrial ATP synthase gene extracted from H.sapiens gene for mitochondrial ATP synthase c subunit (P2 form)"
"COATOMER BETA' SUBUNIT"
"GLRX Glutaredoxin (thioltransferase)"
"mRNA (clone C-2k) mRNA for serine/threonine protein kinase"
"NUCLEAR FACTOR RIP140"
"RABAPTIN-5 protein"
"RPS26 Ribosomal protein S26"
"PROBABLE G PROTEIN-COUPLED RECEPTOR LCR1 HOMOLOG"
"Carboxyl Methyltransferase, Aspartate, Alt. Splice 1"
"Drebrin E"
"Transcriptional activator hSNF2b"
"IARS Isoleucine-tRNA synthetase"
"RD Radin blood group"
"Histone H4 gene, clone FO108"
"MEF2A gene (myocyte-specific enhancer factor 2A, C9 form) extracted from Human myocyte-specific enhancer factor 2A (MEF2A) gene, first coding"
"GB DEF = Smooth muscle LIM protein (h-SmLIM) mRNA"
"Death domain containing protein CRADD mRNA"
"L2-9 transcript of unrearranged immunoglobulin V(H)5 pseudogene"
"GB DEF = Immunoglobulin-related 14.1 protein mRNA"
Find your gene of interest in the cluster (gene with index 30):
1. Note clustering is performed in the correlation i.e. rescaled space. So the rescaling here only affects the plotting
2. Looking at the plot in the absolute space: the gene is overexpressed in most of the ALL patients and
underexpressed in most of the AML (the gene is quite pronouncedly expressed in the absolute space which makes
it probably a good biomarker. Other genes that in the absolute space tend to be closely related are gene 50, 9, 57
(indicated in blue).
3. In the rescaled space: the gene with index 30 is more closely related to e.g. gene 60, 44, 53, 54 (indicated in
green)
Conclusion scale plots: recoloring plots based on the global datasets allows
comparing the plots of the different clusters amongst each other. It is clear
that the signal of cluster 1 is much more pronounced than the signal of the
cluster to which the biomarker belongs (genes change there also consistently
but vary less over the different patients)
Plotting the cluster in the rescaled data makes it obvious why the genes belong
together. Plotting it in the absolute scale tells you whether a gene is highly
or lowly expressed.
Comparing plots in the absolute versus rescaled space after resorting the rows
and columns based on the Euclidean distance shows the effect of using a
rescaling versus not in the clustering (some rows that nicely belong together
using correlation seem very distant in the Euclidean space on the unscaled
data)
KMEANS WITH CORRELATION
SMALL NUMBER OF CLUSTERS
kmeangolub10 = kmeans(d.correlation,10)
table(kmeangolub10$cluster)
1
2
3
4
5
6
7
8
9
10
183 234 282 365 302 338 327 431 250 339
15
Exercises Clustering
183 + 234 + 282 + 365 + 302 + 338+ 327+ 431 +250 +339
=3051 (dim(golub)
All genes have been assigned to clusters, which is property of kmeans
LARGE NUMBER OF CLUSTERS
Perform K means clustering with 100 clusters
kmeangolub100= kmeans(d.correlation,100)
table(kmeangolub100$cluster)
1
38
21
24
41
32
61
23
81
25
2
28
22
39
42
35
62
15
82
37
3
40
23
30
43
28
63
19
83
28
4
25
24
24
44
23
64
32
84
50
5
36
25
61
45
29
65
40
85
34
6
42
26
31
46
21
66
26
86
30
7
37
27
27
47
18
67
23
87
34
8
27
28
25
48
25
68
15
88
28
9 10 11 12 13 14 15 16 17 18 19 20
36 30 17 21 25 25 24 34 33 38 23 28
29 30 31 32 33 34 35 36 37 38 39 40
28 30 25 29 43 21 28 17 49 48 24 29
49 50 51 52 53 54 55 56 57 58 59 60
34 36 26 27 24 32 33 30 29 41 57 46
69 70 71 72 73 74 75 76 77 78 79 80
31 32 29 24 32 33 24 50 16 27 50 43
89 90 91 92 93 94 95 96 97 98 99 100
41 25 21 31 26 25 22 45 27 18 27 28
Gene 1042 is an example of a gene that is differentially expressed between both cancer types (biomarker)
plot(golub[1042,])
golub.gnames[1042,] "CCND3 Cyclin D3" "M92287_at"
Find the cluster to which this biomarker belongs and make a heatmap of the corresponding cluster, explain what is
guilt by association:
kmeangolub100$cluster[1042]
#=36 (this is variable because K means is stochastic)
Heatmap in the original space
Rescaled color scheme according to the full data, genes sorted according to the Euclidean distance.
X11()
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub[which(kmeangolub100$cluster == kmeangolub100$cluster[1042]
),], col=bluered(40), scale="none", density.info='none', trace="none", breaks=breaks.tmp,
Colv=FALSE, dendrogram = 'row')
16
Exercises Clustering
Color Key
-2
-1
0
1
2
3
Value
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
Macmarcks
CCND3 Cyclin D3
PROTEASOME IOTA CHAIN
C-myb gene extracted from Human (c-myb) gene, comple
GNB1 Guanine nucleotide binding protein (G protein), beta
14-3-3 epsilon mRNA
ZNF91 Zinc finger protein 91 (HPF7, HTF10)
HS1 binding protein HAX-1 mRNA, nuclear gene encoding
Estrogen sulfotransferase mRNA
ALDR1 Aldehyde reductase 1 (low Km aldose reductase
SERYL-TRNA SYNTHETASE
HCF1 gene related mRNA sequence
ATP-DEPENDENT DNA HELICASE II, 86 KD SUBUNIT
AARS Alanyl-tRNA synthetase
TCRA T cell receptor alpha-chain
Transcriptional activator hSNF2b
Transcriptional activator hSNF2b
Butyrophilin (BTF5) mRNA
FRAP FK506 binding protein 12-rapamycin associated pro
IL7R Interleukin 7 receptor
SNRPN Small nuclear ribonucleoprotein polypeptide N
MCM3 Minichromosome maintenance deficient (S. cerevis
GB DEF = mRNA fragment
LBR Lamin B receptor
(clone S164) mRNA, 3' end of cds
Transcriptional regulator homolog RPD3 mRNA
Fetal Alz-50-reactive clone 1 (FAC1) mRNA
GB DEF = Retrotransposon
Interferon-gamma induced protein (IFI 16) gene
Uridine diphosphoglucose pyrophosphorylase mRNA
MLH1 DNA mismatch repair protein MLH1
KIAA0225 gene, partial cds
IEF SSP 9502 mRNA
KIAA0097 gene
Inducible protein mRNA
KIAA0128 gene, partial cds
SPTAN1 Spectrin, alpha, non-erythrocytic 1 (alpha-fodrin
Skeletal muscle LIM-protein SLIM1 mRNA
Dihydropyrimidinase related protein-2
ORF mRNA
CD38 CD38 antigen (p45)
Protein phosphatase 2A 74 kDa regulatory subunit (delta
KINESIN LIGHT CHAIN
HKR-T1
Liver 2,4-dienoyl-CoA reductase mRNA
KIAA0276 gene, partial cds
Run3
17
Exercises Clustering
Color Key
-2
-1
0
1
2
3
Value
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
Oncoprotein 18 (Op18) gene
Nuclear factor NF45 mRNA
VIL2 Villin 2 (ezrin)
Macmarcks
CCND3 Cyclin D3
SOX4 SRY (sex determining region Y)-box 4
YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-m
ADA Adenosine deaminase
SERYL-TRNA SYNTHETASE
ALDR1 Aldehyde reductase 1 (low Km aldose redu
Transcriptional activator hSNF2b
KIAA0212 gene
IL7R Interleukin 7 receptor
Adenosine triphosphatase, calcium
Protein tyrosine kinase related mRNA sequence
TTF mRNA for small G protein
Hlark mRNA
KIAA0128 gene, partial cds
GB DEF = mRNA fragment
DAGK1 Diacylglycerol kinase, alpha (80kD)
SPTAN1 Spectrin, alpha, non-erythrocytic 1 (alphaSNRPN Small nuclear ribonucleoprotein polypeptid
Inducible protein mRNA
LTB Lymphotoxin-beta
NADH:ubiquinone oxidoreductase subunit B13 (B1
KINESIN HEAVY CHAIN
SNRPN mRNA, 3' UTR, partial sequence
Transcriptional repressor (CTCF) mRNA
Protein phosphatase 2A 74 kDa regulatory subunit
GB DEF = Natriuretic peptide receptor (ANP-A rece
Spinal Muscular Atrophy 4
Heatmap in the rescaled space
Rescaled color scheme according to the full data, genes and columns sorted according to the Euclidean
distance in rescaled space
Run3
X11()
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled[which(kmeangolub100$cluster == kmeangolub100$cluster[1042]
),], col=bluered(40), scale="none", density.info='none', trace="none", breaks=breaks.tmp,
Colv=FALSE, dendrogram = 'row')
Color Key
0
2
4
Value
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
KIAA0212 gene
GB DEF = Natriuretic peptide receptor (ANP-A rece
LTB Lymphotoxin-beta
KINESIN HEAVY CHAIN
SNRPN mRNA, 3' UTR, partial sequence
SNRPN Small nuclear ribonucleoprotein polypeptid
SOX4 SRY (sex determining region Y)-box 4
TTF mRNA for small G protein
Protein tyrosine kinase related mRNA sequence
GB DEF = mRNA fragment
NADH:ubiquinone oxidoreductase subunit B13 (B1
IL7R Interleukin 7 receptor
Transcriptional repressor (CTCF) mRNA
Hlark mRNA
CCND3 Cyclin D3
Transcriptional activator hSNF2b
Spinal Muscular Atrophy 4
Protein phosphatase 2A 74 kDa regulatory subunit
Adenosine triphosphatase, calcium
DAGK1 Diacylglycerol kinase, alpha (80kD)
ADA Adenosine deaminase
Oncoprotein 18 (Op18) gene
KIAA0128 gene, partial cds
SPTAN1 Spectrin, alpha, non-erythrocytic 1 (alphaInducible protein mRNA
SERYL-TRNA SYNTHETASE
Nuclear factor NF45 mRNA
VIL2 Villin 2 (ezrin)
Macmarcks
YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-m
ALDR1 Aldehyde reductase 1 (low Km aldose redu
ALL
-2
ALL
-4
18
Exercises Clustering
golub.gnames[which(kmeangolub100$cluster == kmeangolub100$cluster[1042]),2]
golub.gnames[which(kmeangolub100$cluster == kmeangolub100$cluster[1042]),2]
[1] "KIAA0128 gene, partial cds"
[2] "Macmarcks"
[3] "SNRPN Small nuclear ribonucleoprotein polypeptide N"
[4] "SPTAN1 Spectrin, alpha, non-erythrocytic 1 (alpha-fodrin)"
[5] "IEF SSP 9502 mRNA"
[6] "GB DEF = mRNA fragment"
[7] "Inducible protein mRNA"
[8] "Protein phosphatase 2A 74 kDa regulatory subunit (delta or B\" subunit)"
[9] "ADA Adenosine deaminase"
[10] "IL7R Interleukin 7 receptor"
[11] "Oncoprotein 18 (Op18) gene"
[12] "CCND3 Cyclin D3"
[13] "HKR-T1"
[14] "Fetal Alz-50-reactive clone 1 (FAC1) mRNA"
[15] "Transcriptional repressor (CTCF) mRNA"
[16] "FRAP FK506 binding protein 12-rapamycin associated protein"
[17] "ALDR1 Aldehyde reductase 1 (low Km aldose reductase)"
[18] "DAGK1 Diacylglycerol kinase, alpha (80kD)"
[19] "SOX4 SRY (sex determining region Y)-box 4"
[20] "SERYL-TRNA SYNTHETASE"
[21] "GB DEF = Retrotransposon"
[22] "Adenosine triphosphatase, calcium"
[23] "GNB1 Guanine nucleotide binding protein (G protein), beta polypeptide 1"
[24] "Transcriptional activator hSNF2b"
[25] "C-myb gene extracted from Human (c-myb) gene, complete primary cds, and five complete alternatively
spliced cds"
[26] "Spinal Muscular Atrophy 4"
[27] "LBR Lamin B receptor"
[28] "TCRA T cell receptor alpha-chain"
[29] "ZNF91 Zinc finger protein 91 (HPF7, HTF10)"
[30] "Estrogen sulfotransferase mRNA"
[31] "Transcriptional activator hSNF2b
Green ones are the ones that also in rescaled space are the closest, blue ones in the absolute space
Conclusion: the clusters of K. means correspond to some extent to those of hierarchical clustering.
STOCHASTICITY OF K MEANS
Rerun the clustering using 100 clusters
kmeangolub100= kmeans(d.correlation,100)
table(kmeangolub100$cluster)
kmeangolub100$cluster[1042]
How does this prove the stochasticity of KMeans
Kmeans is a stochastic algorithm
19
Exercises Clustering
CLUSTERING WITH EUCLIDIAN DISTANCE
HIERARCHICAL CLUSTERING WITH EUCLIDEAN D ISTANCE ON RESCALED DATA
## Build a tree with the Euclidian distance
hclust.euclidian_rescaled <- hclust(d.euclidian_rescaled, method = 'complete')
(if you copy this the quotes are not OK)
## Plot the tree
plot(hclust.euclidian_rescaled)
## Each object (gene) is assigned to a cluster,
hclust_clusters.euclidean_rescaled gives the assignment for each gene to any
of the 100 clusters
hclust_clusters.euclidean_rescaled <- cutree(hclust.euclidian_rescaled, k=100)
## Count the number of genes per cluster
table(hclust_clusters.euclidean_rescaled)
1
23
20
24
39
25
58
42
77
15
96
17
2
62
21
68
40
22
59
18
78
13
97
14
3
45
22
34
41
33
60
34
79
22
98
10
4
5
68 29
23 24
27 35
42 43
45 36
61 62
25 27
80 81
44 13
99 100
12
8
6
9
25
22
44
29
63
12
82
8
7
16
26
60
45
25
64
23
83
28
8
9
75 101
27 28
15 49
46 47
19 85
65 66
26 46
84 85
18 24
10
52
29
42
48
22
67
24
86
15
11 12
29 114
30 31
81 39
49 50
19 11
68 69
19 12
87 88
16 10
13
32
32
55
51
23
70
14
89
7
14
10
33
72
52
26
71
13
90
18
15
11
34
26
53
7
72
11
91
15
16
26
35
70
54
44
73
28
92
26
17
30
36
54
55
33
74
24
93
16
18
28
37
44
56
15
75
17
94
12
19
19
38
48
57
28
76
50
95
14
golub.gnames[which(hclust_clusters.euclidean_rescaled ==
hclust_clusters.euclidean_rescaled[1042]),2]
> golub.gnames[which(hclust_clusters.euclidean_rescaled == hclust_clusters.euclidean_rescaled [1042]),2]
[1] "RAS-RELATED PROTEIN RAB-11A"
[2] "HYPOTHETICAL MYELOID CELL LINE PROTEIN 2"
[3] "KIAA0064 gene"
[4] "Proteasome subunit p42"
[5] "KIAA0200 gene"
[6] "CD38 CD38 antigen (p45)"
[7] "KIAA0216 gene"
[8] "KIAA0235 gene, partial cds"
[9] "Calmodulin Type I"
[10] "EIF2A Eukaryotic translation initiation factor 2A"
[11] "CYC1 Cytochrome c-1"
[12] "NPY Neuropeptide Y"
[13] "CARS Cysteinyl-tRNA synthetase"
[14] "RPS6KA2 Ribosomal protein S6 kinase, 90kD, polypeptide 2"
[15] "IEF SSP 9502 mRNA"
[16] "GB DEF = Translation initiation factor eIF-2 gamma subunit mRNA"
[17] "SDH2 Succinate dehydrogenase 2, flavoprotein (Fp) subunit"
20
Exercises Clustering
[18] "(clone S20iii15) mRNA, 3' end of cds"
[19] "RB1 Retinoblastoma 1 (including osteosarcoma)"
[20] "Terminal transferase mRNA"
[21] "Transposon-like element mRNA"
[22] "RASA1 GTPase-activating protein ras p21 (RASA)"
[23] "Recombination activating protein (RAG-1) gene"
[24] "MGMT 6-O-methylguanine-DNA methyltransferase (MGMT)"
[25] "CD72 CD72 antigen"
[26] "DCK Deoxycytidine kinase"
[27] "POLYPOSIS LOCUS PROTEIN 1"
[28] "NUCLEOLYSIN TIA-1"
[29] "TCF12 Transcription factor 12 (HTF4, helix-loop-helix transcription factors 4)"
[30] "CCND3 Cyclin D3"
[31] "GB DEF = Recombination acitivating protein (RAG2) gene, last exon"
[32] "FLI1 Friend leukemia virus integration 1"
[33] "BLK Protein-tyrosine kinase blk"
[34] "TFIID subunit TAFII55 (TAFII55) mRNA"
[35] "JTV-1 (JTV-1) mRNA"
[36] "PPP2R4 Protein phosphatase 2A, regulatory subunit B' alpha-1"
[37] "Liver 2,4-dienoyl-CoA reductase mRNA"
[38] "JUN V-jun avian sarcoma virus 17 oncogene homolog"
[39] "PLATELET-ACTIVATING FACTOR ACETYLHYDROLASE 45 KD SUBUNIT"
[40] "GTBP DNA G/T mismatch-binding protein"
[41] "Clone 23693 mRNA sequence"
[42] "GLYCYLPEPTIDE N-TETRADECANOYLTRANSFERASE"
[43] "Clone 23721 mRNA sequence"
[44] "Hlark mRNA"
[45] "Embryonic ectoderm development protein homolog (eed) mRNA, partial cds"
[46] "SRP19 Signal recognition particle 19 kD protein"
[47] "PFKL Phosphofructokinase (liver type)"
[48] "MULTIFUNCTIONAL AMINOACYL-TRNA SYNTHETASE"
[49] "NKG2-D TYPE II INTEGRAL MEMBRANE PROTEIN"
[50] "P2 gene for c subunit of mitochondrial ATP synthase gene extracted from H.sapiens gene for mitochondrial ATP synthase c
subunit (P2 form)"
[51] "COATOMER BETA' SUBUNIT"
[52] "GLRX Glutaredoxin (thioltransferase)"
[53] "mRNA (clone C-2k) mRNA for serine/threonine protein kinase"
[54] "NUCLEAR FACTOR RIP140"
[55] "RABAPTIN-5 protein"
[56] "RPS26 Ribosomal protein S26"
[57] "PROBABLE G PROTEIN-COUPLED RECEPTOR LCR1 HOMOLOG"
[58] "Carboxyl Methyltransferase, Aspartate, Alt. Splice 1"
[59] "Drebrin E"
[60] "Transcriptional activator hSNF2b"
[61] "IARS Isoleucine-tRNA synthetase"
[62] "RD Radin blood group"
[63] "Histone H4 gene, clone FO108"
[64] "MEF2A gene (myocyte-specific enhancer factor 2A, C9 form) extracted from Human myocyte-specific enhancer factor 2A
(MEF2A) gene, first coding"
[65] "GB DEF = Smooth muscle LIM protein (h-SmLIM) mRNA"
[66] "Death domain containing protein CRADD mRNA"
[67] "L2-9 transcript of unrearranged immunoglobulin V(H)5 pseudogene"
[68] "GB DEF = Immunoglobulin-related 14.1 protein mRNA"
For comparison: this should give exactly the same result as with hierarchical clustering correlation being
performed on the original or rescaled space using the same number of clusters (see higher p 9)
21
Exercises Clustering
Heatmaps
Plot results in the absolute space: color scheme according to the full data, genes sorted
X11()
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub[which(hclust_clusters.euclidean_rescaled == hclust_clusters.euclidean_rescaled[1042]
),], col=bluered(40), scale="none", density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram ='row')
Heatmap in the rescaled space
Color scheme according to the full data, genes sorted according to the Euclidean distance in the
rescaled space
X11()
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled[which(hclust_clusters.euclidean_rescaled ==
hclust_clusters.euclidean_rescaled[1042] ),], col=bluered(40), scale="none", density.info='none',
trace="none", breaks=breaks.tmp,Colv=FALSE, dendrogram ='row')
22
Exercises Clustering
HIERARCHICAL CLUSTERING WITH THE EUCLIDEAN DISTANCE ON THE ORIGINAL (NONRESCALED) DATA
## Build a tree with the Euclidian distance
hclust.euclidian <- hclust(d.euclidian, method = 'complete') ## Plot the tree
plot(hclust.euclidian)
hclust_clusters.euclidean<- cutree(hclust.euclidian, k=100)
table(hclust_clusters.euclidean)
golub.gnames[which(hclust_clusters.euclidean == hclust_clusters.euclidean
[1042]),2]
[1] "Macmarcks"
[2] "ADA Adenosine deaminase"
[3] "26-kDa cell surface protein TAPA-1 mRNA"
[4] "CCND3 Cyclin D3"
[5] "VIL2 Villin 2 (ezrin)"
[6] "PROTEASOME IOTA CHAIN"
[7] "PAGA Proliferation-associated gene A (natural killer-enhancing factor A)"
[8] "Alpha-tubulin mRNA"
[9] "SOX4 SRY (sex determining region Y)-box 4"
[10] "TOP2B Topoisomerase (DNA) II beta (180kD)"
[11] "PROBABLE G PROTEIN-COUPLED RECEPTOR LCR1 HOMOLOG"
[12] "C-myb gene extracted from Human (c-myb) gene, complete primary cds, and five complete alternatively spliced cds"
[13] "ZNF91 Zinc finger protein 91 (HPF7, HTF10)"
By comparing the gene list with the heatmap we can see that genes in blue are close in the absolute space,
whereas genes in green are the closest in the rescaled space.
The clustering has now been performed in the absolute space (unscaled data). Therefore we recruit other genes in
the clustering i.e. genes that are close in absolute space, but not necessarily close in the rescaled space (only for
the genes in green are close in the absolute and rescaled space and so they will end up in cluster together with the
query gene irrespective of whether you perform the clustering in the absolute or rescaled space
23
Exercises Clustering
Heatmap in the original space, color scheme according to the full data
X11()
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub[which(hclust_clusters.euclidean == hclust_clusters.euclidean[1042] ),],
col=bluered(40), scale="none", density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram ='row')
Color Key
-2
-1
0
1
2
3
Value
VIL2 Villin 2 (ezrin)
Macmarcks
CCND3 Cyclin D3
SOX4 SRY (sex determining region Y)-box
ADA Adenosine deaminase
TOP2B Topoisomerase (DNA) II beta (180
ZNF91 Zinc finger protein 91 (HPF7, HTF10
C-myb gene extracted from Human (c-myb
PROTEASOME IOTA CHAIN
PAGA Proliferation-associated gene A (nat
26-kDa cell surface protein TAPA-1 mRNA
Alpha-tubulin mRNA
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
PROBABLE G PROTEIN-COUPLED REC
Heatmap in the rescaled space
color scheme according to the full data
x11()
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled[which(hclust_clusters.euclidean == hclust_clusters.euclidean[1042] ),],
col=bluered(40), scale="none", density.info='none', trace="none", breaks=breaks.tmp,Colv=FALSE,
dendrogram ='row')
24
Exercises Clustering
Color Key
-4
-2
0
2
4
Value
Alpha-tubulin mRNA
ZNF91 Zinc finger protein 9
ADA Adenosine deaminas
SOX4 SRY (sex determini
PROTEASOME IOTA CHA
PAGA Proliferation-associa
VIL2 Villin 2 (ezrin)
Macmarcks
CCND3 Cyclin D3
PROBABLE G PROTEIN-
TOP2B Topoisomerase (D
C-myb gene extracted from
Using the scaled versus the non-scaled data gives different clusters. With the Euclidean distance the definition of
‘closeness depends on whether you perform clustering on the original versus the rescaled data. Without rescaling
the genes that show the same profile but also expressed to the same degree are taken together. For comparison
compare these results with those obtained by performing the hierarchical clustering with the Euclidean distance
on the rescaled space p21
K MEANS CLUSTERING WITH EUCLIDEAN DISTANCE ON THE ORIGINAL (NON-RESCALED)
DATA
kmeangolub500eucl = kmeans(d.euclidian, 500)
table(kmeangolub500eucl$cluster)
kmeangolub500eucl$cluster[1042]
golub.gnames[which(kmeangolub500eucl$cluster ==
kmeangolub500eucl$cluster[1042]),2]
#=x cluster x heeft 6 genen -> let op dit kan varieren als je de clustering
opnieuw doet
[1] "CCND3 Cyclin D3"
[2] "Cellular oncogene c-fos (complete sequence)"
[3] "HLA-SB alpha gene (class II antigen) extracted from Human HLA-SB(DP) alpha gene"
[4] "PROTEASOME IOTA CHAIN"
[5] "PAGA Proliferation-associated gene A (natural killer-enhancing factor A)"
[6] "C-myb gene extracted from Human (c-myb) gene, complete primary cds, and five complete
alternatively spliced cds"
[1] "CCND3 Cyclin D3"
[2] "Cellular oncogene c-fos (complete sequence)"
[3] "HLA-SB alpha gene (class II antigen) extracted from Human HLA-SB(DP) alpha gene"
[4] "PROTEASOME IOTA CHAIN"
[5] "PAGA Proliferation-associated gene A (natural killer-enhancing factor A)"
[6] "C-myb gene extracted from Human (c-myb) gene, complete primary cds, and five complete
alternatively spliced cds"
25
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
26-kDa cell surface protein
Exercises Clustering
[1] "Macmarcks"
[2] "CCND3 Cyclin D3"
[3] "HLA-SB alpha gene (class II antigen) extracted from Human HLA-SB(DP) alpha gene"
[4] "PROTEASOME IOTA CHAIN"
[5] "PAGA Proliferation-associated gene A (natural killer-enhancing factor A)"
[6] "TOP2B Topoisomerase (DNA) II beta (180kD)"
Heatmap in the original space
X11()
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub[which(kmeangolub500eucl$cluster ==
kmeangolub500eucl$cluster[1042]),], col=bluered(40), scale="none",
density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE, dendrogram ='row')
Color Key
0
1
2
3
Value
PROTEASOME IOT
CCND3 Cyclin D3
C-myb gene extract
PAGA Proliferation-
HLA-SB alpha gene
26
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
Cellular oncogene c
ALL
-1
ALL
-2
Exercises Clustering
Color Key
-2
-1
0
1
2
3
Value
TOP2B Topoisomer
PROTEASOME IOT
Macmarcks
CCND3 Cyclin D3
PAGA Proliferation-
Heatmap in the rescaled space
X11()
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled[which(kmeangolub500eucl$cluster ==
kmeangolub500eucl$cluster[1042]),], col=bluered(40), scale="none",
density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE, dendrogram ='row')
27
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
HLA-SB alpha gene
Exercises Clustering
Color Key
-4
-2
0
2
4
Value
PROTEASOME IOTA CHAIN
C-myb gene extracted from Hum
CCND3 Cyclin D3
PAGA Proliferation-associated
Cellular oncogene c-fos (comple
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
HLA-SB alpha gene (class II an
Color Key
-4
-2
0
2
4
Value
PROTEASOME IOTA CHAIN
TOP2B Topoisomerase (DNA) I
PAGA Proliferation-associated
CCND3 Cyclin D3
Macmarcks
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
HLA-SB alpha gene (class II an
Because the clustering was performed in the original space we do not expect that genes are also close in the
rescaled space and this is confirmed by the heatplots. Genes are closer in the non rescaled space
Repeat the exercise with K means/Euclidean distance in the rescaled space. Results should be comparable with Kmeans using correlation (provided the same number of clusters is chosen). Result will however never exactly be
the same? Why?
K MEANS CLUSTERING WITH EUCLIDEAN DISTANCE ON THE RESCALED DATA
K means clustering performed on the rescaled data with the Euclidean distance
28
Exercises Clustering
kmeangolub500eucl = kmeans(d.euclidian_rescaled, 500)
table(kmeangolub500eucl$cluster)
kmeangolub500eucl$cluster[1042]
#=x cluster x heeft 6 genen -> let op dit kan varieren als je de clustering
opnieuw doet
golub.gnames[which(kmeangolub500eucl$cluster == kmeangolub500eucl$cluster[1042]
),2]
1] "Macmarcks"
[2] "YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, zeta
polypeptide"
[3] "CCND3 Cyclin D3"
[4] "VIL2 Villin 2 (ezrin)"
[5] "Transcriptional activator hSNF2b"
(this one is plotted_)
[1] "KIAA0140 gene"
[2] "Macmarcks"
[3] "YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, zeta
polypeptide"
[4] "CCND3 Cyclin D3"
[5] "Hlark mRNA"
[6] "VIL2 Villin 2 (ezrin)"
[7] "SOX4 SRY (sex determining region Y)-box 4"
[1] "Macmarcks"
[2] "YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, zeta
polypeptide"
[3] "CCND3 Cyclin D3"
[4] "Hlark mRNA"
[5] "VIL2 Villin 2 (ezrin)"
[6] "SOX4 SRY (sex determining region Y)-box 4"
[7] "Transcriptional activator hSNF2b
[2] "YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, zeta
polypeptide"
[3] "CCND3 Cyclin D3"
[4] "VIL2 Villin 2 (ezrin)"
[5] "SOX4 SRY (sex determining region Y)-box 4"
[6] "Transcriptional activator hSNF2b"
[1] "Macmarcks"
[2] "SNRPN Small nuclear ribonucleoprotein polypeptide N"
[3] "SPTAN1 Spectrin, alpha, non-erythrocytic 1 (alpha-fodrin)"
[4] "Inducible protein mRNA"
[5] "CCND3 Cyclin D3"
[6] "Transcriptional activator hSNF2b"
[7] "TCRA T cell receptor alpha-chain"
These results are more similar to the ones obtained with Kmeans/hierarchical clustering (correlation) than
the ones obtained by using Kmeans on the original non rescaled data.
Plotted in the original space
29
Exercises Clustering
X11()
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub[which(kmeangolub500eucl$cluster ==
kmeangolub500eucl$cluster[1042]),], col=bluered(40), scale="none",
density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram ='row')
Color Key
-2
-1
0
1
2
3
Value
VIL2 Villin 2 (ezrin)
Macmarcks
CCND3 Cyclin D3
SOX4 SRY (sex det
YWHAZ Tyrosine 3-
Transcriptional activa
Plotted in the rescaled spaceX11()
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled[which(kmeangolub500eucl$cluster ==
kmeangolub500eucl$cluster[1042]),], col=bluered(40), scale="none",
density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram ='row')
30
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
Hlark mRNA
Exercises Clustering
Color Key
-4
-2
0
2
4
Value
Hlark mRNA
CCND3 Cyclin D3
Transcriptional activa
SOX4 SRY (sex det
VIL2 Villin 2 (ezrin)
Macmarcks
CLUSTERING IN GENE DIRECTION/ YEAST CELL CYCLE
UPLOAD DATASET
cho <-read.table("combined_NA.txt", header=TRUE,sep="\t")
head(cho)
data =cho[,2:79]
names= cho[,1]
typeof(data)
rownames(data)=names
FILTER DATA
Filter data if at least one NA value is present in a row
any(is.na(data[1,]))
row.has.na <- apply(data, 1, function(x){any(is.na(x))})
#This returns logical vector with values denoting whether there is any NA in a row. You can use it to see
how many #rows you'll have to drop:
sum(row.has.na)
#and eventually drop them
data.filtered <- data[!row.has.na,]
names.filtered<- names[!row.has.na]
dim(data.filtered)
rownames(data.filtered)
Data filteren als tenminste % NA hoger is dan thrershold
any(is.na(data[1,]))
row.has.na <- apply(data, 1, function(x){sum(is.na(x))>8})
31
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
YWHAZ Tyrosine 3-
Exercises Clustering
#This returns logical vector with values denoting whether there is any NA in a row. You can use it to see
how many #rows you'll have to drop:
sum(row.has.na)
#and eventually drop them
data.filtered <- data[!row.has.na,]
names.filtered<- names[!row.has.na]
dim(data.filtered)
typeof(data.filtered)
rownames(data.filtered)
RESCALING
#variance rescaling and centering
#centering
row_mean = apply(as.matrix(data.filtered), 1, mean, na.rm=TRUE)
tg_meancentered = as.matrix(data.filtered)- row_mean
dim(tg_meancentered)
rowMeans(tg_meancentered, na.rm=TRUE) #(should be 0)
#variance rescaling
SD = apply(as.matrix(tg_meancentered), 1, sd, na.rm=TRUE)
tg_rescaled = as.matrix(tg_meancentered)/SD
dim(tg_rescaled)
SD = apply(as.matrix(tg_rescaled), 1, sd, na.rm=TRUE)#(should be one)
tg_rescaled[1,1:10]
mean(as.matrix(tg_rescaled[1,]), na.rm=TRUE)
rownames(tg_rescaled)
CALCULATE DISTANCE
X= cor(t(as.matrix(data.filtered)), use="pairwise.complete.obs")
dim(X)
d.correlation <- as.dist (X)
dim(as.matrix(d.correlation))
CLUSTER HIERARCHICAL (TER INFO, NIET IN LES)
RUN CLUSTERING
hclust.correlation <- hclust(d.correlation, method = 'complete')
hclust_clusters.correlation <- cutree(hclust.correlation , k=50)
## Each object (gene) is assigned to a cluster, hclust_correlation gives the assignment for each gene to
any of the clusters
## Count the number of genes per cluster
table(hclust_clusters.correlation)
hclust_clusters.correlation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
32
Exercises Clustering
135 117
66 137 102 155 142 109 165
74 165 146 165 121
99 116
95
74
90 166
82 105 108
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
68 165 98 107 99 88 139 72 227 109 69 180 99 43 154 84 101 86 171 133 85 106 71
47 48 49 50
146 142 145 45
test = as.matrix((tg_rescaled[which(hclust_clusters.correlation==33),]))
#indices of the gene in the dataset on which clustering was performed (filtered dataset= data.filtered)
Tg_rescaled had the same dimensions as datafiltered
VISUALIZE DATA
In the rescaled space
breaks.tmp <- seq(min(tg_rescaled,na.rm=TRUE), max(tg_rescaled,na.rm=TRUE), length=(40+1) )
heatmap.2(test, col=bluered(40), scale="none", density.info='none', trace="none", na.rm = TRUE,
Colv=FALSE, dendrogram ='row')
CLUSTER WITH K MEANS
x=100
kmeancorr= kmeans(d.correlation, x)
table(kmeancorr$cluster)
typeof(kmeancorr)
# "list"
Make heatmap in the rescaled space (of the cluster to which gene 4 belongs)
X11()
breaks.tmp <- seq(min(tg_rescaled,na.rm=TRUE), max(tg_rescaled,na.rm=TRUE), length=(40+1) )
33
Exercises Clustering
heatmap.2(as.matrix(tg_rescaled[which(kmeancorr$cluster == kmeancorr$cluster[4] ),]), col=bluered(40),
scale="none", density.info='none', trace="none", dendrogram="row", Colv=FALSE)
X11()
breaks.tmp <- seq(min(tg_rescaled,na.rm=TRUE), max(tg_rescaled,na.rm=TRUE), length=(40+1) )
heatmap.2(as.matrix(tg_rescaled[which(kmeancorr$cluster == kmeancorr$cluster[4] ),]), col=bluered(40),
scale="none", density.info='none', trace="none", Colv=FALSE, Rowv=FALSE)
Heatmap in the original space
#needed to define the breaks.tmp. depending on the filtering you can #still have NA value
X11()
breaks.tmp <- seq(min(data.filtered,na.rm=TRUE), max(data.filtered,na.rm=TRUE), length=(40+1) )
heatmap.2(as.matrix(data.filtered [which(kmeancorr$cluster == kmeancorr$cluster[4]),]),
col=bluered(40), scale="none", density.info='none', trace="none", dendrogram="row",
Colv=FALSE,breaks=breaks.tmp)
X11()
breaks.tmp <- seq(min(data.filtered,na.rm=TRUE), max(data.filtered,na.rm=TRUE), length=(40+1) )
heatmap.2(as.matrix(data.filtered [which(kmeancorr$cluster == kmeancorr$cluster[4]),]),
col=bluered(40), scale="none", density.info='none', trace="none", Rowv=FALSE,
Colv=FALSE,breaks=breaks.tmp)
34
Exercises Clustering
PLOT all pictures at once
Heatmaps in the rescaled space
breaks.tmp <- seq(min(tg_rescaled,na.rm=TRUE), max(tg_rescaled,na.rm=TRUE), length=(40+1) )
for (n in 1:100)
{
png(paste(n,".png",sep=""),
#opgelet met pasten van quotjes
# create PNG for the heat map
width = 5*300,
# 5 x 300 pixels
height = 10*300,
res = 300,
# 300 pixels per inch
pointsize = 8)
heatmap.2(as.matrix(tg_rescaled[which(kmeancorr$cluster == n
),]), col=bluered(40), scale="none", density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram = 'row')
dev.off()
}
Plot expression profiles instead of heatmaps for one cluster
X11() # moet je doen of hij plot niet
ClusterNr=18
k=1:78
indices = which(kmeancorr$cluster == ClusterNr)
datafiltered_t =t(data.filtered)
datafiltered_t_selected=datafiltered_t[,indices]
matplot(k, datafiltered_t_selected, type ='l')
PLOT all expression profiles at once in rescaled scale
ClusterNr=100
k=1:78
for (n in 1:ClusterNr)
{
png(paste(n,".png",sep=""),
#opgelet met pasten van quotjes
# create PNG for the heat map
35
Exercises Clustering
width = 5*300,
# 5 x 300 pixels
height = 10*300,
res = 300,
# 300 pixels per inch
pointsize = 8)
indices = which(kmeancorr$cluster == n)
tg_rescaled_t=t(tg_rescaled)
tg_rescaled_t_selected= tg_rescaled_t[,indices]
matplot(k, tg_rescaled_t_selected, type ='l')
dev.off()
}
Plot all Heatmaps in the absolute space
breaks.tmp <- seq(min(data.filtered,na.rm=TRUE), max(data.filtered,na.rm=TRUE), length=(40+1) )
for (n in 1:100)
{
png(paste(n,".png",sep=""),
#opgelet met pasten van quotjes
# create PNG for the heat map
width = 5*300,
# 5 x 300 pixels
height = 10*300,
res = 300,
# 300 pixels per inch
pointsize = 8)
heatmap.2(as.matrix(data.filtered [which(kmeancorr$cluster == n
),]), col=bluered(40), scale="none", density.info='none', trace="none", breaks=breaks.tmp, Colv=FALSE,
dendrogram = 'row')
dev.off()
}
PLOT all expression profiles at once in absolute scale
ClusterNr=100
k=1:78
for (n in 1: ClusterNr)
{
png(paste(n,".png",sep=""),
#opgelet met pasten van quotjes
# create PNG for the heat map
width = 5*300,
# 5 x 300 pixels
height = 10*300,
res = 300,
# 300 pixels per inch
pointsize = 8)
indices = which(kmeancorr$cluster == n)
datafiltered_t =t(data.filtered)
datafiltered_t_selected=datafiltered_t[,indices]
matplot(k, datafiltered_t_selected, type ='l')
dev.off()
}
36
Exercises Clustering
(making plots in the rescaled space does not change a lot as data are already rescaled)
CALCULATE FUNCTIONAL ENRICHMENT OF A SELECTED CLUSTER
ClusterNr=53
Names = names.filtered[which(kmeancorr$cluster == ClusterNr)]
outputfile = 'test.txt'
write.table(Names,outputfile,sep="\t", quote=FALSE, append=FALSE, col.names=FALSE,
row.names=FALSE)
X11()
breaks.tmp <- seq(min(data.filtered,na.rm=TRUE), max(data.filtered,na.rm=TRUE), length=(40+1) )
heatmap.2(as.matrix(data.filtered [which(kmeancorr$cluster == 56),]), col=bluered(40), scale="none",
density.info='none', trace="none", Rowv=FALSE, Colv=FALSE,breaks=breaks.tmp)
eerste figis cluster 53, de andere clusters zijn andere files
37
Exercises Clustering
clus t
38
Exercises Clustering
Using biomart to translate the gene ids to gene names improves the enrichment
ClusterNr=53
Names = names.filtered[which(kmeancorr$cluster == ClusterNr)]
library(biomaRt)
# List the datasets available for biomart
mart <- useMart(biomart="ensembl")
listDatasets(mart)
# define biomart object
mart <- useMart(biomart="ensembl", dataset="scerevisiae_gene_ensembl")
# query biomart (this takes a while the first time. Don't cancel it.)
externalNames <- getBM(attributes='external_gene_name', filters = 'ensembl_gene_id', values = Names ,
mart = mart)
outputfile = 'test.txt'
write.table(externalNames,outputfile,sep="\t", quote=FALSE, append=FALSE, col.names=FALSE,
row.names=FALSE)
use website
http://geneontology.org/page/go-enrichment-analysis
39
Exercises Clustering
Are the overrepresented Go categories involved in cell cycle?
pyrimidine nucleotide catabolic process (GO:0006244) 1
0
6.520e-03
1.000e+00
positive regulation of cytoplasmic translational elongation (GO:1900249)
1
0
6.520e03
1.000e+00
regulation of cytoplasmic translational elongation (GO:1900247)
1
0
6.520e-03
1.000e+00
regulation of ornithine metabolic process (GO:0090368)
1
0
6.520e-03
1.000e+00
negative regulation of phenotypic switching (GO:1900240)
1
0
6.520e-03
1.000e+00
positive regulation of transcription from RNA polymerase II promoter in response to calcium ion
(GO:0061400)
1
0
6.520e-03
1.000e+00
dTDP biosynthetic process (GO:0006233)
1
0
6.520e-03
1.000e+00
positive regulation of mRNA binding (GO:1902416)
1
0
6.520e-03
1.000e+00
regulation of mRNA binding (GO:1902415)
1
0
6.520e-03
1.000e+00
regulation of phenotypic switching (GO:1900239)
1
0
6.520e-03
1.000e+00
regulation of mitotic cytokinesis (GO:1902412)
1
0
6.520e-03
1.000e+00
UTP biosynthetic process (GO:0006228) 1
0
6.520e-03
1.000e+00
dUDP biosynthetic process (GO:0006227)
1
0
6.520e-03
1.000e+00
seedling development (GO:0090351)
1
0
6.520e-03
1.000e+0
REMARK SCALING AND HEATMAPS
1.
To use a comparable color scaling for all datasets (I used colors between the maximum and minimum
values of either the rescaled or original data). This allows comparing the colors used in the heatmaps
between different clustering results (the default setting determines the color setting/scaling based in the
min and max values of the genes in the cluster and not based on all genes in the dataset).
tmp.data <-golub
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub[which(clusters.correlation100==1),], col=bluered(40),
scale="none", density.info='none', trace="none", breaks=breaks.tmp)
40
Exercises Clustering
Color Key
-1
0
1
2
Value
AML
ALL
AML
ALL
ALL
AML
AML
ALL
AML
ALL
AML
ALL
AML
ALL
ALL
ALL
ALL
AML
ALL
AML
ALL
ALL
ALL
AML
ALL
ALL
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
23
3
22
6
21
12
18
4
15
5
2
11
9
14
1
13
20
19
8
7
16
17
10
tmp.data <-golub_rescaled
breaks.tmp <- seq(min(tmp.data), max(tmp.data), length=(40+1) )
heatmap.2(golub_rescaled [which(clusters.correlation100==1),],
col=bluered(40), scale="none", density.info='none', trace="none",
breaks=breaks.tmp)
2) If you make a heatplot and you omit Colv=FALSE and Rowv= FALSE
The heatmap.2 function reorders columns and rows based on the Euclidean distance of the genes in the cluster in
the dataset in which you plot e.g. if you plot the data in the original space genes will be reordered according to
their distance in the ‘euclidean space’ in the original data. As a result the reordering can be different when plotting
in the rescaled versus the absolute space). Also the ordering which you used for clustering might have been
different from the ordering you see in the plot (e.g. when you used the correlation coefficient and you visualize the
data in the absolute space). In some cases it might be easier to omit the reordering performed by the heatmap for
the sake of the comparison.
41
AML
AML
ALL
ALL
ALL
ALL
AML
AML
ALL
ALL
AML
AML
ALL
ALL
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
AML
ALL
ALL
ALL
AML
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
2
ALL
0
AML
-2
ALL
Exercises Clustering
Color Key
Value
4
18
2
22
3
4
12
5
23
6
11
1
13
9
14
15
21
20
19
8
7
16
10
17
42