Perform PCA with genes being the variables/patients as observations

PCA Exercises
PCA tutorial
CONTENTS
Libraries .........................................................................................................................................................................2
INSTALLING BIOCONDUCTOR ....................................................................................................................................2
Loading libraries ........................................................................................................................................................2
Installing libraries.......................................................................................................................................................2
EXERCISE 1: METABOLITE DATA ....................................................................................................................................2
Load the dAta ........................................................................................................................................................2
Visualize the data...................................................................................................................................................2
Calculate the PCA (version1) .................................................................................................................................6
PCA (version2) .....................................................................................................................................................14
Exercise 2: Cancer data ................................................................................................................................................16
UPLOADING THE DATA ............................................................................................................................................16
Perform PCA with genes being the variables/patients as observations ..................................................................17
rescale the data ...................................................................................................................................................17
Calculate the PCA.................................................................................................................................................17
Visualize the results .............................................................................................................................................19
Selection of the genes with the highest absolute loading on the first PC ...........................................................20
Selection of the genes with the highest absolute loading on the second PC ......................................................22
Perform PCA with genes being treated as observations/patients as variables .......................................................23
Calculate PCA .......................................................................................................................................................23
Visualize the data.................................................................................................................................................24
1
PCA Exercises
LIBRARIES
For this tutorial, we will use some R libraries, providing methods which are not part of the basic R package.
INSTALLING BIOCONDUCTOR
source("http://bioconductor.org/biocLite.R")
biocLite()
LOADING LIBRARIES
In principle, these libraries should be installed on your machine before the beginning of the practicals. Assuming
this is the case, you can load the libraries with the following instructions.
library(multtest)
library(matrixStats)
library(gplots)
library(rgl)
library(scatterplot3d)
If the libraries are loaded properly, you do not need to install them, and you can skip the next section.
INSTALLING LIBRARIES
If the libraries are not installed on your machine, you can install them yourself easily, provided you have an
internet connection. To install the required libraries, log in as system administrator, open R and type the following
command.
biocLite("multtest")
install.packages("scatterplot3d", dependencies = TRUE)
install.packages("rgl")
install.packages("gplots")
install.packages("matrixStat")
EXERCISE 1: METABOLITE DATA
Dataset from the library PCAMethods
Description Dataset: A complete subset from a larger metabolite data set. This is the original, complete data set
and can be used to compare estimation results created with the also provided incomplete data (called
metaboliteData). The data was created for an Arabidopsis coldstress experiment.
# Details: A matrix containing 52 timepoints (columns) and 154 metabolites (rows).
Standard: rows = observations and the columns are the variables
LOAD THE DATA
load("C:/Data/Marchal/lessen/lessen_2015_2016/statistiek/PCA/course_PCA_2014/exercises
/Data/metaboliteDataComplete.RData")
head(metaboliteDataComplete)
VISUALIZE THE DATA
#metaboliteDataComplete does not contain missing values whereas metaboliteData does
metaboliteDataComplete[1:3,]
mDC <-metaboliteDataComplete
2
PCA Exercises
# get dimensions of the matrix
dim(metaboliteDataComplete)
#[1] 154 52
# the data contains measurements of 154 metabolites in 52 timepoints (variables are the metabolites,
observations are the timepoints). Plotted points are the observations and the axes the variables. Here the axes are
the metabolites and we plot the time points. Metabolites with the same time profile that belong to the same
pathway correspond to redundant information that can be removed by PCA.
#Make a scatter plot of the data. The time profiles of some metabolites seem to be correlated (i.e. variables are
not completely independent, which metabolites would be correlated? E.g. the ones that are involved in similar
pathways)
X11()
layout(matrix(1:10,ncol=5))
plot(mDC[1,],mDC[2,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[2])
plot(mDC[1,],mDC[3,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[3])
plot(mDC[1,],mDC[4,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[4])
plot(mDC[1,],mDC[5,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[5])
plot(mDC[1,],mDC[6,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[6])
plot(mDC[100,],mDC[6,], xlab=rownames(mDC)[100], ylab=rownames(mDC)[6])
plot(mDC[110,],mDC[50,], xlab=rownames(mDC)[110], ylab=rownames(mDC)[50])
plot(mDC[110,],mDC[70,], xlab=rownames(mDC)[110], ylab=rownames(mDC)[70])
plot(mDC[110,],mDC[70,], xlab=rownames(mDC)[110], ylab=rownames(mDC)[70])
plot(mDC[90,],mDC[80,], xlab=rownames(mDC)[90], ylab=rownames(mDC)[80])
OR
layout(matrix(1:10,ncol=5))
for(i in 1:10)
{for(j in 1:10)
{
plot(mDC[i,],mDC[j,], xlab=rownames(mDC)[i], ylab=rownames(mDC)[j])}}
#Make a scatterplot of the data to check whether there is correlation between the profiles of the time points
(variables are the time points, observations are the metabolites). This would imply also redundancy in the data
3
PCA Exercises
that can be removed by PCA but in the other direction, It is obvious that there is a strong correlation between
repeats of the same time profiles. Now the plotted points are the metabolites (observations) and the time profiles
are the axes (variables)
layout(matrix(1:10,ncol=5))
plot(mDC[,1],mDC[,1], xlab=colnames(mDC)[1], ylab=colnames(mDC)[1])
plot(mDC[,1],mDC[,3], xlab=colnames(mDC)[1], ylab=colnames(mDC)[3])
plot(mDC[,1],mDC[,4], xlab=colnames(mDC)[1], ylab=colnames(mDC)[4])
plot(mDC[,1],mDC[,5], xlab=colnames(mDC)[1], ylab=colnames(mDC)[5])
plot(mDC[,1],mDC[,6], xlab=colnames(mDC)[1], ylab=colnames(mDC)[6])
plot(mDC[,40],mDC[,6], xlab=colnames(mDC)[40], ylab=colnames(mDC)[6])
plot(mDC[,41],mDC[,50], xlab=colnames(mDC)[41], ylab=colnames(mDC)[50])
plot(mDC[,50],mDC[,49], xlab=colnames(mDC)[50], ylab=colnames(mDC)[49])
plot(mDC[,20],mDC[,30], xlab=colnames(mDC)[20], ylab=colnames(mDC)[30])
plot(mDC[,21],mDC[,22], xlab=colnames(mDC)[21], ylab=colnames(mDC)[22])
OR
0.0 0.2
1.5
X24h
-0.5
0.0
0.5
1.0
1.5
-0.5 0.0 0.5 1.0 1.5 2.0 2.5
X4h.4
0.0 0.2
X0h
0.4
X12h
1.0
-1.0 -0.5
0.0
0.5
X48h.3
1.0
1.5
-1.0
-0.5
-0.5
0.0
0.0
X96h.4
1.0
1.5
1.5
2.0
2.0
X48h.4
0.6
-0.4
0.0
-0.5
-1.0
0.2 0.4
0.4
X0h.5
-0.8
0.5
1.0
2.0
1.0
X96h.5
0.5
0.0
0.0
0.8
0.6
0.2
0.0
X0h.4
-0.2
-0.4
0.0 0.2 0.4
X0h
-0.4
X0h
-0.6
-0.4
-0.5
-0.8
0.4
0.2
X0h.2
0.0
-0.2
-0.4
-0.8
0.2
0.4
X0h
0.5
-0.4
-0.6 -0.4 -0.2 0.0
X0h.5
-0.8
0.5
0.0 0.2 0.4
X0h
0.2
-0.4
-0.6 -0.4 -0.2 0.0
-0.8
1.5
0.4
0.0
-1.5
-1.0
X0h.3
-0.5
0.2
-0.8 -0.6 -0.4 -0.2 0.0
X0h
0.6
0.4
0.8
X11()
layout(matrix(1:10,ncol=5))
for(i in 1:10)
{for(j in 1:10)
{
plot(mDC[,i],mDC[,j], xlab=colnames(mDC)[i], ylab=colnames(mDC)[j])}}
-0.5
0.0
0.5
1.0
X96h.5
1.5
2.0
-0.5 0.0 0.5 1.0 1.5 2.0 2.5
X4h.5
4
PCA Exercises
Calculate the correlation of mDC:
Correlation between the time profiles.
X11()
dim(cor(mDC))
[1] 52 52
round(cor(mDC),2)
heatmap( cor(mDC),
cexCol=0.8)
scale="none",
main="correlation
between
samples",
cexRow=0.8,
correlation between samples
X96h.2
X96h.1
X96h.3
X96h.6
X48h.2
X48h.1
X48h.3
X48h.7
X48h.6
X96h.7
X48h.5
X48h.4
X96h.5
X96h.4
X96h
X48h
X24h.6
X24h.5
X24h.2
X24h.3
X12h.1
X24h.1
X24h.4
X24h
X12h.2
X12h.5
X12h.6
X0h.3
X0h.5
X0h.6
X0h.4
X0h.2
X0h
X0h.1
X1h.7
X1h.3
X1h
X1h.4
X1h.6
X1h.2
X1h.5
X4h.3
X12h.7
X4h.2
X4h.5
X1h.1
X4h.1
X4h
X4h.4
X12h
X12h.4
X12h.3
X12h.6
X12h.5
X12h.2
X24h
X24h.4
X24h.1
X12h.1
X24h.3
X24h.2
X24h.5
X24h.6
X48h
X96h
X96h.4
X96h.5
X48h.4
X48h.5
X96h.7
X48h.6
X48h.7
X48h.3
X48h.1
X48h.2
X96h.6
X96h.3
X96h.1
X96h.2
X12h.3
X12h.4
X12h
X4h.4
X4h
X4h.1
X1h.1
X4h.5
X4h.2
X12h.7
X4h.3
X1h.5
X1h.2
X1h.6
X1h.4
X1h
X1h.3
X1h.7
X0h.1
X0h
X0h.2
X0h.4
X0h.6
X0h.5
X0h.3
#Correlation of the transpose of mDC
dim(cor(t(mDC)))
5
PCA Exercises
round(cor(t(mDC)),2)
heatmap( cor(t(mDC)), scale="none", main="correlation between metabolites", cexRow=0.5
, cexCol=0.5)
Expresses how well the row vectors of the original matrix are correlated (i.e. the metabolites profiles over time)
# check the first 50 metabolites only
heatmap( cor(t(mDC))[1:50,1:50], scale="none", main="correlation between metabolites", cexRow=0.5, cexCol=0.5)
correlation between metabolites
Shikimic acid (4TMS)
Tyramine (3TMS)
Maltose methoxyamine (8TMS)
L-Aspartic acid (3TMS)
Glyceric acid (3TMS)
Dehydroascorbic acid dimer; L(+)-Ascorbic acid {BP}
Succinic acid (2TMS)
Glucuronic acid methoxyamine (5TMS)
Galactonic acid (6TMS)
myo-Inositol (6TMS)
Melibiose methoxyamine (8TMS)
Glycerol (3TMS)
Glucose-6-phosphate methoxyamine (6TMS)
Fructose-6-phosphate methoxyamine (6TMS)
Fructose methoxyamine (5TMS)
Fumaric acid (2TMS)
Glucose methoxyamine (5TMS)
Galactinol (9TMS)
L-Alanine (2TMS); L-Alanine (2TMS)
Glycine (3TMS)
L-Isoleucine (2TMS)
Malic acid (3TMS)
L-Leucine (2TMS)
L-Valine (2TMS)
Octadecanoic acid (1TMS)
L-Cysteine (3TMS)
L-Threonine (2TMS); L-Threonine (3TMS)
L-Asparagine (4TMS)
L-Glutamic acid (3TMS)
Threonic acid-1,4-lactone (2TMS), transSalicylic acid (2TMS)
Erythronic acid (4TMS)
Threonic acid (4TMS)
L-Serine (2TMS); L-Serine (3TMS)
L-Proline (2TMS)
Xylose methoxyamine (4TMS)
Phosphoric acid (3TMS)
L-Homoserine (3TMS)
myo-Inositol-phosphate (7TMS)
Putrescine (4TMS)
Ornithine (3TMS); Arginine {BP} (3TMS); Ornithine (4TMS); Arginine {BP} (4TMS)
trans-Sinapinic acid (2TMS)
Erythritol (4TMS)
Pyroglutamic acid (2TMS)
L-Methionine (2TMS)
L-Arginine (5TMS)
L-Phenylalanine (2TMS)
L-Tyrosine (3TMS)
L-Glutamine (3TMS); L-Glutamine (4TMS)
Ribose methoxyamine (4TMS)
Ribose methoxyamine (4T MS)
L-Glutamine (3T MS); L-Glutamine (4T MS)
L-T yrosine (3T MS)
L-Phenylalanine (2T MS)
L-Arginine (5T MS)
L-Methionine (2T MS)
Pyroglutamic acid (2T MS)
Erythritol (4T MS)
trans-Sinapinic acid (2T MS)
Ornithine (3T MS); Arginine {BP} (3T MS); Ornithine (4T MS); Arginine {BP} (4T MS)
Putrescine (4T MS)
myo-Inositol-phosphate (7T MS)
L-Homoserine (3T MS)
Phosphoric acid (3T MS)
Xylose methoxyamine (4T MS)
L-Proline (2T MS)
L-Serine (2T MS); L-Serine (3T MS)
T hreonic acid (4T MS)
Erythronic acid (4T MS)
Salicylic acid (2T MS)
T hreonic acid-1,4-lactone (2T MS), transL-Glutamic acid (3T MS)
L-Asparagine (4T MS)
L-T hreonine (2T MS); L-T hreonine (3T MS)
L-Cysteine (3T MS)
Octadecanoic acid (1T MS)
L-Valine (2T MS)
L-Leucine (2T MS)
Malic acid (3T MS)
L-Isoleucine (2T MS)
Glycine (3T MS)
L-Alanine (2T MS); L-Alanine (2T MS)
Galactinol (9T MS)
Glucose methoxyamine (5T MS)
Fumaric acid (2T MS)
Fructose methoxyamine (5T MS)
Fructose-6-phosphate methoxyamine (6T MS)
Glucose-6-phosphate methoxyamine (6T MS)
Glycerol (3T MS)
Melibiose methoxyamine (8T MS)
myo-Inositol (6T MS)
Galactonic acid (6T MS)
Glucuronic acid methoxyamine (5T MS)
Succinic acid (2T MS)
Dehydroascorbic acid dimer; L(+)-Ascorbic acid {BP}
Glyceric acid (3T MS)
L-Aspartic acid (3T MS)
Maltose methoxyamine (8T MS)
T yramine (3T MS)
Shikimic acid (4T MS)
When metabolites are considered to be the variables, some of them are clearly correlated. So we can reduce the
dimensions.
CALCULATE THE PCA (VERSION1)
6
PCA Exercises
Variables are the time and the observations are the metabolites. We want to see whether there is relation
between the metabolites. Because there was correlation between the time points (e.g. repeats, we can reduce the
dimensions so we can visualize the metabolites in less dimensions. In this case each PC is a linear combination of
time points.
#variance rescaling and centering (variables in the rows but because we take the transpose the variables are the
timepoints)
#centering
#for PCA variables should always be mean centered and variance rescaled over the observations. So is the
variables are in the columns (as is here, because the time points are the variables, the matrix should be mean
centered and variance rescaled over the column. Because subtracting a vector from a matrix works on the row
direction the ‘t’ are needed.
row_mean = apply(t(mDC), 1, mean, na.rm=TRUE)
mDC_meancentered = t(mDC) - row_mean
dim(mDC_meancentered)
rowMeans(mDC_meancentered, na.rm=TRUE) #(should be 0)
#variance rescaling
SD = apply(mDC_meancentered, 1, sd, na.rm=TRUE)
mDC_rescaled = mDC_meancentered/SD
dim(mDC_rescaled)
SD = apply(mDC_rescaled, 1, sd)#(should be one)
mDC_rescaled[1,1:10]
mean(as.matrix(mDC_rescaled[1,]), na.rm=TRUE)
dim(mDC_rescaled)
#note we have to take the transpose of mDCrescaled in order to have the same dimension as mDC
dim(mDC)
dim(t(mDC_rescaled))
# visualize the column centering using boxplots
(for a boxplot as for any statistical comment in R, it is assumed that the variables (t) are in the columns. For mDC
this was originally the case but for mDC_meancentered and rescaled we need to take the transpose. )
X11()
boxplot(mDC, las=2, col=rainbow(ncol(mDC)), main="Samples before centering", ylim=c(-2,3))
abline(h=0, col="gray20", lwd=2)
7
PCA Exercises
Samples before centering
3
2
1
0
-1
X0h
X0h.1
X0h.2
X0h.3
X0h.4
X0h.5
X0h.6
X1h
X1h.1
X1h.2
X1h.3
X1h.4
X1h.5
X1h.6
X1h.7
X4h
X4h.1
X4h.2
X4h.3
X4h.4
X4h.5
X12h
X12h.1
X12h.2
X12h.3
X12h.4
X12h.5
X12h.6
X12h.7
X24h
X24h.1
X24h.2
X24h.3
X24h.4
X24h.5
X24h.6
X48h
X48h.1
X48h.2
X48h.3
X48h.4
X48h.5
X48h.6
X48h.7
X96h
X96h.1
X96h.2
X96h.3
X96h.4
X96h.5
X96h.6
X96h.7
-2
X11()
boxplot(t(mDC_meancentered), las=2, col=rainbow(ncol(t(mDC_meancentered))), main="Samples after centering",
ylim=c(-2,3))
abline(h=0, col="gray20", lwd=2)
X11()
boxplot(t(mDC_rescaled), las=2, col=rainbow(ncol(t(mDC_rescaled))), main="Samples after rescaling and mean
centering", ylim=c(-2,3))
abline(h=0, col="gray20", lwd=2)
8
PCA Exercises
Samples after centering
3
2
1
0
-1
X0h
X0h.1
X0h.2
X0h.3
X0h.4
X0h.5
X0h.6
X1h
X1h.1
X1h.2
X1h.3
X1h.4
X1h.5
X1h.6
X1h.7
X4h
X4h.1
X4h.2
X4h.3
X4h.4
X4h.5
X12h
X12h.1
X12h.2
X12h.3
X12h.4
X12h.5
X12h.6
X12h.7
X24h
X24h.1
X24h.2
X24h.3
X24h.4
X24h.5
X24h.6
X48h
X48h.1
X48h.2
X48h.3
X48h.4
X48h.5
X48h.6
X48h.7
X96h
X96h.1
X96h.2
X96h.3
X96h.4
X96h.5
X96h.6
X96h.7
-2
#calculate the PCA, works on the variables and those are the columns, we want to make new time variables
#for PCA variables should always be in the columns and observations in the row. Here we want the timepoints to
be the variables.
PCAres<-prcomp(mDC, scale = TRUE, center=TRUE)
typeof(PCAres)
print(PCAres)
OR
PCAres<-prcomp(t(mDC_rescaled), scale = FALSE, center=FALSE)
See which items are present in the list PCAres
#Look at the loadings of the PC
head(PCAres$rotation)
9
PCA Exercises
PC1
PC2
PC3
PC4
PC5
PC6
X0h 0.021863836 0.02390600 -0.463163721 0.05626445 -0.08596935 0.2551856
X0h.1 0.039028678 0.01769139 -0.346908304 0.34927246 0.01555764 -0.2710336
X0h.2 0.005461632 -0.01703487 0.033930967 -0.35370175 0.47293593 -0.3680019
X0h.3 -0.041738981 -0.12097550 0.220754264 0.26540362 0.45128467 0.2682409
X0h.4 -0.013439588 0.04087621 -0.009428514 -0.59870518 0.02383249 -0.1353698
X0h.5 -0.004382840 0.09862447 0.238903307 0.03477424 -0.58256605 -0.2192965
Shows the contribution of the different time points to the new PC. PCAres$rotation is a 52X52 matrix that contains
in the columns the PC vectors. The rows indicate the contribution of each orinal timepoint to each of the PCs
# calculate the summary
summary(PCAres)
> summary(PCAres)
Importance of components:
PC1
PC2
PC3
PC4
PC5
PC6
Standard deviation
5.6000 2.6055 1.85088 1.45427 1.23946 1.07153
Proportion of Variance 0.6031 0.1305 0.06588 0.04067 0.02954 0.02208
Cumulative Proportion 0.6031 0.7336 0.79951 0.84018 0.86972 0.89180
PC7
PC8
PC9
PC10
PC11
PC12
Standard deviation
0.92986 0.84450 0.77595 0.67501 0.58490 0.54667
Proportion of Variance 0.01663 0.01371 0.01158 0.00876 0.00658 0.00575
Cumulative Proportion 0.90843 0.92214 0.93372 0.94248 0.94906 0.95481
PC13
PC14
PC15
PC16
PC17
PC18
Standard deviation
0.50560 0.49336 0.46365 0.44599 0.39925 0.38106
Proportion of Variance 0.00492 0.00468 0.00413 0.00383 0.00307 0.00279
Cumulative Proportion 0.95973 0.96441 0.96854 0.97237 0.97543 0.97822
PC19
PC20
PC21
PC22
PC23
PC24
Standard deviation
0.36972 0.3306 0.31363 0.29442 0.26589 0.25914
Proportion of Variance 0.00263 0.0021 0.00189 0.00167 0.00136 0.00129
Cumulative Proportion 0.98085 0.9830 0.98485 0.98651 0.98787 0.98917
PC25
PC26
PC27
PC28
PC29
PC30
Standard deviation
0.24716 0.22246 0.2161 0.20559 0.1905 0.18644
Proportion of Variance 0.00117 0.00095 0.0009 0.00081 0.0007 0.00067
Cumulative Proportion 0.99034 0.99129 0.9922 0.99300 0.9937 0.99437
De eerste PC bepaalt al 60 % van de variantie in de data.
#proportion of the variance explained
var_explained=(PCAres$sdev[])^2
tot_var = sum((PCAres$sdev[])^2)
proportion_of_variance =var_explained/tot_var
#note these values are identical to summary(PCAres)
#plot the variance explained per PC
# with the first 4 PC a lot of the variance can already be explained
X11()
plot(proportion_of_variance, type="o", pch=20)
pie(proportion_of_variance)
10
PCA Exercises
1
6
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
5
2
3
4
# calculate the scores: projection of the original data points on the PCs
predict(PCAres)[,1]
Xylose methoxyamine
Tyramine
trans-Sinapinic acid
Threonic acid-1,4-lactone (2TMS),
Threonic acid
Succinic acid
(4TMS)
(3TMS)
(2TMS)
trans(4TMS)
(2TMS)
-2.2280074
-3.4103018
3.0918246
-3.2430423
-0.9644185
-4.1518490
(4TMS)
(3TMS)
(2TMS)
trans(4TMS)
(2TMS)
0.34982667
-1.72048763
1.25496585
0.10434601
-0.03492749
2.32706064
predict(PCAres)[,2]
Xylose methoxyamine
Tyramine
trans-Sinapinic acid
Threonic acid-1,4-lactone (2TMS),
Threonic acid
Succinic acid
Note that the predict corresponds to the projection of the original data points on the PCs. This can be obtained by
multiplying PC matrix with the original data matrix; (is what the predict function does).
PCAres$rotation is the matrix with in the columns the PCs or the transformation matrix P.
From the theory course we know that PX=Y corresponds to a rotation of the original axis and Y contains the
coordinates of the datapoints according to each of the novel axes.
11
PCA Exercises
In these examples The vector P corresponds to PCAres$rotation and has in its columns the vectors that
correspond to the PC (so it is the transpose of the example above).
The transformation then will be PX= Y = XTPT
Where now compared to the theory the XT has in its rows the observations and in its columns the variables so a
154X52 matrix.
#projection on the first PC
X1 = t(mDC_rescaled) %*%PCAres$rotation[,1]
head(X1)
Xylose methoxyamine (4TMS)
-2.2280074
Tyramine (3TMS)
-3.4103018
trans-Sinapinic acid (2TMS)
3.0918246
Threonic acid-1,4-lactone (2TMS), trans- -3.2430423
Threonic acid (4TMS)
-0.9644185
Succinic acid (2TMS)
-4.1518490
OR
head(predict(PCAres)[,1])
#projection on the second PC
X2 = t(mDC_rescaled) %*%PCAres$rotation[,2]
#projection on the third PC
X3 = t(mDC_rescaled) %*%PCAres$rotation[,3]
Xylose methoxyamine (4TMS)
Tyramine (3TMS)
-2.2280074
-3.4103018
trans-Sinapinic acid (2TMS) Threonic acid-1,4-lactone (2TMS), trans3.0918246
-3.2430423
Threonic acid (4TMS)
Succinic acid (2TMS)
-0.9644185
-4.1518490
#plot the original data (metabolites) in the new basis (scores)
plot(predict(PCAres)[,1],predict(PCAres)[,2])
12
PCA Exercises
abline(v=0, col="gray")
abline(h=0, col="gray")
text(predict(PCAres)[,1],predict(PCAres)[,2],labels=sub("X(.+h)(\\..)?","\\1",rownames(mDC)),cex=1,
col=rainbow(nrow(mDC)), adj = c(0,0))
#v, h the values of the horizontal and vertical lines
Galactinol (9TMS)
-5
-10
-20
-15
predict(PCAres)[, 2]
0
5
[NA_108]
L-Proline
(2TMS)
[612; Proline
(2TMS)]
L-Homoserine (3TMS)
L-Glutamic acid (3TMS) [NA_94]
L-Arginine
(5TMS) (5TMS)]
[861;acid
Glucopyranose
[614;
Glutamine (4TMS)
4-Aminobutyric
acid
(2TMS); 4-Aminobutyric
(3TMS)
[529;
Indole-3-acetic
acid (2TMS)]
[NA_260]
Pyroglutamic
acid
(2TMS)
[NA_167]
Succinic acid (2TMS)
[NA_27]
[NA_3]
L-Glutamine
(3TMS);
L-Glutamine
(4TMS)
L-Tyrosine
(3TMS)
Ornithine
(3TMS);
Arginine
{BP}
(3TMS); Ornithine1-(tert-butyldimethylsilyl)-7-propyl-,
(4TMS);
Arginine
{BP}
(4TMS) 3-(O-methyloxime)]
[NA_48]
L-Serine
[NA_259]
cis-Aconitic
Putrescine
(2TMS);
(4TMS)
acid
L-Serine
(3TMS)
(3TMS)
[564;
1H-Indole-2,3-dione,
Glycine
(3TMS)
2-Ketoglutaric
acid
methoxyamine
(2TMS)
[NA_23]
Phosphoric
[640; Putrescine
acid
(3TMS)
(4TMS)]
[NA_98]
[NA_115]
beta-Alanine
trans-Sinapinic
(3TMS)
acid (2TMS)
[NA_166]
[734;
L-Aspartic
acid
(3TMS)]
[NA_51]
[861;
L-Asparagine
[NA_47]
Digalactosylglycerol
(4TMS)
(9TMS)]
[NA_54]
[NA_4]
[NA_49]
[NA_279]
Erythritol
[NA_22]
(4TMS)
[798;
[NA_109]
Fructose
(5TMS)]
[NA_96]
ylose
[NA_52]
methoxyamine
(4TMS)
Malic acid (3TMS)
[NA_105]
[NA_25]
Glucose methoxyamine (5TMS)
[NA_274]
Melibiose
[NA_266]
Threonic
methoxyamine
acid-1,4-lactone
myo-Inositol-phosphate
(8TMS)
(2TMS),
trans(7TMS)
[NA_58]
[NA_100]
[NA_275]
Erythronic
acid
(4TMS)
Fumaric
acid
(2TMS)
[NA_57]
Threonic
L-Methionine
acid
(4TMS)
(2TMS)
[NA_102]
[NA_104]
Glycerol
(3TMS)
cis-Sinapinic
acid
(2TMS)
[NA_273]
Salicylic
acid
Octadecanoic
(2TMS)
acid
(1TMS)
[NA_2]
Ribose
methoxyamine
(4TMS)
[NA_170]
Benzoic
acid
[NA_271]
(1TMS)
[NA_171]
[NA_268]
[708;
Ribonic
Glucuronic
[NA_55]
acid
acid
(5TMS)]
methoxyamine
(5TMS)
[NA_284]
[NA_263]
Glyceric
[NA_103]
Arabinose
[NA_91]
[NA_53]
acid
[914;
[NA_56]
(3TMS)
Galactinol
methoxyamine
L-Threonine
(9TMS)]
(4TMS)
(2TMS);
L-Threonine
(3TMS)
[NA_262]
[NA_110]
[NA_101]
myo-Inositol
[NA_270]
Galactonic
[NA_264]
[NA_92]
(6TMS)
acid
[NA_168]
(6TMS)
[NA_280]
[NA_276]
[NA_277]
[NA_24]
[NA_28]
L-Phenylalanine
(2TMS)
[NA_265]
[NA_99]
L-Cysteine
(3TMS)
Dehydroascorbic
[NA_261]
dimer;
L(+)-Ascorbic
acid {BP}
L-Valine(2TMS)
(2TMS)
[NA_269]
[NA_107] [NA_61]
[NA_9]
L-Alanine (2TMS); L-Alanine (2TMS)
[NA_67]
[924;
[NA_272]
Trehalose
[NA_95]
[NA_93]
(8TMS)]
L-Isoleucine
(2TMS)
Fructose
methoxyamine
(5TMS)
[NA_165]
[NA_50]
[NA_267]
L-Leucine
[NA_29]
L-Aspartic
[NA_306]
[NA_26]
acid
(3TMS)
[NA_281]
[NA_173]
[NA_60]
[NA_59]
[NA_278]
[NA_106]
[NA_97]
Tyramine
(3TMS)
Shikimic
acid (4TMS)
[NA_172]
[NA_169]
Melezitose (11TMS)]
Citramalic acid[721;
(3TMS)
Glucose-6-phosphate
methoxyamine methoxyamine
(6TMS)
Fructose-6-phosphate
(6TMS)
-25
Maltose methoxyamine (8TMS)
-5
0
5
10
15
20
25
predict(PCAres)[, 1]
We can plot the metabolites in two dimensions but do not observe a clear clustering of metabolites (plotting
without the gene names makes it more clear). However, if we would plot metabolites that occur in the same
pathways in the same color we might see more structure in the data (we do not have that information). The
variance along the second PC is mainly affected by the outlier maltose.
We can also plot the loadings or the contributions of each original time point (axis) to the novel PCs.
0.1
plot(PCAres$rotation[,1],PCAres$rotation[,2], pch=20, col="gray40")
abline(v=0, col="gray")
abline(h=0, col="gray")
text(PCAres$rotation[,1],PCAres$rotation[,2], labels=sub("X(.+h)(\\..)?","\\1",rownames(PCAres$rotation)),cex=1,
col=rainbow(ncol(mDC)), adj = c(0,0))
96h96h
96h96h
96h
96h
96h48h
48h
48h
48h
48h
48h
48h
48h
24h
24h
24h
0h
0h
0.0
24h
24h24h
24h
0h
12h
0h
12h
12h
12h
12h
12h12h
-0.1
PCAres$rotation[, 2]
0h
0h
0h
-0.2
12h
4h
4h 4h
4h
1h
1h
1h
4h
1h
1h
1h
4h
1h
-0.3
1h
-0.05
0.00
0.05
0.10
0.15
PCAres$rotation[, 1]
This plot shows that the variability along the first PC is mainly driven by the difference between time 0 (reference
and later time points: time zero has almost no contribution to the first PC whereas the other timepoints do. , The
variability along the second PC is mainly driven by the early and late time points where the early timepoint have a
strong negative contribution and the late timepoints a string positive contribution. Repeats of the time points
13
PCA Exercises
contribute equally to either component (have similar loadings on either component, indicating that they are
measured consistently). .
biplot(PCAres, col =c("blue", "black"))
-10
0
10
Galactinol (9TMS)
0
[NA_108]
L-Proline
(2TMS)
[612;
Proline (2TMS)]
X96h
X96h.1
X96h.6
X96h.4
X96h.5
L-Homoserine
(3TMS)
X96h.3
X96h.2
L-Glutamic acid
(3TMS)
X96h.7
X48h.4
X48h
[NA_94]
X48h.1
X48h.2
X48h.5
(5TMS)
X48h.3
[861; L-Arginine
Glucopyranose
(5TMS)]
[614;
Glutamine
(4TMS)
X0h.5
X48h.7
4-Aminobutyric
acid
(2TMS);
4-Aminobutyric
acid (3TMS)
[529;
Indole-3-acetic
acidX48h.6
(2TMS)]
[NA_260]
Pyroglutamic
acid
(2TMS)
[NA_167]
Succinic
acid
(2TMS)
X24hArginine
[NA_3]
[NA_27]
L-Glutamine
(3TMS);
L-Glutamine
(4TMS)
L-Tyrosine
(3TMS)
X24h.4
Ornithine
Arginine
{BP}
[NA_48]
(3TMS);
Ornithine
(4TMS);
{BP} (4TMS)
L-Serine
cis-Aconitic
Putrescine
(2TMS);
[NA_259]
L-Serine
acid
(4TMS)
(3TMS)
(3TMS)
X24h.2
[564; (3TMS);
1H-Indole-2,3-dione,
2-Ketoglutaric
acid
1-(tert-butyldimethylsilyl)-7-propyl-,
methoxyamine
Glycine
(2TMS)
(3TMS)
3-(O-methyloxime)]
[NA_23]
Phosphoric
[640;
trans-Sinapinic
Putrescine
beta-Alanine
[NA_115]
acid
(3TMS)
(4TMS)]
acid
(3TMS)
(2TMS)
[NA_98]
[NA_166]
X0h.4
[734;
L-Aspartic
acid
(3TMS)]
[NA_51]
X24h.5
[861;
Digalactosylglycerol
L-Asparagine
[NA_47]
(4TMS)
(9TMS)]
[NA_54]
[NA_49]
[NA_279]
[NA_4]
[NA_22]
X0h
X24h.3
X24h.1
Erythritol
(4TMS)
X24h.6
Xylose
[798;
methoxyamine
Fructose
[NA_109]
[NA_96]
(5TMS)]
X0h.1
(4TMS)
[NA_52]
Malic
acid
(3TMS)
[NA_105]
[NA_25]
[NA_266]
Glucose
methoxyamine
(5TMS)
Threonic
Melibiose
myo-Inositol-phosphate
acid-1,4-lactone
Erythronic
methoxyamine
Fumaric
[NA_274]
[NA_58]
acid
acid
(2TMS),
(4TMS)
(2TMS)
(8TMS)
(7TMS)
transThreonic
L-Methionine
[NA_100]
[NA_275]
[NA_57]
acid
(4TMS)
(2TMS)
cis-Sinapinic
[NA_102]
[NA_104]
acid
(2TMS)
Octadecanoic
Salicylic
Glycerol
[NA_273]
acid
(3TMS)
acid
[NA_2]
(2TMS)
(1TMS)
X12h
Ribose
methoxyamine
(4TMS)
Benzoic
[NA_170]
[NA_271]
acid
(1TMS)
Glucuronic
L-Threonine
Arabinose
[708;
Ribonic
acid
[NA_171]
[NA_268]
(2TMS);
methoxyamine
[NA_55]
acid
L-Threonine
(5TMS)]
(5TMS)
(3TMS)
Glyceric
[914;
[NA_262]
[NA_284]
Galactinol
[NA_263]
[NA_103]
[NA_110]
[NA_91]
[NA_53]
[NA_56]
acid
(3TMS)
(9TMS)]
X12h.1
Galactonic
[NA_101]
[NA_270]
acid
(6TMS)
myo-Inositol
[NA_276]
[NA_280]
[NA_277]
[NA_264]
[NA_28]
[NA_92]
[NA_168]
X0h.2
(6TMS)
L-Cysteine
L-Phenylalanine
[NA_24]
[NA_99]
(3TMS)
(2TMS)
Dehydroascorbic
acid
dimer;
[NA_261]
[NA_265]
L-Valine
L(+)-Ascorbic
(2TMS)
acid
{BP}
[NA_269]
[NA_107]
L-Alanine
[NA_9]
X0h.6
(2TMS);
L-Alanine
(2TMS)
[924;
Trehalose
[NA_67]
[NA_272]
L-Isoleucine
[NA_95]
[NA_93]
Fructose
(8TMS)]
(2TMS)
methoxyamine
(5TMS)
[NA_165]
X12h.3
L-Aspartic
[NA_61]
[NA_306]
[NA_267]
[NA_50]
acid
L-Leucine
(3TMS)
(2TMS)
[NA_29]
[NA_26]
[NA_281]
[NA_173]
[NA_60]
[NA_278]
[NA_59]
Tyramine
[NA_106]
[NA_97]
(3TMS)
Shikimic
acid
(4TMS)
X12h.2
[NA_172]
X12h.4
[NA_169]
X12h.6
X12h.5
[721; Melezitose
(11TMS)]
Citramalic
acid
(3TMS)
X0h.3
X12h.7
X4h
Glucose-6-phosphate
methoxyamine
(6TMS)
X4h.1
X4h.5
Fructose-6-phosphate
X1h.1
methoxyamine
(6TMS)
X4h.4
X1h
X4h.2
X1h.5
X1h.4
X4h.3
X1h.6
X1h.2
X1h.3
X1h.7
-0.6
-20
-0.4
-10
-0.2
PC2
0.0
0.2
10
-20
Maltose methoxyamine (8TMS)
-0.6
-0.4
-0.2
0.0
0.2
PC1
Blue are the metabolites (scores), black are the timeppoints (loadings)
PCA (VERSION2)
We can also run PCA in the other direction (consider the metabolites as variables and the timepoints as
observations). This makes most sense as the number of variables is now larger than the number of observations
and thus a dimensionality reduction is required.
#variance rescaling and centering (variables in the rows)
#centering
row_mean = apply(mDC, 1, mean, na.rm=TRUE)
mDC_meancentered = mDC - row_mean
dim(mDC_meancentered)
rowMeans(mDC_meancentered, na.rm=TRUE) #(should be 0)
#variance rescaling
SD = apply(mDC_meancentered, 1, sd, na.rm=TRUE)
mDC_rescaled = mDC_meancentered/SD
dim(mDC_rescaled)
SD = apply(mDC_rescaled, 1, sd)#(should be one)
mDC_rescaled[1,1:10]
mean(as.matrix(mDC_rescaled[1,]), na.rm=TRUE)
rownames(mDC_rescaled)=rownames(mDC)
dim(mDC)
[1] 154 52
dim(mDCrescaled)
[1] 154 52
Calculating PCA
PCAres<-prcomp(t(mDC_rescaled), scale = FALSE, center=FALSE)
OR
PCAres<-prcomp(t(mDC), scale = TRUE, center=TRUE)
Importance of components:
14
PCA Exercises
PC1
PC13
PC2
PC14
PC3
PC15
PC4
PC5
PC6
PC7
PC8
PC9
PC10
PC11
PC12
Standard deviation
8.6680 3.57568 3.33176 2.99490 2.42193 2.03065 1.81039 1.64788 1.56768
1.51289 1.41419 1.36950 1.24157 1.22521 1.16774
Proportion of Variance 0.4879 0.08302 0.07208 0.05824 0.03809 0.02678 0.02128 0.01763 0.01596
0.01486 0.01299 0.01218 0.01001 0.00975 0.00885
biplot(PCAres, col =c("blue", "black"))
Now the loadings (black) are the metabolites and the observations the time points (blue)
Blue are the timepoints (scores), black are the metabolites (loadings)
X11()
plot(predict(PCAres)[,1],predict(PCAres)[,2])
abline(v=0, col="gray")
abline(h=0, col="gray")
text(predict(PCAres)[,1],predict(PCAres)[,2], labels=sub("X(.+h)(\\..)?","\\1",colnames(mDC)),cex=1,
col=rainbow(ncol(mDC)), adj = c(0,0))
15
PCA Exercises
EXERCISE 2: CANCER DATA
UPLOADING THE DATA
library(multtest)
data(golub, package = "multtest")
# this imports the matrix golub and the variable golub.names
dim(golub)
#[1] 3051
38
typeof(golub)
#double
#we will know assign column names
colnames(golub)<-factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
Input : gene expression data collected by Golub et al. Science, Vol.286:531-537. 1999
Following Golub et al three preprocessing steps were applied to the normalized matrix of intensity values available
on the website:
(i)
(ii)
(iii)
thresholding: floor of 100 and ceiling of 16,000;
filtering: exclusion of genes with max / min 5 or (max-min) 500, where max and min refer
respectively to the maximum and minimum intensities for a particular gene across mRNA samples;
base 10 logarithmic transformation.
Boxplots of the expression levels for each of the 38 samples revealed the need to standardize the expression levels
within arrays before combining data across samples. The data were then summarized by a 3 051×38 matrix X =
(xij), where xjijdenotes the expression level for gene i in tumor mRNA sample j.
Lets start with an example of a gene from which we know it is a biomarker” ? is the gene differentially expressed
between both cancer types:
16
PCA Exercises
plot(golub[1042,])
golub.gnames[1042,]
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
boxplot(golub[1042,] ~ gol.fac)
#boxplot shows ALL of CCND3 Cyclin D3 are positive
#one sample t-Test to demonstrate that the gene’s expression level per factor is significantly different
from 0.
t.test(golub[1042,gol.fac=="ALL"], mu=0, alternative = c ("greater"))
#two sample t-Test to demonstrate that the gene is differentially expressed
t.test(golub[1042,] ~ gol.fac, var.equal=FALSE)
PERFORM PCA WITH GENES BEING THE VARIABLES/PATIENTS AS OBSERVATIONS
This is the most logical direction as we here have more variables than observations
RESCALE THE DATA
#variance rescaling and centering (variables are the patients but the mean function works on the rows, because
we have to center and rescale the variables we have to take the transpose of the matrix)
#centering
row_mean = apply(golub, 1, mean, na.rm=TRUE)
golub_meancentered = golub- row_mean
dim(golub_meancentered)
#variance rescaling
SD = apply(golub_meancentered, 1, sd, na.rm=TRUE)
golub_rescaled = golub_meancentered/SD
dim(golub_rescaled)
apply(golub_rescaled, 1, sd)#(should be one)
golub_rescaled[1,1:10]
mean(as.matrix(golub_rescaled[1,]), na.rm=TRUE)
dim(golub_rescaled)
# before doing PCA make sure the variables are in the rows, this is not yet the case. So we will need to
calculate the PCA on the transpose of this matrix
CALCULATE THE PCA
Patients are the observations and the genes the variables:
So transpose the matrix
gt = t(golub)
dim(gt)
#38 X 3051
Perform PCA
PCAres_t<-prcomp(gt, center = TRUE, scale = TRUE)
head(PCAres_t$rotation)
OR
PCAres_t<-prcomp(t(golub_rescaled), center = FALSE, scale= FALSE)
head(PCAres_t$rotation)
#Check the results
summary(PCAres_t)
17
PCA Exercises
#Proportion of the variance explained
var_explained=(PCAres_t$sdev[])^2
tot_var = sum((PCAres_t$sdev[])^2)
proportion_of_variance =var_explained/tot_var
#Note these values are identical to summary(PCAres)
#Plot the variance explained per component
plot(proportion_of_variance)
pie(proportion_of_variance)
3
2
4
5
1
6
7
8
9
10
11
12
13
14
15 16
17
18 19 20
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
2122
#Project the data on the first two principal components
# de projectie van de data op de PC wordt berekend door de originele data te vermenigvuldigen met de
nieuwe basisvectoren (PC)
gt_scale<-scale(gt, center = PCAres_t$center, scale=PCAres_t$scale)
dim(gt_scale)
gt_scale[,1:2]
ALL -0.559145162 0.17976638
ALL -0.451135951 -0.78613128
OR
t(golub_rescaled)
dim(t(golub_rescaled))
t(golub_rescaled)[,1:2]
[,1]
[,2]
ALL -0.559145162 0.17976638
ALL -0.451135951 -0.78613128
dim(gt_scale)
#38 3051
dim(as.matrix(PCAres_t$rotation[,1]))
#3051 1
#projection on the first PC
X1 = gt_scale %*%PCAres_t$rotation[,1]
#projection on the second PC
X2 = gt_scale %*%PCAres_t$rotation[,2]
#projection on the third PC
X3 = gt_scale %*%PCAres_t$rotation[,3]
# note that X1 = gt_scale %*%PCAres_t$rotation[,1]
# is identical to
t(PCAres_t$rotation[,1])%*%t(gt_scale)
# or
18
PCA Exercises
predict(PCAres_t)[,1]
head(predict(PCAres_t)[,1])
head(gt_scale %*%PCAres_t$rotation[,1])
dim(gt_scale)
dim(as.matrix(PCAres_t$rotation[,1]))
VISUALIZE THE RESULTS
#plot the original data in the new basis (scores)
X11()
layout(matrix(1:2,ncol=2))
plot(predict(PCAres_t)[,1],predict(PCAres_t)[,2])
text(predict(PCAres_t)[,1],predict(PCAres_t)[,2], labels=golub.cl[],cex=0.8,col="red")
title("scores")
#plotten the loadings (for each sample its loading on the first and second component)
plot(PCAres_t$rotation[,1],PCAres_t$rotation[,2])
points(PCAres_t$rotation[1042, 1],PCAres_t$rotation[1042,2],col ="blue")
title("loadings")
scores
loadings
0.04
0
0
0
0
0
0
00
0
1
11
0
0
0
0
0
1
1
1
0
PCAres_t$rotation[, 2]
1
0
00
00
0
-20
predict(PCAres_t)[, 2]
0
0
0.02
0
0
0.00
00
-0.02
20
0
1
-0.04
1
-40
1
1
-40
-20
0
predict(PCAres_t)[, 1]
20
40
-0.04
-0.02
0.00
PCAres_t$rotation[, 1]
Genes that
only
contribute to
second
component
0.02
Genes that
only
contribute to
0.04
first
component
Patients observations are plotted on their new axes (linear combinations of variables). Right: loadings or
contribution of the different genes on the PCs. The two PCs contribute to distinguishing the patients. Think of it as
two pathways characteristic pathways of genes that are correlated and that together are able to separate the
patients. When plotted in two dimensions the patients can be completely separated.
19
PCA Exercises
#plot on the first 3 PCs
library(rgl)
mcol=c(1:38)
i=1
while (i <= dim(as.matrix(golub.cl))[1]){
if (golub.cl[i] == 1){
mcol[i] = c(1)
}else {
mcol[i] = 2
}
i = i +1
}
plot3d(x = predict(PCAres_t)[,1], y = predict(PCAres_t)[,2], z = predict(PCAres_t)[,3], col= mcol)
OR
library(scatterplot3d)
scatterplot3d(x = predict(PCAres_t)[,1], y = predict(PCAres_t)[,2], z = predict(PCAres_t)[,3])
SELECTION OF THE GENES WITH THE HIGHEST ABSOLUTE LOADING ON THE FIRST PC
Search for the gene with the highest loading on the first PC (this is the gene that determines the largest direction in
variation)
which(PCAres_t$rotation[,1]==max(PCAres_t$rotation[,1]))
#2821
20
PCA Exercises
golub.gnames[2821,2]
#"Metargidin precursor mRNA"
order(abs(PCAres_t$rotation[,1]), decreasing = TRUE)
sort(abs(PCAres_t$rotation[,1]), decreasing = TRUE)
#x=c(2,5,8,2,-4)
#abs(x)
#2, 5, 8, 2, 4
#sort(abs(x), decreasing =TRUE)
#8 5 4 2 2
#order(abs(x), decreasing =TRUE)
# 3 2 5 1 4
#Order the genes according to their loading on the 1PC
selection = order(abs(PCAres_t$rotation[,1]), decreasing = TRUE)
# plot the expression of those genes with a high loading across the different patients (gene expression profiles).
Because the genes contribute to the same direction of variation they should be redundant (and thus have similar
expression profiles)
library(gplots)
heatmap.2(golub[selection[1:10],] , scale="none", cexRow=0.5, cexCol=0.8, col=topo.colors(20),
trace="none", Rowv= FALSE, Colv=FALSE)
golub.gnames[selection[1:10],2]
plot panel without resorting of genes and columns
25
10
0
Count
Color Key
and Histogram
-1
0
1
2
Value
1
2
3
4
5
6
7
8
9
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
10
plot panel with resorting genes
heatmap.2(golub[selection[1:10],] , scale="none", cexRow=0.5, cexCol=0.8, col=topo.colors(20),
trace="none", Rowv= TRUE, Colv=FALSE)
golub.gnames[selection[1:10],2]
21
PCA Exercises
plot panel b with resorting of genes and columns
Conclusion: none of the genes has a profile that exactly corresponds with the patient subdivision:
 Patients can only be subdivided by taking a combination of the eigengenes (or pathways)
 Genes with high loading to the first component tend to be correlated (coexpressed) (or anticorrelated if
the loading is negative -> reminiscent of similar pathway)
 Most important direction of variation more or less corresponds to the class distinction (with few
exceptions)
 So we could successfully reduce the number of variables from the total number of genes measures to 2
SELECTION OF THE GENES WITH THE HIGHEST ABSOLUTE LOADING ON THE SECOND PC
selection2 = order(abs(PCAres_t$rotation[,2]), decreasing = TRUE)
22
PCA Exercises
heatmap.2(golub[selection2[1:10],] , scale="none", cexRow=0.5, cexCol=0.8, col=topo.colors(20),
trace="none", Rowv= TRUE, Colv=FALSE)
PERFORM PCA WITH GENES BEING TREATED AS OBSERVATIONS/PATIENTS AS VARIABLES
CALCULATE PCA
For the standard PCA the variables are the columns and the rows are the observations. So we perform PCA with
the genes as the observations and the patients as the variables.
How many components do we observe? (38)
PCAres<-prcomp(golub, scale = TRUE)
summary(PCAres)
Manually calculate the proportion of the variance explained
var_explained=PCAres$sdev[]^2
tot_var = sum((PCAres$sdev[])^2)
proportion_of_variance =var_explained/tot_var
#note these values are identical to summary(PCAres)
Make a chart to visualize the amount of variance explained by each of the components
plot(proportion_of_variance)
pie(proportion_of_variance)
The first component (which corresponds to some linear combination of the patients explains almost all the
variation.
1
2
3
67
4 5
8
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
910
Project the data on the first two PCs: multiply the original data with the new base vectors.
#variance rescaling and centering (variables are the patients but the mean function works on the rows, because
we have to center and rescale the variables we have to take the transpose of the matrix)
#centering
row_mean = apply(t(golub), 1, mean, na.rm=TRUE)
golub_meancentered = t(golub) - row_mean
dim(golub_meancentered)
#variance rescaling
SD = apply(golub_meancentered, 1, sd, na.rm=TRUE)
golub_rescaled = golub_meancentered/SD
dim(golub_rescaled)
SD = apply(golub_rescaled, 1, sd)#(should be one)
golub_rescaled[1,1:10]
mean(as.matrix(golub_rescaled[1,]), na.rm=TRUE)
23
PCA Exercises
dim(golub_rescaled)
golub_rescaled = t(golub_rescaled)
head(golub_rescaled )
OR
#rescale the data
golub_scale<-scale(golub, center = PCAres$center, scale=PCAres$scale)
head(golub_scale)
Manual calculation of PCA (P*X=Y)
#projectie op de eerste PC = scores on the first PC
dim(golub_scale)
#3051 38
dim(as.matrix(PCAres$rotation[,1]))
#38 1
X1= golub_scale %*%PCAres$rotation[,1]
dim(X1)
3051 1
#projectie op de tweede PC= scores on the second PC
X2= golub_scale %*%PCAres$rotation[,2]
#this calculation is identical to
predict(PCAres)[,1]
head(predict(PCAres)[,1])
head(golub_scale %*%PCAres$rotation[,1])
dim(golub_scale)
dim(as.matrix(PCAres$rotation[,1]))
VISUALIZE THE DATA
#Plot the data
#plot the observations in their new axis (each point , gene is now represented by its coordinates on the first
most important eigenpatients,
layout(matrix(1:2,ncol=2))
plot(predict(PCAres)[,1],predict(PCAres)[,2])
points(predict(PCAres)[,1],predict(PCAres)[,2], col="red")
title("scores")
#plot the loadings(so for each patients its loading on the first and second PC component)
plot(PCAres$rotation[,1],PCAres$rotation[,2])
text(PCAres$rotation[,1],PCAres$rotation[,2], labels=golub.cl[],cex=0.8,col="red")
title("loadings")
#Combined plot
biplot(PCAres, col = c(“red”, “blue”)
24
PCA Exercises
scores
loadings
5
0.2
0
0
0
0
0
0.1
0 0
0
0
00
0
-5
0
0.0
0
0
0 0
0 00
0
-0.1
0
0
0
1
11 1
1
-0.2
0
PCAres$rotation[, 2]
0
predict(PCAres)[, 2]
0
0
1
1
1
-0.3
1
1
-5
0
5
10
15
20
predict(PCAres)[, 1]
0.13
0.14
1
0.15
0.16
0.17
PCAres$rotation[, 1]
plot(PCAres$rotation[,2],PCAres$rotation[,3])
text(PCAres$rotation[,1],PCAres$rotation[,2], labels=golub.cl[],cex=0.8,col="red")
title("loadings")
Left plot: Variables are the genes being plotted on an axis that is a linear combination of the patient vectors
Right plot: loadings of the patients on each of the PCs. Both the ALL and AML patients contribute to the
loadings on the two axis but for the second component they have an opposite sign. So this component
probably best captures the class distinction.
0.05
0.00
-0.10
-0.10
-0.05
0.00
-60
829
2663
2664
40
40
20
20
2065
2459
377 1162
1030 738
2489
2266
1995
1334
2879
515
717
1881
2306
1019
345
1882
2939
1037
1206
523
529
2087
1732
1883
2829
1909
963
1316
1109
330
2020
2851
1585
1629
839 1955
1352
1598
2285
2702
801
1513
112
703
297
2602
3046
2801
892
688
204
1455
2208
489
746
1081
2386
2302
1428
2736
893
81
2254
2950
1948
1042
1817
2402
462
563
2076
2289
559
307
220
304
1359
232
96
2213
395
ALL
1696
742
971
862
2410
1381
1959
344
2645
422
ALL
1480
1640
1916
1086
1985
695
1060
811
1594
522
2251
1271
ALL
2343
922
1969
1295
1986
2002
2180
1978
2616
1939
1828
1006
2794
1245
725
2235
763
163
126
1524
1078
1360
286
2307
1979
1047
977
1542
309
285
174
2673
1827
909
1616
2313
1981
2347
2378
2422
239
66
1642
357
1698
2318
1388
1337
270
1547
259
1161
516
1347
2627
1176
314
2860
1811
490
135
478
182
713
1327
978
207
394866 ALL ALL
2888
1445
1648
1298
704
151
1348
764
2156
2262
858
1368
1094
2294
1456
1920
701
621
1289
1045
23
544
127
1805
2798
1653
648
329
2594
1244
711
1175
121
ALL
2961
675
2889
289
1509
2132
1491
2576
1163
797
2593
2925
1390
235
1660
2179
1011
1963
1613
407
25
1775
2418
1459
1980
837
1273
2364
1126
1887
2408
693
2052
1559
1733
2959
2903
615
2955
282
555
984
376
194
1341
1679
561
2973
627
295
ALL
2244
188
1288
2804
84
842
2815
1437
1947
202
1369
195
2063
753
1604
206
2297
2638
1543
2433
985
1744
167
241
1873
590
1203
1701
571
1253
2231
2080
2027
846
398
283
331
2216
877
156
1856
2356
777
2096
543
966
154
333
990
2261
1860
248
1719
2108
835
2319
2265
1773
2861
2359
689
2686
2988
2653
360
1255
1770
369
2517
943
1324
1426
2573
733
1638
679
1474
2218
849
1812
230
244
51
2543
1945
325
189
291
1857
611
1934
1564
1834
228
1421
2039
718
686
2444
2590
2164
1712
342
508
1508
1851
2276
2032
363
335
55
1691
1468
334
1101
124
620
551
2438
2907
2170
2210
75
1705
1519
294
720
1859
2355
74
3001
2118
313
445
2947
2844
871
193
1610
1186
2224
454
1167
2770
205
2737
1993
104253
1478
2803
153
2595
2388
347
2982
1293
2668
962
2802
1193
645
1562
2079
546
785
793
1285
1340
1949
730
1699
1876
1061
320
201
970
494
1726
1917
2895
227
1515
362
520
237
2430
323
1317
359
828
2610
368
79
370
1466
1649
2786
651
2045
640
1290
2526
1903
94
1027
1850
2348
258
2660
1221
726
1565
1392
1571
2790
1007
2506
337
2122
1058
2985
ALL
2352
39
2538
610
210
273
1809
138
254
936
1305
185
2842
1453
1619
2367
2613
1110
1673
1779
288
2341
1463
1021
807
2727
218
1278
1671
77
2182
399
212
ALL
242
2250
1447
1795
1440
2592
1759
1641
2236
875
838
2376
1181
59
2374
1636
952
2225
1510
2443
2291
226
2155
634
698
1469
187
1688
172
1855
2642
1869
1489
1016
2467
2417
1943
2183
221
2424
1766
364
1354
1709
770
1117
373
2432
2149
1752
890
1832
192
78
1582753
2384
974
2202
2162
2877
500
1228
840
2532
2693
1926
2581
514
214
1070
1821
ALL
934
1177
1991
421
1538
1819
1461
1501
2310
1297
318
967
177
776
1108
2464
1329
660
ALL
1929
2073
365
2290
290
1650
557
2349
2458
1566
1470
48
276
250
2327
349
1618
1105
2485
2040
1197
1231
1258
1364
1718
1992
715
2105
117
560
1707
455
136
358
114
1140
1282
800
1932
1605
954
1777
2107
2061
574
1225
2817
184
852
1441
1787
2067
1723
ALL
271
1804
1961
1363
2508
754
2512
1557
2845
1714
328
1302
260
63
3000
1410
2077
1988
1286
833
605
2720
1055
810
12259
292
1551
1174
710
2110
2293
449
1845
799
778
1025
2496
1807 ALL
123
61
1020
426
338
947
1663
548
2314
69
1149
446
2760
2900
2217
2023
2554
1419
2362
696
2006
246
1684
2330
348
1287
1690
2049
233
447
3021
1330
564
1786
385
2258
603
864
1700
2751
2449
2365
1793
2419
552
2787
1291
1223
1656
440
924
1761
804
940
2018
1560
1333
1531
1083
1554
1075
2729
1900
2629
506
542
389
2466
ALL
387
1865
1525
638639
224
2069
2912
353
1612
504
122
2460
1704
2746
3006
2084
609
798
3025
2549
2012
673
2019
1399
1567
1614
2599
886
1579
2871
2245
1484
2397
391
885
1103
2480
2484
89
2373
538
312
1956
1753
198
298
1306
327
1919
453
98
1425
1355
460
2876
997
2471
592
1668
2902
2706
1620
2345
2400
2286
690
324
2185
2683
1806
1157
1724
856
1937
2303
1666
1254
1372
2234
86
1198
133
2299
1232
2207
2984
2957
1742
2092
2146
469
1318
882
2997
1326
1098
2129
636
1938
1269
1962
1765
2473
2498
2377
2097
1213
1944
1277
1366
2732
2091
2936
1722
170
332
1708
1152
109
354
743
2395
663
2437
178
2042
2072
1890
527
493
1952
2104
1756
671
991
683
402
1973
920
2901
1866
2000
2776
1281
567
1224
1133
149
1711
1731
316
1481
1908
102
2894
2863
392
2558
1301
1609
1586
2337
1745
2708
2519
1240
499
343
1739
2733
147
2279
1886
2451
2283
2309
1568
2060
1931
1502
587
423
949
2565
2678
128
1072
889
1813
2429
1014
2633
107
2017
262
2582
180
788
657
2509
965
1270
491
697
47
65
336
678
2806
1781
171
851
2747
1971
1202
1145
1053
2209
813
2546
468
2818
681
277
2292
1028
27
186
296
1252
1238
83
255
1003
1377
245
579
142
1527
1310
1499
326
80
1608
464
1342
1467
1166
903
859
2221
2575
1603
2605
1199
144
1572
2143
1999
2978
1002
2764
2088
2223
2371
2557
2368
2497
116
2190
672
2094
448
659
2774
452
2114
2282
925
1964
2317
599
1494
1646
2550
702
652
2005
2831
2755
979
1018
692
958
2442
2805
240
2308
92
772
ALL
1434
817
1921
2568
2785
76
795
1681
1630
1951
1325
2890
1561
1575
1521
1702
918
1695
2098
2867
2807
685
2320
2050
2783
1720
2390
906
847
1748
2447
2227
106
1864
2242
2472
1879
1215
2674
2826
1311
2328
2522
197
2832
929
727
1173
1192
1201
1346
1040
863
2257
1631
1033
2379
2721
225
2255
1443
1262
93
1874
1621
1465
2331
ALL
2272
2243
2240
2068
1897
1233
1446
1764
1729
3045
1146
249
2267
284
2201
134
467
1178
1552
2515
3024
1401
595
2295
1057
2675
570
787
2625
1142
1190
461
2948
2188
1351
805
2814
2608
1530
150
1222
1692
1387
999
119
2687
2115
2007
2446
1435
169
655
87
691
1073
95
658
973
148
2013
479
1792
2983
2334
2839
2797
1517
530
2228
541
2704
279
596
302
586
661
429
2748
265
2358
1111
3040
1643
2264
1644
1791
2841
1658
708
602
1371
3018
1205
2482
1905
507
1059
1182
2505
1725
ALL
2369
459
724
2666
2741
2956
1250
1871
431
1661
2312
278
2951
238
1635
71
2716
2724
646
485
1574
2707
2196
2431
2253
1818
257
1892
1721
1741
ALL
1727
2269
2690
2486
1716
1422
1314
2523
2130
266
306
1234
1179
901
878
2029
2212
2241
2657
247
591
1431
45
1633
2150
631
771
1353
1577
1087
1276
1957
ALL
2591
1274
1204
510
222
340
2089
2193
1212
2949
1622
1713
1382
2057
1983
1464
1039
33
1477
203
2731
54
758
1533
275
2584
573
2383
2344
88
3014
217
2780
2493
2828
2719
2239
2363
2161
2445
503
2923
1013
2434
131
1820
2184
3
2487
366
1994
2963
4862784
2551
2211
1357
2775
2014
2133
300
2461
1942
1996
159
1815
2452
1972
2123
533
2544
1479
641
2351
1875
111
1606
2527
1303
2103
2717
2169
2001
1433
1309
1836
2685
556
751
755
1017
2514
982
1136
2075
1507
492
219
2083
2846
164
2082
2022
582
2567
99
1550
2284
2827
576
281
5
2864
1026
734
1965
1600
1540
1160
604
251
2587
1953
1082
1483
665
2990
356
371
234
923
806
2163
341
2111
2518
1880
1573
1680
1148
472
1451
1632
280
898
1147
2141
1498
2882
161
435
2481
1400
1264
915
1144
1593
1159
975
2439
879
2268
1139
1486
1503
1454
1730
34
2326
2333
737
1415
589
1898
2140
2448
1505
2745
1015
2271
24
30
488
1735
867
ALL
2071
960
200
2423
2649
2476
3007
1096
287
1116
2987
414
1280
1587
1854
38
2457
1004
1236
31
2054
1376
572
1506
3047
635
1623
2165
1686
199
767
2833
1734
37
598
501
2525
3037
872
667
223
2392
424
2200
1266
3042
2339
1751
956
292
961
951
2112
1091
1115
1628
120
1858
2852
1657
441
1473
953
2848
1703
1694
2516
1541
2166
451
1597
484
2697
1984
1895
1050
1599
2003
644
2682
1187
1669
1891
826
2908
208
2789
873
2967
3048
712
2970
132
1674
2767
537
983
2016
2754
714
989
2788
2819
902
2157
1156
632
760
64
796
17
2360
1230
1835
834
1386
2699
612
458
1032
1046
2824
166
2676
1049
1294
897
384
85
299
525
1462
1350
2106
2891
976
1588
1460
2479
2641
2809
1534
736
2705
2116
1689
2996
2535
2644
1923
524
3044
103
2624
2372
1549
1924
1645
684
2529
2394
2044
1022
654
97
1936
2938
2055
1284
410
926
884
1607
1416
614
2034
935
ALL
1847
1843
649
2728
1088
2886
1710
3016
160
428
3050
44
1555
1100
1119
558
3032
517
2913
2177
933
231
1335
1219
1241
2311
3
1899
2892
1257
027
1080
1637
2762
1697
196
2086
57
2041
2357
1518
2854
2134
129
1590
1584
176
165
2849
825
1582
2414
2725
2454
2046
1958
505
1757
2979
2626
513
2974
1001
2035
52
2142
899
1012
1801
2885
1877
850
145
870
1189
578
2160
912
2927
408
1407
1736
272
1476
790
1583
466
2896
146
1814
836
1482
1089
2621
581
2194
2507
315
2453
1196
256
91
26172618
680
2909
1687
895
944
1915
1500
191
762
2128
1769
1393
2866
616
843
1323
162
21252126
2199
1990
2850
3026
229
1172
625
1118
412
2966
179
1261
2109
1414
213
2495
352
1095
2533
101
1402
585
388
2816
1397
637
2206
2051
2316
1746
1404
2634
1946
2391
2782
1997
29
2249
2205
1544
1304
1362
1127
2534
1191
1214
1358
1581
2298
268
465
1532
993
2024
1313
2726
1738
1
1685
2136
395
2679
2580
2062
2975
1844
1589
2325
1852
1216
791
2615
1104
1113
311
2152
1071
2450
2962
1546
876
1444
463
2277
1429
722
1043
2943
674
2872
411
2577
2994
1808
1976
1885
2021
2389
3041
2677
2278
656
2810
2015
1578
2090
959
1211
3043
1896
1902
2701
707
22
2757
1246
1249
1322
2880
1803
1925
1436
252
1365
2135
1067
319
2723
2998
16
380
565
1889
1912
1672
439
2981
2604
946
433
1755
ALL
553
1872
1321
1749
1485
1596
2953
2772
1183
3029
442
2059
687
2635
1279
1776
2910
2246
1268
1328
1394
2144
670
2154
1488
1750
1319
2632
367
1180
980
2301
1251
386
438
3038
3033
1794
2033
457
1913
1207
747
2009
881
1169
2796
2771
706
1529
1263
2878
2058
2601
2628
1349
626
1678
2398
487
814
317
211
668
608
955
1056
1822
58
2585
907
2709
1029
969
1853
2470
633
2665
1036
2411
2609
1548
2288
917
2795
2875
1200
593
143
2662
2175
2779
1128
1545
1522
1625
919
2868
476
1121
1487
1537
2127
2463
28
2296
2093
383
2030
748
2946
305
2539
880
2606
3036
2404
2583
2838
2426
393
1841
1336
1132
2053
511
2248
1084
2085
1052
2491
1868
2025
2167
3023
2151
2256
1591
2078
1331
3028
1068
1343
1950
495
2942
1154
236
1782
2980
601
2366
2230
1375
1195
2353
1035
2263
682
1918
2010
1967
46
2976
2769
396
269
784
3034
2478
3022
2843
437
1296
1878
40
1367
1838
1265
1260
2691
986
2425
2381
2986
2835
2203
1662
2260
2652
2440
2637
1272
2380
2385
2658
2545
2933
1664
374
765
1420
2405
2247
550
1153
1344
456
2571
339
1960
2688
1799
731
1423
2323
183
1675
1097
2694
3013
1135
2607
1862
721
812
2361
1458
911
1450
1870
2916
2893
1933
1743
744
2598
577
1627
2853
2
607
1424
1626
2611
11061107
1848
1496
1846
2214
3012
1677
2382
1747
2511
2501
2793
1655
1922
2808
1218
403
1570
1472
2270
1054
2898
3010
535
2095
3005
1024
729
2148
2739
1090
1120
2332
2781
794
2930
1602
545
1131
716
1114
823
819
1796
1125
2905
2153
409
768
1840
887
740
2178
2932
2070
2387
1339
2822
425
2830
1789
2413
2965
2960
416
2995
2742
2722
2502
549
752
1693
996
2756
931
874
2047
1884
1432
2469
30193020
1332
1076
415
745
1256
1490
783
2596
1904
2396
831
2145
2648
1150
2684
1217
2011
2603
1370
1283
518
139
1492
1320
2744
49
1667
2176
3030
916
1398
1670
2195
1788
2064
406
50
1000
35
588
883
173
2940
6
41
361
1389
2416
9 ALL
482
2588
539
2811
994
2238
1758
1833
664
1226
1511
1935
2173
1940
583
267
900
1137
1790
531
613
1966
606
2646
1526
2972
1728
82
1442
2187
2427
2520
1740
1760
2232
623
939
1893
2315
2934
2855
502
2640
2346
1405
820
534
1457
1783
2252
1380
554
1374
769
430
53
1452
118
2102
2766
2859
2121
404
1345
2197
1785
1031
137
1188
2138
913
2435
1023
3009
2911
2412
1535
301
2340
1008
1683
1248
175
2028
927
749
2597
2120
1867
2174
669
2695
1099
2758
643
1520
2623
2008
630
2474
1074
1831
1970
379
209
1651
2823
1155
2137
2897
1229
1941
397
2740
155
2066
1539
2659
2113
928
818
855
2630
1185
2181
854
274
243
844
1384
444
2971
73
42
802
110
434
597
519
2954
1092
2219
1299
427
417
355
2881
761
1379
1495
2915
1987
2836
90
2586
2710
1077
1771
2488
569
2537
405
2465
1122
1438
2428
1383
1634
1259
157
1639
1171
1093
1208
130
2100
677
2081
2612
594
1763
1800
480
2765
481
2918
ALL
483
1143
1210
322
1412
70
562
21
2548
1409
473
2540
1165
941
861
2222
471
1275
2935
115
2928
2287
2680
1403
945
2857
261
2840
308
1715
1930
723
2117
1652
1512
1624
1682
2563
1797
113
2768
2528
2281
2462
2812
3008
642
2407
2730
2324
584
1493
739
1654
2159
382
1842
2354
1406
2667
1041
1235
1134
2441
2820
1906
1130
36
2375
1528
832
972
2131
957
1514
2964
2999
181
477
2883
152
2799
2099
2056
372
948
1124
676
72
756
650
1989
666
2992
1816
841
420
351
2490
728
19
1063
1861
1079
1209
1385
2639
1974
938
987
381
1408
1706
67
1563
2186
930
2220
2189
741
1242
1580
2524
908
1430
1051
2620
1954
3004
375
2436
2847
575
699
2944
1158
2031
2931
1168
2858
100
3017
536
905
2614
2521
2229
1184
1449
662
293
2168
3039
26
2158
2869
2492
1659
869
3031
474
1647
321
2926
2409
1427
2689
151717
2468
810
7 AML
1558
1247
1982
190
2237
705
700
2503
2713
830
815
759
981
413
2711
1164
2821
1356
600
400
1151
1767
2043
216
432
732
2873
443
2171
2406
1839
865
2191
622
1888
921
2643
1220
2837
1504
401
2884
2483
2636
3049
3003
125
619
1194
1418
1267
1396
390
2226
2415
168
2968
2420
2991
497
2338
1615
1439
2856
1064
1553
1910
3011
263
2192
475
1810
2631
532
809
775
1112
1569
1617
2650
2743
2038
821
2777
2778
43
750
1170
2825
914
2048
853
2147
857
512
2536
1237
2917
2513
1894
2074
2036
2477
2622
105
2329
780
2899
2566
580
2547
2969
1010
1802
1
816
2304
2919
56
992
1066
1576
540
1975
2475
786
964
1065
2874
2342
32
2914
2651
822
1129
2403
2370
653
2037264
547
868
1536
2322
2937
60
1595
2870
1914
2952
2401
2300
1315
891
2887
2718
2669
774
1798
AML
782
1102
450
1556
1824
1085
3015
14
2233
1475
18
1497
2574
2350
2763
AML
1338
1998
20
628
757
1411
2791
1300
618
1062
888
2792
998
647
568
735
2715
950
1754
932
4
AML
1361
1780
1823
2455
2139
2172
2559
1928
2924
1772
2834
2759
2421
12
2305
526
1471
2504
2530
3035
2275
AML
1863
629
11
694
1227
3002
2703
2993
1601
2560
2800
1308
719
2556
827
2647
1373
470
2589
2929
62
845
2456
1927
419
2570
141
1005
617
2510
2619
2274
995
346
2735
2655
108
1239
2004
1977
2541
1391
2712
2198
1968
1448
2336
1138
2542
2977
896
303
1417
624
2654
2119
2904
AML
1307
824
2335
1826
942
803
1774
848
498
2215
528
2696
1048
2579
910
2531
1837
1849
509
566
937
1044
988
2500
13
1911
2552
2681
AML
2555
2941
2280
2749
1737
2564
2321
2773
2989
808
1665
709
792
215
2399
140
2761
1676
1378
2569
2958
894
1123
2578
3051
AML
418
2865
766
2494
2600
1825
68
2692
1830
1784
2906
1523
2813
496
1907
1243
968
779
521
2562
1038
1901
2561
1762
1312
350
2700
1516
310
2204
2750
1592
2661
2738
2101
436
2499
AML
2862
1829
AML
1611
2273
2656
2671
860
2553
1778
2124
2698
781
789
2752
2734
2026
1034
1413
1768
2945
2670
1141
773
2920
2922
2921
2672
378
2572
2393
2714
1069
904
1009
-0.05
PC2
0
0
-20
-20
-40
-40
-60
0.05
PC1
#plot on the first 3 PCs
install.packages("scatterplot3d", dependencies = TRUE)
library(scatterplot3d)
scatterplot3d(x = predict(PCAres)[,1], y = predict(PCAres)[,2], z = predict(PCAres)[,3])
25
PCA Exercises
In the gene dimension reducing the data does not help to better cluster the genes. We need more dimensions to
observe clusters. Dimensionality reduction is not so useful because in this direction the number of observations
>variables.
Conclusion: PCA can be performed in the two dimensions (scores and loadings are then swapped) but in general
PCA is most useful if the number of variables > observations because then you expect that there are so many
variables that more directions of variations are possible and you want to select the most pronounced direction
26