PCA Exercises PCA tutorial CONTENTS Libraries .........................................................................................................................................................................2 INSTALLING BIOCONDUCTOR ....................................................................................................................................2 Loading libraries ........................................................................................................................................................2 Installing libraries.......................................................................................................................................................2 EXERCISE 1: METABOLITE DATA ....................................................................................................................................2 Load the dAta ........................................................................................................................................................2 Visualize the data...................................................................................................................................................2 Calculate the PCA (version1) .................................................................................................................................6 PCA (version2) .....................................................................................................................................................14 Exercise 2: Cancer data ................................................................................................................................................16 UPLOADING THE DATA ............................................................................................................................................16 Perform PCA with genes being the variables/patients as observations ..................................................................17 rescale the data ...................................................................................................................................................17 Calculate the PCA.................................................................................................................................................17 Visualize the results .............................................................................................................................................19 Selection of the genes with the highest absolute loading on the first PC ...........................................................20 Selection of the genes with the highest absolute loading on the second PC ......................................................22 Perform PCA with genes being treated as observations/patients as variables .......................................................23 Calculate PCA .......................................................................................................................................................23 Visualize the data.................................................................................................................................................24 1 PCA Exercises LIBRARIES For this tutorial, we will use some R libraries, providing methods which are not part of the basic R package. INSTALLING BIOCONDUCTOR source("http://bioconductor.org/biocLite.R") biocLite() LOADING LIBRARIES In principle, these libraries should be installed on your machine before the beginning of the practicals. Assuming this is the case, you can load the libraries with the following instructions. library(multtest) library(matrixStats) library(gplots) library(rgl) library(scatterplot3d) If the libraries are loaded properly, you do not need to install them, and you can skip the next section. INSTALLING LIBRARIES If the libraries are not installed on your machine, you can install them yourself easily, provided you have an internet connection. To install the required libraries, log in as system administrator, open R and type the following command. biocLite("multtest") install.packages("scatterplot3d", dependencies = TRUE) install.packages("rgl") install.packages("gplots") install.packages("matrixStat") EXERCISE 1: METABOLITE DATA Dataset from the library PCAMethods Description Dataset: A complete subset from a larger metabolite data set. This is the original, complete data set and can be used to compare estimation results created with the also provided incomplete data (called metaboliteData). The data was created for an Arabidopsis coldstress experiment. # Details: A matrix containing 52 timepoints (columns) and 154 metabolites (rows). Standard: rows = observations and the columns are the variables LOAD THE DATA load("C:/Data/Marchal/lessen/lessen_2015_2016/statistiek/PCA/course_PCA_2014/exercises /Data/metaboliteDataComplete.RData") head(metaboliteDataComplete) VISUALIZE THE DATA #metaboliteDataComplete does not contain missing values whereas metaboliteData does metaboliteDataComplete[1:3,] mDC <-metaboliteDataComplete 2 PCA Exercises # get dimensions of the matrix dim(metaboliteDataComplete) #[1] 154 52 # the data contains measurements of 154 metabolites in 52 timepoints (variables are the metabolites, observations are the timepoints). Plotted points are the observations and the axes the variables. Here the axes are the metabolites and we plot the time points. Metabolites with the same time profile that belong to the same pathway correspond to redundant information that can be removed by PCA. #Make a scatter plot of the data. The time profiles of some metabolites seem to be correlated (i.e. variables are not completely independent, which metabolites would be correlated? E.g. the ones that are involved in similar pathways) X11() layout(matrix(1:10,ncol=5)) plot(mDC[1,],mDC[2,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[2]) plot(mDC[1,],mDC[3,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[3]) plot(mDC[1,],mDC[4,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[4]) plot(mDC[1,],mDC[5,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[5]) plot(mDC[1,],mDC[6,], xlab=rownames(mDC)[1], ylab=rownames(mDC)[6]) plot(mDC[100,],mDC[6,], xlab=rownames(mDC)[100], ylab=rownames(mDC)[6]) plot(mDC[110,],mDC[50,], xlab=rownames(mDC)[110], ylab=rownames(mDC)[50]) plot(mDC[110,],mDC[70,], xlab=rownames(mDC)[110], ylab=rownames(mDC)[70]) plot(mDC[110,],mDC[70,], xlab=rownames(mDC)[110], ylab=rownames(mDC)[70]) plot(mDC[90,],mDC[80,], xlab=rownames(mDC)[90], ylab=rownames(mDC)[80]) OR layout(matrix(1:10,ncol=5)) for(i in 1:10) {for(j in 1:10) { plot(mDC[i,],mDC[j,], xlab=rownames(mDC)[i], ylab=rownames(mDC)[j])}} #Make a scatterplot of the data to check whether there is correlation between the profiles of the time points (variables are the time points, observations are the metabolites). This would imply also redundancy in the data 3 PCA Exercises that can be removed by PCA but in the other direction, It is obvious that there is a strong correlation between repeats of the same time profiles. Now the plotted points are the metabolites (observations) and the time profiles are the axes (variables) layout(matrix(1:10,ncol=5)) plot(mDC[,1],mDC[,1], xlab=colnames(mDC)[1], ylab=colnames(mDC)[1]) plot(mDC[,1],mDC[,3], xlab=colnames(mDC)[1], ylab=colnames(mDC)[3]) plot(mDC[,1],mDC[,4], xlab=colnames(mDC)[1], ylab=colnames(mDC)[4]) plot(mDC[,1],mDC[,5], xlab=colnames(mDC)[1], ylab=colnames(mDC)[5]) plot(mDC[,1],mDC[,6], xlab=colnames(mDC)[1], ylab=colnames(mDC)[6]) plot(mDC[,40],mDC[,6], xlab=colnames(mDC)[40], ylab=colnames(mDC)[6]) plot(mDC[,41],mDC[,50], xlab=colnames(mDC)[41], ylab=colnames(mDC)[50]) plot(mDC[,50],mDC[,49], xlab=colnames(mDC)[50], ylab=colnames(mDC)[49]) plot(mDC[,20],mDC[,30], xlab=colnames(mDC)[20], ylab=colnames(mDC)[30]) plot(mDC[,21],mDC[,22], xlab=colnames(mDC)[21], ylab=colnames(mDC)[22]) OR 0.0 0.2 1.5 X24h -0.5 0.0 0.5 1.0 1.5 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 X4h.4 0.0 0.2 X0h 0.4 X12h 1.0 -1.0 -0.5 0.0 0.5 X48h.3 1.0 1.5 -1.0 -0.5 -0.5 0.0 0.0 X96h.4 1.0 1.5 1.5 2.0 2.0 X48h.4 0.6 -0.4 0.0 -0.5 -1.0 0.2 0.4 0.4 X0h.5 -0.8 0.5 1.0 2.0 1.0 X96h.5 0.5 0.0 0.0 0.8 0.6 0.2 0.0 X0h.4 -0.2 -0.4 0.0 0.2 0.4 X0h -0.4 X0h -0.6 -0.4 -0.5 -0.8 0.4 0.2 X0h.2 0.0 -0.2 -0.4 -0.8 0.2 0.4 X0h 0.5 -0.4 -0.6 -0.4 -0.2 0.0 X0h.5 -0.8 0.5 0.0 0.2 0.4 X0h 0.2 -0.4 -0.6 -0.4 -0.2 0.0 -0.8 1.5 0.4 0.0 -1.5 -1.0 X0h.3 -0.5 0.2 -0.8 -0.6 -0.4 -0.2 0.0 X0h 0.6 0.4 0.8 X11() layout(matrix(1:10,ncol=5)) for(i in 1:10) {for(j in 1:10) { plot(mDC[,i],mDC[,j], xlab=colnames(mDC)[i], ylab=colnames(mDC)[j])}} -0.5 0.0 0.5 1.0 X96h.5 1.5 2.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 X4h.5 4 PCA Exercises Calculate the correlation of mDC: Correlation between the time profiles. X11() dim(cor(mDC)) [1] 52 52 round(cor(mDC),2) heatmap( cor(mDC), cexCol=0.8) scale="none", main="correlation between samples", cexRow=0.8, correlation between samples X96h.2 X96h.1 X96h.3 X96h.6 X48h.2 X48h.1 X48h.3 X48h.7 X48h.6 X96h.7 X48h.5 X48h.4 X96h.5 X96h.4 X96h X48h X24h.6 X24h.5 X24h.2 X24h.3 X12h.1 X24h.1 X24h.4 X24h X12h.2 X12h.5 X12h.6 X0h.3 X0h.5 X0h.6 X0h.4 X0h.2 X0h X0h.1 X1h.7 X1h.3 X1h X1h.4 X1h.6 X1h.2 X1h.5 X4h.3 X12h.7 X4h.2 X4h.5 X1h.1 X4h.1 X4h X4h.4 X12h X12h.4 X12h.3 X12h.6 X12h.5 X12h.2 X24h X24h.4 X24h.1 X12h.1 X24h.3 X24h.2 X24h.5 X24h.6 X48h X96h X96h.4 X96h.5 X48h.4 X48h.5 X96h.7 X48h.6 X48h.7 X48h.3 X48h.1 X48h.2 X96h.6 X96h.3 X96h.1 X96h.2 X12h.3 X12h.4 X12h X4h.4 X4h X4h.1 X1h.1 X4h.5 X4h.2 X12h.7 X4h.3 X1h.5 X1h.2 X1h.6 X1h.4 X1h X1h.3 X1h.7 X0h.1 X0h X0h.2 X0h.4 X0h.6 X0h.5 X0h.3 #Correlation of the transpose of mDC dim(cor(t(mDC))) 5 PCA Exercises round(cor(t(mDC)),2) heatmap( cor(t(mDC)), scale="none", main="correlation between metabolites", cexRow=0.5 , cexCol=0.5) Expresses how well the row vectors of the original matrix are correlated (i.e. the metabolites profiles over time) # check the first 50 metabolites only heatmap( cor(t(mDC))[1:50,1:50], scale="none", main="correlation between metabolites", cexRow=0.5, cexCol=0.5) correlation between metabolites Shikimic acid (4TMS) Tyramine (3TMS) Maltose methoxyamine (8TMS) L-Aspartic acid (3TMS) Glyceric acid (3TMS) Dehydroascorbic acid dimer; L(+)-Ascorbic acid {BP} Succinic acid (2TMS) Glucuronic acid methoxyamine (5TMS) Galactonic acid (6TMS) myo-Inositol (6TMS) Melibiose methoxyamine (8TMS) Glycerol (3TMS) Glucose-6-phosphate methoxyamine (6TMS) Fructose-6-phosphate methoxyamine (6TMS) Fructose methoxyamine (5TMS) Fumaric acid (2TMS) Glucose methoxyamine (5TMS) Galactinol (9TMS) L-Alanine (2TMS); L-Alanine (2TMS) Glycine (3TMS) L-Isoleucine (2TMS) Malic acid (3TMS) L-Leucine (2TMS) L-Valine (2TMS) Octadecanoic acid (1TMS) L-Cysteine (3TMS) L-Threonine (2TMS); L-Threonine (3TMS) L-Asparagine (4TMS) L-Glutamic acid (3TMS) Threonic acid-1,4-lactone (2TMS), transSalicylic acid (2TMS) Erythronic acid (4TMS) Threonic acid (4TMS) L-Serine (2TMS); L-Serine (3TMS) L-Proline (2TMS) Xylose methoxyamine (4TMS) Phosphoric acid (3TMS) L-Homoserine (3TMS) myo-Inositol-phosphate (7TMS) Putrescine (4TMS) Ornithine (3TMS); Arginine {BP} (3TMS); Ornithine (4TMS); Arginine {BP} (4TMS) trans-Sinapinic acid (2TMS) Erythritol (4TMS) Pyroglutamic acid (2TMS) L-Methionine (2TMS) L-Arginine (5TMS) L-Phenylalanine (2TMS) L-Tyrosine (3TMS) L-Glutamine (3TMS); L-Glutamine (4TMS) Ribose methoxyamine (4TMS) Ribose methoxyamine (4T MS) L-Glutamine (3T MS); L-Glutamine (4T MS) L-T yrosine (3T MS) L-Phenylalanine (2T MS) L-Arginine (5T MS) L-Methionine (2T MS) Pyroglutamic acid (2T MS) Erythritol (4T MS) trans-Sinapinic acid (2T MS) Ornithine (3T MS); Arginine {BP} (3T MS); Ornithine (4T MS); Arginine {BP} (4T MS) Putrescine (4T MS) myo-Inositol-phosphate (7T MS) L-Homoserine (3T MS) Phosphoric acid (3T MS) Xylose methoxyamine (4T MS) L-Proline (2T MS) L-Serine (2T MS); L-Serine (3T MS) T hreonic acid (4T MS) Erythronic acid (4T MS) Salicylic acid (2T MS) T hreonic acid-1,4-lactone (2T MS), transL-Glutamic acid (3T MS) L-Asparagine (4T MS) L-T hreonine (2T MS); L-T hreonine (3T MS) L-Cysteine (3T MS) Octadecanoic acid (1T MS) L-Valine (2T MS) L-Leucine (2T MS) Malic acid (3T MS) L-Isoleucine (2T MS) Glycine (3T MS) L-Alanine (2T MS); L-Alanine (2T MS) Galactinol (9T MS) Glucose methoxyamine (5T MS) Fumaric acid (2T MS) Fructose methoxyamine (5T MS) Fructose-6-phosphate methoxyamine (6T MS) Glucose-6-phosphate methoxyamine (6T MS) Glycerol (3T MS) Melibiose methoxyamine (8T MS) myo-Inositol (6T MS) Galactonic acid (6T MS) Glucuronic acid methoxyamine (5T MS) Succinic acid (2T MS) Dehydroascorbic acid dimer; L(+)-Ascorbic acid {BP} Glyceric acid (3T MS) L-Aspartic acid (3T MS) Maltose methoxyamine (8T MS) T yramine (3T MS) Shikimic acid (4T MS) When metabolites are considered to be the variables, some of them are clearly correlated. So we can reduce the dimensions. CALCULATE THE PCA (VERSION1) 6 PCA Exercises Variables are the time and the observations are the metabolites. We want to see whether there is relation between the metabolites. Because there was correlation between the time points (e.g. repeats, we can reduce the dimensions so we can visualize the metabolites in less dimensions. In this case each PC is a linear combination of time points. #variance rescaling and centering (variables in the rows but because we take the transpose the variables are the timepoints) #centering #for PCA variables should always be mean centered and variance rescaled over the observations. So is the variables are in the columns (as is here, because the time points are the variables, the matrix should be mean centered and variance rescaled over the column. Because subtracting a vector from a matrix works on the row direction the ‘t’ are needed. row_mean = apply(t(mDC), 1, mean, na.rm=TRUE) mDC_meancentered = t(mDC) - row_mean dim(mDC_meancentered) rowMeans(mDC_meancentered, na.rm=TRUE) #(should be 0) #variance rescaling SD = apply(mDC_meancentered, 1, sd, na.rm=TRUE) mDC_rescaled = mDC_meancentered/SD dim(mDC_rescaled) SD = apply(mDC_rescaled, 1, sd)#(should be one) mDC_rescaled[1,1:10] mean(as.matrix(mDC_rescaled[1,]), na.rm=TRUE) dim(mDC_rescaled) #note we have to take the transpose of mDCrescaled in order to have the same dimension as mDC dim(mDC) dim(t(mDC_rescaled)) # visualize the column centering using boxplots (for a boxplot as for any statistical comment in R, it is assumed that the variables (t) are in the columns. For mDC this was originally the case but for mDC_meancentered and rescaled we need to take the transpose. ) X11() boxplot(mDC, las=2, col=rainbow(ncol(mDC)), main="Samples before centering", ylim=c(-2,3)) abline(h=0, col="gray20", lwd=2) 7 PCA Exercises Samples before centering 3 2 1 0 -1 X0h X0h.1 X0h.2 X0h.3 X0h.4 X0h.5 X0h.6 X1h X1h.1 X1h.2 X1h.3 X1h.4 X1h.5 X1h.6 X1h.7 X4h X4h.1 X4h.2 X4h.3 X4h.4 X4h.5 X12h X12h.1 X12h.2 X12h.3 X12h.4 X12h.5 X12h.6 X12h.7 X24h X24h.1 X24h.2 X24h.3 X24h.4 X24h.5 X24h.6 X48h X48h.1 X48h.2 X48h.3 X48h.4 X48h.5 X48h.6 X48h.7 X96h X96h.1 X96h.2 X96h.3 X96h.4 X96h.5 X96h.6 X96h.7 -2 X11() boxplot(t(mDC_meancentered), las=2, col=rainbow(ncol(t(mDC_meancentered))), main="Samples after centering", ylim=c(-2,3)) abline(h=0, col="gray20", lwd=2) X11() boxplot(t(mDC_rescaled), las=2, col=rainbow(ncol(t(mDC_rescaled))), main="Samples after rescaling and mean centering", ylim=c(-2,3)) abline(h=0, col="gray20", lwd=2) 8 PCA Exercises Samples after centering 3 2 1 0 -1 X0h X0h.1 X0h.2 X0h.3 X0h.4 X0h.5 X0h.6 X1h X1h.1 X1h.2 X1h.3 X1h.4 X1h.5 X1h.6 X1h.7 X4h X4h.1 X4h.2 X4h.3 X4h.4 X4h.5 X12h X12h.1 X12h.2 X12h.3 X12h.4 X12h.5 X12h.6 X12h.7 X24h X24h.1 X24h.2 X24h.3 X24h.4 X24h.5 X24h.6 X48h X48h.1 X48h.2 X48h.3 X48h.4 X48h.5 X48h.6 X48h.7 X96h X96h.1 X96h.2 X96h.3 X96h.4 X96h.5 X96h.6 X96h.7 -2 #calculate the PCA, works on the variables and those are the columns, we want to make new time variables #for PCA variables should always be in the columns and observations in the row. Here we want the timepoints to be the variables. PCAres<-prcomp(mDC, scale = TRUE, center=TRUE) typeof(PCAres) print(PCAres) OR PCAres<-prcomp(t(mDC_rescaled), scale = FALSE, center=FALSE) See which items are present in the list PCAres #Look at the loadings of the PC head(PCAres$rotation) 9 PCA Exercises PC1 PC2 PC3 PC4 PC5 PC6 X0h 0.021863836 0.02390600 -0.463163721 0.05626445 -0.08596935 0.2551856 X0h.1 0.039028678 0.01769139 -0.346908304 0.34927246 0.01555764 -0.2710336 X0h.2 0.005461632 -0.01703487 0.033930967 -0.35370175 0.47293593 -0.3680019 X0h.3 -0.041738981 -0.12097550 0.220754264 0.26540362 0.45128467 0.2682409 X0h.4 -0.013439588 0.04087621 -0.009428514 -0.59870518 0.02383249 -0.1353698 X0h.5 -0.004382840 0.09862447 0.238903307 0.03477424 -0.58256605 -0.2192965 Shows the contribution of the different time points to the new PC. PCAres$rotation is a 52X52 matrix that contains in the columns the PC vectors. The rows indicate the contribution of each orinal timepoint to each of the PCs # calculate the summary summary(PCAres) > summary(PCAres) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 Standard deviation 5.6000 2.6055 1.85088 1.45427 1.23946 1.07153 Proportion of Variance 0.6031 0.1305 0.06588 0.04067 0.02954 0.02208 Cumulative Proportion 0.6031 0.7336 0.79951 0.84018 0.86972 0.89180 PC7 PC8 PC9 PC10 PC11 PC12 Standard deviation 0.92986 0.84450 0.77595 0.67501 0.58490 0.54667 Proportion of Variance 0.01663 0.01371 0.01158 0.00876 0.00658 0.00575 Cumulative Proportion 0.90843 0.92214 0.93372 0.94248 0.94906 0.95481 PC13 PC14 PC15 PC16 PC17 PC18 Standard deviation 0.50560 0.49336 0.46365 0.44599 0.39925 0.38106 Proportion of Variance 0.00492 0.00468 0.00413 0.00383 0.00307 0.00279 Cumulative Proportion 0.95973 0.96441 0.96854 0.97237 0.97543 0.97822 PC19 PC20 PC21 PC22 PC23 PC24 Standard deviation 0.36972 0.3306 0.31363 0.29442 0.26589 0.25914 Proportion of Variance 0.00263 0.0021 0.00189 0.00167 0.00136 0.00129 Cumulative Proportion 0.98085 0.9830 0.98485 0.98651 0.98787 0.98917 PC25 PC26 PC27 PC28 PC29 PC30 Standard deviation 0.24716 0.22246 0.2161 0.20559 0.1905 0.18644 Proportion of Variance 0.00117 0.00095 0.0009 0.00081 0.0007 0.00067 Cumulative Proportion 0.99034 0.99129 0.9922 0.99300 0.9937 0.99437 De eerste PC bepaalt al 60 % van de variantie in de data. #proportion of the variance explained var_explained=(PCAres$sdev[])^2 tot_var = sum((PCAres$sdev[])^2) proportion_of_variance =var_explained/tot_var #note these values are identical to summary(PCAres) #plot the variance explained per PC # with the first 4 PC a lot of the variance can already be explained X11() plot(proportion_of_variance, type="o", pch=20) pie(proportion_of_variance) 10 PCA Exercises 1 6 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 5 2 3 4 # calculate the scores: projection of the original data points on the PCs predict(PCAres)[,1] Xylose methoxyamine Tyramine trans-Sinapinic acid Threonic acid-1,4-lactone (2TMS), Threonic acid Succinic acid (4TMS) (3TMS) (2TMS) trans(4TMS) (2TMS) -2.2280074 -3.4103018 3.0918246 -3.2430423 -0.9644185 -4.1518490 (4TMS) (3TMS) (2TMS) trans(4TMS) (2TMS) 0.34982667 -1.72048763 1.25496585 0.10434601 -0.03492749 2.32706064 predict(PCAres)[,2] Xylose methoxyamine Tyramine trans-Sinapinic acid Threonic acid-1,4-lactone (2TMS), Threonic acid Succinic acid Note that the predict corresponds to the projection of the original data points on the PCs. This can be obtained by multiplying PC matrix with the original data matrix; (is what the predict function does). PCAres$rotation is the matrix with in the columns the PCs or the transformation matrix P. From the theory course we know that PX=Y corresponds to a rotation of the original axis and Y contains the coordinates of the datapoints according to each of the novel axes. 11 PCA Exercises In these examples The vector P corresponds to PCAres$rotation and has in its columns the vectors that correspond to the PC (so it is the transpose of the example above). The transformation then will be PX= Y = XTPT Where now compared to the theory the XT has in its rows the observations and in its columns the variables so a 154X52 matrix. #projection on the first PC X1 = t(mDC_rescaled) %*%PCAres$rotation[,1] head(X1) Xylose methoxyamine (4TMS) -2.2280074 Tyramine (3TMS) -3.4103018 trans-Sinapinic acid (2TMS) 3.0918246 Threonic acid-1,4-lactone (2TMS), trans- -3.2430423 Threonic acid (4TMS) -0.9644185 Succinic acid (2TMS) -4.1518490 OR head(predict(PCAres)[,1]) #projection on the second PC X2 = t(mDC_rescaled) %*%PCAres$rotation[,2] #projection on the third PC X3 = t(mDC_rescaled) %*%PCAres$rotation[,3] Xylose methoxyamine (4TMS) Tyramine (3TMS) -2.2280074 -3.4103018 trans-Sinapinic acid (2TMS) Threonic acid-1,4-lactone (2TMS), trans3.0918246 -3.2430423 Threonic acid (4TMS) Succinic acid (2TMS) -0.9644185 -4.1518490 #plot the original data (metabolites) in the new basis (scores) plot(predict(PCAres)[,1],predict(PCAres)[,2]) 12 PCA Exercises abline(v=0, col="gray") abline(h=0, col="gray") text(predict(PCAres)[,1],predict(PCAres)[,2],labels=sub("X(.+h)(\\..)?","\\1",rownames(mDC)),cex=1, col=rainbow(nrow(mDC)), adj = c(0,0)) #v, h the values of the horizontal and vertical lines Galactinol (9TMS) -5 -10 -20 -15 predict(PCAres)[, 2] 0 5 [NA_108] L-Proline (2TMS) [612; Proline (2TMS)] L-Homoserine (3TMS) L-Glutamic acid (3TMS) [NA_94] L-Arginine (5TMS) (5TMS)] [861;acid Glucopyranose [614; Glutamine (4TMS) 4-Aminobutyric acid (2TMS); 4-Aminobutyric (3TMS) [529; Indole-3-acetic acid (2TMS)] [NA_260] Pyroglutamic acid (2TMS) [NA_167] Succinic acid (2TMS) [NA_27] [NA_3] L-Glutamine (3TMS); L-Glutamine (4TMS) L-Tyrosine (3TMS) Ornithine (3TMS); Arginine {BP} (3TMS); Ornithine1-(tert-butyldimethylsilyl)-7-propyl-, (4TMS); Arginine {BP} (4TMS) 3-(O-methyloxime)] [NA_48] L-Serine [NA_259] cis-Aconitic Putrescine (2TMS); (4TMS) acid L-Serine (3TMS) (3TMS) [564; 1H-Indole-2,3-dione, Glycine (3TMS) 2-Ketoglutaric acid methoxyamine (2TMS) [NA_23] Phosphoric [640; Putrescine acid (3TMS) (4TMS)] [NA_98] [NA_115] beta-Alanine trans-Sinapinic (3TMS) acid (2TMS) [NA_166] [734; L-Aspartic acid (3TMS)] [NA_51] [861; L-Asparagine [NA_47] Digalactosylglycerol (4TMS) (9TMS)] [NA_54] [NA_4] [NA_49] [NA_279] Erythritol [NA_22] (4TMS) [798; [NA_109] Fructose (5TMS)] [NA_96] ylose [NA_52] methoxyamine (4TMS) Malic acid (3TMS) [NA_105] [NA_25] Glucose methoxyamine (5TMS) [NA_274] Melibiose [NA_266] Threonic methoxyamine acid-1,4-lactone myo-Inositol-phosphate (8TMS) (2TMS), trans(7TMS) [NA_58] [NA_100] [NA_275] Erythronic acid (4TMS) Fumaric acid (2TMS) [NA_57] Threonic L-Methionine acid (4TMS) (2TMS) [NA_102] [NA_104] Glycerol (3TMS) cis-Sinapinic acid (2TMS) [NA_273] Salicylic acid Octadecanoic (2TMS) acid (1TMS) [NA_2] Ribose methoxyamine (4TMS) [NA_170] Benzoic acid [NA_271] (1TMS) [NA_171] [NA_268] [708; Ribonic Glucuronic [NA_55] acid acid (5TMS)] methoxyamine (5TMS) [NA_284] [NA_263] Glyceric [NA_103] Arabinose [NA_91] [NA_53] acid [914; [NA_56] (3TMS) Galactinol methoxyamine L-Threonine (9TMS)] (4TMS) (2TMS); L-Threonine (3TMS) [NA_262] [NA_110] [NA_101] myo-Inositol [NA_270] Galactonic [NA_264] [NA_92] (6TMS) acid [NA_168] (6TMS) [NA_280] [NA_276] [NA_277] [NA_24] [NA_28] L-Phenylalanine (2TMS) [NA_265] [NA_99] L-Cysteine (3TMS) Dehydroascorbic [NA_261] dimer; L(+)-Ascorbic acid {BP} L-Valine(2TMS) (2TMS) [NA_269] [NA_107] [NA_61] [NA_9] L-Alanine (2TMS); L-Alanine (2TMS) [NA_67] [924; [NA_272] Trehalose [NA_95] [NA_93] (8TMS)] L-Isoleucine (2TMS) Fructose methoxyamine (5TMS) [NA_165] [NA_50] [NA_267] L-Leucine [NA_29] L-Aspartic [NA_306] [NA_26] acid (3TMS) [NA_281] [NA_173] [NA_60] [NA_59] [NA_278] [NA_106] [NA_97] Tyramine (3TMS) Shikimic acid (4TMS) [NA_172] [NA_169] Melezitose (11TMS)] Citramalic acid[721; (3TMS) Glucose-6-phosphate methoxyamine methoxyamine (6TMS) Fructose-6-phosphate (6TMS) -25 Maltose methoxyamine (8TMS) -5 0 5 10 15 20 25 predict(PCAres)[, 1] We can plot the metabolites in two dimensions but do not observe a clear clustering of metabolites (plotting without the gene names makes it more clear). However, if we would plot metabolites that occur in the same pathways in the same color we might see more structure in the data (we do not have that information). The variance along the second PC is mainly affected by the outlier maltose. We can also plot the loadings or the contributions of each original time point (axis) to the novel PCs. 0.1 plot(PCAres$rotation[,1],PCAres$rotation[,2], pch=20, col="gray40") abline(v=0, col="gray") abline(h=0, col="gray") text(PCAres$rotation[,1],PCAres$rotation[,2], labels=sub("X(.+h)(\\..)?","\\1",rownames(PCAres$rotation)),cex=1, col=rainbow(ncol(mDC)), adj = c(0,0)) 96h96h 96h96h 96h 96h 96h48h 48h 48h 48h 48h 48h 48h 48h 24h 24h 24h 0h 0h 0.0 24h 24h24h 24h 0h 12h 0h 12h 12h 12h 12h 12h12h -0.1 PCAres$rotation[, 2] 0h 0h 0h -0.2 12h 4h 4h 4h 4h 1h 1h 1h 4h 1h 1h 1h 4h 1h -0.3 1h -0.05 0.00 0.05 0.10 0.15 PCAres$rotation[, 1] This plot shows that the variability along the first PC is mainly driven by the difference between time 0 (reference and later time points: time zero has almost no contribution to the first PC whereas the other timepoints do. , The variability along the second PC is mainly driven by the early and late time points where the early timepoint have a strong negative contribution and the late timepoints a string positive contribution. Repeats of the time points 13 PCA Exercises contribute equally to either component (have similar loadings on either component, indicating that they are measured consistently). . biplot(PCAres, col =c("blue", "black")) -10 0 10 Galactinol (9TMS) 0 [NA_108] L-Proline (2TMS) [612; Proline (2TMS)] X96h X96h.1 X96h.6 X96h.4 X96h.5 L-Homoserine (3TMS) X96h.3 X96h.2 L-Glutamic acid (3TMS) X96h.7 X48h.4 X48h [NA_94] X48h.1 X48h.2 X48h.5 (5TMS) X48h.3 [861; L-Arginine Glucopyranose (5TMS)] [614; Glutamine (4TMS) X0h.5 X48h.7 4-Aminobutyric acid (2TMS); 4-Aminobutyric acid (3TMS) [529; Indole-3-acetic acidX48h.6 (2TMS)] [NA_260] Pyroglutamic acid (2TMS) [NA_167] Succinic acid (2TMS) X24hArginine [NA_3] [NA_27] L-Glutamine (3TMS); L-Glutamine (4TMS) L-Tyrosine (3TMS) X24h.4 Ornithine Arginine {BP} [NA_48] (3TMS); Ornithine (4TMS); {BP} (4TMS) L-Serine cis-Aconitic Putrescine (2TMS); [NA_259] L-Serine acid (4TMS) (3TMS) (3TMS) X24h.2 [564; (3TMS); 1H-Indole-2,3-dione, 2-Ketoglutaric acid 1-(tert-butyldimethylsilyl)-7-propyl-, methoxyamine Glycine (2TMS) (3TMS) 3-(O-methyloxime)] [NA_23] Phosphoric [640; trans-Sinapinic Putrescine beta-Alanine [NA_115] acid (3TMS) (4TMS)] acid (3TMS) (2TMS) [NA_98] [NA_166] X0h.4 [734; L-Aspartic acid (3TMS)] [NA_51] X24h.5 [861; Digalactosylglycerol L-Asparagine [NA_47] (4TMS) (9TMS)] [NA_54] [NA_49] [NA_279] [NA_4] [NA_22] X0h X24h.3 X24h.1 Erythritol (4TMS) X24h.6 Xylose [798; methoxyamine Fructose [NA_109] [NA_96] (5TMS)] X0h.1 (4TMS) [NA_52] Malic acid (3TMS) [NA_105] [NA_25] [NA_266] Glucose methoxyamine (5TMS) Threonic Melibiose myo-Inositol-phosphate acid-1,4-lactone Erythronic methoxyamine Fumaric [NA_274] [NA_58] acid acid (2TMS), (4TMS) (2TMS) (8TMS) (7TMS) transThreonic L-Methionine [NA_100] [NA_275] [NA_57] acid (4TMS) (2TMS) cis-Sinapinic [NA_102] [NA_104] acid (2TMS) Octadecanoic Salicylic Glycerol [NA_273] acid (3TMS) acid [NA_2] (2TMS) (1TMS) X12h Ribose methoxyamine (4TMS) Benzoic [NA_170] [NA_271] acid (1TMS) Glucuronic L-Threonine Arabinose [708; Ribonic acid [NA_171] [NA_268] (2TMS); methoxyamine [NA_55] acid L-Threonine (5TMS)] (5TMS) (3TMS) Glyceric [914; [NA_262] [NA_284] Galactinol [NA_263] [NA_103] [NA_110] [NA_91] [NA_53] [NA_56] acid (3TMS) (9TMS)] X12h.1 Galactonic [NA_101] [NA_270] acid (6TMS) myo-Inositol [NA_276] [NA_280] [NA_277] [NA_264] [NA_28] [NA_92] [NA_168] X0h.2 (6TMS) L-Cysteine L-Phenylalanine [NA_24] [NA_99] (3TMS) (2TMS) Dehydroascorbic acid dimer; [NA_261] [NA_265] L-Valine L(+)-Ascorbic (2TMS) acid {BP} [NA_269] [NA_107] L-Alanine [NA_9] X0h.6 (2TMS); L-Alanine (2TMS) [924; Trehalose [NA_67] [NA_272] L-Isoleucine [NA_95] [NA_93] Fructose (8TMS)] (2TMS) methoxyamine (5TMS) [NA_165] X12h.3 L-Aspartic [NA_61] [NA_306] [NA_267] [NA_50] acid L-Leucine (3TMS) (2TMS) [NA_29] [NA_26] [NA_281] [NA_173] [NA_60] [NA_278] [NA_59] Tyramine [NA_106] [NA_97] (3TMS) Shikimic acid (4TMS) X12h.2 [NA_172] X12h.4 [NA_169] X12h.6 X12h.5 [721; Melezitose (11TMS)] Citramalic acid (3TMS) X0h.3 X12h.7 X4h Glucose-6-phosphate methoxyamine (6TMS) X4h.1 X4h.5 Fructose-6-phosphate X1h.1 methoxyamine (6TMS) X4h.4 X1h X4h.2 X1h.5 X1h.4 X4h.3 X1h.6 X1h.2 X1h.3 X1h.7 -0.6 -20 -0.4 -10 -0.2 PC2 0.0 0.2 10 -20 Maltose methoxyamine (8TMS) -0.6 -0.4 -0.2 0.0 0.2 PC1 Blue are the metabolites (scores), black are the timeppoints (loadings) PCA (VERSION2) We can also run PCA in the other direction (consider the metabolites as variables and the timepoints as observations). This makes most sense as the number of variables is now larger than the number of observations and thus a dimensionality reduction is required. #variance rescaling and centering (variables in the rows) #centering row_mean = apply(mDC, 1, mean, na.rm=TRUE) mDC_meancentered = mDC - row_mean dim(mDC_meancentered) rowMeans(mDC_meancentered, na.rm=TRUE) #(should be 0) #variance rescaling SD = apply(mDC_meancentered, 1, sd, na.rm=TRUE) mDC_rescaled = mDC_meancentered/SD dim(mDC_rescaled) SD = apply(mDC_rescaled, 1, sd)#(should be one) mDC_rescaled[1,1:10] mean(as.matrix(mDC_rescaled[1,]), na.rm=TRUE) rownames(mDC_rescaled)=rownames(mDC) dim(mDC) [1] 154 52 dim(mDCrescaled) [1] 154 52 Calculating PCA PCAres<-prcomp(t(mDC_rescaled), scale = FALSE, center=FALSE) OR PCAres<-prcomp(t(mDC), scale = TRUE, center=TRUE) Importance of components: 14 PCA Exercises PC1 PC13 PC2 PC14 PC3 PC15 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 Standard deviation 8.6680 3.57568 3.33176 2.99490 2.42193 2.03065 1.81039 1.64788 1.56768 1.51289 1.41419 1.36950 1.24157 1.22521 1.16774 Proportion of Variance 0.4879 0.08302 0.07208 0.05824 0.03809 0.02678 0.02128 0.01763 0.01596 0.01486 0.01299 0.01218 0.01001 0.00975 0.00885 biplot(PCAres, col =c("blue", "black")) Now the loadings (black) are the metabolites and the observations the time points (blue) Blue are the timepoints (scores), black are the metabolites (loadings) X11() plot(predict(PCAres)[,1],predict(PCAres)[,2]) abline(v=0, col="gray") abline(h=0, col="gray") text(predict(PCAres)[,1],predict(PCAres)[,2], labels=sub("X(.+h)(\\..)?","\\1",colnames(mDC)),cex=1, col=rainbow(ncol(mDC)), adj = c(0,0)) 15 PCA Exercises EXERCISE 2: CANCER DATA UPLOADING THE DATA library(multtest) data(golub, package = "multtest") # this imports the matrix golub and the variable golub.names dim(golub) #[1] 3051 38 typeof(golub) #double #we will know assign column names colnames(golub)<-factor(golub.cl,levels=0:1, labels= c("ALL","AML")) Input : gene expression data collected by Golub et al. Science, Vol.286:531-537. 1999 Following Golub et al three preprocessing steps were applied to the normalized matrix of intensity values available on the website: (i) (ii) (iii) thresholding: floor of 100 and ceiling of 16,000; filtering: exclusion of genes with max / min 5 or (max-min) 500, where max and min refer respectively to the maximum and minimum intensities for a particular gene across mRNA samples; base 10 logarithmic transformation. Boxplots of the expression levels for each of the 38 samples revealed the need to standardize the expression levels within arrays before combining data across samples. The data were then summarized by a 3 051×38 matrix X = (xij), where xjijdenotes the expression level for gene i in tumor mRNA sample j. Lets start with an example of a gene from which we know it is a biomarker” ? is the gene differentially expressed between both cancer types: 16 PCA Exercises plot(golub[1042,]) golub.gnames[1042,] gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML")) boxplot(golub[1042,] ~ gol.fac) #boxplot shows ALL of CCND3 Cyclin D3 are positive #one sample t-Test to demonstrate that the gene’s expression level per factor is significantly different from 0. t.test(golub[1042,gol.fac=="ALL"], mu=0, alternative = c ("greater")) #two sample t-Test to demonstrate that the gene is differentially expressed t.test(golub[1042,] ~ gol.fac, var.equal=FALSE) PERFORM PCA WITH GENES BEING THE VARIABLES/PATIENTS AS OBSERVATIONS This is the most logical direction as we here have more variables than observations RESCALE THE DATA #variance rescaling and centering (variables are the patients but the mean function works on the rows, because we have to center and rescale the variables we have to take the transpose of the matrix) #centering row_mean = apply(golub, 1, mean, na.rm=TRUE) golub_meancentered = golub- row_mean dim(golub_meancentered) #variance rescaling SD = apply(golub_meancentered, 1, sd, na.rm=TRUE) golub_rescaled = golub_meancentered/SD dim(golub_rescaled) apply(golub_rescaled, 1, sd)#(should be one) golub_rescaled[1,1:10] mean(as.matrix(golub_rescaled[1,]), na.rm=TRUE) dim(golub_rescaled) # before doing PCA make sure the variables are in the rows, this is not yet the case. So we will need to calculate the PCA on the transpose of this matrix CALCULATE THE PCA Patients are the observations and the genes the variables: So transpose the matrix gt = t(golub) dim(gt) #38 X 3051 Perform PCA PCAres_t<-prcomp(gt, center = TRUE, scale = TRUE) head(PCAres_t$rotation) OR PCAres_t<-prcomp(t(golub_rescaled), center = FALSE, scale= FALSE) head(PCAres_t$rotation) #Check the results summary(PCAres_t) 17 PCA Exercises #Proportion of the variance explained var_explained=(PCAres_t$sdev[])^2 tot_var = sum((PCAres_t$sdev[])^2) proportion_of_variance =var_explained/tot_var #Note these values are identical to summary(PCAres) #Plot the variance explained per component plot(proportion_of_variance) pie(proportion_of_variance) 3 2 4 5 1 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 2122 #Project the data on the first two principal components # de projectie van de data op de PC wordt berekend door de originele data te vermenigvuldigen met de nieuwe basisvectoren (PC) gt_scale<-scale(gt, center = PCAres_t$center, scale=PCAres_t$scale) dim(gt_scale) gt_scale[,1:2] ALL -0.559145162 0.17976638 ALL -0.451135951 -0.78613128 OR t(golub_rescaled) dim(t(golub_rescaled)) t(golub_rescaled)[,1:2] [,1] [,2] ALL -0.559145162 0.17976638 ALL -0.451135951 -0.78613128 dim(gt_scale) #38 3051 dim(as.matrix(PCAres_t$rotation[,1])) #3051 1 #projection on the first PC X1 = gt_scale %*%PCAres_t$rotation[,1] #projection on the second PC X2 = gt_scale %*%PCAres_t$rotation[,2] #projection on the third PC X3 = gt_scale %*%PCAres_t$rotation[,3] # note that X1 = gt_scale %*%PCAres_t$rotation[,1] # is identical to t(PCAres_t$rotation[,1])%*%t(gt_scale) # or 18 PCA Exercises predict(PCAres_t)[,1] head(predict(PCAres_t)[,1]) head(gt_scale %*%PCAres_t$rotation[,1]) dim(gt_scale) dim(as.matrix(PCAres_t$rotation[,1])) VISUALIZE THE RESULTS #plot the original data in the new basis (scores) X11() layout(matrix(1:2,ncol=2)) plot(predict(PCAres_t)[,1],predict(PCAres_t)[,2]) text(predict(PCAres_t)[,1],predict(PCAres_t)[,2], labels=golub.cl[],cex=0.8,col="red") title("scores") #plotten the loadings (for each sample its loading on the first and second component) plot(PCAres_t$rotation[,1],PCAres_t$rotation[,2]) points(PCAres_t$rotation[1042, 1],PCAres_t$rotation[1042,2],col ="blue") title("loadings") scores loadings 0.04 0 0 0 0 0 0 00 0 1 11 0 0 0 0 0 1 1 1 0 PCAres_t$rotation[, 2] 1 0 00 00 0 -20 predict(PCAres_t)[, 2] 0 0 0.02 0 0 0.00 00 -0.02 20 0 1 -0.04 1 -40 1 1 -40 -20 0 predict(PCAres_t)[, 1] 20 40 -0.04 -0.02 0.00 PCAres_t$rotation[, 1] Genes that only contribute to second component 0.02 Genes that only contribute to 0.04 first component Patients observations are plotted on their new axes (linear combinations of variables). Right: loadings or contribution of the different genes on the PCs. The two PCs contribute to distinguishing the patients. Think of it as two pathways characteristic pathways of genes that are correlated and that together are able to separate the patients. When plotted in two dimensions the patients can be completely separated. 19 PCA Exercises #plot on the first 3 PCs library(rgl) mcol=c(1:38) i=1 while (i <= dim(as.matrix(golub.cl))[1]){ if (golub.cl[i] == 1){ mcol[i] = c(1) }else { mcol[i] = 2 } i = i +1 } plot3d(x = predict(PCAres_t)[,1], y = predict(PCAres_t)[,2], z = predict(PCAres_t)[,3], col= mcol) OR library(scatterplot3d) scatterplot3d(x = predict(PCAres_t)[,1], y = predict(PCAres_t)[,2], z = predict(PCAres_t)[,3]) SELECTION OF THE GENES WITH THE HIGHEST ABSOLUTE LOADING ON THE FIRST PC Search for the gene with the highest loading on the first PC (this is the gene that determines the largest direction in variation) which(PCAres_t$rotation[,1]==max(PCAres_t$rotation[,1])) #2821 20 PCA Exercises golub.gnames[2821,2] #"Metargidin precursor mRNA" order(abs(PCAres_t$rotation[,1]), decreasing = TRUE) sort(abs(PCAres_t$rotation[,1]), decreasing = TRUE) #x=c(2,5,8,2,-4) #abs(x) #2, 5, 8, 2, 4 #sort(abs(x), decreasing =TRUE) #8 5 4 2 2 #order(abs(x), decreasing =TRUE) # 3 2 5 1 4 #Order the genes according to their loading on the 1PC selection = order(abs(PCAres_t$rotation[,1]), decreasing = TRUE) # plot the expression of those genes with a high loading across the different patients (gene expression profiles). Because the genes contribute to the same direction of variation they should be redundant (and thus have similar expression profiles) library(gplots) heatmap.2(golub[selection[1:10],] , scale="none", cexRow=0.5, cexCol=0.8, col=topo.colors(20), trace="none", Rowv= FALSE, Colv=FALSE) golub.gnames[selection[1:10],2] plot panel without resorting of genes and columns 25 10 0 Count Color Key and Histogram -1 0 1 2 Value 1 2 3 4 5 6 7 8 9 AML AML AML AML AML AML AML AML AML AML ALL AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL 10 plot panel with resorting genes heatmap.2(golub[selection[1:10],] , scale="none", cexRow=0.5, cexCol=0.8, col=topo.colors(20), trace="none", Rowv= TRUE, Colv=FALSE) golub.gnames[selection[1:10],2] 21 PCA Exercises plot panel b with resorting of genes and columns Conclusion: none of the genes has a profile that exactly corresponds with the patient subdivision: Patients can only be subdivided by taking a combination of the eigengenes (or pathways) Genes with high loading to the first component tend to be correlated (coexpressed) (or anticorrelated if the loading is negative -> reminiscent of similar pathway) Most important direction of variation more or less corresponds to the class distinction (with few exceptions) So we could successfully reduce the number of variables from the total number of genes measures to 2 SELECTION OF THE GENES WITH THE HIGHEST ABSOLUTE LOADING ON THE SECOND PC selection2 = order(abs(PCAres_t$rotation[,2]), decreasing = TRUE) 22 PCA Exercises heatmap.2(golub[selection2[1:10],] , scale="none", cexRow=0.5, cexCol=0.8, col=topo.colors(20), trace="none", Rowv= TRUE, Colv=FALSE) PERFORM PCA WITH GENES BEING TREATED AS OBSERVATIONS/PATIENTS AS VARIABLES CALCULATE PCA For the standard PCA the variables are the columns and the rows are the observations. So we perform PCA with the genes as the observations and the patients as the variables. How many components do we observe? (38) PCAres<-prcomp(golub, scale = TRUE) summary(PCAres) Manually calculate the proportion of the variance explained var_explained=PCAres$sdev[]^2 tot_var = sum((PCAres$sdev[])^2) proportion_of_variance =var_explained/tot_var #note these values are identical to summary(PCAres) Make a chart to visualize the amount of variance explained by each of the components plot(proportion_of_variance) pie(proportion_of_variance) The first component (which corresponds to some linear combination of the patients explains almost all the variation. 1 2 3 67 4 5 8 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 910 Project the data on the first two PCs: multiply the original data with the new base vectors. #variance rescaling and centering (variables are the patients but the mean function works on the rows, because we have to center and rescale the variables we have to take the transpose of the matrix) #centering row_mean = apply(t(golub), 1, mean, na.rm=TRUE) golub_meancentered = t(golub) - row_mean dim(golub_meancentered) #variance rescaling SD = apply(golub_meancentered, 1, sd, na.rm=TRUE) golub_rescaled = golub_meancentered/SD dim(golub_rescaled) SD = apply(golub_rescaled, 1, sd)#(should be one) golub_rescaled[1,1:10] mean(as.matrix(golub_rescaled[1,]), na.rm=TRUE) 23 PCA Exercises dim(golub_rescaled) golub_rescaled = t(golub_rescaled) head(golub_rescaled ) OR #rescale the data golub_scale<-scale(golub, center = PCAres$center, scale=PCAres$scale) head(golub_scale) Manual calculation of PCA (P*X=Y) #projectie op de eerste PC = scores on the first PC dim(golub_scale) #3051 38 dim(as.matrix(PCAres$rotation[,1])) #38 1 X1= golub_scale %*%PCAres$rotation[,1] dim(X1) 3051 1 #projectie op de tweede PC= scores on the second PC X2= golub_scale %*%PCAres$rotation[,2] #this calculation is identical to predict(PCAres)[,1] head(predict(PCAres)[,1]) head(golub_scale %*%PCAres$rotation[,1]) dim(golub_scale) dim(as.matrix(PCAres$rotation[,1])) VISUALIZE THE DATA #Plot the data #plot the observations in their new axis (each point , gene is now represented by its coordinates on the first most important eigenpatients, layout(matrix(1:2,ncol=2)) plot(predict(PCAres)[,1],predict(PCAres)[,2]) points(predict(PCAres)[,1],predict(PCAres)[,2], col="red") title("scores") #plot the loadings(so for each patients its loading on the first and second PC component) plot(PCAres$rotation[,1],PCAres$rotation[,2]) text(PCAres$rotation[,1],PCAres$rotation[,2], labels=golub.cl[],cex=0.8,col="red") title("loadings") #Combined plot biplot(PCAres, col = c(“red”, “blue”) 24 PCA Exercises scores loadings 5 0.2 0 0 0 0 0 0.1 0 0 0 0 00 0 -5 0 0.0 0 0 0 0 0 00 0 -0.1 0 0 0 1 11 1 1 -0.2 0 PCAres$rotation[, 2] 0 predict(PCAres)[, 2] 0 0 1 1 1 -0.3 1 1 -5 0 5 10 15 20 predict(PCAres)[, 1] 0.13 0.14 1 0.15 0.16 0.17 PCAres$rotation[, 1] plot(PCAres$rotation[,2],PCAres$rotation[,3]) text(PCAres$rotation[,1],PCAres$rotation[,2], labels=golub.cl[],cex=0.8,col="red") title("loadings") Left plot: Variables are the genes being plotted on an axis that is a linear combination of the patient vectors Right plot: loadings of the patients on each of the PCs. Both the ALL and AML patients contribute to the loadings on the two axis but for the second component they have an opposite sign. So this component probably best captures the class distinction. 0.05 0.00 -0.10 -0.10 -0.05 0.00 -60 829 2663 2664 40 40 20 20 2065 2459 377 1162 1030 738 2489 2266 1995 1334 2879 515 717 1881 2306 1019 345 1882 2939 1037 1206 523 529 2087 1732 1883 2829 1909 963 1316 1109 330 2020 2851 1585 1629 839 1955 1352 1598 2285 2702 801 1513 112 703 297 2602 3046 2801 892 688 204 1455 2208 489 746 1081 2386 2302 1428 2736 893 81 2254 2950 1948 1042 1817 2402 462 563 2076 2289 559 307 220 304 1359 232 96 2213 395 ALL 1696 742 971 862 2410 1381 1959 344 2645 422 ALL 1480 1640 1916 1086 1985 695 1060 811 1594 522 2251 1271 ALL 2343 922 1969 1295 1986 2002 2180 1978 2616 1939 1828 1006 2794 1245 725 2235 763 163 126 1524 1078 1360 286 2307 1979 1047 977 1542 309 285 174 2673 1827 909 1616 2313 1981 2347 2378 2422 239 66 1642 357 1698 2318 1388 1337 270 1547 259 1161 516 1347 2627 1176 314 2860 1811 490 135 478 182 713 1327 978 207 394866 ALL ALL 2888 1445 1648 1298 704 151 1348 764 2156 2262 858 1368 1094 2294 1456 1920 701 621 1289 1045 23 544 127 1805 2798 1653 648 329 2594 1244 711 1175 121 ALL 2961 675 2889 289 1509 2132 1491 2576 1163 797 2593 2925 1390 235 1660 2179 1011 1963 1613 407 25 1775 2418 1459 1980 837 1273 2364 1126 1887 2408 693 2052 1559 1733 2959 2903 615 2955 282 555 984 376 194 1341 1679 561 2973 627 295 ALL 2244 188 1288 2804 84 842 2815 1437 1947 202 1369 195 2063 753 1604 206 2297 2638 1543 2433 985 1744 167 241 1873 590 1203 1701 571 1253 2231 2080 2027 846 398 283 331 2216 877 156 1856 2356 777 2096 543 966 154 333 990 2261 1860 248 1719 2108 835 2319 2265 1773 2861 2359 689 2686 2988 2653 360 1255 1770 369 2517 943 1324 1426 2573 733 1638 679 1474 2218 849 1812 230 244 51 2543 1945 325 189 291 1857 611 1934 1564 1834 228 1421 2039 718 686 2444 2590 2164 1712 342 508 1508 1851 2276 2032 363 335 55 1691 1468 334 1101 124 620 551 2438 2907 2170 2210 75 1705 1519 294 720 1859 2355 74 3001 2118 313 445 2947 2844 871 193 1610 1186 2224 454 1167 2770 205 2737 1993 104253 1478 2803 153 2595 2388 347 2982 1293 2668 962 2802 1193 645 1562 2079 546 785 793 1285 1340 1949 730 1699 1876 1061 320 201 970 494 1726 1917 2895 227 1515 362 520 237 2430 323 1317 359 828 2610 368 79 370 1466 1649 2786 651 2045 640 1290 2526 1903 94 1027 1850 2348 258 2660 1221 726 1565 1392 1571 2790 1007 2506 337 2122 1058 2985 ALL 2352 39 2538 610 210 273 1809 138 254 936 1305 185 2842 1453 1619 2367 2613 1110 1673 1779 288 2341 1463 1021 807 2727 218 1278 1671 77 2182 399 212 ALL 242 2250 1447 1795 1440 2592 1759 1641 2236 875 838 2376 1181 59 2374 1636 952 2225 1510 2443 2291 226 2155 634 698 1469 187 1688 172 1855 2642 1869 1489 1016 2467 2417 1943 2183 221 2424 1766 364 1354 1709 770 1117 373 2432 2149 1752 890 1832 192 78 1582753 2384 974 2202 2162 2877 500 1228 840 2532 2693 1926 2581 514 214 1070 1821 ALL 934 1177 1991 421 1538 1819 1461 1501 2310 1297 318 967 177 776 1108 2464 1329 660 ALL 1929 2073 365 2290 290 1650 557 2349 2458 1566 1470 48 276 250 2327 349 1618 1105 2485 2040 1197 1231 1258 1364 1718 1992 715 2105 117 560 1707 455 136 358 114 1140 1282 800 1932 1605 954 1777 2107 2061 574 1225 2817 184 852 1441 1787 2067 1723 ALL 271 1804 1961 1363 2508 754 2512 1557 2845 1714 328 1302 260 63 3000 1410 2077 1988 1286 833 605 2720 1055 810 12259 292 1551 1174 710 2110 2293 449 1845 799 778 1025 2496 1807 ALL 123 61 1020 426 338 947 1663 548 2314 69 1149 446 2760 2900 2217 2023 2554 1419 2362 696 2006 246 1684 2330 348 1287 1690 2049 233 447 3021 1330 564 1786 385 2258 603 864 1700 2751 2449 2365 1793 2419 552 2787 1291 1223 1656 440 924 1761 804 940 2018 1560 1333 1531 1083 1554 1075 2729 1900 2629 506 542 389 2466 ALL 387 1865 1525 638639 224 2069 2912 353 1612 504 122 2460 1704 2746 3006 2084 609 798 3025 2549 2012 673 2019 1399 1567 1614 2599 886 1579 2871 2245 1484 2397 391 885 1103 2480 2484 89 2373 538 312 1956 1753 198 298 1306 327 1919 453 98 1425 1355 460 2876 997 2471 592 1668 2902 2706 1620 2345 2400 2286 690 324 2185 2683 1806 1157 1724 856 1937 2303 1666 1254 1372 2234 86 1198 133 2299 1232 2207 2984 2957 1742 2092 2146 469 1318 882 2997 1326 1098 2129 636 1938 1269 1962 1765 2473 2498 2377 2097 1213 1944 1277 1366 2732 2091 2936 1722 170 332 1708 1152 109 354 743 2395 663 2437 178 2042 2072 1890 527 493 1952 2104 1756 671 991 683 402 1973 920 2901 1866 2000 2776 1281 567 1224 1133 149 1711 1731 316 1481 1908 102 2894 2863 392 2558 1301 1609 1586 2337 1745 2708 2519 1240 499 343 1739 2733 147 2279 1886 2451 2283 2309 1568 2060 1931 1502 587 423 949 2565 2678 128 1072 889 1813 2429 1014 2633 107 2017 262 2582 180 788 657 2509 965 1270 491 697 47 65 336 678 2806 1781 171 851 2747 1971 1202 1145 1053 2209 813 2546 468 2818 681 277 2292 1028 27 186 296 1252 1238 83 255 1003 1377 245 579 142 1527 1310 1499 326 80 1608 464 1342 1467 1166 903 859 2221 2575 1603 2605 1199 144 1572 2143 1999 2978 1002 2764 2088 2223 2371 2557 2368 2497 116 2190 672 2094 448 659 2774 452 2114 2282 925 1964 2317 599 1494 1646 2550 702 652 2005 2831 2755 979 1018 692 958 2442 2805 240 2308 92 772 ALL 1434 817 1921 2568 2785 76 795 1681 1630 1951 1325 2890 1561 1575 1521 1702 918 1695 2098 2867 2807 685 2320 2050 2783 1720 2390 906 847 1748 2447 2227 106 1864 2242 2472 1879 1215 2674 2826 1311 2328 2522 197 2832 929 727 1173 1192 1201 1346 1040 863 2257 1631 1033 2379 2721 225 2255 1443 1262 93 1874 1621 1465 2331 ALL 2272 2243 2240 2068 1897 1233 1446 1764 1729 3045 1146 249 2267 284 2201 134 467 1178 1552 2515 3024 1401 595 2295 1057 2675 570 787 2625 1142 1190 461 2948 2188 1351 805 2814 2608 1530 150 1222 1692 1387 999 119 2687 2115 2007 2446 1435 169 655 87 691 1073 95 658 973 148 2013 479 1792 2983 2334 2839 2797 1517 530 2228 541 2704 279 596 302 586 661 429 2748 265 2358 1111 3040 1643 2264 1644 1791 2841 1658 708 602 1371 3018 1205 2482 1905 507 1059 1182 2505 1725 ALL 2369 459 724 2666 2741 2956 1250 1871 431 1661 2312 278 2951 238 1635 71 2716 2724 646 485 1574 2707 2196 2431 2253 1818 257 1892 1721 1741 ALL 1727 2269 2690 2486 1716 1422 1314 2523 2130 266 306 1234 1179 901 878 2029 2212 2241 2657 247 591 1431 45 1633 2150 631 771 1353 1577 1087 1276 1957 ALL 2591 1274 1204 510 222 340 2089 2193 1212 2949 1622 1713 1382 2057 1983 1464 1039 33 1477 203 2731 54 758 1533 275 2584 573 2383 2344 88 3014 217 2780 2493 2828 2719 2239 2363 2161 2445 503 2923 1013 2434 131 1820 2184 3 2487 366 1994 2963 4862784 2551 2211 1357 2775 2014 2133 300 2461 1942 1996 159 1815 2452 1972 2123 533 2544 1479 641 2351 1875 111 1606 2527 1303 2103 2717 2169 2001 1433 1309 1836 2685 556 751 755 1017 2514 982 1136 2075 1507 492 219 2083 2846 164 2082 2022 582 2567 99 1550 2284 2827 576 281 5 2864 1026 734 1965 1600 1540 1160 604 251 2587 1953 1082 1483 665 2990 356 371 234 923 806 2163 341 2111 2518 1880 1573 1680 1148 472 1451 1632 280 898 1147 2141 1498 2882 161 435 2481 1400 1264 915 1144 1593 1159 975 2439 879 2268 1139 1486 1503 1454 1730 34 2326 2333 737 1415 589 1898 2140 2448 1505 2745 1015 2271 24 30 488 1735 867 ALL 2071 960 200 2423 2649 2476 3007 1096 287 1116 2987 414 1280 1587 1854 38 2457 1004 1236 31 2054 1376 572 1506 3047 635 1623 2165 1686 199 767 2833 1734 37 598 501 2525 3037 872 667 223 2392 424 2200 1266 3042 2339 1751 956 292 961 951 2112 1091 1115 1628 120 1858 2852 1657 441 1473 953 2848 1703 1694 2516 1541 2166 451 1597 484 2697 1984 1895 1050 1599 2003 644 2682 1187 1669 1891 826 2908 208 2789 873 2967 3048 712 2970 132 1674 2767 537 983 2016 2754 714 989 2788 2819 902 2157 1156 632 760 64 796 17 2360 1230 1835 834 1386 2699 612 458 1032 1046 2824 166 2676 1049 1294 897 384 85 299 525 1462 1350 2106 2891 976 1588 1460 2479 2641 2809 1534 736 2705 2116 1689 2996 2535 2644 1923 524 3044 103 2624 2372 1549 1924 1645 684 2529 2394 2044 1022 654 97 1936 2938 2055 1284 410 926 884 1607 1416 614 2034 935 ALL 1847 1843 649 2728 1088 2886 1710 3016 160 428 3050 44 1555 1100 1119 558 3032 517 2913 2177 933 231 1335 1219 1241 2311 3 1899 2892 1257 027 1080 1637 2762 1697 196 2086 57 2041 2357 1518 2854 2134 129 1590 1584 176 165 2849 825 1582 2414 2725 2454 2046 1958 505 1757 2979 2626 513 2974 1001 2035 52 2142 899 1012 1801 2885 1877 850 145 870 1189 578 2160 912 2927 408 1407 1736 272 1476 790 1583 466 2896 146 1814 836 1482 1089 2621 581 2194 2507 315 2453 1196 256 91 26172618 680 2909 1687 895 944 1915 1500 191 762 2128 1769 1393 2866 616 843 1323 162 21252126 2199 1990 2850 3026 229 1172 625 1118 412 2966 179 1261 2109 1414 213 2495 352 1095 2533 101 1402 585 388 2816 1397 637 2206 2051 2316 1746 1404 2634 1946 2391 2782 1997 29 2249 2205 1544 1304 1362 1127 2534 1191 1214 1358 1581 2298 268 465 1532 993 2024 1313 2726 1738 1 1685 2136 395 2679 2580 2062 2975 1844 1589 2325 1852 1216 791 2615 1104 1113 311 2152 1071 2450 2962 1546 876 1444 463 2277 1429 722 1043 2943 674 2872 411 2577 2994 1808 1976 1885 2021 2389 3041 2677 2278 656 2810 2015 1578 2090 959 1211 3043 1896 1902 2701 707 22 2757 1246 1249 1322 2880 1803 1925 1436 252 1365 2135 1067 319 2723 2998 16 380 565 1889 1912 1672 439 2981 2604 946 433 1755 ALL 553 1872 1321 1749 1485 1596 2953 2772 1183 3029 442 2059 687 2635 1279 1776 2910 2246 1268 1328 1394 2144 670 2154 1488 1750 1319 2632 367 1180 980 2301 1251 386 438 3038 3033 1794 2033 457 1913 1207 747 2009 881 1169 2796 2771 706 1529 1263 2878 2058 2601 2628 1349 626 1678 2398 487 814 317 211 668 608 955 1056 1822 58 2585 907 2709 1029 969 1853 2470 633 2665 1036 2411 2609 1548 2288 917 2795 2875 1200 593 143 2662 2175 2779 1128 1545 1522 1625 919 2868 476 1121 1487 1537 2127 2463 28 2296 2093 383 2030 748 2946 305 2539 880 2606 3036 2404 2583 2838 2426 393 1841 1336 1132 2053 511 2248 1084 2085 1052 2491 1868 2025 2167 3023 2151 2256 1591 2078 1331 3028 1068 1343 1950 495 2942 1154 236 1782 2980 601 2366 2230 1375 1195 2353 1035 2263 682 1918 2010 1967 46 2976 2769 396 269 784 3034 2478 3022 2843 437 1296 1878 40 1367 1838 1265 1260 2691 986 2425 2381 2986 2835 2203 1662 2260 2652 2440 2637 1272 2380 2385 2658 2545 2933 1664 374 765 1420 2405 2247 550 1153 1344 456 2571 339 1960 2688 1799 731 1423 2323 183 1675 1097 2694 3013 1135 2607 1862 721 812 2361 1458 911 1450 1870 2916 2893 1933 1743 744 2598 577 1627 2853 2 607 1424 1626 2611 11061107 1848 1496 1846 2214 3012 1677 2382 1747 2511 2501 2793 1655 1922 2808 1218 403 1570 1472 2270 1054 2898 3010 535 2095 3005 1024 729 2148 2739 1090 1120 2332 2781 794 2930 1602 545 1131 716 1114 823 819 1796 1125 2905 2153 409 768 1840 887 740 2178 2932 2070 2387 1339 2822 425 2830 1789 2413 2965 2960 416 2995 2742 2722 2502 549 752 1693 996 2756 931 874 2047 1884 1432 2469 30193020 1332 1076 415 745 1256 1490 783 2596 1904 2396 831 2145 2648 1150 2684 1217 2011 2603 1370 1283 518 139 1492 1320 2744 49 1667 2176 3030 916 1398 1670 2195 1788 2064 406 50 1000 35 588 883 173 2940 6 41 361 1389 2416 9 ALL 482 2588 539 2811 994 2238 1758 1833 664 1226 1511 1935 2173 1940 583 267 900 1137 1790 531 613 1966 606 2646 1526 2972 1728 82 1442 2187 2427 2520 1740 1760 2232 623 939 1893 2315 2934 2855 502 2640 2346 1405 820 534 1457 1783 2252 1380 554 1374 769 430 53 1452 118 2102 2766 2859 2121 404 1345 2197 1785 1031 137 1188 2138 913 2435 1023 3009 2911 2412 1535 301 2340 1008 1683 1248 175 2028 927 749 2597 2120 1867 2174 669 2695 1099 2758 643 1520 2623 2008 630 2474 1074 1831 1970 379 209 1651 2823 1155 2137 2897 1229 1941 397 2740 155 2066 1539 2659 2113 928 818 855 2630 1185 2181 854 274 243 844 1384 444 2971 73 42 802 110 434 597 519 2954 1092 2219 1299 427 417 355 2881 761 1379 1495 2915 1987 2836 90 2586 2710 1077 1771 2488 569 2537 405 2465 1122 1438 2428 1383 1634 1259 157 1639 1171 1093 1208 130 2100 677 2081 2612 594 1763 1800 480 2765 481 2918 ALL 483 1143 1210 322 1412 70 562 21 2548 1409 473 2540 1165 941 861 2222 471 1275 2935 115 2928 2287 2680 1403 945 2857 261 2840 308 1715 1930 723 2117 1652 1512 1624 1682 2563 1797 113 2768 2528 2281 2462 2812 3008 642 2407 2730 2324 584 1493 739 1654 2159 382 1842 2354 1406 2667 1041 1235 1134 2441 2820 1906 1130 36 2375 1528 832 972 2131 957 1514 2964 2999 181 477 2883 152 2799 2099 2056 372 948 1124 676 72 756 650 1989 666 2992 1816 841 420 351 2490 728 19 1063 1861 1079 1209 1385 2639 1974 938 987 381 1408 1706 67 1563 2186 930 2220 2189 741 1242 1580 2524 908 1430 1051 2620 1954 3004 375 2436 2847 575 699 2944 1158 2031 2931 1168 2858 100 3017 536 905 2614 2521 2229 1184 1449 662 293 2168 3039 26 2158 2869 2492 1659 869 3031 474 1647 321 2926 2409 1427 2689 151717 2468 810 7 AML 1558 1247 1982 190 2237 705 700 2503 2713 830 815 759 981 413 2711 1164 2821 1356 600 400 1151 1767 2043 216 432 732 2873 443 2171 2406 1839 865 2191 622 1888 921 2643 1220 2837 1504 401 2884 2483 2636 3049 3003 125 619 1194 1418 1267 1396 390 2226 2415 168 2968 2420 2991 497 2338 1615 1439 2856 1064 1553 1910 3011 263 2192 475 1810 2631 532 809 775 1112 1569 1617 2650 2743 2038 821 2777 2778 43 750 1170 2825 914 2048 853 2147 857 512 2536 1237 2917 2513 1894 2074 2036 2477 2622 105 2329 780 2899 2566 580 2547 2969 1010 1802 1 816 2304 2919 56 992 1066 1576 540 1975 2475 786 964 1065 2874 2342 32 2914 2651 822 1129 2403 2370 653 2037264 547 868 1536 2322 2937 60 1595 2870 1914 2952 2401 2300 1315 891 2887 2718 2669 774 1798 AML 782 1102 450 1556 1824 1085 3015 14 2233 1475 18 1497 2574 2350 2763 AML 1338 1998 20 628 757 1411 2791 1300 618 1062 888 2792 998 647 568 735 2715 950 1754 932 4 AML 1361 1780 1823 2455 2139 2172 2559 1928 2924 1772 2834 2759 2421 12 2305 526 1471 2504 2530 3035 2275 AML 1863 629 11 694 1227 3002 2703 2993 1601 2560 2800 1308 719 2556 827 2647 1373 470 2589 2929 62 845 2456 1927 419 2570 141 1005 617 2510 2619 2274 995 346 2735 2655 108 1239 2004 1977 2541 1391 2712 2198 1968 1448 2336 1138 2542 2977 896 303 1417 624 2654 2119 2904 AML 1307 824 2335 1826 942 803 1774 848 498 2215 528 2696 1048 2579 910 2531 1837 1849 509 566 937 1044 988 2500 13 1911 2552 2681 AML 2555 2941 2280 2749 1737 2564 2321 2773 2989 808 1665 709 792 215 2399 140 2761 1676 1378 2569 2958 894 1123 2578 3051 AML 418 2865 766 2494 2600 1825 68 2692 1830 1784 2906 1523 2813 496 1907 1243 968 779 521 2562 1038 1901 2561 1762 1312 350 2700 1516 310 2204 2750 1592 2661 2738 2101 436 2499 AML 2862 1829 AML 1611 2273 2656 2671 860 2553 1778 2124 2698 781 789 2752 2734 2026 1034 1413 1768 2945 2670 1141 773 2920 2922 2921 2672 378 2572 2393 2714 1069 904 1009 -0.05 PC2 0 0 -20 -20 -40 -40 -60 0.05 PC1 #plot on the first 3 PCs install.packages("scatterplot3d", dependencies = TRUE) library(scatterplot3d) scatterplot3d(x = predict(PCAres)[,1], y = predict(PCAres)[,2], z = predict(PCAres)[,3]) 25 PCA Exercises In the gene dimension reducing the data does not help to better cluster the genes. We need more dimensions to observe clusters. Dimensionality reduction is not so useful because in this direction the number of observations >variables. Conclusion: PCA can be performed in the two dimensions (scores and loadings are then swapped) but in general PCA is most useful if the number of variables > observations because then you expect that there are so many variables that more directions of variations are possible and you want to select the most pronounced direction 26
© Copyright 2025 Paperzz