Graphical Exploration of Gene Expression Data by Spectral Map Analysis L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 1 Outline • Short introduction to microarray technology • Exploratory multivariate data analysis • Multivariate projection methods • Principal component analysis • Correspondence factor analysis • Spectral map analysis • Case data: Leukemia data (Golub, 1999) • Comparison of multivariate projection methods September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 2 Introduction to DNA Microarray Experiments Cell Labelling mRNA 18µm Chip September 2002 Gene Probe Hybridization Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium Scan 3 Structure of Microarray Data p samples Y= September 2002 n genes • p n (e.g. p = 38, n = 5327) Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 4 Exploratory Multivariate Analysis & Gene Expression Data • Supervised learning: Which genes discriminate known groups of subjects ? • Unsupervised learning: 1. Do gene expression profiles reveal groups of patients (class discovery) ? 2. Which genes correlate with these groups ? • • clustering methods projection methods September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 5 Multivariate Projection Methods & Gene Expression Data • Principal Component Analysis Lefkovits, I., et al. (1988) Hilsenbeck S.G., et al. (1999) • Correspondence Factor Analysis Fellenberg, K., et al. (2001) • Spectral Map Analysis September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 6 Key Steps in Multivariate Projection Methods Lewi, P.J. (1993) • Weighting by marginal means or other • Re-expression replace each table element by its logarithm • Closure (row, column, or double) divide each table element by marginal totals • Centering (row, column, or double) subtract from each table element marginal mean • Normalization (row, column, or global) divide each table element by standard deviation • Factorization generalized SVD • Projection of scores and loadings with proper scaling biplot September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 7 Principal Component Analysis • Pearson (1901), Hotelling (1933) • Optional log-transformation • Constant weighting of row and columns • Column centering • Column normalization • Centering and normalization handle tables asymmetrically: R-mode - Q-mode analysis September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 8 Correspondence Factor Analysis • Benzécri (1973), Greenacre (1984) • Weighting of row and columns by marginal row and column totals • Double closure (rank reduction) • Double centering • Global normalization • Distances are chi-square related September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 9 Spectral Map Analysis • Lewi (1976) • Constant weighting of row and columns, or weighting by marginal row and column totals • Logarithmic re-expression • Double centering (rank reduction) • Global normalization • Distances are contrasts (log-ratio) • Areas of symbols are proportional to marginal row-and columntotals or to a selected column September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 10 Spectral Map Analysis Double Centering • On log-basis related to double closure • Treats row- and column-space equivalently • Geometrically results in a projection of the data on a hyperplane orthogonal to the line of identiy • Number of dimensions is reduced by one • Effectively removes a “size” component from the data September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 11 Properties of Gene Expression Data • Presence of extreme high values: logarithmic re-expression • Importance of level of gene expression, differences at lower levels are less reliable: weighting for marginal totals • Extreme rectangular shape of matrices, e.g. 7000 rows, 20 columns: • assymetric factor scaling (downplay variability among genes) • label only most extreme genes September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 12 Case Study MIT Leukemia Data (Golub, 1999) • Training sample: • 38 bone marrow samples • 27 acute lymphoblastic leukemia ALL (T-lineage, B-lineage) • 11 acute myeloid leukemia AML • 6817 genes (5327 used) • arbitrary choice of 50 informative genes to discriminate between ALL and AML • Validation sample: • 34 bone marrow samples • 20 ALL (T-lineage, B-lineage), 14 AML September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 13 Class Discovery & Gene Detection • Use case study to evaluate & validate three multivariate projection methods: • Ability to discover (known) classes • Identify genes related to these classes • Quantify relationships September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 14 Principal Component Analysis 37 33 30 38 36 28 31 32 2934 35 12 PC2 3 % 2 8 22 25 27 23 10 6 7 3 11 9 17 18 14 4 1 21 AML ALL-T ALL-B 19 26 16 5 24 20 15 13 PC1 71 % September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 15 Principal Component Analysis Conclusions • Results of PCA obscured by presence of size component, i.e. overall level of expression of genes • Poor performance in class discovery • Useless for detecting genes related to classes September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 16 Correspondence Factor Analysis 20 13 15 21 PC2 10 % 24 26 19 5 AML 1 16 25 17 18 8 4 22 30 33 28 32 34 38 27 7 37 36 12 35 ALL-T 29 31 ALL-B 2 14 11 9 10 3 23 6 PC1 17 % September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 17 Correspondence Factor Analysis Conclusions • Excellent performance in class discovery • Difficult to detect genes related to classes • Difficult to interpret distances between genes, samples (chisquare values) September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 18 Spectral Map Analysis Marginal Means Weighted X82240 M89957 M38690 L08895 Z49194 21 M57466 20U36922 L33930 13 HG3576-HT3779 X62744 19 K01911 26 AML 24 16 15 PC2 12 % 4 1 25 5 22 8 18 27 ALL-T M87789 M63438 17 7 29 28 12 30 36 35 32 34 31 38 33 ALL-B 37 M57731 M57710 M28130 M84526 2 M83667 14 10 M12886 9 3 23 M27891 11 6 D83920 X05908 X04145 U93049 X00437 M28826 X76223 PC1 14 % September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 19 Spectral Map Analysis Conclusions • Excellent performance in class discovery • Allows to detect genes related to classes • Distances between objects are contrasts (ratios) • Suggests data reduction September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 20 Weighted Spectral Map Analysis 0.5 % Most Extreme Genes X82240 24 8 M89957 19 20 4 26 161 21 15 25 Z49194 K01911 27 L08895 13 22 185 M38690 L33930 U36922 12 AML M12886 7 PC2 32 % 11 3 M57466 ALL-T 6 X62744 14 35 HG3576-HT3779 23 10 17 X04145 9 X00437 X76223 ALL-B 28 M28826 2 U93049 32 29 34 M87789 31 M63438 38 33 30 36 37 M57731 M57710 M28130 M83667 M27891 D83920 X05908 M84526 PC1 43 % September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 21 Positioning of Validation Set 48 X82240 39 69 M89957 46 49 Z49194 K01911 4543 56 L08895 70 72 44 41 59 71 L33930 55 M38690 42 U36922 68 PC2 32 % AML M12886 47 ALL-T M57466 40 X62744 66 X04145 61 HG3576-HT3779 52 60 65 54 62 58 64 53 50 63 M87789 M63438 X00437 X76223 ALL-B M28826 67 U93049 57 51 M57731 M57710 M28130 M83667 M27891 D83920 X05908 M84526 PC1 43 % September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 22 Golub Data Set Conclusion • Weighted SMA allows to reduce data from 5327 genes to 27 genes without loss of important information • Positioning validation data indicates 27 genes are also predictive for new cases • Further data reduction possible ? • Quantify samples for gene specificity ? September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 23 Spectral Mapping as a Tool for Quantitative Differential Gene Expression AML ALL-T ALL-B September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 24 Conclusions • Weighted SMA outperforms the other projection methods in: • • • • • Visualizing data Class detection of biological samples Identifying important genes, confirmed by literature search Reducing the data Identifying & interpreting relations between genes and subjects • Quantifying differential gene expression September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 25 SPM Library • Principal component analysis PCA<-spectralmap(DATA,center="column",normal="column") plot(PCA,scale="uvc",label.tol=c(1,1),col.group=GRP) • Correspondence analysis CFA<spectralmap(DATA,logtrans=False,row.weight="mean”, col.weight="mean",closure="double") plot(CFA,scale="uvc",label.tol=c(1,1),col.group=GRP) • Spectral map analysis, weighted by marginal totals SPM<-spectralmap(DATA,row.weight="mean“,col.weight=“mean”) plot(SPM,scale="uvc",col.group=GRP) September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 26 References • Wouters, L., Göhlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., Lewi, P.J. (2002). Graphical Exploration of Gene Expression Data: A Comparative Study of Three Multivariate Methods. submitted Biometrics. • SPM-library availability: email: [email protected] http://alpha.luc.ac.be/~lucp1456/ September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 27
© Copyright 2026 Paperzz