Graphical Exploration

Graphical Exploration of Gene
Expression Data by Spectral Map
Analysis
L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
1
Outline
• Short introduction to microarray technology
• Exploratory multivariate data analysis
• Multivariate projection methods
• Principal component analysis
• Correspondence factor analysis
• Spectral map analysis
• Case data: Leukemia data (Golub, 1999)
• Comparison of multivariate projection methods
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
2
Introduction to DNA Microarray Experiments
Cell
Labelling
mRNA
18µm
Chip
September 2002
Gene
Probe
Hybridization
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
Scan
3
Structure of Microarray Data
p samples
Y=
September 2002
n genes
• p  n
(e.g. p = 38, n = 5327)
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
4
Exploratory Multivariate Analysis
&
Gene Expression Data
•
Supervised learning:
Which genes discriminate known groups of subjects ?
•
Unsupervised learning:
1. Do gene expression profiles reveal groups of patients (class
discovery) ?
2. Which genes correlate with these groups ?
•
•
clustering methods
projection methods
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
5
Multivariate Projection Methods
&
Gene Expression Data
• Principal Component Analysis
Lefkovits, I., et al. (1988)
Hilsenbeck S.G., et al. (1999)
• Correspondence Factor Analysis
Fellenberg, K., et al. (2001)
• Spectral Map Analysis
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
6
Key Steps in Multivariate Projection Methods
Lewi, P.J. (1993)
• Weighting by marginal means or other
• Re-expression
replace each table element by its logarithm
• Closure (row, column, or double)
divide each table element by marginal totals
• Centering (row, column, or double)
subtract from each table element marginal mean
• Normalization (row, column, or global)
divide each table element by standard deviation
• Factorization
generalized SVD
• Projection of scores and loadings with proper scaling
biplot
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
7
Principal Component Analysis
• Pearson (1901), Hotelling (1933)
• Optional log-transformation
• Constant weighting of row and columns
• Column centering
• Column normalization
• Centering and normalization handle tables asymmetrically:
R-mode - Q-mode analysis
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
8
Correspondence Factor Analysis
• Benzécri (1973), Greenacre (1984)
• Weighting of row and columns by marginal row and column totals
• Double closure (rank reduction)
• Double centering
• Global normalization
• Distances are chi-square related
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
9
Spectral Map Analysis
• Lewi (1976)
• Constant weighting of row and columns, or weighting by marginal
row and column totals
• Logarithmic re-expression
• Double centering (rank reduction)
• Global normalization
• Distances are contrasts (log-ratio)
• Areas of symbols are proportional to marginal row-and columntotals or to a selected column
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
10
Spectral Map Analysis
Double Centering
• On log-basis related to double closure
• Treats row- and column-space
equivalently
• Geometrically results in a projection of the
data on a hyperplane orthogonal to the
line of identiy
• Number of dimensions is reduced by one
• Effectively removes a “size” component
from the data
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
11
Properties of Gene Expression Data
• Presence of extreme high values:
logarithmic re-expression
• Importance of level of gene expression, differences at lower levels
are less reliable:
weighting for marginal totals
• Extreme rectangular shape of matrices, e.g. 7000 rows, 20
columns:
• assymetric factor scaling (downplay variability among genes)
• label only most extreme genes
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
12
Case Study
MIT Leukemia Data (Golub, 1999)
• Training sample:
• 38 bone marrow samples
• 27 acute lymphoblastic leukemia ALL (T-lineage, B-lineage)
• 11 acute myeloid leukemia AML
• 6817 genes (5327 used)
• arbitrary choice of 50 informative genes to discriminate
between ALL and AML
• Validation sample:
• 34 bone marrow samples
• 20 ALL (T-lineage, B-lineage), 14 AML
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
13
Class Discovery & Gene Detection
• Use case study to evaluate & validate three multivariate projection
methods:
• Ability to discover (known) classes
• Identify genes related to these classes
• Quantify relationships
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
14
Principal Component Analysis
37
33
30
38
36
28
31
32
2934
35
12
PC2 3 %
2
8
22
25
27
23
10
6
7
3
11
9
17
18
14
4
1
21
AML
ALL-T
ALL-B
19
26
16
5
24
20
15
13
PC1 71 %
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
15
Principal Component Analysis
Conclusions
• Results of PCA obscured by presence of size component, i.e.
overall level of expression of genes
• Poor performance in class discovery
• Useless for detecting genes related to classes
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
16
Correspondence Factor Analysis
20
13
15
21
PC2 10 %
24
26
19
5
AML
1
16
25
17 18
8
4
22
30
33
28
32
34 38
27
7
37
36
12
35
ALL-T
29
31
ALL-B
2
14
11
9
10
3
23
6
PC1 17 %
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
17
Correspondence Factor Analysis
Conclusions
•
Excellent performance in class discovery
•
Difficult to detect genes related to classes
•
Difficult to interpret distances between genes, samples (chisquare values)
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
18
Spectral Map Analysis
Marginal Means Weighted
X82240
M89957
M38690
L08895
Z49194
21
M57466
20U36922
L33930 13
HG3576-HT3779
X62744
19
K01911
26
AML
24
16 15
PC2 12 %
4
1
25
5
22
8 18
27
ALL-T
M87789
M63438
17
7
29
28
12
30
36
35
32
34
31
38
33
ALL-B
37
M57731
M57710
M28130
M84526
2
M83667
14
10
M12886
9
3
23
M27891
11 6
D83920
X05908
X04145
U93049
X00437
M28826
X76223
PC1 14 %
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
19
Spectral Map Analysis
Conclusions
• Excellent performance in class discovery
• Allows to detect genes related to classes
• Distances between objects are contrasts (ratios)
• Suggests data reduction
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
20
Weighted Spectral Map Analysis
0.5 % Most Extreme Genes
X82240
24
8
M89957
19
20
4 26 161
21
15
25 Z49194 K01911
27
L08895
13
22
185
M38690
L33930
U36922
12
AML
M12886
7
PC2 32 %
11
3
M57466
ALL-T
6
X62744
14
35
HG3576-HT3779
23
10
17
X04145
9
X00437
X76223
ALL-B
28
M28826
2
U93049
32
29
34
M87789 31
M63438
38
33
30
36
37
M57731
M57710
M28130
M83667
M27891
D83920
X05908
M84526
PC1 43 %
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
21
Positioning of Validation Set
48
X82240
39
69
M89957
46
49
Z49194
K01911
4543 56
L08895 70 72
44
41
59
71
L33930
55
M38690
42
U36922
68
PC2 32 %
AML
M12886
47
ALL-T
M57466
40
X62744
66
X04145
61
HG3576-HT3779
52
60
65 54
62
58
64
53
50
63
M87789
M63438
X00437
X76223
ALL-B
M28826
67
U93049
57
51
M57731
M57710
M28130
M83667
M27891
D83920
X05908
M84526
PC1 43 %
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
22
Golub Data Set Conclusion
• Weighted SMA allows to reduce data from 5327 genes to 27
genes without loss of important information
• Positioning validation data indicates 27 genes are also predictive
for new cases
• Further data reduction possible ?
• Quantify samples for gene specificity ?
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
23
Spectral Mapping as a Tool for
Quantitative Differential Gene Expression
AML
ALL-T
ALL-B
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
24
Conclusions
• Weighted SMA outperforms the other projection methods in:
•
•
•
•
•
Visualizing data
Class detection of biological samples
Identifying important genes, confirmed by literature search
Reducing the data
Identifying & interpreting relations between genes and
subjects
• Quantifying differential gene expression
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
25
SPM Library
• Principal component analysis
PCA<-spectralmap(DATA,center="column",normal="column")
plot(PCA,scale="uvc",label.tol=c(1,1),col.group=GRP)
• Correspondence analysis
CFA<spectralmap(DATA,logtrans=False,row.weight="mean”,
col.weight="mean",closure="double")
plot(CFA,scale="uvc",label.tol=c(1,1),col.group=GRP)
• Spectral map analysis, weighted by marginal totals
SPM<-spectralmap(DATA,row.weight="mean“,col.weight=“mean”)
plot(SPM,scale="uvc",col.group=GRP)
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
26
References
• Wouters, L., Göhlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G.,
Lewi, P.J. (2002). Graphical Exploration of Gene Expression Data: A
Comparative Study of Three Multivariate Methods. submitted Biometrics.
• SPM-library availability:
email: [email protected]
http://alpha.luc.ac.be/~lucp1456/
September 2002
Center for Statistics, transnational University Limburg, Hasselt, Belgium
and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
27