Simple Decision Rules for Classifying Human Cancers from Gene

Simple Decision Rules for Classifying
Human Cancers from Gene Expression
Profiles
Aik Choon TAN
Post-Doc Research Fellow
[email protected]
Prof. Raimond L. Winslow [email protected], Director, ICM & CCBM,
Prof. Donald Geman [email protected],
Prof. Daniel Naiman [email protected],
Lei Xu [email protected],
Troy Anderson [email protected]
The Institute for Computational Medicine (ICM) and
Center for Cardiovascular Bioinformatics and Modeling (CCBM),
Johns Hopkins University
Biomarkers Discovery Workflow
Disease
Clinical
Applications
Sample
Collection
Candidate
Biomarkers
Follow-up
Study
Patients
Decision Rules
Transcriptomics Pipeline
Store
MAGE-DB2
Gene Expression
Profiling
Experiments
Disease
2500
Normal
5000
7500
10000
12500
6524.9+H
6
Query
4
CACO2
2
0
30
20
HCT116
10
0
15
Machine
Learning
Relative
Expression
Reversal
Classifiers
6528.9+H
6517.8+H
10
BE
5
0
7.5
6519.1+H
5
A2780
2.5
0
15
10
6518.2+H
DLD-1
5
0
10
6516.6+H
7.5
5
HT29
2.5
0
2500
Normal
Store
Mass
Spectrometry
5000
7500
10000
12500
PROTEIN-DB2
Proteomics Pipeline
Store
Query
Available at
ICM/CCBM
Difference Gel Electrophoresis
AC TAN 2006
2
Outline
1) Relative Expression Reversal Classifiers
•
•
TSP classifier
k-TSP classifier
2) Results on binary & multi-class disease gene
expression classification problems
3) Data Integration and Cross-platform analysis
4) Applications to other “–omics” data
5) Conclusions
AC TAN 2006
3
Disease Classification
AC TAN 2006
From http://research.dfci.harvard.edu/korsmeyer/Home.html
4
Microarray Gene Expression Profiles
(Golub et al 1999)
AML
acute myeloid leukemia
(myeloid precursor)
ALL
acute lymphoblastic leukemia
(lymphoid precursors)
AC TAN 2006
5
Learning Approaches
(Ramaswamy and Golub 2002)
AC TAN 2006
6
Gene Expression Profiles
P × N matrix
Y=
{Cancer,
Normal}
N arrays (N = {1,2,…N}
Label
Cancer
Normal
…
Cancer
Geneid
Array 1
Array 2
…
Array N
g1
103.02
58.79
…
101.54
P genes
g2
40.55
1246.87
…
1432.12
P=
{1,…,P}
…
…
…
…
…
gP
78.13
66.25
…
823.09
AC TAN 2006
7
Microarray Data Analysis
• A P × N matrix where
– P is the number of genes
– N is the number of experiments
– The columns are “gene expression profiles”
AC TAN 2006
8
Sample Size Dilemma
• Small N (typically tens to hundreds)
• Large P (typically thousands)
• Consequence: Standard methods in
machine learning often lead to over-fitting
and inflated estimates of performance.
AC TAN 2006
9
Interpretability Dilemma
(Biological Perspective)
• The “decision boundary” generated by
standard machine learning methods is
often highly complex.
• Examples: support vector machines,
neural networks, random forests, nearest
neighbors.
• Consequence: Decision-making is a
mystery and does not readily generate
hypotheses or suggest follow-up studies.
AC TAN 2006
10
Relative Expression Reversal Classifiers
• Pairwise rank-based comparisons (relative
expression values within each array)
• Generates accurate and simple decision rules
– TSP classifier: Top Scoring Pair
– k-TSP classifier: k-disjoint Top Scoring Pairs
• Data driven, parameter-free learning algorithm
• Performance comparable to or exceeds that of
other machine learning methods
• Easy to interpret, facilitating follow-up study
(small number of genes)
AC TAN 2006
(Tan et al., 2005, Bioinformatics, 21:3896-3904)
11
Rank-based Classification
• Novelty: Replace the measured expression values by
their ranks within profiles, hence obtaining invariance to
normalization.
• Example: Differentiate between classes by finding pairs
of genes whose ordering typically changes from Normal
to Disease.
• Simple Interpretation: Inversion of mRNA (protein)
abundance.
AC TAN 2006
12
Statistical Formulation
• The expression profile is a random vector
X = (X1,…,XP)
• The true class is also a r.v. Y, say Y=1
(Disease) or Y=2 (Normal)
• A classifier is a mapping f from X to {1,2}.
• Training data: A P × N matrix S whose
columns represent N = N1+N2 samples of
(X, Y), with N1 (resp. N2) samples for
which Y=1 (Y=2).
AC TAN 2006
13
Statistical Formulation (cont)
• Learning algorithm: A mapping from the
training set S to a classifier f based on S.
• Generalization error: e(f) = P(f(X)?Y).
This depends on S and the distribution of
(X,Y) and is extremely hard to estimate.
• Dilemmas:
– N << P
– f(X) is too complex and hard to interpret
AC TAN 2006
14
Gene Expression Comparisons
• Features: Z ij = 1{ X < X } ,
• Feature Score:
i
j
1 ≤ i < j ≤ P.
∆ ij =| P ( X i < X j | Y = 1) − P ( X i < X j | Y = 2)
≈
N ij(1)
N1
−
N ij( 2 )
N2
.
where
N ij( k ) =| {1 ≤ m ≤ N : Ym = k , X im < X jm } |, k = 1,2.
AC TAN 2006
15
1
g1
g2
g3
g4
g5
TSP Algorithm
n1
n2
n3
n4
Cancer Cancer
Normal
Normal
1000
789
356
45
289
150
500
1000
634
450
220
150
367
455
150
50
2500
1800
1900
2100
2
n1
g1
g2
g3
g4
g5
3
n2
n3
n4
Cancer Cancer
Normal
Normal
2
2
3
5
5
5
2
2
3
4
4
3
4
3
5
4
1
1
1
1
4
P(g1 > g2 | Cancer) = 0/2 = 0
P(g1 > g2 | Normal) = 2/2 = 1
? 12 = |P(g1>g2|Cancer) – P(g1 > g2|Normal)| = |0 – 1| = 1
High ?
P(g1 > g3 | Cancer) = 0/2 = 0
P(g1 > g2 | Normal) = 1/2 = 0.5
? 13 = |P(g1>g3|Cancer) – P(g1 > g3|Normal)| = |0 – 0.5| = 0.5
P(g1 > g4 | Cancer) = 0/2 = 0
P(g1 > g4 | Normal) = 1/2 = 0.5
? 14 = |P(g1>g4|Cancer) – P(g1 > g4|Normal)| = |0 – 0.5| = 0.5
P(g1 > g5 | Cancer) = 2/2 = 1
P(g1 > g5 | Normal) = 2/2 = 1
? 15 = |P(g1>g5|Cancer) – P(g1 > g5|Normal)| = |1 – 1| = 0
…
P(g4 > g5 | Cancer) = 2/2 = 1
P(g4 > g5 | Normal) = 2/2 = 1
? 45 = |P(g4>g5|Cancer) – P(g4 > g5|Normal)| = |1 – 1| = 0
AC TAN 2006
Low ?
? 12 = 1
.
.
.
? 13 = 0.5
? 14 = 0.5
.
.
.
? 15 = 0
? 45 = 0
16
TSP Classifier
• Select only the top scoring pairs :
– { (i*, j*): ? i*j* = ? max }
• TSP classifier (hTSP) is based on these pairs:
– Example: Let all the top scoring pairs “vote” (Geman et al, 2004)
– Example: Select one unique top scoring pair, based on maximizing difference in
ranks (i, j) (Tan et al, 2005)
• Prediction: Suppose Pij(Normal) > Pij(Disease), Xnew = new profile:
Normal, if Ri,new > Rj,new,
ynew = hTSP(Xnew) =
(1)
Disease, otherwise.
– If, on the other hand, if Pij(Disease) > Pij(Normal), then the decision rule is
reversed.
(Tan et al., 2005, Bioinformatics, 21:3896-3904)
AC TAN 2006
17
Initial Conclusions
• There may be many pairs of genes with an
informative ordering
– Motivation for k-TSP
• The TSP classifier is sensitive to S for
small samples but invariant to
normalization
– Motivation for “data integration”
AC TAN 2006
18
k-TSP Classifier
• Uses exactly k top disjoint pairs in prediction.
• k is determined by internal cross-validation
• Ensemble learning – to combine the discriminating power of
many “weaker” rules to make more reliable predictions.
• Prediction:
– Suppose Xnew = new profile, each gene pair (iu, ju), u = 1,…, k, votes
according (1).
– The k-TSP classifier hk-TSP employs an unweighted majority voting
procedure to obtain the final prediction of ynew.
(Tan et al., 2005, Bioinformatics, 21:3896-3904)
AC TAN 2006
19
Microarray Data Sets
(Binary class Problems)
# samples
Data set
Colon
Leukemia
CNS
DLBCL
Lung
Prostate1
Prostate2
Prostate3
GCM
Platform
cDNA
Affy
Affy
Affy
Affy
Affy
Affy
Affy
Affy
# genes
2,000
7,129
7,129
7,129
12,533
12,600
12,625
12,626
16,063
C1
40 (T)
25 (AML)
25 (C)
58 (D)
150 (A)
52 (T)
38 (T)
24 (T)
190 (C)
C2
22 (N)
47 (ALL)
9 (D)
19 (F)
31 (M)
50 (N)
50 (N)
9 (N)
90 (N)
Reference
(Alon et al. 1998)
(Golub et al. 1999)
(Pomeroy et al. 2002)
(Shipp et al. 2002)
(Gordon et al. 2002)
(Singh et al. 2002)
(Stuart et al. 2004)
(Welsh et al. 2001)
(Ramaswamy et al. 2001)
(Multi-class Problems)
Data set
Leukemia1
Lung1
Leukemia2
SRBCT
Breast
Lung2
DLBCL
Leukemia3
Cancers
GCM
Platform
Affy
Affy
Affy
cDNA
Affy
Affy
cDNA
Affy
Affy
Affy
# classes
3
3
3
4
5
5
6
7
11
14
# genes
7,129
7,129
12,582
2,308
9,216
12,600
4,026
12,558
12,533
16,063
# samples
Training Testing
38
34
64
32
57
15
63
20
54
30
136
67
58
30
215
112
100
74
144
46
AC TAN 2006
Reference
(Golub et al. 1999)
(Beer et al. 2002)
(Armstrong et al. 2002)
(Khan et al. 2001)
(Perou et al. 2000)
(Bhattacharjee et al. 2001)
(Alizadeh et al. 2000)
(Yeoh et al. 2002)
(Su et al. 2001)
(Ramaswamy et al. 2001)
20
Results
(LOOCV Binary Class Problems)
Method
TSP
k-TSP
DT
NB
k-NN
SVM
PAM
Leukemia
93.80
95.83
73.61
100.00
84.72
98.61
97.22
CNS
77.90
97.10
67.65
82.35
76.47
82.35
82.35
DLBCL
98.10
97.40
80.52
80.52
84.42
97.40
85.71
Colon
91.10
90.30
80.65
58.06
74.19
82.26
85.48
Prostate1
95.10
91.18
87.25
62.75
76.47
91.18
91.18
Prostate2
67.60
75.00
64.77
73.86
69.32
76.14
79.55
Prostate3
97.00
97.00
84.85
90.91
87.88
100.00
100.00
Lung
98.30
98.90
96.13
97.79
98.34
99.45
99.45
GCM
75.40
85.40
77.86
84.29
82.86
93.21
79.29
Average
88.26
92.01
79.25
81.17
81.63
91.18
88.91
Number of Informative Genes
Method
TSP
k-TSP
DT
PAM
Leukemia
2
18
2
2296
CNS
2
10
2
4
DLBCL
2
2
3
17
Colon
2
2
3
15
Prostate1
2
2
4
47
Prostate2
2
18
4
13
Prostate3
2
2
1
701
Lung
2
10
3
9
GCM
2
10
14
47
(Tan et al., 2005, Bioinformatics, 21:3896-3904)
AC TAN 2006
21
(a) TSP
ALL
AML
IF SPTAN1 ≥ CD33* THEN ALL; ELSE AML
∆ = 0.9787
IF SPTAN1 ≥ CD33* THEN ALL; ELSE AML
IF HA-1 ≥ ZYX* THEN ALL; ELSE AML
IF TCF3* > APLP2 THEN ALL; ELSE AML
IF ATP2A3* ≥ CST3* THEN ALL; ELSE AML
IF DGKD > MGST1 THEN ALL; ELSE AML
IF CCND3* ≥ NPC2 THEN ALL; ELSE AML
IF TOP2B* > PLCB2 THEN ALL; ELSE AML
IF Macmarcks ≥ CTSD* THEN ALL; ELSE AML
IF PSMB8 ≥ DF* THEN ALL; ELSE AML
∆ = 0.9787
∆ = 0.9787
∆ = 0.9574
∆ = 0.9387
∆ = 0.9387
∆ = 0.9387
∆ = 0.9387
∆ = 0.9362
∆ = 0.9200
(b) k-TSP
Normalized Expression
Low
High
* Genes previously identified by Golub et al (1999)
AC TAN 2006
(Tan et al., 2005, Bioinformatics, 21:3896-3904)
22
Results
(Test Accuracy for Multi-Class Problems)
Method
HC-TSP
HC-k-TSP
DT
NB
k-NN
1-vs-1-SVM
PAM
Leuk1
97.06
97.06
85.29
85.29
67.65
79.41
97.06
Lung1
71.88
78.13
78.13
81.25
75.00
87.50
78.13
Leuk2
80.00
100
80.00
100
86.67
100
93.33
SRBCT
95.00
100
75.00
60.00
30.00
100
95.00
Breast
66.67
66.67
73.33
66.67
63.33
83.33
93.33
Lung2
83.58
94.03
88.06
88.06
88.06
97.01
100
DLBCL
83.33
83.33
86.67
86.67
93.33
100
90.00
Leuk3
77.68
82.14
75.89
32.14
75.89
84.82
93.75
Cancers
74.32
82.43
68.92
79.73
64.86
83.78
87.84
GCM
52.17
67.39
52.17
52.17
34.78
65.22
56.52
Average
78.17
85.12
76.35
73.20
67.96
88.11
88.50
Number of Informative Genes
Method
HC-TSP
HC-k-TSP
DT
PAM
Leuk1
4
36
2
44
Lung1
4
20
4
13
Leuk2
4
24
2
62
SRBCT
6
30
3
285
Breast
8
24
4
4822
Lung2
8
28
5
614
DLBCL
10
46
5
3949
Leuk3
12
64
16
3338
Cancers
20
128
10
2008
GCM
26
134
18
1253
(Tan et al., 2005, Bioinformatics, 21:3896-3904)
AC TAN 2006
23
h1
h1
{AML, MLL}
ALL
IF SCRN1 ≥ HIST2H4 THEN AML; ELSE MLL
IF ANPEP* ≥ P29 THEN AML; ELSE MLL
IF CHRNA7 > TLR1 THEN AML; ELSE MLL
IF ATF5 > NFYC THEN AML; ELSE MLL
IF C6orf106 ≥ MEF2C THEN AML; ELSE MLL
IF PHGDH ≥ CTGF THEN AML; ELSE MLL
IF STAT4 ≥ MEIS1 THEN AML; ELSE MLL
IF AMELX ≥ PQBP1 THEN AML; ELSE MLL
IF DVL1 > ZNF148 THEN AML; ELSE MLL
Normalized Expression
low
IF WFS1* ≥ MEIS1 THEN ALL; ELSE {AML,MLL}
IF DNTT* ≥ LGALS1* THEN ALL; ELSE {AML,MLL}
IF MYLK* > LGALS1* THEN ALL; ELSE {AML,MLL}
high
h2
h2
MLL
AC TAN 2006
AML
24
“Direct” Data Integration
(Lei Xu et al, 2005, Bioinformatics, 21:3905-3911)
Lab A
Lab X
Lab B
Lab Y
Lab C
AC TAN 2006
25
Data Sets
Data Set
Training
Set
Singh6
Stuart7
Welsh8
Testing
Set
LaTulippe9
Lapointe10
Microarray
Platform
Number of
Probe Sets
No. of Normal
Samples
No. of Cancer
Samples
Affy. HG_U95Av2
Affy. HG_U95Av2
Affy. HG_U95Av2
Affy. HG_U95Av2
Spotted cDNA
12600
12625
12626
12626
44160/43008*
50
50
9
3
41
52
38
24
23
62
* 22 samples (9 normal / 13 cancer) have 44160 probes and 81 samples (32 normal / 49 cancer) have 43008 probes.
(Lei Xu et al, 2005, Bioinformatics, 21:3905-3911)
AC TAN 2006
26
TSPs from Data Integration
Training Data Set
Welsh
Sample
Size
33
Probe Set ID of TSP
(HG_U95Av2)
39608_at, 32526_at
Gene Symbol of
TSP
SIM2, JAM3
Score of
TSP
1.00
Classification
Accuracy (%)
97.0
Stuart
88
41732_at, 456_at
0.74
69.3
Singh
102
40282_s_at, 2035_s_at
CTNNB1,
SMARCD3
DF, ENO1
0.90
95.1
Welsh_Stuart*
121
31971_at, 34213_at
TP73L, KIBRA
0.79
77.7
Welsh_Singh
135
37639_at, 32198_at
HPN, COMMD4
0.88
83.7
Stuart_Singh
190
37639_at, 41222_at
HPN, STAT6
0.75
86.8
Welsh_Stuart_Singh
223
37639_at, 41222_at
HPN, STAT6
0.78
88.8
* Welsh_Stuart is the integrated data set of Welsh and Stuart data sets. Other integrated data sets use similar symbols.
(Lei Xu et al, 2005, Bioinformatics, 21:3905-3911)
AC TAN 2006
27
Results on Test Set
Testing Data Set
Microarray
Platform
No. of
Normal
Sample
No. of
Cancer
Sample
Accuracy
(%)
Sensitivity
(%)
Specificity
(%)
LaTulippe
Lapointe
Overall
Affy. HG_U95Av2
Spotted cDNA
Cross-platform
3
41
44
23
61*
84
96.2
93.1
93.8
95.7
90.2
91.7
100
97.6
97.7
* One of the cancer samples has missing value for HPN and is removed from the testing set.
Comparisons of Marker TSP with Individual TSPs
Testing Data Set
TSP
Accuracy (%)
Sensitivity (%)
Specificity (%)
LaTulippe
(HG_U95Av2)
Welsh
Stuart
Singh
Welsh_Stuart_Singh
Welsh
Stuart
Singh
Welsh_Stuart_Singh
69.2
84.5
88.5
96.2
70.9
43.6
43.7
93.1
69.6
82.6
87.0
95.7
95.2
6.7
6.4
90.2
66.7
100
100
100
34.1
97.6
100
97.6
Lapointe
(cDNA)
(Lei Xu et al, 2005, Bioinformatics, 21:3905-3911)
AC TAN 2006
28
Marker TSP for Prostate Cancer
• HPN (Hepsin) [biomarker candidate for prostate cancer]
• STAT6 (Signal transduction and translation
protein)
IF HPN > STAT6 THEN Prostate Cancer
ELSE Normal
PSA (Prostate Specific Antigen): Sn = 67.5% – 80% , Sp = 60% - 70%
TSP (HPN, STAT6): Sn = 91.7%, Sp = 97.7% (From this study!)
(Lei Xu et al, 2005, Bioinformatics, 21:3905-3911)
AC TAN 2006
29
DIGE Technology
(From http://www5.amershambiosciences.com)
Experimental Settings:
Name:
Model state:
Tetracycline present?
Beta- estrodiol present?
Proteomics Data
Tet (+)
Normal non-proliferating
yes
no
Beta(+)
Normal proliferating
yes
yes
Tet (-)
Cancer
no
no
Gels: 18 experiments
Cy2 – Internal Standards (18)
Cy3 – Cancer gels (18)
Cy5 – Normal gels (18)
1098 protein spots (BVA ratios from DeCyder software)
AC TAN 2006
(Troy Anderson et al)
30
Decision Rule
Normal
Cancer
Decision Rule:
IF Ratio530 ≥ Ratio786 THEN Cancer,
ELSE Normal.
LOOCV Results:
Accuracy: 97.2%
Sensitivity: 100%
Specificity: 94.4%
(35/36)
(18/18)
(17/18)
AC TAN 2006
(Troy Anderson et al)
31
Protein Marker Spots
(Troy Anderson et al)
AC TAN 2006
32
M. Bibilova et al (2006). Highthroughput DNA methylation profiling
using universal bead arrays. Genome
Research 16: 383-393.
AC TAN 2006
33
Bibikova et al (2006) Genome Research 16: 383
Cluster analysis of lung adenocarcinoma samples
•55 CpG sites that are differentially methylated in cancer versus normal tissues with high confidence level
(adjusted P-value < 0.001) and significant change in absolute methylation level (|β| > 0.15).
•Cancer sample G12029 was mistakenly coclustered with normal samples
•Cancer sample D12162 was coclustered with normal samples
•Normal samples are underlined in green, cancer in red.
Bibikova et al (2006) Genome Research 16: 383
•The asterisks indicate misclassified samples.
AC TAN 2006
34
k-TSP Results
k-TSP decision rules:
IF ATP5G1_1451 ≥ HS3ST2_311# THEN Normal, ELSE Cancer.
IF CDKN1B_1480 ≥ NEFL_524# THEN Normal, ELSE Cancer.
IF SLC22A18_679 ≥ ADAMTS12_1369 THEN Normal, ELSE Cancer.
#
CpG sites used in Bibikova et al (adjusted p-values = 0.000112, top in their list).
Training Set (LOOCV = 95.5%)
Cancer
Normal
Test Set (Prediction Accuracy = 85.42%)
* **
Normal
**
*
* The asterisks indicate misclassified samples
*
Cancer
Test Results:
Sample
D12152a
D12152b
D12165a
D12165b
D12155a
D12155b
D12170a
D12170b
D12197a
D12197b
D12158a
D12158b
D12160a
D12160b
D12181a
D12181b
D12203a
D12203b
D12162a
D12162b
D12163a
D12163b
D12207a
D12207b
D12195a
D12195b
D12157a
D12157b
D12173a
D12173b
D12198a
D12198b
D12180a
D12180b
D12202a
D12202b
D12182a
D12182b
D12205a
D12205b
D12184a
D12184b
D12164a
D12164b
D12188a
D12188b
D12209a
D12209b
AC TAN 2006
True Class
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
normal
TSP
cancer
cancer
normal
normal
normal
cancer
cancer
cancer
cancer
normal
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
normal
normal
cancer
cancer
cancer
cancer
normal
normal
cancer
normal
normal
cancer
normal
normal
cancer
normal
cancer
normal
normal
normal
normal
normal
normal
normal
cancer
normal
normal
normal
normal
normal
35
low
Methylation level
high
KTSP
cancer
cancer
cancer
cancer
normal
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
cancer
normal
normal
normal
normal
normal
normal
normal
normal
cancer
cancer
cancer
normal
normal
normal
normal
normal
normal
normal
cancer
cancer
normal
normal
normal
cancer
SARS ( Zhu et alPNAS 2006)
PNAS (2006) 103(11): 4011-4016
Unsupervised Learning
AC TAN 2006
k-NN
LR
36
Results on SARS Data
(Zhu et al 2006 PNAS)
(A) Canadian Sera Data (Training Set) [Data obtained from Supporting Table 4]
SARS_NEG
SARS_POS
TSP
SARS-N pEGH-B4*
229E-S 3/4
SARS-N pEGH-B4*
SARS-NC1 pEGH-B7*
SARS-N-C2 pEGH-B8 #2
229E-S 3/4
229E-P 4/8
SARS pEGH-109(Y)
k-TSP
TSP Decision Rule:
IF SARS-N pEGH-B4* > 229E-S 3/4 THEN SARS_POS, ELSE SARS_NEG
k-TSP Decision Rules:
IF SARS-N pEGH-B4* > 229E-S 3/4 THEN SARS_POS, ELSE SARS_NEG
IF SARS-NC1 pEGH-B7* > 229E-P 4/8 THEN SARS_POS, ELSE SARS_NEG
IF SARS-N-C2 pEGH-B8 #2 > SARS pEGH-109(Y) THEN SARS_POS, ELSE SARS_NEG
* Proteins identified by k-NN & LR methods in Zhu et al
Leave-One-Out Cross-Validation:
TSP accuracy: 87.7%
k-TSP accuracy: 90%
AC TAN 2006
37
Results on the SARS Data
(Zhu et al 2006 PNAS)
(B) Chinese Sera Data (Testing Set) [Data obtained from Supporting Table 6]
TSP
SARS_NEG
SARS_POS
*
k-TSP SARS_NEG
SARS_POS
*
Both classifiers misclassified 1 sample (*) on the Chinese SARS data (56 samples)
AC TAN 2006
38
Integrating Gene Expression Data and Clinical
Information for Cancer Outcome Prediction
AC TAN 2006
39
Preliminary Results
Data Sets:
#genes
#samples
Dead
Alive
# Clinical features
Cancer
Beer et al West et al Huang et al Bullinger et al
7129
6283
7129
12625
49
100
86
89
24
15
36
66
34
34
62
53
7
5
8
12
Lung
Breast
Breast
Leukemia
Rosenwald et al Ovarian Pittman et al
7399
22215
12625
240
133
171
138
61
43
102
72
128
2
1
1
Lymphoma
Ovarian
Breast
Lung
54613
85
43
42
2
Lung
Miller et al Ma et al
44928
22575
236
60
181
28
55
32
7
8
Breast
Breast
LOOCV Accuracy:
Beer et al West et al Huang et al Bullinger et al Rosenwald et al Ovarian Pittman et al
TSP
44.19
48.98
46.07
35.00
63.33
64.70
51.50
k-TSP
66.28
55.10
42.70
60.00
61.25
77.40
56.70
DT(50-TSP + Clinical)
60.47
83.67
56.18
80.00
57.50
72.18
70.18
AC TAN 2006
Lung
20.00
47.10
60.00
Miller et al Ma et al
32.20
41.70
51.30
46.70
74.15
90.00
40
http://www.ccbm.jhu.edu
Software
Availability
AC TAN 2006
41
Conclusions
• Bioinformatics tools to facilitate biomarkers discovery
• k-TSP is comparable with the state-of-the-art
classifiers (PAM, SVM) in classifying gene
expression profiles
• k-TSP generates simple and accurate decision rules
– Biological significance
– Easy to interpret
– Potential clinical applications
• Allow “direct” data integration without performing
normalization
• Allow cross-platform analysis
• Applicable to a wide-range of high-throughput data
AC TAN 2006
42
Acknowledgements
•
•
•
•
•
•
Prof. Raimond Winslow
Prof. Donald Geman
Prof. Daniel Naiman
Lei Xu
Troy Anderson
DIMACS Travel Fellowships
• THANK YOU !
EMAIL:
AC TAN 2006
[email protected]
43
References:
• D. Geman, C. d’Avignon, D.Q. Naiman and R.L. Winslow
(2004). Classifying gene expression profiles from
pairwise mRNA comparison. Statistical Applications in
Genetics and Molecular Biology, 3: Article 19.
• A.C. Tan, D.Q. Naiman, L. Xu, R.L. Winslow and D.
Geman (2005). Simple decision rules for classifying
human cancers from gene expression profiles.
Bioinformatics, 21(20): 3896-3904.
• L. Xu, A.C. Tan, D.Q. Naiman, D. Geman and R.L.
Winslow (2005). Robust prostate cancer marker genes
emerge from direct integration of inter-study microarray
data. Bioinformatics, 21(20): 3905-3911.
AC TAN 2006
44