Simple Decision Rules for Classifying Human Cancers from Gene Expression Profiles Aik Choon TAN Post-Doc Research Fellow [email protected] Prof. Raimond L. Winslow [email protected], Director, ICM & CCBM, Prof. Donald Geman [email protected], Prof. Daniel Naiman [email protected], Lei Xu [email protected], Troy Anderson [email protected] The Institute for Computational Medicine (ICM) and Center for Cardiovascular Bioinformatics and Modeling (CCBM), Johns Hopkins University Biomarkers Discovery Workflow Disease Clinical Applications Sample Collection Candidate Biomarkers Follow-up Study Patients Decision Rules Transcriptomics Pipeline Store MAGE-DB2 Gene Expression Profiling Experiments Disease 2500 Normal 5000 7500 10000 12500 6524.9+H 6 Query 4 CACO2 2 0 30 20 HCT116 10 0 15 Machine Learning Relative Expression Reversal Classifiers 6528.9+H 6517.8+H 10 BE 5 0 7.5 6519.1+H 5 A2780 2.5 0 15 10 6518.2+H DLD-1 5 0 10 6516.6+H 7.5 5 HT29 2.5 0 2500 Normal Store Mass Spectrometry 5000 7500 10000 12500 PROTEIN-DB2 Proteomics Pipeline Store Query Available at ICM/CCBM Difference Gel Electrophoresis AC TAN 2006 2 Outline 1) Relative Expression Reversal Classifiers • • TSP classifier k-TSP classifier 2) Results on binary & multi-class disease gene expression classification problems 3) Data Integration and Cross-platform analysis 4) Applications to other “–omics” data 5) Conclusions AC TAN 2006 3 Disease Classification AC TAN 2006 From http://research.dfci.harvard.edu/korsmeyer/Home.html 4 Microarray Gene Expression Profiles (Golub et al 1999) AML acute myeloid leukemia (myeloid precursor) ALL acute lymphoblastic leukemia (lymphoid precursors) AC TAN 2006 5 Learning Approaches (Ramaswamy and Golub 2002) AC TAN 2006 6 Gene Expression Profiles P × N matrix Y= {Cancer, Normal} N arrays (N = {1,2,…N} Label Cancer Normal … Cancer Geneid Array 1 Array 2 … Array N g1 103.02 58.79 … 101.54 P genes g2 40.55 1246.87 … 1432.12 P= {1,…,P} … … … … … gP 78.13 66.25 … 823.09 AC TAN 2006 7 Microarray Data Analysis • A P × N matrix where – P is the number of genes – N is the number of experiments – The columns are “gene expression profiles” AC TAN 2006 8 Sample Size Dilemma • Small N (typically tens to hundreds) • Large P (typically thousands) • Consequence: Standard methods in machine learning often lead to over-fitting and inflated estimates of performance. AC TAN 2006 9 Interpretability Dilemma (Biological Perspective) • The “decision boundary” generated by standard machine learning methods is often highly complex. • Examples: support vector machines, neural networks, random forests, nearest neighbors. • Consequence: Decision-making is a mystery and does not readily generate hypotheses or suggest follow-up studies. AC TAN 2006 10 Relative Expression Reversal Classifiers • Pairwise rank-based comparisons (relative expression values within each array) • Generates accurate and simple decision rules – TSP classifier: Top Scoring Pair – k-TSP classifier: k-disjoint Top Scoring Pairs • Data driven, parameter-free learning algorithm • Performance comparable to or exceeds that of other machine learning methods • Easy to interpret, facilitating follow-up study (small number of genes) AC TAN 2006 (Tan et al., 2005, Bioinformatics, 21:3896-3904) 11 Rank-based Classification • Novelty: Replace the measured expression values by their ranks within profiles, hence obtaining invariance to normalization. • Example: Differentiate between classes by finding pairs of genes whose ordering typically changes from Normal to Disease. • Simple Interpretation: Inversion of mRNA (protein) abundance. AC TAN 2006 12 Statistical Formulation • The expression profile is a random vector X = (X1,…,XP) • The true class is also a r.v. Y, say Y=1 (Disease) or Y=2 (Normal) • A classifier is a mapping f from X to {1,2}. • Training data: A P × N matrix S whose columns represent N = N1+N2 samples of (X, Y), with N1 (resp. N2) samples for which Y=1 (Y=2). AC TAN 2006 13 Statistical Formulation (cont) • Learning algorithm: A mapping from the training set S to a classifier f based on S. • Generalization error: e(f) = P(f(X)?Y). This depends on S and the distribution of (X,Y) and is extremely hard to estimate. • Dilemmas: – N << P – f(X) is too complex and hard to interpret AC TAN 2006 14 Gene Expression Comparisons • Features: Z ij = 1{ X < X } , • Feature Score: i j 1 ≤ i < j ≤ P. ∆ ij =| P ( X i < X j | Y = 1) − P ( X i < X j | Y = 2) ≈ N ij(1) N1 − N ij( 2 ) N2 . where N ij( k ) =| {1 ≤ m ≤ N : Ym = k , X im < X jm } |, k = 1,2. AC TAN 2006 15 1 g1 g2 g3 g4 g5 TSP Algorithm n1 n2 n3 n4 Cancer Cancer Normal Normal 1000 789 356 45 289 150 500 1000 634 450 220 150 367 455 150 50 2500 1800 1900 2100 2 n1 g1 g2 g3 g4 g5 3 n2 n3 n4 Cancer Cancer Normal Normal 2 2 3 5 5 5 2 2 3 4 4 3 4 3 5 4 1 1 1 1 4 P(g1 > g2 | Cancer) = 0/2 = 0 P(g1 > g2 | Normal) = 2/2 = 1 ? 12 = |P(g1>g2|Cancer) – P(g1 > g2|Normal)| = |0 – 1| = 1 High ? P(g1 > g3 | Cancer) = 0/2 = 0 P(g1 > g2 | Normal) = 1/2 = 0.5 ? 13 = |P(g1>g3|Cancer) – P(g1 > g3|Normal)| = |0 – 0.5| = 0.5 P(g1 > g4 | Cancer) = 0/2 = 0 P(g1 > g4 | Normal) = 1/2 = 0.5 ? 14 = |P(g1>g4|Cancer) – P(g1 > g4|Normal)| = |0 – 0.5| = 0.5 P(g1 > g5 | Cancer) = 2/2 = 1 P(g1 > g5 | Normal) = 2/2 = 1 ? 15 = |P(g1>g5|Cancer) – P(g1 > g5|Normal)| = |1 – 1| = 0 … P(g4 > g5 | Cancer) = 2/2 = 1 P(g4 > g5 | Normal) = 2/2 = 1 ? 45 = |P(g4>g5|Cancer) – P(g4 > g5|Normal)| = |1 – 1| = 0 AC TAN 2006 Low ? ? 12 = 1 . . . ? 13 = 0.5 ? 14 = 0.5 . . . ? 15 = 0 ? 45 = 0 16 TSP Classifier • Select only the top scoring pairs : – { (i*, j*): ? i*j* = ? max } • TSP classifier (hTSP) is based on these pairs: – Example: Let all the top scoring pairs “vote” (Geman et al, 2004) – Example: Select one unique top scoring pair, based on maximizing difference in ranks (i, j) (Tan et al, 2005) • Prediction: Suppose Pij(Normal) > Pij(Disease), Xnew = new profile: Normal, if Ri,new > Rj,new, ynew = hTSP(Xnew) = (1) Disease, otherwise. – If, on the other hand, if Pij(Disease) > Pij(Normal), then the decision rule is reversed. (Tan et al., 2005, Bioinformatics, 21:3896-3904) AC TAN 2006 17 Initial Conclusions • There may be many pairs of genes with an informative ordering – Motivation for k-TSP • The TSP classifier is sensitive to S for small samples but invariant to normalization – Motivation for “data integration” AC TAN 2006 18 k-TSP Classifier • Uses exactly k top disjoint pairs in prediction. • k is determined by internal cross-validation • Ensemble learning – to combine the discriminating power of many “weaker” rules to make more reliable predictions. • Prediction: – Suppose Xnew = new profile, each gene pair (iu, ju), u = 1,…, k, votes according (1). – The k-TSP classifier hk-TSP employs an unweighted majority voting procedure to obtain the final prediction of ynew. (Tan et al., 2005, Bioinformatics, 21:3896-3904) AC TAN 2006 19 Microarray Data Sets (Binary class Problems) # samples Data set Colon Leukemia CNS DLBCL Lung Prostate1 Prostate2 Prostate3 GCM Platform cDNA Affy Affy Affy Affy Affy Affy Affy Affy # genes 2,000 7,129 7,129 7,129 12,533 12,600 12,625 12,626 16,063 C1 40 (T) 25 (AML) 25 (C) 58 (D) 150 (A) 52 (T) 38 (T) 24 (T) 190 (C) C2 22 (N) 47 (ALL) 9 (D) 19 (F) 31 (M) 50 (N) 50 (N) 9 (N) 90 (N) Reference (Alon et al. 1998) (Golub et al. 1999) (Pomeroy et al. 2002) (Shipp et al. 2002) (Gordon et al. 2002) (Singh et al. 2002) (Stuart et al. 2004) (Welsh et al. 2001) (Ramaswamy et al. 2001) (Multi-class Problems) Data set Leukemia1 Lung1 Leukemia2 SRBCT Breast Lung2 DLBCL Leukemia3 Cancers GCM Platform Affy Affy Affy cDNA Affy Affy cDNA Affy Affy Affy # classes 3 3 3 4 5 5 6 7 11 14 # genes 7,129 7,129 12,582 2,308 9,216 12,600 4,026 12,558 12,533 16,063 # samples Training Testing 38 34 64 32 57 15 63 20 54 30 136 67 58 30 215 112 100 74 144 46 AC TAN 2006 Reference (Golub et al. 1999) (Beer et al. 2002) (Armstrong et al. 2002) (Khan et al. 2001) (Perou et al. 2000) (Bhattacharjee et al. 2001) (Alizadeh et al. 2000) (Yeoh et al. 2002) (Su et al. 2001) (Ramaswamy et al. 2001) 20 Results (LOOCV Binary Class Problems) Method TSP k-TSP DT NB k-NN SVM PAM Leukemia 93.80 95.83 73.61 100.00 84.72 98.61 97.22 CNS 77.90 97.10 67.65 82.35 76.47 82.35 82.35 DLBCL 98.10 97.40 80.52 80.52 84.42 97.40 85.71 Colon 91.10 90.30 80.65 58.06 74.19 82.26 85.48 Prostate1 95.10 91.18 87.25 62.75 76.47 91.18 91.18 Prostate2 67.60 75.00 64.77 73.86 69.32 76.14 79.55 Prostate3 97.00 97.00 84.85 90.91 87.88 100.00 100.00 Lung 98.30 98.90 96.13 97.79 98.34 99.45 99.45 GCM 75.40 85.40 77.86 84.29 82.86 93.21 79.29 Average 88.26 92.01 79.25 81.17 81.63 91.18 88.91 Number of Informative Genes Method TSP k-TSP DT PAM Leukemia 2 18 2 2296 CNS 2 10 2 4 DLBCL 2 2 3 17 Colon 2 2 3 15 Prostate1 2 2 4 47 Prostate2 2 18 4 13 Prostate3 2 2 1 701 Lung 2 10 3 9 GCM 2 10 14 47 (Tan et al., 2005, Bioinformatics, 21:3896-3904) AC TAN 2006 21 (a) TSP ALL AML IF SPTAN1 ≥ CD33* THEN ALL; ELSE AML ∆ = 0.9787 IF SPTAN1 ≥ CD33* THEN ALL; ELSE AML IF HA-1 ≥ ZYX* THEN ALL; ELSE AML IF TCF3* > APLP2 THEN ALL; ELSE AML IF ATP2A3* ≥ CST3* THEN ALL; ELSE AML IF DGKD > MGST1 THEN ALL; ELSE AML IF CCND3* ≥ NPC2 THEN ALL; ELSE AML IF TOP2B* > PLCB2 THEN ALL; ELSE AML IF Macmarcks ≥ CTSD* THEN ALL; ELSE AML IF PSMB8 ≥ DF* THEN ALL; ELSE AML ∆ = 0.9787 ∆ = 0.9787 ∆ = 0.9574 ∆ = 0.9387 ∆ = 0.9387 ∆ = 0.9387 ∆ = 0.9387 ∆ = 0.9362 ∆ = 0.9200 (b) k-TSP Normalized Expression Low High * Genes previously identified by Golub et al (1999) AC TAN 2006 (Tan et al., 2005, Bioinformatics, 21:3896-3904) 22 Results (Test Accuracy for Multi-Class Problems) Method HC-TSP HC-k-TSP DT NB k-NN 1-vs-1-SVM PAM Leuk1 97.06 97.06 85.29 85.29 67.65 79.41 97.06 Lung1 71.88 78.13 78.13 81.25 75.00 87.50 78.13 Leuk2 80.00 100 80.00 100 86.67 100 93.33 SRBCT 95.00 100 75.00 60.00 30.00 100 95.00 Breast 66.67 66.67 73.33 66.67 63.33 83.33 93.33 Lung2 83.58 94.03 88.06 88.06 88.06 97.01 100 DLBCL 83.33 83.33 86.67 86.67 93.33 100 90.00 Leuk3 77.68 82.14 75.89 32.14 75.89 84.82 93.75 Cancers 74.32 82.43 68.92 79.73 64.86 83.78 87.84 GCM 52.17 67.39 52.17 52.17 34.78 65.22 56.52 Average 78.17 85.12 76.35 73.20 67.96 88.11 88.50 Number of Informative Genes Method HC-TSP HC-k-TSP DT PAM Leuk1 4 36 2 44 Lung1 4 20 4 13 Leuk2 4 24 2 62 SRBCT 6 30 3 285 Breast 8 24 4 4822 Lung2 8 28 5 614 DLBCL 10 46 5 3949 Leuk3 12 64 16 3338 Cancers 20 128 10 2008 GCM 26 134 18 1253 (Tan et al., 2005, Bioinformatics, 21:3896-3904) AC TAN 2006 23 h1 h1 {AML, MLL} ALL IF SCRN1 ≥ HIST2H4 THEN AML; ELSE MLL IF ANPEP* ≥ P29 THEN AML; ELSE MLL IF CHRNA7 > TLR1 THEN AML; ELSE MLL IF ATF5 > NFYC THEN AML; ELSE MLL IF C6orf106 ≥ MEF2C THEN AML; ELSE MLL IF PHGDH ≥ CTGF THEN AML; ELSE MLL IF STAT4 ≥ MEIS1 THEN AML; ELSE MLL IF AMELX ≥ PQBP1 THEN AML; ELSE MLL IF DVL1 > ZNF148 THEN AML; ELSE MLL Normalized Expression low IF WFS1* ≥ MEIS1 THEN ALL; ELSE {AML,MLL} IF DNTT* ≥ LGALS1* THEN ALL; ELSE {AML,MLL} IF MYLK* > LGALS1* THEN ALL; ELSE {AML,MLL} high h2 h2 MLL AC TAN 2006 AML 24 “Direct” Data Integration (Lei Xu et al, 2005, Bioinformatics, 21:3905-3911) Lab A Lab X Lab B Lab Y Lab C AC TAN 2006 25 Data Sets Data Set Training Set Singh6 Stuart7 Welsh8 Testing Set LaTulippe9 Lapointe10 Microarray Platform Number of Probe Sets No. of Normal Samples No. of Cancer Samples Affy. HG_U95Av2 Affy. HG_U95Av2 Affy. HG_U95Av2 Affy. HG_U95Av2 Spotted cDNA 12600 12625 12626 12626 44160/43008* 50 50 9 3 41 52 38 24 23 62 * 22 samples (9 normal / 13 cancer) have 44160 probes and 81 samples (32 normal / 49 cancer) have 43008 probes. (Lei Xu et al, 2005, Bioinformatics, 21:3905-3911) AC TAN 2006 26 TSPs from Data Integration Training Data Set Welsh Sample Size 33 Probe Set ID of TSP (HG_U95Av2) 39608_at, 32526_at Gene Symbol of TSP SIM2, JAM3 Score of TSP 1.00 Classification Accuracy (%) 97.0 Stuart 88 41732_at, 456_at 0.74 69.3 Singh 102 40282_s_at, 2035_s_at CTNNB1, SMARCD3 DF, ENO1 0.90 95.1 Welsh_Stuart* 121 31971_at, 34213_at TP73L, KIBRA 0.79 77.7 Welsh_Singh 135 37639_at, 32198_at HPN, COMMD4 0.88 83.7 Stuart_Singh 190 37639_at, 41222_at HPN, STAT6 0.75 86.8 Welsh_Stuart_Singh 223 37639_at, 41222_at HPN, STAT6 0.78 88.8 * Welsh_Stuart is the integrated data set of Welsh and Stuart data sets. Other integrated data sets use similar symbols. (Lei Xu et al, 2005, Bioinformatics, 21:3905-3911) AC TAN 2006 27 Results on Test Set Testing Data Set Microarray Platform No. of Normal Sample No. of Cancer Sample Accuracy (%) Sensitivity (%) Specificity (%) LaTulippe Lapointe Overall Affy. HG_U95Av2 Spotted cDNA Cross-platform 3 41 44 23 61* 84 96.2 93.1 93.8 95.7 90.2 91.7 100 97.6 97.7 * One of the cancer samples has missing value for HPN and is removed from the testing set. Comparisons of Marker TSP with Individual TSPs Testing Data Set TSP Accuracy (%) Sensitivity (%) Specificity (%) LaTulippe (HG_U95Av2) Welsh Stuart Singh Welsh_Stuart_Singh Welsh Stuart Singh Welsh_Stuart_Singh 69.2 84.5 88.5 96.2 70.9 43.6 43.7 93.1 69.6 82.6 87.0 95.7 95.2 6.7 6.4 90.2 66.7 100 100 100 34.1 97.6 100 97.6 Lapointe (cDNA) (Lei Xu et al, 2005, Bioinformatics, 21:3905-3911) AC TAN 2006 28 Marker TSP for Prostate Cancer • HPN (Hepsin) [biomarker candidate for prostate cancer] • STAT6 (Signal transduction and translation protein) IF HPN > STAT6 THEN Prostate Cancer ELSE Normal PSA (Prostate Specific Antigen): Sn = 67.5% – 80% , Sp = 60% - 70% TSP (HPN, STAT6): Sn = 91.7%, Sp = 97.7% (From this study!) (Lei Xu et al, 2005, Bioinformatics, 21:3905-3911) AC TAN 2006 29 DIGE Technology (From http://www5.amershambiosciences.com) Experimental Settings: Name: Model state: Tetracycline present? Beta- estrodiol present? Proteomics Data Tet (+) Normal non-proliferating yes no Beta(+) Normal proliferating yes yes Tet (-) Cancer no no Gels: 18 experiments Cy2 – Internal Standards (18) Cy3 – Cancer gels (18) Cy5 – Normal gels (18) 1098 protein spots (BVA ratios from DeCyder software) AC TAN 2006 (Troy Anderson et al) 30 Decision Rule Normal Cancer Decision Rule: IF Ratio530 ≥ Ratio786 THEN Cancer, ELSE Normal. LOOCV Results: Accuracy: 97.2% Sensitivity: 100% Specificity: 94.4% (35/36) (18/18) (17/18) AC TAN 2006 (Troy Anderson et al) 31 Protein Marker Spots (Troy Anderson et al) AC TAN 2006 32 M. Bibilova et al (2006). Highthroughput DNA methylation profiling using universal bead arrays. Genome Research 16: 383-393. AC TAN 2006 33 Bibikova et al (2006) Genome Research 16: 383 Cluster analysis of lung adenocarcinoma samples •55 CpG sites that are differentially methylated in cancer versus normal tissues with high confidence level (adjusted P-value < 0.001) and significant change in absolute methylation level (|β| > 0.15). •Cancer sample G12029 was mistakenly coclustered with normal samples •Cancer sample D12162 was coclustered with normal samples •Normal samples are underlined in green, cancer in red. Bibikova et al (2006) Genome Research 16: 383 •The asterisks indicate misclassified samples. AC TAN 2006 34 k-TSP Results k-TSP decision rules: IF ATP5G1_1451 ≥ HS3ST2_311# THEN Normal, ELSE Cancer. IF CDKN1B_1480 ≥ NEFL_524# THEN Normal, ELSE Cancer. IF SLC22A18_679 ≥ ADAMTS12_1369 THEN Normal, ELSE Cancer. # CpG sites used in Bibikova et al (adjusted p-values = 0.000112, top in their list). Training Set (LOOCV = 95.5%) Cancer Normal Test Set (Prediction Accuracy = 85.42%) * ** Normal ** * * The asterisks indicate misclassified samples * Cancer Test Results: Sample D12152a D12152b D12165a D12165b D12155a D12155b D12170a D12170b D12197a D12197b D12158a D12158b D12160a D12160b D12181a D12181b D12203a D12203b D12162a D12162b D12163a D12163b D12207a D12207b D12195a D12195b D12157a D12157b D12173a D12173b D12198a D12198b D12180a D12180b D12202a D12202b D12182a D12182b D12205a D12205b D12184a D12184b D12164a D12164b D12188a D12188b D12209a D12209b AC TAN 2006 True Class cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal normal TSP cancer cancer normal normal normal cancer cancer cancer cancer normal cancer cancer cancer cancer cancer cancer cancer cancer normal normal cancer cancer cancer cancer normal normal cancer normal normal cancer normal normal cancer normal cancer normal normal normal normal normal normal normal cancer normal normal normal normal normal 35 low Methylation level high KTSP cancer cancer cancer cancer normal cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer cancer normal normal normal normal normal normal normal normal cancer cancer cancer normal normal normal normal normal normal normal cancer cancer normal normal normal cancer SARS ( Zhu et alPNAS 2006) PNAS (2006) 103(11): 4011-4016 Unsupervised Learning AC TAN 2006 k-NN LR 36 Results on SARS Data (Zhu et al 2006 PNAS) (A) Canadian Sera Data (Training Set) [Data obtained from Supporting Table 4] SARS_NEG SARS_POS TSP SARS-N pEGH-B4* 229E-S 3/4 SARS-N pEGH-B4* SARS-NC1 pEGH-B7* SARS-N-C2 pEGH-B8 #2 229E-S 3/4 229E-P 4/8 SARS pEGH-109(Y) k-TSP TSP Decision Rule: IF SARS-N pEGH-B4* > 229E-S 3/4 THEN SARS_POS, ELSE SARS_NEG k-TSP Decision Rules: IF SARS-N pEGH-B4* > 229E-S 3/4 THEN SARS_POS, ELSE SARS_NEG IF SARS-NC1 pEGH-B7* > 229E-P 4/8 THEN SARS_POS, ELSE SARS_NEG IF SARS-N-C2 pEGH-B8 #2 > SARS pEGH-109(Y) THEN SARS_POS, ELSE SARS_NEG * Proteins identified by k-NN & LR methods in Zhu et al Leave-One-Out Cross-Validation: TSP accuracy: 87.7% k-TSP accuracy: 90% AC TAN 2006 37 Results on the SARS Data (Zhu et al 2006 PNAS) (B) Chinese Sera Data (Testing Set) [Data obtained from Supporting Table 6] TSP SARS_NEG SARS_POS * k-TSP SARS_NEG SARS_POS * Both classifiers misclassified 1 sample (*) on the Chinese SARS data (56 samples) AC TAN 2006 38 Integrating Gene Expression Data and Clinical Information for Cancer Outcome Prediction AC TAN 2006 39 Preliminary Results Data Sets: #genes #samples Dead Alive # Clinical features Cancer Beer et al West et al Huang et al Bullinger et al 7129 6283 7129 12625 49 100 86 89 24 15 36 66 34 34 62 53 7 5 8 12 Lung Breast Breast Leukemia Rosenwald et al Ovarian Pittman et al 7399 22215 12625 240 133 171 138 61 43 102 72 128 2 1 1 Lymphoma Ovarian Breast Lung 54613 85 43 42 2 Lung Miller et al Ma et al 44928 22575 236 60 181 28 55 32 7 8 Breast Breast LOOCV Accuracy: Beer et al West et al Huang et al Bullinger et al Rosenwald et al Ovarian Pittman et al TSP 44.19 48.98 46.07 35.00 63.33 64.70 51.50 k-TSP 66.28 55.10 42.70 60.00 61.25 77.40 56.70 DT(50-TSP + Clinical) 60.47 83.67 56.18 80.00 57.50 72.18 70.18 AC TAN 2006 Lung 20.00 47.10 60.00 Miller et al Ma et al 32.20 41.70 51.30 46.70 74.15 90.00 40 http://www.ccbm.jhu.edu Software Availability AC TAN 2006 41 Conclusions • Bioinformatics tools to facilitate biomarkers discovery • k-TSP is comparable with the state-of-the-art classifiers (PAM, SVM) in classifying gene expression profiles • k-TSP generates simple and accurate decision rules – Biological significance – Easy to interpret – Potential clinical applications • Allow “direct” data integration without performing normalization • Allow cross-platform analysis • Applicable to a wide-range of high-throughput data AC TAN 2006 42 Acknowledgements • • • • • • Prof. Raimond Winslow Prof. Donald Geman Prof. Daniel Naiman Lei Xu Troy Anderson DIMACS Travel Fellowships • THANK YOU ! EMAIL: AC TAN 2006 [email protected] 43 References: • D. Geman, C. d’Avignon, D.Q. Naiman and R.L. Winslow (2004). Classifying gene expression profiles from pairwise mRNA comparison. Statistical Applications in Genetics and Molecular Biology, 3: Article 19. • A.C. Tan, D.Q. Naiman, L. Xu, R.L. Winslow and D. Geman (2005). Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics, 21(20): 3896-3904. • L. Xu, A.C. Tan, D.Q. Naiman, D. Geman and R.L. Winslow (2005). Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics, 21(20): 3905-3911. AC TAN 2006 44
© Copyright 2026 Paperzz