KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights David Page, University of Wisconsin Co-chairs August 26, 2001 Special thanks: DuPont Pharmaceuticals Research Laboratories for providing data set 1, Chris Kostas from Silico Insights for cleaning and organizing data sets 2 and 3 http://www.cs.wisc.edu/~dpage/kddcup2001/ The Genomics Challenge • High throughput technologies in genomics, proteomics and drug screening are creating large, complex datasets • Bioinformatics datasets are typically underdetermined – very large number of features (complex domain) – small number of instances (high cost per data point) • Multi-relational nature of data – reflect complex interactions between molecules, pathways and systems – Hierarchical organization of interacting layers • Current tools and approaches do not adequately address the Genomics Challenge KDD-2001 Cup 2 Overview • Cup organization • Dataset description – Thrombin binding – Gene function/localization prediction • Statistics • Tasks and highlights • Winners talk (3x10 min) KDD-2001 Cup 3 Cup Organization • KDD-2001 Cup web site – Posting of datasets, Q&A, answer keys • Schedule – – – – – – – Training dataset available: May 31 Question period 1: June 1-10 Test set available: July 13 Question period 2: July 13-24 Entries due: July 26 Winners notified: August 1 Results to participants: August 7 • Evaluation criteria – Task 1: weighted accuracy (average of true pos, true neg) – Tasks 2, 3: non-weighted accuracy KDD-2001 Cup 4 Dataset 1: Molecular Bioactivity Dataset provided by DuPont Pharmaceuticals for the KDD-2001 Cup competition • Activity of compounds binding to thrombin • Library of compounds included: – 1909 known molecules (42 actively binding thrombin) • 139,351 binary features describe the 3-D structure of each compound • 636 new compounds with unknown capacity to bind thrombin KDD-2001 Cup 5 Dataset 2: Protein Functional Annotation • Yeast Genome dataset – Data on the protein-protein interactions from MIPS database (Munich Information Centre for Protein Sequences) – Expression profiles: DeRisi et al. (1997) Science 278: 680 • Relational dataset Strong Similarity to Known Protein 4% – Gene information – Interaction information • Predict function, localization of unknown proteins Known Proteins 52% 6449 total proteins KDD-2001 Cup Weak Similarity to Known Protein 13% Similarity to Unknown Protein 16% No Similarity 8% Questionable ORFs 7% 6 Statistics: I. Participation KDD Cup Participation Total by Task (200 submissions) 160 45 Number of Participant Groups Thrombin 136 140 Function 114 120 Localization 41 100 80 Total by Affiliation (200 submissions) 60 20 Com 40 20 16 21 24 30 Gov Univ 107 66 Other 0 Cup 97 Cup 98 Cup 99 Cup 2000 Cup 2001 7 • 136 unique groups, 200 total entries by about 300-400 participants • Almost 5-fold increase over previous years • More than half of the entries from commercial sector KDD-2001 Cup 7 Statistics: II. Data Mining Software Task 1 Task 2 21 Total Task 3 9 42 12 Custom Public Domain 19 16 88 5 Commercial 17 53 6 6 Note: Statistics from 157 responders who provided details on their approach • Mostly custom software was used • Especially for task 1, where the number of features was too large for most commercial systems • Gap points to need for commercial tools that can cope with bioinformatics datasets KDD-2001 Cup 8 • • • • 0.7 0.6 0.5 Task 1 Task 2 Task 3 0.4 0.3 0.2 Cross Validation ILP OLAP Linear Regression Decision Table Genetic Programming Bayesian Net Logistic Regression Statistical Clustering Bagging SVM Association Rules Neural Net Boosting k-Nearest Neighbor Naïve Bayes Ensemble Classifier Decision Tree 0 Feature Construction 0.1 Feature Selection Fraction of Entries by Task Statistics: III. Algorithms Feature selection used in almost 70% of the entries for Task 1 Ensemble classifiers based on more than one algorithm used extensively Decision trees among the most commonly used, with Naïve Bayes and k-NN Cross-validation to deal with small dataset size KDD-2001 Cup 9 Task 1 Highlights • Test set was challenging second round of compounds made by chemists -- change in distribution. • Far more features than data points; can’t run most commercial systems even with 1G RAM. • Varying degrees of correlation among features. • Better than 60% weighted accuracy is impressive. • Pure binary prediction task, yet the winner is a Bayes net learning system (after feature selection). KDD-2001 Cup 10 Tasks 2 & 3: Relational Prediction Gene/Protein Level Interactions Gene Expression Gene Sequence Cluster E Cluster C Cluster A Structural Motifs Cluster B ATTGCCATT-ATGGCCATT-ATC-CAATTTT ATCTTC-TT-ACTGACC---AT*GCCATTTT Cluster D Proteomic Clusters Expression Clusters -0.31 -0.50 0.03 -0.76 -0.22 -0.39 -0.48 -0.57 -0.53 -0.41 -0.23 -0.38 -0.12 0.15 0.23 0.20 0.21 0.01 -0.21 -0.05 -0.07 -0.06 -0.11 0.24 0.62 0.38 0.47 0.28 0.68 0.56 0.39 0.25 0.41 0.64 0.16 -0.12 -0.30 -0.04 -0.65 -0.35 -0.56 -0.64 -0.59 -0.65 -0.58 -0.38 -0.57 -0.32 0.02 0.02 0.15 0.18 -0.01 -0.29 -0.19 -0.12 -0.25 -0.31 0.27 0.54 0.25 0.55 0.30 0.71 0.65 0.50 0.21 0.46 0.75 0.40 0.32 0.47 0.05 0.73 0.30 0.47 0.53 0.51 0.52 0.46 0.28 0.40 0.20 -0.14 -0.19 -0.25 -0.28 -0.09 0.17 0.01 0.00 0.09 0.10 -0.32 -0.66 -0.32 -0.55 -0.30 -0.70 -0.63 -0.35 -0.19 -0.37 -0.65 -0.20 0.30 0.46 0.06 0.72 0.31 0.48 0.55 0.52 0.53 0.48 0.29 0.41 0.22 -0.13 -0.18 -0.24 -0.28 -0.08 0.18 0.02 0.01 0.11 0.12 -0.32 -0.65 -0.32 -0.55 -0.30 -0.71 -0.64 -0.36 -0.18 -0.38 -0.66 -0.22 -0.76 -0.65 -0.22 -0.34 -0.04 0.14 0.22 0.29 0.41 0.27 0.27 0.53 0.25 0.42 0.57 0.46 0.51 0.48 0.47 0.72 0.55 0.50 0.71 0.39 0.57 0.21 0.18 0.11 0.08 0.13 -0.53 -0.20 -0.35 -0.26 -0.60 Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster 20 12 13 9 8 10 4 32 29 22 21 1 7 27 6 30 3 24 23 34 2 33 26 5 31 28 15 11 25 16 19 17 14 18 35 FUNCTION LOCATION Protein Interactions Chromosomal Location KDD-2001 Cup 11 Task 2 Highlights • Average of about 3 functions per protein. • Multi-relational, as are many real-world databases. • Yet top-scoring approaches were not pure relational learners. • But top-scoring approaches did account for multi-relational structure of the data. – Krogel: novel form of feature construction to capture relational information in a feature vector. – Sese, Hayashi, and Morishita: instance-based learning, but using the interactions relation as part of the distance function. KDD-2001 Cup 12 Task 3 Highlights • Similar to task 3, but only one localization per protein. • Similar lessons. • High overlap in top scorers for both tasks. • Question: did anyone “bootstrap” by using their predictions for function to help predict localization, or vice-versa? KDD-2001 Cup 13 KDD-2001 Cup Winners • Task 1: Jie Cheng, CIBC • Task 2: Mark-A. Krogel, Magdeburg Univ. • Task 3: Hisashi Hayashi, Jun Sese, and Shinichi Morishita, Univ. of Tokyo KDD-2001 Cup 14 Task 1 Winner KDD Cup 2001 Results Distribution of Prediction Accuracy Scores for Task 1: Thrombin Activity Task 1: Thrombin Name: Rank: Weighted Accuracy: Accuracy: Jie Cheng 1 68.4435 71.1356 1 1.000 0.9 Actual Positive Negative True Positive Rate: True Negative Rate: Predicted Positive Negative 95 55 128 356 223 411 63.3% 73.6% Cumulative Frequency 0.8 150 484 634 0.7 0.6 0.5 0.4 0.3 0.2 0.1 68.444 0 30 40 50 60 70 80 90 100 Score KDD-2001 Cup 15 Task 2 Winner KDD Cup 2001 Results Distribution of Prediction Accuracy Scores for Task 2: Function Prediction Task 2: Function Name: Rank: Accuracy: Weighted Accuracy: Mark-A. Krogel 1 93.6258 84.8290 1 1.000 0.9 Actual Positive Negative True Positive Rate: True Negative Rate: Predicted Positive Negative 690 282 58 4304 748 4586 71.0% 98.7% Cumulative Frequency 0.8 972 4362 5334 0.7 0.6 0.5 0.4 0.3 0.2 0.1 93.626 0 60 65 70 75 80 85 90 95 100 Score KDD-2001 Cup 16 Task 3 Winner KDD Cup 2001 Results Task 3: Localization Name: Rank: Accuracy: Hisashi Hayashi, Jun Sese, and Shinichi Morishita 1 72.1785 Distribution of Prediction Accuracy Scores for Task 3: Localization Prediction 1 1.000 0.9 Cumulative Frequency 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 72.179 0 0 20 40 60 80 100 Score KDD-2001 Cup 17 KDD-2001 Honorable Mentions Task 1: Silander, Univ. of Helsinki Task 2: Lambert, Golden Helix; Sese & Hayashi & Morishita; Vogel & Srinivasan, A.I. Insight Task 3: Schonlau & DuMouchel & Volinsky & Cortes, RAND and AT&T Labs; Frasca & Zheng & Parekh & Kohavi, Blue Martini KDD-2001 Cup 18 KDD-2001 Cup Winners • Task 1: • Task 2: • Task 3: Jie Cheng, CIBC Mark-A. Krogel, Magdeburg Univ. Hisashi Hayashi, Jun Sese, and Shinichi Morishita, Univ. of Tokyo KDD-2001 Cup 19
© Copyright 2026 Paperzz