The pitch

KDD-2001 Cup
The Genomics Challenge
Christos Hatzis, Silico Insights
David Page, University of Wisconsin
Co-chairs
August 26, 2001
Special thanks: DuPont Pharmaceuticals Research
Laboratories for providing data set 1, Chris Kostas from
Silico Insights for cleaning and organizing data sets 2 and 3
http://www.cs.wisc.edu/~dpage/kddcup2001/
The Genomics Challenge
• High throughput technologies in genomics,
proteomics and drug screening are creating
large, complex datasets
• Bioinformatics datasets are typically underdetermined
– very large number of features (complex domain)
– small number of instances (high cost per data point)
• Multi-relational nature of data
– reflect complex interactions between molecules,
pathways and systems
– Hierarchical organization of interacting layers
• Current tools and approaches do not
adequately address the Genomics Challenge
KDD-2001 Cup
2
Overview
• Cup organization
• Dataset description
– Thrombin binding
– Gene function/localization prediction
• Statistics
• Tasks and highlights
• Winners talk (3x10 min)
KDD-2001 Cup
3
Cup Organization
• KDD-2001 Cup web site
– Posting of datasets, Q&A, answer keys
• Schedule
–
–
–
–
–
–
–
Training dataset available: May 31
Question period 1: June 1-10
Test set available: July 13
Question period 2: July 13-24
Entries due: July 26
Winners notified: August 1
Results to participants: August 7
• Evaluation criteria
– Task 1: weighted accuracy (average of true pos, true neg)
– Tasks 2, 3: non-weighted accuracy
KDD-2001 Cup
4
Dataset 1: Molecular Bioactivity
Dataset provided by DuPont Pharmaceuticals for
the KDD-2001 Cup competition
• Activity of compounds binding to thrombin
• Library of compounds included:
– 1909 known molecules (42 actively binding
thrombin)
• 139,351 binary features describe the 3-D
structure of each compound
• 636 new compounds with unknown capacity to
bind thrombin
KDD-2001 Cup
5
Dataset 2: Protein Functional Annotation
• Yeast Genome dataset
– Data on the protein-protein interactions from MIPS database
(Munich Information Centre for Protein Sequences)
– Expression profiles: DeRisi et al. (1997) Science 278: 680
• Relational dataset
Strong Similarity
to Known
Protein
4%
– Gene information
– Interaction information
• Predict function,
localization of unknown
proteins
Known Proteins
52%
6449 total proteins
KDD-2001 Cup
Weak Similarity
to Known
Protein
13%
Similarity to
Unknown
Protein
16%
No Similarity
8%
Questionable
ORFs
7%
6
Statistics: I. Participation
KDD Cup Participation
Total by Task
(200 submissions)
160
45
Number of Participant Groups
Thrombin
136
140
Function
114
120
Localization
41
100
80
Total by Affiliation
(200 submissions)
60
20
Com
40
20
16
21
24
30
Gov
Univ
107
66
Other
0
Cup 97
Cup 98
Cup 99
Cup 2000
Cup 2001
7
• 136 unique groups, 200 total entries by about 300-400
participants
• Almost 5-fold increase over previous years
• More than half of the entries from commercial sector
KDD-2001 Cup
7
Statistics: II. Data Mining Software
Task 1
Task 2
21
Total
Task 3
9
42
12
Custom
Public Domain
19
16
88
5
Commercial
17
53
6
6
Note: Statistics from 157 responders who provided details on their approach
• Mostly custom software was used
• Especially for task 1, where the number of
features was too large for most commercial
systems
• Gap points to need for commercial tools that
can cope with bioinformatics datasets
KDD-2001 Cup
8
•
•
•
•
0.7
0.6
0.5
Task 1
Task 2
Task 3
0.4
0.3
0.2
Cross Validation
ILP
OLAP
Linear Regression
Decision Table
Genetic Programming
Bayesian Net
Logistic Regression
Statistical
Clustering
Bagging
SVM
Association Rules
Neural Net
Boosting
k-Nearest Neighbor
Naïve Bayes
Ensemble Classifier
Decision Tree
0
Feature Construction
0.1
Feature Selection
Fraction of Entries by Task
Statistics: III. Algorithms
Feature selection used in almost 70% of the entries for Task 1
Ensemble classifiers based on more than one algorithm used extensively
Decision trees among the most commonly used, with Naïve Bayes and k-NN
Cross-validation to deal with small dataset size
KDD-2001 Cup
9
Task 1 Highlights
• Test set was challenging second round of
compounds made by chemists -- change in
distribution.
• Far more features than data points; can’t run
most commercial systems even with 1G RAM.
• Varying degrees of correlation among
features.
• Better than 60% weighted accuracy is
impressive.
• Pure binary prediction task, yet the winner is a
Bayes net learning system (after feature
selection).
KDD-2001 Cup
10
Tasks 2 & 3: Relational Prediction
Gene/Protein Level
Interactions
Gene Expression
Gene Sequence
Cluster E
Cluster C
Cluster A
Structural Motifs
Cluster B
ATTGCCATT-ATGGCCATT-ATC-CAATTTT
ATCTTC-TT-ACTGACC---AT*GCCATTTT
Cluster D
Proteomic Clusters
Expression
Clusters
-0.31
-0.50
0.03
-0.76
-0.22
-0.39
-0.48
-0.57
-0.53
-0.41
-0.23
-0.38
-0.12
0.15
0.23
0.20
0.21
0.01
-0.21
-0.05
-0.07
-0.06
-0.11
0.24
0.62
0.38
0.47
0.28
0.68
0.56
0.39
0.25
0.41
0.64
0.16
-0.12
-0.30
-0.04
-0.65
-0.35
-0.56
-0.64
-0.59
-0.65
-0.58
-0.38
-0.57
-0.32
0.02
0.02
0.15
0.18
-0.01
-0.29
-0.19
-0.12
-0.25
-0.31
0.27
0.54
0.25
0.55
0.30
0.71
0.65
0.50
0.21
0.46
0.75
0.40
0.32
0.47
0.05
0.73
0.30
0.47
0.53
0.51
0.52
0.46
0.28
0.40
0.20
-0.14
-0.19
-0.25
-0.28
-0.09
0.17
0.01
0.00
0.09
0.10
-0.32
-0.66
-0.32
-0.55
-0.30
-0.70
-0.63
-0.35
-0.19
-0.37
-0.65
-0.20
0.30
0.46
0.06
0.72
0.31
0.48
0.55
0.52
0.53
0.48
0.29
0.41
0.22
-0.13
-0.18
-0.24
-0.28
-0.08
0.18
0.02
0.01
0.11
0.12
-0.32
-0.65
-0.32
-0.55
-0.30
-0.71
-0.64
-0.36
-0.18
-0.38
-0.66
-0.22
-0.76
-0.65
-0.22
-0.34
-0.04
0.14
0.22
0.29
0.41
0.27
0.27
0.53
0.25
0.42
0.57
0.46
0.51
0.48
0.47
0.72
0.55
0.50
0.71
0.39
0.57
0.21
0.18
0.11
0.08
0.13
-0.53
-0.20
-0.35
-0.26
-0.60
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
20
12
13
9
8
10
4
32
29
22
21
1
7
27
6
30
3
24
23
34
2
33
26
5
31
28
15
11
25
16
19
17
14
18
35
FUNCTION
LOCATION
Protein Interactions
Chromosomal Location
KDD-2001 Cup
11
Task 2 Highlights
• Average of about 3 functions per protein.
• Multi-relational, as are many real-world
databases.
• Yet top-scoring approaches were not pure
relational learners.
• But top-scoring approaches did account for
multi-relational structure of the data.
– Krogel: novel form of feature construction to capture
relational information in a feature vector.
– Sese, Hayashi, and Morishita: instance-based
learning, but using the interactions relation as part of
the distance function.
KDD-2001 Cup
12
Task 3 Highlights
• Similar to task 3, but only one localization per
protein.
• Similar lessons.
• High overlap in top scorers for both tasks.
• Question: did anyone “bootstrap” by using
their predictions for function to help predict
localization, or vice-versa?
KDD-2001 Cup
13
KDD-2001 Cup Winners
• Task 1:
Jie Cheng, CIBC
• Task 2:
Mark-A. Krogel, Magdeburg Univ.
• Task 3:
Hisashi Hayashi, Jun Sese, and
Shinichi Morishita, Univ. of Tokyo
KDD-2001 Cup
14
Task 1 Winner
KDD Cup 2001 Results
Distribution of Prediction Accuracy Scores for
Task 1: Thrombin Activity
Task 1: Thrombin
Name:
Rank:
Weighted Accuracy:
Accuracy:
Jie Cheng
1
68.4435
71.1356
1
1.000
0.9
Actual
Positive
Negative
True Positive Rate:
True Negative Rate:
Predicted
Positive
Negative
95
55
128
356
223
411
63.3%
73.6%
Cumulative Frequency
0.8
150
484
634
0.7
0.6
0.5
0.4
0.3
0.2
0.1
68.444
0
30
40
50
60
70
80
90
100
Score
KDD-2001 Cup
15
Task 2 Winner
KDD Cup 2001 Results
Distribution of Prediction Accuracy Scores for
Task 2: Function Prediction
Task 2: Function
Name:
Rank:
Accuracy:
Weighted Accuracy:
Mark-A. Krogel
1
93.6258
84.8290
1
1.000
0.9
Actual
Positive
Negative
True Positive Rate:
True Negative Rate:
Predicted
Positive
Negative
690
282
58
4304
748
4586
71.0%
98.7%
Cumulative Frequency
0.8
972
4362
5334
0.7
0.6
0.5
0.4
0.3
0.2
0.1
93.626
0
60
65
70
75
80
85
90
95
100
Score
KDD-2001 Cup
16
Task 3 Winner
KDD Cup 2001 Results
Task 3: Localization
Name:
Rank:
Accuracy:
Hisashi Hayashi, Jun Sese, and Shinichi Morishita
1
72.1785
Distribution of Prediction Accuracy Scores for
Task 3: Localization Prediction
1
1.000
0.9
Cumulative Frequency
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
72.179
0
0
20
40
60
80
100
Score
KDD-2001 Cup
17
KDD-2001 Honorable Mentions
Task 1:
Silander, Univ. of Helsinki
Task 2:
Lambert, Golden Helix;
Sese & Hayashi & Morishita;
Vogel & Srinivasan, A.I. Insight
Task 3:
Schonlau & DuMouchel & Volinsky &
Cortes, RAND and AT&T Labs;
Frasca & Zheng & Parekh & Kohavi,
Blue Martini
KDD-2001 Cup
18
KDD-2001 Cup Winners
• Task 1:
• Task 2:
• Task 3:
Jie Cheng, CIBC
Mark-A. Krogel, Magdeburg Univ.
Hisashi Hayashi, Jun Sese, and
Shinichi Morishita, Univ. of Tokyo
KDD-2001 Cup
19